Compare commits

...

2761 Commits

Author SHA1 Message Date
Botond Dénes
0ab815bc90 Merge 'doc: fix the installation section' from Anna Stuchlik
This PR fixes the Installation page:

- Replaces `http `with `https `in the download command.
- Replaces the Open Source example from the Installation section for CentOS (we overlooked this example before).

Fixes https://github.com/scylladb/scylladb/issues/29087
Fixes https://github.com/scylladb/scylladb/issues/29087

This update affects all supported versions and should be backported as a bug fix.

Closes scylladb/scylladb#29088

* github.com:scylladb/scylladb:
  doc: remove the Open Source Example from Installation
  doc: replace http with https in the installation instructions

(cherry picked from commit e8b37d1a89)

Closes scylladb/scylladb#29135

Closes scylladb/scylladb#29192

Closes scylladb/scylladb#29201
2026-03-24 21:13:53 +02:00
Pavel Emelyanov
2e132f19c9 Update seastar submodule (iotune fixes)
* seastar 5a0c5e1f8...0f879e7b6 (2):
  > Add sequential buffer size options to IOTune
  > Fix incorrect defaults for io queue iops/bandwidth

refs: CUSTOMER-242

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29073
2026-03-24 21:13:19 +02:00
Anna Stuchlik
fa59daad49 doc: update the warning about shared dictionary training
This commit updates the inadequate warning on the Advanced Internode (RPC) Compression page.

The warning is replaced with a note about how training data is encrypted.

Fixes https://github.com/scylladb/scylladb/issues/29109

Closes scylladb/scylladb#29111

(cherry picked from commit 88b98fac3a)

Closes scylladb/scylladb#29119

Closes scylladb/scylladb#29139

Closes scylladb/scylladb#29199
2026-03-24 16:06:22 +02:00
Patryk Jędrzejczak
95e8c6fd56 test: test_remove_garbage_group0_members: wait for token ring and group0 consistency before removenode
The removenove initiator could have an outdated token ring (still considering
the node removed by the previous removenode a token owner) and unexpectedly
reject the operation.

Fix that by waiting for token ring and group0 consistency before removenode.
Note that the test already checks that consistency, but only for one node,
which is different from the removenode initiator.

This test has been removed in master together with the code being tested
(the gossip-based topology). Hence, the fix is submitted directly to 2026.1.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1103

Backport to all supported branches (other than 2026.1), as the test can fail
there.

Closes scylladb/scylladb#29108

(cherry picked from commit 1398a55d16)

Closes scylladb/scylladb#29206
2026-03-24 13:09:36 +01:00
Calle Wilund
42730454cb encryption::gcp_host: Add exponential retry for server errors
Fixes #27242

Similar to AWS, google services may at times simply return a 503,
more or less meaning "busy, please retry". We rely for most cases
higher up layers to handle said retry, but we cannot fully do so,
because both we reach this code sometimes through paths that do
no such thing, and also because it would be slightly inefficient,
since we'd like to for example control the back-off for auth etc.

This simply changes the existing retry loop in gcp_host to
be a little more forgiving, special case 503 errors and extend
the retry to the auth part, as well as re-use the
exponential_backoff_retry primitive.

v2:
* Avoid backoff if refreshing credentials. Should not add latency due to this.
* Only allow re-auth once per (non-service-failure-backoff) try.
* Add abort source to both request and retry
v3:
* Include timeout and other server errors in retry-backoff
v4:
* Reorder error code handling correctly

Closes scylladb/scylladb#27267

(cherry picked from commit 4169bdb7a6)

Closes scylladb/scylladb#27437
2026-03-20 11:02:15 +02:00
Jenkins Promoter
3f0c17db00 Update pgo profiles - aarch64 2026-03-15 04:52:56 +02:00
Anna Stuchlik
244ed3d184 doc: remove reduntant Java-related information
This commit removes:
- Instructions to install scylla-jmx (and all references)
- The Java 11 requirement for Ubuntu.

Fixes https://github.com/scylladb/scylladb/issues/28249
Fixes https://github.com/scylladb/scylladb/issues/28252

Closes scylladb/scylladb#28254

(cherry picked from commit 64b1798513)

Closes scylladb/scylladb#28888

Closes scylladb/scylladb#28906

Closes scylladb/scylladb#28922
2026-03-12 12:28:50 +01:00
Anna Stuchlik
b183fe87a1 doc: fix the unified installer instructions
This commit updates the documentation for the unified installer.

- The Open Source example is replaced with version 2025.1 (Source Available, currently supported, LTS).
- The info about CentOS 7 is removed (no longer supported).
- Java 8 is removed.
- The example for cassandra-stress is removed (as it was already removed on other installation pages).

Fixes https://github.com/scylladb/scylladb/issues/28150

Closes scylladb/scylladb#28152

(cherry picked from commit 855c503c63)

Closes scylladb/scylladb#28910

Closes scylladb/scylladb#28927

Closes scylladb/scylladb#28974

Closes scylladb/scylladb#28989
2026-03-11 10:40:13 +02:00
Avi Kivity
36337c59de dist: scylla_raid_setup: don't override XFS block size on modern kernels
In 6977064693 ("dist: scylla_raid_setup:
reduce xfs block size to 1k"), we reduced the XFS block size to 1k when
possible. This is because commitlog wants to write the smallest amount
of padding it can, and older Linux could only write a multiple of the
block size. Modern Linux [1] can O_DIRECT overwrite a range smaller than
a filesystem block.

However, this doesn't play well with some SSDs that have 512 byte
logical sector size and 4096 byte physical sector size - it causes them
to issue read-modify-writes.

To improve the situation, if we detect that the kernel is recent enough,
format the filesystem with its default block size, which should be optimal.

Note that commitlog will still issue sub-4k writes, which can translate
to RMW. There, we believe that the amplification is reduced since
sequential sub-physical-sector writes can be merged, and that the overhead
from commitlog space amplification is worse than the RMW overhead.

Tested on AWS i4i.large. fsqual report:

```
memory DMA alignment:    512
disk DMA alignment:      512
filesystem block size:   4096
context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0.0003 (GOOD)
context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.7961 (BAD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0 (GOOD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0.0001 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.8006 (BAD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0.0001 (GOOD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD)
```

The sub-block overwrite cases are GOOD.

In comparison, the fsqual report for 1k (similar):

```
memory DMA alignment:    512
disk DMA alignment:      512
filesystem block size:   1024
context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0.0005 (GOOD)
context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.7948 (BAD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0015 (GOOD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0022 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.4999 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.798 (BAD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0012 (GOOD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0019 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.5 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD)
```

Fixes #25441.

[1] ed1128c2d0

Closes scylladb/scylladb#25445

(cherry picked from commit 5d1846d783)
2026-03-10 14:05:27 +02:00
Botond Dénes
a7909a1fda Merge '[Backport 2025.1] docs: update a documentation of adding/removing DC and rebuilding a node' from Scylladb[bot]
Describe a procedure to convert tablet keyspace replication factor
to rack list. Update the procedures of adding and removing a node
to consider tablet keyspaces.

Fixes: [SCYLLADB-398](https://scylladb.atlassian.net/browse/SCYLLADB-398)
Fixes: https://github.com/scylladb/scylladb/issues/28306.
Fixes: https://github.com/scylladb/scylladb/issues/28307.
Fixes: https://github.com/scylladb/scylladb/issues/28270.

Needs backport to all live branches as they all include tablets.

- (cherry picked from commit 1c764cf6ea)

- (cherry picked from commit e4c42acd8f)

- (cherry picked from commit 9ccc95808f)

Compared to parent PR:
* drop conversion to rack-list and setting enforce_rack_list commits
* remove the rack list part of all procedures
* in case of rf_rack_valid_keyspaces - there is no way to add/remove DC - recommend to restart the cluster and turn the option off (without mentioning MVs)

Parent PR: #28521

[SCYLLADB-398]: https://scylladb.atlassian.net/browse/SCYLLADB-398?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28777

* github.com:scylladb/scylladb:
  docs: update nodetool rebuild docs
  docs: update a procedure of decommissioning a DC
  docs: update a procedure of adding a DC
2026-03-03 16:25:37 +02:00
Wojciech Mitros
793a1e9e89 mv: don't mark the view as built if the reader produced no partitions
When we build a materialized view we read the entire base table from start to
end to generate all required view udpates. If a view is created while another view
is being built on the same base table, this is optimized - we start generating
view udpates for the new view from the base table rows that we're currently
reading, and we read the missed initial range again after the previous view
finishes building.
The view building progress is only updated after generating view updates for
some read partitions. However, there are scenarios where we'll generate no
view updates for the entire read range. If this was not handled we could
end up in an infinite view building loop like we did in https://github.com/scylladb/scylladb/issues/17293
To handle this, we mark the view as built if the reader generated no partitions.
However, this is not always the correct conclusion. Another scenario where
the reader won't encounter any partitions is when view building is interrupted,
and then we perform a reshard. In this scenario, we set the reader for all
shards to the last unbuilt token for an existing partition before the reshard.
However, this partition may not exist on a shard after reshard, and if there
are also no partitions with higher tokens, the reader will generate no partitions
even though it hasn't finished view building.
Additionally, we already have a check that prevents infinite view building loops
without taking the partitions generated by the reader into account. At the end
of stream, before looping back to the start, we advance current_key to the end
of the built range and check for built views in that range. This handles the case
where the entire range is empty - the conditions for a built view are:
1. the "next_token" is no greater than "first_token" (the view building process
looped back, so we've built all tokens above "first_token")
2. the "current_token" is no less than "first_token" (after looping back, we've
built all tokens below "first_token")

If the range is empty, we'll pass these conditions on an empty range after advancing
"current_key" to the end because:
1. after looping back, "next_token" will be set to `dht::minimum_token`
2. "current_key" will be set to `dht::ring_position::max()`

In this patch we remove the check for partitions generated by the reader. This fixes
the issue with resharding and it does not resurrect the issue with infinite view building
that the check was introduced for.

Fixes https://github.com/scylladb/scylladb/issues/26523

Closes scylladb/scylladb#26635

(cherry picked from commit 0a22ac3c9e)

Closes scylladb/scylladb#26880
2026-03-03 13:32:24 +02:00
Łukasz Paszkowski
94650ff482 test/pylib/util.py: Add retries and additional logging to start_writes()
Consider the following scenario:
1. Let nodes A,B,C form a cluster with RF=3
2. Write query with CL=QUORUM is submitted and is acknowledged by
   nodes B,C
3. Follow-up read query with CL=QUORUM is sent to verify the write
   from the previous step
4. Coordinator sends data/digest requests to the nodes A,B. Since the
   node A is missing data, digest mismatches and data reconciliation
   is triggered
5. The node A or B fails, becomes unavailable, etc
6. During reconciliation, data requests are sent to node A,B and fail
   failing the entire read query

When the above scenario happens, the tests using `start_writes()` fail
with the following stacktrace:
```
...

>           await finish_writes()

test/cluster/test_tablets_migration.py:259:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/pylib/util.py:241: in finish
    await asyncio.gather(*tasks)
test/pylib/util.py:227: in do_writes
    raise e
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

worker_id = 1

...

>                   rows = await cql.run_async(rd_stmt, [pk])
E                   cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test_1767777001181_bmsvk.test - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}
```

Note that when a node failure happens before/during a read query,
there is no test failure as the speculative retries are enabled
by default. Hence an additional data/digest read is sent to the third
remaining node.

However, the same speculative read is cancelled the moment, the read
query reaches CL which may trigger a read-repair.

This change:
- Retries the verification read in start_writes() on failure to mitigate
  races between reads and node failures
- Adds additional logging to correlate Python exceptions with Scylla logs

Fixes https://github.com/scylladb/scylladb/issues/27478
Fixes https://github.com/scylladb/scylladb/issues/27974
Fixes https://github.com/scylladb/scylladb/issues/27494
Fixes https://github.com/scylladb/scylladb/issues/23529

Note that this change test flakiness observed during tablet transitions.
However, it serves as a workaround for a higher-level issue
https://github.com/scylladb/scylladb/issues/28125

Closes scylladb/scylladb#28140

(cherry picked from commit e07fe2536e)

Closes scylladb/scylladb#28827
2026-03-03 13:29:12 +02:00
Jenkins Promoter
91b8176900 Update pgo profiles - aarch64 2026-03-01 04:51:30 +02:00
Aleksandra Martyniuk
08cc2fa804 docs: update nodetool rebuild docs
Update nodetool rebuild docs to mention that the command does not
work for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28270.
(cherry picked from commit 9ccc95808f)
2026-02-26 12:11:31 +01:00
Aleksandra Martyniuk
d889f9420f docs: update a procedure of decommissioning a DC
Update a procedure of decommissioning a DC for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28307.
(cherry picked from commit e4c42acd8f)
2026-02-26 12:11:17 +01:00
Aleksandra Martyniuk
e249204f76 docs: update a procedure of adding a DC
Update a procedure of adding a DC for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28306.
(cherry picked from commit 1c764cf6ea)
2026-02-26 12:10:11 +01:00
Botond Dénes
312a417060 Merge '[Backport 2025.1] test: cluster: Fix test_sync_point' from Scylladb[bot]
The test `test_sync_point` had a few shortcomings that made it flaky
or simply wrong:

1. We were verifying that hints were written by checking the size of
   in-flight hints. However, that could potentially lead to problems
   in rare situations.

   For instance, if all of the hints failed to be written to disk, the
   size of in-flight hints would drop to zero, but creating a sync point
   would correspond to the empty state.

   In such a situation, we should fail immediately and indicate what
   the cause was.

2. A sync point corresponds to the hints that have already been written
   to disk. The number of those is tracked by the metric `written`.
   It's a much more reliable way to make sure that hints have been
   written to the commitlog. That ensures that the sync point we'll
   create will really correspond to those hints.

3. The auxiliary function `wait_for` used in the test works like this:
   it executes the passed callback and looks at the result. If it's
   `None`, it retries it. Otherwise, the callback is deemed to have
   finished its execution and no further retries will be attempted.

   Before this commit, we simply returned a bool, and so the code was
   wrong. We improve it.

---

Note that this fixes scylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.

As a bonus, we rewrite the auxiliary code responsible for fetching
metrics and manipulating sync points. Now it's asynchronous and
uses the existing standard mechanisms available to developers.

Furthermore, we reduce the time needed for executing
`test_sync_point` by 27 seconds.

---

The total difference in time needed to execute the whole test file
(on my local machine, in dev mode):

Before:

    CPU utilization: 0.9%

    real    2m7.811s
    user    0m25.446s
    sys     0m16.733s

After:

    CPU utilization: 1.1%

    real    1m40.288s
    user    0m25.218s
    sys     0m16.566s

---

Refs scylladb/scylladb#25879
Fixes scylladb/scylladb#28203

Backport: This improves the stability of our CI, so let's
          backport it to all supported versions.

- (cherry picked from commit 628e74f157)

- (cherry picked from commit ac4af5f461)

- (cherry picked from commit c5239edf2a)

- (cherry picked from commit a256ba7de0)

- (cherry picked from commit f83f911bae)

Parent PR: #28602

Closes scylladb/scylladb#28620

* github.com:scylladb/scylladb:
  test: topology_custom: Reduce wait time in test_sync_point
  test: topology_custom: Fix test_sync_point
  test: topology_custom: Await sync points asynchronously
  test: topology_custom: Create sync points asynchronously
  test: topology_custom: Fetch hint metrics asynchronously
2026-02-26 10:10:12 +02:00
Calle Wilund
51b21e11fc commitlog: Always abort replenish queue on loop exit
Fixes #28678

If replenish loop exits the sleep condition, with an empty queue,
when "_shutdown" is already set, a waiter might get stuck, unsignalled
waiting for segments, even though we are exiting.

Simply move queue abort to always be done on loop exit.

Closes scylladb/scylladb#28679

(cherry picked from commit ab4e4a8ac7)

Closes scylladb/scylladb#28689
2026-02-26 10:09:51 +02:00
Yaron Kaikov
59c0a6a057 .github/workflows: enable automatic backport PR creation with Jira sub-issue integration
This workflow calls the reusable backport-with-jira workflow from
scylladb/github-automation to enable automatic backport PR creation with
Jira sub-issue integration.

The workflow triggers on:
- Push to master/next-*/branch-* branches (for promotion events)
- PR labeled with backport/X.X pattern (for manual backport requests)
- PR closed/merged on version branches (for chain backport processing)

Features enabled by calling the shared workflow:
- Creates Jira sub-issues under the main issue for each backport version
- Sorts versions descending (highest first: 2025.4 -> 2025.3 -> 2025.2)
- Cherry-picks from previous version branch to avoid repeated conflicts
- On Jira API failure: adds comment to main issue, applies 'jira-sub-issue-creation-failed' label, continues with PR

Closes scylladb/scylladb#28804

(cherry picked from commit b211590bc0)

Closes scylladb/scylladb#28815
2026-02-26 10:09:25 +02:00
Dawid Mędrek
032b5f16e0 test: topology_custom: Reduce wait time in test_sync_point
If everything is OK, the sync point will not resolve with node 3 dead.
As a result, the waiting will use all of the time we allocate for it,
i.e. 30 seconds. That's a lot of time.

There's no easy way to verify that the sync point will NOT resolve, but
let's at least reduce the waiting to 3 seconds. If there's a bug, it
should be enough to trigger it at some point, while reducing the average
time needed for CI.

(cherry picked from commit f83f911bae)
2026-02-19 15:04:09 +01:00
Dawid Mędrek
654a824f58 test: topology_custom: Fix test_sync_point
The test had a few shortcomings that made it flaky or simply wrong:

1. We were verifying that hints were written by checking the size of
   in-flight hints. However, that could potentially lead to problems
   in rare situations.

   For instance, if all of the hints failed to be written to disk, the
   size of in-flight hints would drop to zero, but creating a sync point
   would correspond to the empty state.

   In such a situation, we should fail immediately and indicate what
   the cause was.

2. A sync point corresponds to the hints that have already been written
   to disk. The number of those is tracked by the metric `written`.
   It's a much more reliable way to make sure that hints have been
   written to the commitlog. That ensures that the sync point we'll
   create will really correspond to those hints.

3. The auxiliary function `wait_for` used in the test works like this:
   it executes the passed callback and looks at the result. If it's
   `None`, it retries it. Otherwise, the callback is deemed to have
   finished its execution and no further retries will be attempted.

   Before this commit, we simply returned a bool, and so the code was
   wrong. We improve it.

Note that this fixes scylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.

Refs scylladb/scylladb#25879
Fixes scylladb/scylladb#28203

(cherry picked from commit a256ba7de0)
2026-02-19 15:04:08 +01:00
Dawid Mędrek
07758bce68 test: topology_custom: Await sync points asynchronously
There's a dedicated HTTP API for communicating with the cluster, so
let's use it instead of yet another custom solution.

(cherry picked from commit c5239edf2a)
2026-02-19 15:04:08 +01:00
Dawid Mędrek
45577072fa test: topology_custom: Create sync points asynchronously
There's a dedicated HTTP API for communicating with the nodes, so let's
use it instead of yet another custom solution.

(cherry picked from commit ac4af5f461)
2026-02-19 15:04:08 +01:00
Dawid Mędrek
58a342c524 test: topology_custom: Fetch hint metrics asynchronously
There's a dedicated API for fetching metrics now. Let's use it instead
of developing yet another solution that's also worse.

(cherry picked from commit 628e74f157)
2026-02-19 15:04:08 +01:00
Tomasz Grabiec
3e4b6369f3 Fix lambda-coroutine fiasco in hint_endpoint_manager.cc
Found by copilot.

No issue was observed yet.

Fixes #27520

Closes scylladb/scylladb#27477

(cherry picked from commit 7bc59e93b2)

Closes scylladb/scylladb#27727
2026-02-19 13:19:20 +02:00
Patryk Jędrzejczak
5db584a037 test: test_restart_leaving_replica_during_cleanup: reconnect driver after restart
The test can currently fail like this:
```
>           await cql.run_async(f"ALTER TABLE {ks}.test WITH tablets = {{'min_tablet_count': 1}}")
E           cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.158.27.9:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Failed to apply group 0 change due to concurrent modification">})
```
The following happens:
- node A is restarted and becomes the group0 leader,
- the driver sends the ALTER TABLE request to node B,
- the request hits group 0 concurrent modification error 10 times and fails
  because node A performs tablet migrations at the the same time.

What is unexpected is that even though the driver session uses the default
retry policy, the driver doesn't retry the request on node A. The request
is guaranteed to succeed on node A because it's the only node adding group0
entries.

The driver doesn't retry the request on node A because of a missing
`wait_for_cql_and_get_hosts` call. We add it in this commit. We also reconnect
the driver just in case to prevent hitting scylladb/python-driver#295.

Moreover, we can revert the workaround from
4c9efc08d8, as the fix from this commit also
prevents DROP KEYSPACE failures.

The commit has been tested in byo with `_concurrent_ddl_retries{0}` to
verify that node A really can't hit group 0 concurrent modification error
and always receives the ALTER TABLE request from the driver. All 300 runs in
each build mode passed.

Fixes #25938

Closes scylladb/scylladb#28632

(cherry picked from commit 0693091aff)

Closes scylladb/scylladb#28670
2026-02-17 16:05:53 +01:00
Jenkins Promoter
d683369a1c Update pgo profiles - aarch64 2026-02-15 04:50:23 +02:00
Patryk Jędrzejczak
280303154d Merge '[Backport 2025.1] storage_service: set up topology properly in maintenance mode' from Scylladb[bot]
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
    locator::natural_endpoints_tracker::natural_endpoints_tracker(
    const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
    Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.

This bug basically made maintenance mode unusable in customer clusters.

We fix it by changing the node state to `normal`.

We also extend `test_maintenance_mode` to provide a reproducer for #27988
and to avoid similar bugs in the future.

Fixes #27988

This PR must be backported to all branches, as maintenance mode is
currently unusable everywhere.

- (cherry picked from commit a08c53ae4b)

- (cherry picked from commit 9d4a5ade08)

- (cherry picked from commit c92962ca45)

- (cherry picked from commit 408c6ea3ee)

- (cherry picked from commit 53f58b85b7)

- (cherry picked from commit 867a1ca346)

- (cherry picked from commit 6c547e1692)

- (cherry picked from commit 7e7b9977c5)

Parent PR: #28322

Closes scylladb/scylladb#28496

* https://github.com/scylladb/scylladb:
  test: test_maintenance_mode: enable maintenance mode properly
  test: test_maintenance_mode: shutdown cluster connections
  test: test_maintenance_mode: run with different keyspace options
  test: test_maintenance_mode: check that group0 is disabled by creating a keyspace
  test: test_maintenance_mode: get rid of the conditional skip
  test: test_maintenance_mode: remove the redundant value from the query result
  storage_proxy: skip validate_read_replica in maintenance mode
  storage_service: set up topology properly in maintenance mode
2026-02-04 16:57:50 +01:00
Jenkins Promoter
cec22acbe1 Update ScyllaDB version to: 2025.1.12 2026-02-04 16:38:12 +02:00
Tomasz Grabiec
07f489098c Merge '[Backport 2025.1] load_stats: fix problem with load_stats refresh throwing no_such_column_family' from Scylladb[bot]
When the topology coordinator refreshes load_stats, it caches load_stats for every node. In case the node becomes unresponsive, and fresh load_stats can not be read from the node, the cached version of load_stats will be used. This is to allow the load balancer to have at least some information about the table sizes and disk capacities of the host.

During load_stats refresh, we aggregate the table sizes from all the nodes. This procedure calls db.find_column_family() for each table_id found in load_stats. This function will throw if the table is not found. This will cause load_stats refresh to fail.

It is also possible for a table to have been dropped between the time load_stats has been prepared on the host, and the time it is processed on the topology coordinator. This would also cause an exception in the refresh procedure.

This fixes this problem by checking if the table still exists.

Fixes: #28359

- (cherry picked from commit 71be10b8d6)

- (cherry picked from commit 92dbde54a5)

Parent PR: #28440

Closes scylladb/scylladb#28468

* github.com:scylladb/scylladb:
  test: add test and reproducer for load_stats refresh exception
  load_stats: handle dropped tables when refreshing load_stats
2026-02-03 22:18:27 +01:00
Patryk Jędrzejczak
3408b5ee66 test: test_maintenance_mode: enable maintenance mode properly
The same issue as the one fixed in
394207fd69.
This one didn't cause real problems, but it's still cleaner to fix it.

(cherry picked from commit 7e7b9977c5)
2026-02-03 12:20:11 +01:00
Patryk Jędrzejczak
7c40e90529 test: test_maintenance_mode: shutdown cluster connections
Leaked connections are known to cause inter-test issues.

(cherry picked from commit 6c547e1692)
2026-02-03 12:20:11 +01:00
Patryk Jędrzejczak
687061cca9 test: test_maintenance_mode: run with different keyspace options
We extend the test to provide a reproducer for #27988 and to avoid
similar bugs in the future.

The test slows down from ~14s to ~19s on my local machine in dev
mode. It seems reasonable.

(cherry picked from commit 867a1ca346)
2026-02-03 12:20:10 +01:00
Patryk Jędrzejczak
f5af15c245 test: test_maintenance_mode: check that group0 is disabled by creating a keyspace
In the following commit, we make the rest run with multiple keyspaces,
and the old check becomes inconvenient. We also move it below to the
part of the code that won't be executed for each keyspace.

Additionally, we check if the error message is as expected.

(cherry picked from commit 53f58b85b7)
2026-02-03 12:20:10 +01:00
Patryk Jędrzejczak
e7837844e8 test: test_maintenance_mode: get rid of the conditional skip
This skip has already caused trouble.
After 0668c642a2, the skip was always hit, and
the test was silently doing nothing. This made us miss #26816 for a long
time. The test was fixed in 222eab45f8, but we
should get rid of the skip anyway.

We increase the number of writes from 256 to 1000 to make the chance of not
finding the key on server A even lower. If that still happens, it must be
due to a bug, so we fail the test. We also make the test insert rows until
server A is a replica of one row. The expected number of inserted rows is
a small constant, so it should, in theory, make the test faster and cleaner
(we need one row on server A, so we insert exactly one such row).

It's possible to make the test fully deterministic, by e.g., hardcoding
the key and tokens of all nodes via `initial_token`, but I'm afraid it would
make the test "too deterministic" and could hide a bug.

(cherry picked from commit 408c6ea3ee)
2026-02-03 12:20:10 +01:00
Patryk Jędrzejczak
77721b2f32 test: test_maintenance_mode: remove the redundant value from the query result
(cherry picked from commit c92962ca45)
2026-02-03 12:20:10 +01:00
Patryk Jędrzejczak
072af10f45 storage_proxy: skip validate_read_replica in maintenance mode
In maintenance mode, the local node adds only itself to the topology. However,
the effective replication map of a keyspace with tablets enabled contains all
tablet replicas. It gets them from the tablets map, not the topology. Hence,
`network_topology_strategy::sanity_check_read_replicas` hits
```
throw std::runtime_error(format("Requested location for node {} not in topology. backtrace {}", id, lazy_backtrace()));
```
for tablet replicas other than the local node.

As a result, all requests to a keyspace with tablets enabled and RF > 1 fail
in debug mode (`validate_read_replica` does nothing in other modes). We don't
want to skip maintenance mode tests in debug mode, so we skip the check in
maintenance mode.

We move the `is_debug_build()` check because:
- `validate_read_replicas` is a static function with no access to the config,
- we want the `!_db.local().get_config().maintenance_mode()` check to be
  dropped by the compiler in non-debug builds.

We also suppress `-Wunneeded-internal-declaration` with `[[maybe_unused]]`.

(cherry picked from commit 9d4a5ade08)
2026-02-03 12:20:10 +01:00
Patryk Jędrzejczak
54ca1688d6 storage_service: set up topology properly in maintenance mode
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
    locator::natural_endpoints_tracker::natural_endpoints_tracker(
    const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
    Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.

This bug basically made maintenance mode unusable in customer clusters.

We fix it by changing the node state to `normal`. We also update its
rack, datacenter, and shards count. Rack and datacenter are present in the
topology somehow, but there is nothing wrong with updating them again.
The shard count is also missing, so we better update it to avoid other
issues.

Fixes #27988

(cherry picked from commit a08c53ae4b)
2026-02-03 12:20:09 +01:00
Ferenc Szili
f8ae81d8bb test: add test and reproducer for load_stats refresh exception
This patch adds a test and reproducer for the issue where the load_stats
refresh procedure throws exceptions if any of the tables have been
dropped since load_stats was produced.

(cherry picked from commit 92dbde54a5)
2026-02-03 08:55:26 +01:00
Jenkins Promoter
d268b73638 Update pgo profiles - aarch64 2026-02-01 04:48:13 +02:00
Jenkins Promoter
f2258f4b97 Update pgo profiles - x86_64 2026-02-01 04:02:53 +02:00
Ferenc Szili
7ac3ae68fc load_stats: handle dropped tables when refreshing load_stats
When the topology coordinator refreshes load_stats, it caches load_stats
for every node. In case the node becomes unresponsive, and fresh
load_stats can not be read from the node, the cached version of
load_stats will be used. This is to allow the load balancer to
have at least some information about the table sizes and disk capacities
of the host.

During load_stats refresh, we aggregate the table sizes from all the
nodes. This procedure calls db.find_column_family() for each table_id
found in load_stats. This function will throw if the table is not found.
This will cause load_stats refresh to fail.

It is also possible for a table to have been dropped between the time
load_stats has been prepared on the host, and the time it is processed
on the topology coordinator. This would also cause an exception in the
refresh procedure.

This patch fixes this problem by checking if the table still exists.

(cherry picked from commit 71be10b8d6)
2026-02-01 00:32:24 +00:00
Botond Dénes
ac36cafa98 db/row_cache: make_nonpopulating_reader(): pass cache tracker to snapshot
The API contract in partition_version.hh states that when dealing with
evictable entries, a real cache tracker pointer has to be passed to all
methods that ask for it. The nonpopulating reader violates this, passing
a nullptr to the snapshot. This was observed to cause a crash when a
concurrent cache read accessed the snapshot with the null tracker.

A reproducer is included which fails before and passes after the fix.

Fixes: #26847

Closes scylladb/scylladb#28163

(cherry picked from commit a53f989d2f)

Closes scylladb/scylladb#28277
2026-01-30 16:22:22 +02:00
Botond Dénes
cb5621f17f Merge '[Backport 2025.1] db: batchlog_manager: update _last_replay only if all batches were re…' from Scylladb[bot]
…played

Currently, if flushing hints falls within the repair cache timeout, then the flush_time is set to batchlog_manager::_last_replay. _last_replay is updated on each replay, even if some batches weren't replayed. Due to that, we risk the data resurrection.

Update _last_replay only if all batches were replayed.

Fixes: https://github.com/scylladb/scylladb/issues/24415.

Needs backport to all live versions.

- (cherry picked from commit 4d0de1126f)

- (cherry picked from commit e3dcb7e827)

Parent PR: #26793

Closes scylladb/scylladb#27088

* github.com:scylladb/scylladb:
  test: extend test_batchlog_replay_failure_during_repair
  db: batchlog_manager: update _last_replay only if all batches were replayed
2026-01-30 16:21:43 +02:00
Aleksandra Martyniuk
fef5ae1e00 test: fix test_two_tablets_concurrent_repair_and_migration_repair_writer_level
test_two_tablets_concurrent_repair_and_migration_repair_writer_level waits
for the first node that logs info about repair_writer using asyncio.wait.
The done group is never awaited, so we never learn about the error.

The test itself is incorrect and the log about repair_writer is never
printed. We never learn about that and tests finishes successfully
after 10 minutes timeout.

Fix the test:
- disable hinted handoff;
- repair tablets of the whole table:
  - new table is added so that concurrent migration is possible;
- use wait_for_first_completed that awaits done group;
- do some cleanups.

Remove nightly mark.

Fixes: #26148.

Closes scylladb/scylladb#26209

(cherry picked from commit 48bbe09c8b)

Closes scylladb/scylladb#26217
2026-01-30 16:18:49 +02:00
Yaron Kaikov
b3baef1f2e .github/workflows/backport-pr-fixes-validation: support Atlassian URL format in backport PR fixes validation
Add support for matching full Atlassian JIRA URLs in the format
https://scylladb.atlassian.net/browse/SCYLLADB-400 in addition to
the bare JIRA key format (SCYLLADB-400).

This makes the validation more flexible by accepting both formats
that developers commonly use when referencing JIRA issues.

Fixes: https://github.com/scylladb/scylladb/issues/28373

Closes scylladb/scylladb#28374

(cherry picked from commit 3f10f44232)

Closes scylladb/scylladb#28391
2026-01-27 16:09:14 +02:00
Gleb Natapov
5e706c1790 topology coordinator: complete pending operation for a replaced node
A replaced node may have pending operation on it. The replace operation
will move the node into the 'left' state and the request will never be
completed. More over the code does not expect left node to have a
request. It will try to process the request and will crash because the
node for the request will not be found.

The patch checks is the replaced node has peening request and completes
it with failure. It also changes topology loading code to skip requests
for nodes that are in a left state. This is not strictly needed, but
makes the code more robust.

Fixes #27990

Closes scylladb/scylladb#28009

(cherry picked from commit bee5f63cb6)

Closes scylladb/scylladb#28174
2026-01-27 10:17:06 +01:00
Patryk Jędrzejczak
d360d5a937 test: test_zero_token_nodes_multidc: properly handle reads with CL=LOCAL_ONE
The test is currently flaky. It incorrectly assumes that a read with
CL=LOCAL_ONE will see the data inserted by a preceding write with
CL=LOCAL_ONE in the same datacenter with RF=2.

The same issue has already been fixed for CL=ONE in
21edec1ace. The difference is that
for CL=LOCAL_ONE, only dc1 is problematic, as dc2 has RF=1.

We fix the issue for CL=LOCAL_ONE by skipping the check for dc1.

Fixes #28253

The fix addresses CI flakiness and only changes the test, so it
should be backported.

Closes scylladb/scylladb#28274

(cherry picked from commit 1f0f694c9e)

Closes scylladb/scylladb#28302
2026-01-22 11:08:22 +01:00
Aleksandra Martyniuk
b5dae10acb test: extend test_batchlog_replay_failure_during_repair
Modify test_batchlog_replay_failure_during_repair to also check
that there isn't data resurrection if flushing hints falls within
the repair cache timeout.

(cherry picked from commit e3dcb7e827)
2026-01-22 10:48:32 +01:00
Aleksandra Martyniuk
11cab796a7 db: batchlog_manager: update _last_replay only if all batches were replayed
Currently, if flushing hints falls within the repair cache timeout,
then the flush_time is set to batchlog_manager::_last_replay.
_last_replay is updated on each replay, even if some batches weren't
replayed. Due to that, we risk the data resurrection.

Update _last_replay only if all batches were replayed.

Fixes: https://github.com/scylladb/scylladb/issues/24415.
(cherry picked from commit 4d0de1126f)
2026-01-22 10:48:18 +01:00
Botond Dénes
abf7c34b40 reader_concurrency_semaphore: improve handling of base resources
reader_permit::release_base_resources() is a soft evict for the permit:
it releases the resources aquired during admission. This is used in
cases where a single process owns multiple permits, creating a risk for
deadlock, like it is the case for repair. In this case,
release_base_resources() acts as a manual eviction mechanism to prevent
permits blockings each other from admission.

Recently we found a bad interaction between release_base_resources() and
permit eviction. Repair uses both mechanism: it marks its permits as
inactive and later it also uses release_base_resources(). This partice
might be worth reconsidering, but the fact remains that there is a bug
in the reader permit which causes the base resources to be released
twice when release_base_resources() is called on an already evicted
permit. This is incorrect and is fixed in this patch.

Improve release_base_resources():
* make _base_resources const
* move signal call into the if (_base_resources_consumed()) { }
* use reader_permit::impl::signal() instead of
  reader_concurrency_semaphore::signal()
* all places where base resources are released now call
  release_base_resources()

A reproducer unit test is added, which fails before and passes after the
fix.

Fixes: #28083

Closes scylladb/scylladb#28155

(cherry picked from commit b7bc48e7b7)

Closes scylladb/scylladb#28238
2026-01-21 06:47:21 +02:00
Anna Stuchlik
4bf77d4dde doc: fix the default compaction strategy for Materialized Views
Fixes https://github.com/scylladb/scylladb/issues/24483

Closes scylladb/scylladb#27725

(cherry picked from commit 84e9b94503)

Closes scylladb/scylladb#28283
2026-01-21 06:47:05 +02:00
Asias He
a04f31da5b repair: Allow min max range to be updated for repair history
It is observed that:

repair - repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: Failed to update
system.repair_history table of node d27de212-6f32-4649ad76-a9ef1165fdcb:
seastar::rpc::remote_verb_error
(repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: range (minimum
token,maximum token) is not in the format of (start, end])

This is because repair checks the end of the range to be repaired needs
to be inclusive. When small_table_optimization is enabled for regular
repair, a (minimum token,maximum token) will be used.

To fix, we can relax the check of (start, end] for the min max range.

Fixes #27220

Backport to all active branches.

(cherry picked from commit e97a504)
Parent PR: #27357

Closes scylladb/scylladb#27458
2026-01-19 11:09:42 +02:00
Nikos Dragazis
d21b37c8eb test: database_test: Fix serialization of partition key
The `make_key` lambda erroneously allocates a fixed 8-byte buffer
(`sizeof(s.size())`) for variable-length strings, potentially causing
uninitialized bytes to be included. If such bytes exist and they are
not valid UTF-8 characters, deserialization fails:

```
ERROR 2026-01-16 08:18:26,062 [shard 0:main] testlog - snapshot_list_contains_dropped_tables: cql env callback failed, error: exceptions::invalid_request_exception (Exception while binding column p1: marshaling error: Validation failed - non-UTF8 character in a UTF8 string, at byte offset 7)
```

Fixes #28195.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#28197

(cherry picked from commit 8aca7b0eb9)

Closes scylladb/scylladb#28207
2026-01-19 09:44:21 +02:00
Botond Dénes
729cf6d5b6 Merge '[Backport 2025.1] raft topology: preserve IP -> ID mapping of a replacing node on restart' from Scylladb[bot]
Note: only the fix for replace with different IP has been backported to
2025.1. The IP-based gossiper in 2025.1 has made backporting the fix
for replace with the same IP too difficult.

We currently do it only for a bootstrapping node, which is a bug. The
missing IP can cause an internal error, for example, in the following
scenario:
- replace fails during streaming,
- all live nodes are shut down before the rollback of replace completes,
- all live nodes are restarted,
- live nodes start hitting internal error in all operations that
  require IP of the replacing node (like client requests or REST API
  requests coming from nodetool).

We fix the bug here, but we do it separately for replace with different
IP and replace with the same IP.

For replace with different IP, we persist the IP -> host ID mapping
in `system.peers` just like for bootstrap. That's necessary, since there
is no other way to determine IP of the replacing node on restart.

For replace with the same IP, we can't do the same. This would require
deleting the row corresponding to the node being replaced from
`system.peers`. That's fine in theory, as that node is permanently
banned, so its IP shouldn't be needed. Unfortunately, we have many
places in the code where we assume that IP of a topology member is always
present in the address map or that a topology member is always present in
the gossiper endpoint set. Examples of such places:
- nodetool operations,
- REST API endpoints,
- `db::hints::manager::store_hint`,
- `group0_voter_handler::update_nodes`.

We could fix all those places and verify that drivers work properly when
they see a node in the token metadata, but not in `system.peers`.
However, that would be too risky to backport.

We take a different approach. We recover IP of the replacing node on
restart based on the state of the topology state machine and
`system.peers` just after loading `system.peers`.

We rely on the fact that group 0 is set up at this point. The only case
where this assumption is incorrect is a restart in the Raft-based
recovery procedure. However, hitting this problem then seems improbable,
and even if it happens, we can restart the node again after ensuring
that no client and REST API requests come before replace is rolled back
on the new topology coordinator. Hence, it's not worth to complicate the
fix (by e.g. looking at the persistent topology state instead of the
in-memory state machine).

Fixes #28057

Backport this PR to all branches as it fixes a problematic bug.

- (cherry picked from commit fc4c2df2ce)

- (cherry picked from commit 4526dd93b1)

- (cherry picked from commit 749b0278e5)

- (cherry picked from commit 0fed9f94f8)

Manually cherry-picked:
- 90b5b2c5f5
- 92b165b8c0

Parent PR: #27435

Closes scylladb/scylladb#28096

* github.com:scylladb/scylladb:
  test: introduce test_full_shutdown_during_replace
  utils: error_injection: allow aborting wait_for_message
  raft topology: preserve IP -> ID mapping of a replacing node on restart
  pylib/rest_client.py: encode injection name
  utils/error_injection: allow to abort `injection_handler::wait_for_message()`
2026-01-19 09:43:43 +02:00
Raphael S. Carvalho
ab19da2bd7 replica: Fix race between drop table and merge completion handling
Consider this:
1) merge finishes, wakes up fiber to merge compaction groups
2) drop table happens, which in turn invokes truncate underneath
3) merge fiber stops old groups
4) truncate disables compaction on all groups, but the ones stopped
5) truncate performs a check that compaction has been disabled on
all groups, including the ones stopped
6) the check fails because groups being stopped didn't have compaction
explicitly disabled on them

To fix it, the check on step 6 will ignore groups that have been
stopped, since those are not eligible for having compaction explicitly
disabled on them. The compaction check is there, so ongoing compaction
will not propagate data being truncated, but here it happens in the
context of drop table which doesn't leave anything behind. Also, a
group stopped is somewhat equivalent to compaction disabled on it,
since the procedure to stop a group stops all ongoing compaction
and eventually removes its state from compaction manager.

Fixes #25551.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#25563

(cherry picked from commit 149f9d8448)

Closes scylladb/scylladb#25630
2026-01-19 06:40:30 +02:00
Calle Wilund
74a721b9db db::commitlog: Fix sanity check error on race between segment flushing and oversized alloc
Fixes #27992

When doing a commit log oversized allocation, we lock out all other writers by grabbing
the _request_controller semaphore fully (max capacity).
We thereafter assert that the semaphore is in fact zero. However, due to how things work
with the bookkeep here, the semaphore can in fact become negative (some paths will not
actually wait for the semaphore, because this could deadlock).

Thus, if, after we grab the semaphore and execution actually returns to us (task schedule),
new_buffer via segment::allocate is called (due to a non-fully-full segment), we might
in fact grab the segment overhead from zero, resulting in a negative semaphore.

The same problem applies later when we try to sanity check the return of our permits.

Fix is trivial, just accept less-than-zero values, and take same possible ltz-value
into account in exit check (returning units)

Added whitebox (special callback interface for sync) unit test that provokes/creates
the race condition explicitly (and reliably).

Closes scylladb/scylladb#27998

(cherry picked from commit a7cdb602e1)

Closes scylladb/scylladb#28095
2026-01-16 16:24:43 +02:00
Michał Jadwiszczak
3d053dea4a docs/dev/service_levels: update docs to service levels on raft
Since Scylla 6.0, service levels are manged by Raft group0.
This patch updates table name used by service levels and adds a
paragraph describing service levels on raft.

Fixes scylladb/scylladb#18177

Closes scylladb/scylladb#26556

(cherry picked from commit 649efd198f)

Closes scylladb/scylladb#28128
2026-01-16 14:35:57 +01:00
Piotr Dulikowski
fccbf99c71 Merge '[Backport 2025.1] service/storage_service: update service levels cache after upgrade to v2' from Scylladb[bot]
Service levels cache is empty after upgrade to consistent topology
if no mutations are commited to `system.service_levels_v2` or rolling
restart is not done.

To fix the bug, this patch adds service levels cache reloading after
upgrading the SL data accessor to v2 in `storage_service::topology_state_load()`.

Fixes [SCYLLADB-90](https://scylladb.atlassian.net/browse/SCYLLADB-90)

This fix should be backported to all versions containing service levels on Raft.

[SCYLLADB-90]: https://scylladb.atlassian.net/browse/SCYLLADB-90?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

- (cherry picked from commit 53d0a2b5dc)

- (cherry picked from commit be16e42cb0)

Parent PR: #27585

Closes scylladb/scylladb#28069

* github.com:scylladb/scylladb:
  service/storage_service: update service levels cache after upgrade to v2
  service/storage_service: check if service levels were already upgraded before doing migration to raft
2026-01-16 14:34:58 +01:00
Patryk Jędrzejczak
a1eec6f495 test: test_group0_schema_versioning: wait for schema sync in system.local
`test_schema_versioning_with_recovery` is currently flaky. It performs
a write with CL=ALL and then checks if the schema version is the same on
all nodes by calling `verify_table_versions_synced`. All nodes are expected
to sync their schema before handling the replica write. The node in
RECOVERY mode should do it through a schema pull, and other nodes should do
it through a group 0 read barrier.

The problem is in `verify_local_schema_versions_synced` that compares the
schema versions in `system.local`. The node in RECOVERY mode updates the
schema version in `system.local` after it acknowledges the replica write
as completed. Hence, the check can fail.

We fix the problem by making the function wait until the schema versions
match.

Note that RECOVERY mode is about to be retired together with the whole
gossip-based topology in 2026.2. So, this test is about to be deleted.
However, we still want to fix it, so that it doesn't bother us in older
branches.

Fixes #23803

Closes scylladb/scylladb#28114

(cherry picked from commit 6b5923c64e)

Closes scylladb/scylladb#28172
2026-01-16 11:23:09 +01:00
Sergey Zolotukhin
c0f730bbaf test: disable test_start_bootstrapped_with_invalid_seed
The test intermittently fails when an invalid DNS name is resolved,
likely due to ISP DNS error hijacking (see scylladb/scylladb#28153).

Disable this test to unblock CI.

Fixes scylladb/scylladb#28153

Closes scylladb/scylladb#28162

(cherry picked from commit 799d837295)
2026-01-15 17:56:10 +02:00
Avi Kivity
522f76edca Update seastar submodule (accept unbounded recursion)
* seastar c4ca171b62...5a0c5e1f87 (1):
  > net: posix_server_socket_impl: coroutinize accept(), fix unbounded recursion

Fixes #28166
2026-01-15 14:37:09 +02:00
Jenkins Promoter
b5b4ede294 Update pgo profiles - aarch64 2026-01-15 04:57:01 +02:00
Jenkins Promoter
f817255e62 Update pgo profiles - x86_64 2026-01-15 04:13:35 +02:00
Patryk Jędrzejczak
fe8738540e test: introduce test_full_shutdown_during_replace
(cherry picked from commit 749b0278e5)
2026-01-13 17:14:19 +01:00
Patryk Jędrzejczak
34fd9e55f7 utils: error_injection: allow aborting wait_for_message
The test added in the following commit utilizes it.

(cherry picked from commit 4526dd93b1)
2026-01-13 17:14:19 +01:00
Patryk Jędrzejczak
2096ac02df raft topology: preserve IP -> ID mapping of a replacing node on restart
Note: only the fix for replace with different IP has been backported to
2025.1. The IP-based gossiper in 2025.1 has made backporting the fix
for replace with the same IP too difficult.

We currently do it only for a bootstrapping node, which is a bug. The
missing IP can cause an internal error, for example, in the following
scenario:
- replace fails during streaming,
- all live nodes are shut down before the rollback of replace completes,
- all live nodes are restarted,
- live nodes start hitting internal error in all operations that
  require IP of the replacing node (like client requests or REST API
  requests coming from nodetool).

We fix the bug here, but we do it separately for replace with different
IP and replace with the same IP.

For replace with different IP, we persist the IP -> host ID mapping
in `system.peers` just like for bootstrap. That's necessary, since there
is no other way to determine IP of the replacing node on restart.

For replace with the same IP, we can't do the same. This would require
deleting the row corresponding to the node being replaced from
`system.peers`. That's fine in theory, as that node is permanently
banned, so its IP shouldn't be needed. Unfortunately, we have many
places in the code where we assume that IP of a topology member is always
present in the address map or that a topology member is always present in
the gossiper endpoint set. Examples of such places:
- nodetool operations,
- REST API endpoints,
- `db::hints::manager::store_hint`,
- `group0_voter_handler::update_nodes`.

We could fix all those places and verify that drivers work properly when
they see a node in the token metadata, but not in `system.peers`.
However, that would be too risky to backport.

We take a different approach. We recover IP of the replacing node on
restart based on the state of the topology state machine and
`system.peers` just after loading `system.peers`.

We rely on the fact that group 0 is set up at this point. The only case
where this assumption is incorrect is a restart in the Raft-based
recovery procedure. However, hitting this problem then seems improbable,
and even if it happens, we can restart the node again after ensuring
that no client and REST API requests come before replace is rolled back
on the new topology coordinator. Hence, it's not worth to complicate the
fix (by e.g. looking at the persistent topology state instead of the
in-memory state machine).

(cherry picked from commit fc4c2df2ce)
2026-01-13 17:14:19 +01:00
Petr Gusev
1b16fabbf6 pylib/rest_client.py: encode injection name
Sometimes it's convenient to use slashes in injection names,
for example my_component/my_method/my_condition. Without quote()
we get 'handler not found' error from Scylla.

(cherry picked from commit 92b165b8c0)
2026-01-13 17:14:19 +01:00
Michał Jadwiszczak
e4c7ac5a14 utils/error_injection: allow to abort injection_handler::wait_for_message()
(cherry picked from commit 90b5b2c5f5)
2026-01-13 17:14:18 +01:00
Patryk Jędrzejczak
52f20e66ef Merge '[Backport 2025.1] database: truncate_table_on_all_shards: consider can_flush on all shards' from Scylladb[bot]
Currently, database::truncate_table_on_all_shards calls the table::can_flush only on the coordinator shard
and therefore it may miss shards with dirty data if the coordinator shard happens to have empty memtables, leading to clearing the memtables with dirty data rather than flushing them.

This change fixes that by making flush safe to be called, even if the memtable list is empty, and calling it on every shard that can flush (i.e. seal_immediate_fn is engaged).

Also, change database_test::do_with_some_data is use random keys instead of hard-coded key names, to reproduce this issue with `snapshot_list_contains_dropped_tables`.

Fixes #27639

* The issue exists since forever and might cause data loss due to wrongly clearing the memtable, so it needs backport to all live versions

- (cherry picked from commit ec4069246d)

- (cherry picked from commit 5be6b80936)

- (cherry picked from commit 0342a24ee0)

- (cherry picked from commit 02ee341a03)

- (cherry picked from commit 2a803d2261)

- (cherry picked from commit 93b827c185)

- (cherry picked from commit ebd667a8e0)

Parent PR: #27643

Closes scylladb/scylladb#28067

* https://github.com/scylladb/scylladb:
  test: database_test: do_with_some_data: randomize keys
  database: truncate_table_on_all_shards: drop outdated TODO comment
  database: truncate_table_on_all_shards: consider can_flush on all shards
  memtable_list: unify can_flush and may_flush
  test: database_test: add test_flush_empty_table_waits_on_outstanding_flush
  replica: table, storage_group, compaction_group: add needs_flush
  test: database_test: do_with_some_data_in_thread: accept void callback function
2026-01-12 11:27:01 +01:00
Tomasz Grabiec
bec19944ca test: cluster: Fix NoHostAvailable error in test_not_enough_token_owners
The driver must see server_c before we stop server_a, otherwise
there will be no live host in the pool when we attempt to drop
the keyspace:

```
   @pytest.mark.asyncio
    async def test_not_enough_token_owners(manager: ManagerClient):
        """
        Test that:
        - the first node in the cluster cannot be a zero-token node
        - removenode and decommission of the only token owner fail in the presence of zero-token nodes
        - removenode and decommission of a token owner fail in the presence of zero-token nodes if the number of token
          owners would fall below the RF of some keyspace using tablets
        """
        logging.info('Trying to add a zero-token server as the first server in the cluster')
        await manager.server_add(config={'join_ring': False},
                                 property_file={"dc": "dc1", "rack": "rz"},
                                 expected_error='Cannot start the first node in the cluster as zero-token')

        logging.info('Adding the first server')
        server_a = await manager.server_add(property_file={"dc": "dc1", "rack": "r1"})

        logging.info('Adding two zero-token servers')
        # The second server is needed only to preserve the Raft majority.
        server_b = (await manager.servers_add(2, config={'join_ring': False}, property_file={"dc": "dc1", "rack": "rz"}))[0]

        logging.info(f'Trying to decommission the only token owner {server_a}')
        await manager.decommission_node(server_a.server_id,
                                        expected_error='Cannot decommission the last token-owning node in the cluster')

        logging.info(f'Stopping {server_a}')
        await manager.server_stop_gracefully(server_a.server_id)

        logging.info(f'Trying to remove the only token owner {server_a} by {server_b}')
        await manager.remove_node(server_b.server_id, server_a.server_id,
                                  expected_error='cannot be removed because it is the last token-owning node in the cluster')

        logging.info(f'Starting {server_a}')
        await manager.server_start(server_a.server_id)

        logging.info('Adding a normal server')
        await manager.server_add(property_file={"dc": "dc1", "rack": "r2"})

        cql = manager.get_cql()

        await wait_for_cql_and_get_hosts(cql, [server_a], time.time() + 60)

>       async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }") as ks_name:

test/cluster/test_not_enough_token_owners.py:57:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/lib64/python3.14/contextlib.py:221: in __aexit__
    await anext(self.gen)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

manager = <test.pylib.manager_client.ManagerClient object at 0x7f37efe00830>
opts = "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }"
host = None

    @asynccontextmanager
    async def new_test_keyspace(manager: ManagerClient, opts, host=None):
        """
        A utility function for creating a new temporary keyspace with given
        options. It can be used in a "async with", as:
            async with new_test_keyspace(ManagerClient, '...') as keyspace:
        """
        keyspace = await create_new_test_keyspace(manager.get_cql(), opts, host)
        try:
            yield keyspace
        except:
            logger.info(f"Error happened while using keyspace '{keyspace}', the keyspace is left in place for investigation")
            raise
        else:
>           await manager.get_cql().run_async("DROP KEYSPACE " + keyspace, host=host)
E           cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.69.108.39:9042 dc1>: ConnectionException('Pool for 127.69.108.39:9042 is shutdown')})

test/cluster/util.py:544: NoHostAvailable
```

Fixes #28011

Closes scylladb/scylladb#28040

(cherry picked from commit 34df158605)

Closes scylladb/scylladb#28063
2026-01-09 19:14:34 +01:00
Botond Dénes
f9be7b4672 reader_concurrency_semaphore: add protection against negative count resource leaks
The semaphore has detection and protection against regular resource
leaks, where some resources go unaccounted for and are not released by
the time the semaphore is destroyed. There is no detection or protection
against negative leaks: where resources are "made up" of thin air. This
kind of leaks looks benign at first sight, a few extra resources won't
hurt anyone so long as this is a small amount. But turns out that even a
single extra count resource can defeat a very important anti-deadlock
protection in can_admit_read(): the special case which admits a new
permit regardless of memory resources, when all original count resources
all available. This check uses ==, so if resource > original, the
protection is defeated indefinitely. Instead of just changing == to >=,
we add detection of such negative leaks to signal(), via
on_internal_error_noexcept().
At this time I still don't now how this negative leak happens (the code
doesn't confess), with this detection, hopefully we'll get a clue from
tests or the field. Note that on_internal_error_noexcept() will not
generate a coredump, unless ScyllaDB is explicitely configured to do so.
In production, it will just generate an error log with a backtrace.
The detection also clams the _resources to _initial_resources, to
prevent any damage from the negativae leak.

I just noticed that there is no unit test for the deadlock protection
described above, so one is added in this PR, even if only loosely
related to the rest of the patch.

Fixes: SCYLLADB-163

Closes scylladb/scylladb#27764

(cherry picked from commit e4da0afb8d)

Closes scylladb/scylladb#28000
2026-01-09 13:28:46 +02:00
Benny Halevy
7389784151 test: database_test: do_with_some_data: randomize keys
With randomized keys, and since we're inserting only 2 keys,
it is possible that they would end up owned only by a single shard,
reproducing #27639 in snapshot_list_contains_dropped_tables.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ebd667a8e0)
2026-01-09 09:02:29 +02:00
Benny Halevy
09e49cd711 database: truncate_table_on_all_shards: drop outdated TODO comment
The comment was added in 83323e155e
Since then, table::seal_active_memtable was improved to guarantee
waiting on oustanding flushes on success (See d55a2ac762), so
we can remove this TODO comment (it also not covered by any issue
so nobody is planned to ever work on it).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 93b827c185)
2026-01-09 09:01:57 +02:00
Benny Halevy
65f0c21e3f database: truncate_table_on_all_shards: consider can_flush on all shards
can_flush might return a different value for each shard
so check it right before deciding whether to flush or clear a memtable
shard.

Note that under normal condition can_flush would always return true
now that it checks only the presence of the seal memtable function
rather than check memtable_list::empty().

Fixes #27639

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2a803d2261)
2026-01-09 09:01:24 +02:00
Benny Halevy
750fd1bec3 memtable_list: unify can_flush and may_flush
Now that we have a unit test proving that it's safe to flush an
empty memtable list there is no need to distinguish between
may_flush and can_flush.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 02ee341a03)
2026-01-09 08:58:15 +02:00
Benny Halevy
3920cacadc test: database_test: add test_flush_empty_table_waits_on_outstanding_flush
Test that table::flush waits on outstanding flushes, even if the active memtable is empty

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0342a24ee0)
2026-01-09 08:57:39 +02:00
Benny Halevy
7496fb5b81 replica: table, storage_group, compaction_group: add needs_flush
Table needs flush if not all its memtable lists are empty.
To be used in the next patch for a unit test.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 5be6b80936)
2026-01-09 08:54:36 +02:00
Benny Halevy
faecb2aabc test: database_test: do_with_some_data_in_thread: accept void callback function
Many test cases already assume `func` is being called a seastar
thread and although the function they pass returns a (ready) future,
it serves no purpose other than to conform to the interface.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ec4069246d)
2026-01-09 08:53:16 +02:00
Michał Jadwiszczak
ffae4a9698 service/storage_service: update service levels cache after upgrade to v2
Service levels cache is empty after upgrade to consistent topology
if no mutations are commited to `system.service_levels_v2` or rolling
restart is not done.

To fix the bug, this commit adds service levels cache reloading after
upgrading the SL data accessor to v2 in `storage_service::topology_state_load()`.

Fixes SCYLLADB-90

(cherry picked from commit be16e42cb0)
2026-01-08 22:45:21 +00:00
Michał Jadwiszczak
bb55c3beb8 service/storage_service: check if service levels were already upgraded
before doing migration to raft

There is no need to call `service_level_controller::upgrade_to_v2()`
on every topology state load, we only need to do it once.

(cherry picked from commit 53d0a2b5dc)
2026-01-08 22:45:21 +00:00
Anna Stuchlik
579e66a540 doc: remove cassandra-stress from installation instructions
The cassandra-stress tool is no longer part of the default package
and cannot be run in the way described.

This commit removes the instruction to run cassandra-stress.

Fixes https://github.com/scylladb/scylladb/issues/24994

Closes scylladb/scylladb#27726

(cherry picked from commit 624869de86)

Closes scylladb/scylladb#27948
2026-01-08 18:12:43 +02:00
Botond Dénes
a6367cc48c Merge '[Backport 2025.1] topology_coordinator: Add barrier to cleanup_target' from Scylladb[bot]
Consider the following scenario:
1. A table has RF=3 and writes use CL=QUORUM
2. One node is down
3. There is a pending tablet migration from the unavailable node
   that is reverted

During the revert, there can be a time window where the pending replica
being cleaned up still accepts writes. This leads to write failures,
as only two nodes (out of four) are able to acknowledge writes.

This patch fixes the issue by adding a barrier to the cleanup_target
tablet transition state, ensuring that the coordinator switches back to
the previous replica set before cleanup is triggered.

Fixes https://github.com/scylladb/scylladb/issues/26512

It's a pre existing issue. Backport is required to all recent 2025.x versions.

- (cherry picked from commit 669286b1d6)

- (cherry picked from commit 67f1c6d36c)

- (cherry picked from commit 6163fedd2e)

Parent PR: #27413

Closes scylladb/scylladb#27425

* github.com:scylladb/scylladb:
  topology_coordinator: Fix the indentation for the cleanup_target case
  topology_coordinator: Add barrier to cleanup_target
  test_node_failure_during_tablet_migration: Increase RF from 2 to 3
2026-01-08 18:09:37 +02:00
Geoff Montee
3aa62c361e Update update-topology-strategy-from-simple-to-network.rst: Multiple clarifications to page and sub-procedures
Fixes #27077

Multiple points can be clarified relating to:

* Names of each sub-procedure could be clearer
* Requirements of each sub-procedure could be clearer
* Clarify which keyspaces are relevant and how to check them
* Fix typos in keyspace name

Closes scylladb/scylladb#26855

(cherry picked from commit a0734b8605)

Closes scylladb/scylladb#27148
2026-01-08 18:08:47 +02:00
Botond Dénes
0194ba736a Merge '[Backport 2025.1] api: storage_service: tasks: unify sync and async compaction APIs' from Scylladb[bot]
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.

Unify the handlers of synchronous and asynchronous cleanup, major
compaction, and upgrade_sstables.

Fixes: https://github.com/scylladb/scylladb/issues/26715.

Requires backports to all live versions

- (cherry picked from commit 12dabdec66)

- (cherry picked from commit 044b001bb4)

- (cherry picked from commit fdd623e6bc)

Parent PR: #26746

Closes scylladb/scylladb#26878

* github.com:scylladb/scylladb:
  api: storage_service: tasks: unify upgrade_sstable
  api: storage_service: tasks: force_keyspace_cleanup
  api: storage_service: tasks: unify force_keyspace_compaction
2026-01-08 18:07:30 +02:00
Botond Dénes
9b10a6328d Merge '[Backport 2025.1] db: repair: do not update repair_time if batchlog replay failed' from Scylladb[bot]
Currently, batchlog replay is considered successful even if all batches fail
to be sent (they are replayed later). However, repair requires all batches
to be sent successfully. Currently, if batchlog isn't cleared, the repair never
learns and updates the repair_time. If GC mode is set to "repair", this means
that the tombstones written before the repair_time (minus propagation_delay)
can be GC'd while not all batches were replied.

Consider a scenario:
- Table t has a row with (pk=1, v=0);
- There is an entry in the batchlog that sets (pk=1, v=1) in table t;
- The row with pk=1 is deleted from table t;
- Table t is repaired:
    - batchlog reply fails;
    - repair_time is updated;
- propagation_delay seconds passes and the tombstone of pk=1 is GC'd;
- batchlog is replayed and (pk=1, v=1) inserted - data resurrection!

Do not update repair_time if sending any batch fails. The data is still repaired.
For tablet repair the repair runs, but at the end the exception is passed
to topology coordinator. Thanks to that the repair_time isn't updated.
The repair request isn't removed as well, due to which the repair will need
to rerun.

Apart from that, a batch is removed from the batchlog if its version is invalid
or unknown. The condition on which we consider a batch too fresh to replay
is updated to consider propagation_delay.

Fixes: https://github.com/scylladb/scylladb/issues/24415

Data resurrection fix; needs backport to all versions

- (cherry picked from commit 502b03dbc6)

- (cherry picked from commit 904183734f)

- (cherry picked from commit 7f20b66eff)

- (cherry picked from commit e1b2180092)

- (cherry picked from commit d436233209)

- (cherry picked from commit 1935268a87)

- (cherry picked from commit 6fc43f27d0)

Parent PR: #26319

Closes scylladb/scylladb#26752

* github.com:scylladb/scylladb:
  repair: throw if flush failed in get_flush_time
  db: fix indentation
  test: add reproducer for data resurrection
  repair: fail tablet repair if any batch wasn't sent successfully
  db/batchlog_manager: fix making decision to skip batch replay
  db: repair: throw if replay fails
  db/batchlog_manager: delete batch with incorrect or unknown version
  db/batchlog_manager: coroutinize replay_all_failed_batches
2026-01-08 18:07:03 +02:00
Patryk Jędrzejczak
310ce822c9 Merge '[Backport 2025.1] test/raft: use valid sentinel in liveness check to prevent digest errors' from Scylladb[bot]
Replace -1 with 0 for the liveness check operation to avoid triggering digest validation failures. This prevents rare fatal errors when the cluster is recovering and ensures the test does not violate append_seq invariants.

The value -1 was causing invalid digest results in the append_seq structure, leading to assertion failures. This could happen when the sentinel value was the first (or only) element being appended, resulting in a digest that did not match the expected value.

By using 0 instead, we ensure that the digest calculations remain valid and consistent with the expected behavior of the test.

The specific value of the sentinel is not important, as long as it is a valid elem_t that does not violate the invariants of the append_seq structure. In particular, the sentinel value is typically used only when no valid result is received from any server in the current loop iteration, in which case the loop will retry.

Fixes: scylladb/scylladb#27307

Backporting to active branches - this is a test-only fix (low risk) for a flaky test that exists in older branches (thus affects the CI of active branches).

- (cherry picked from commit 3af5183633)

- (cherry picked from commit 4ba3e90f33)

Parent PR: #28010

Closes scylladb/scylladb#28036

* https://github.com/scylladb/scylladb:
  test/raft: use valid sentinel in liveness check to prevent digest errors
  test/raft: improve debugging in randomized_nemesis_test
  test/raft: improve reporting in the randomized_nemesis_test digest functions
2026-01-08 15:42:08 +01:00
Aleksandra Martyniuk
e1776dc828 test: rename duplicate tests
There are two test with name test_repair_options_hosts_tablets in
test/nodetool/test_cluster_repair.py and and two test_repair_keyspace
in test/nodetool/test_repair.py. Due to that one of each pair is ignored.

Rename the tests so that they are unique.

Fixes: https://github.com/scylladb/scylladb/issues/27701.

Closes scylladb/scylladb#27720

(cherry picked from commit bbe64e0e2a)

Closes scylladb/scylladb#27845
2026-01-08 14:52:38 +02:00
Emil Maskovsky
9945ade82d test/raft: use valid sentinel in liveness check to prevent digest errors
Replace -1 with 0 for the liveness check operation to avoid triggering
digest validation failures. This prevents rare fatal errors when the
cluster is recovering and ensures the test does not violate append_seq
invariants.

The value -1 was causing invalid digest results in the append_seq
structure, leading to assertion failures. This could happen when the
sentinel value was the first (or only) element being appended, resulting
in a digest that did not match the expected value.

By using 0 instead, we ensure that the digest calculations remain valid
and consistent with the expected behavior of the test.

The specific value of the sentinel is not important, as long as it is
a valid elem_t that does not violate the invariants of the append_seq
structure. In particular, the sentinel value is typically used only
when no valid result is received from any server in the current loop
iteration, in which case the loop will retry.

Fixes: scylladb/scylladb#27307

(cherry picked from commit 4ba3e90f33)
2026-01-08 10:47:21 +01:00
Emil Maskovsky
3237ad1e08 test/raft: improve debugging in randomized_nemesis_test
Move the post-condition check before the assertion to ensure it is
always executed first. Before, the wrong value could be passed to the
digest_remove assertion, making the pre-check trigger there instead of
the post-check as expected.

Also, add a check in the append_seq constructor to ensure that the
digest value is valid when creating an append_seq object.

(cherry picked from commit 3af5183633)
2026-01-08 10:47:21 +01:00
Emil Maskovsky
0febd6719a test/raft: improve reporting in the randomized_nemesis_test digest functions
The Boost ASSERTs in the digest functions of the randomized_nemesis_test
were not working well inside the state machine digest functions, leading
to unhelpful boost::execution_exception errors that terminated the apply
fiber, and didn't provide any helpful information.

Replaced by explicit checks with on_fatal_internal_error calls that
provide more context about the failure. Also added validation of the
digest value after appending or removing an element, which allows to
determine which operation resulted in causing the wrong value.

This effectively reverts the changes done in https://github.com/scylladb/scylladb/pull/19282,
but adds improved error reporting.

Refs: scylladb/scylladb#27307
Refs: scylladb/scylladb#17030

(cherry picked from commit d60b908a8e)
2026-01-08 10:46:58 +01:00
Anna Stuchlik
8bc7efc42f doc: fix the syntax of internal links
Some internal links had the wrong syntax: they were formatted as external links.
As a result, they redirected the user to the outdated Open Source documentation.
This commit fixes that bug.

Fixes https://github.com/scylladb/scylladb/issues/25899

Closes scylladb/scylladb#27905

(cherry picked from commit 375479d96c)

Closes scylladb/scylladb#27999
2026-01-07 18:59:48 +02:00
Jenkins Promoter
904e2526dd Update pgo profiles - aarch64 2026-01-01 04:57:41 +02:00
Jenkins Promoter
e88b8ce981 Update pgo profiles - x86_64 2025-12-31 21:15:25 -05:00
Gleb Natapov
30d1d83ea1 topology coordinator: set session id for streaming at the correct time
Commit d3efb3ab6f added streaming session for rebuild, but it set
the session and request submission time. The session should be set when
request starts the execution, so this patch moved it to the correct
place.

Closes scylladb/scylladb#27757

(cherry picked from commit 04976875cc)

Closes scylladb/scylladb#27864
2025-12-28 13:34:37 +02:00
Ferenc Szili
7a646e101c test: fix flakyness caused by TRUNCATE retries
The test test_truncate_during_topology_change tests TRUNCATE TABLE while
bootstrapping a new node. With tablets enabled TRUNCATE is a global
topology operation which needs to serialize with boostrap.

When TRUNCATE TABLE is issued, it first checks if there is an already
queued truncate for the same table. This can happen if a previous
TRUNCATE operation has timed out, and the client retried. The newly
issued truncate will only join the queued one if it is waiting to be
processed, and will fail immediatelly if the TRUNCATE is already being
processed.

In this test, TRUNCATE will be retried after a timeout (1 minute) due to
the default retry policy, and will be retried up to 3 times, while the
bootstrap is delayed by 2 minutes. This means that the test can validate
the result of a truncate which was started after bootstrap was
completed.

Because of the way truncate joins existing truncate operations, we can
also have the following scenario:
- TRUNCATE times out after one minute because the new node is being
  bootstrapped
- the client retries the TRUNCATE command which also times out after 1m
- the third attempt is received during TRUNCATE being processed which
  fails the test

This patch changes the retry policy of the TRUNCATE operation to
FallthroughRetryPolicy which guarantees that TRUNCATE will not be
retried on timeout. It also increases the timeout of the TRUNCATE from 1
to 4 minutes. This way the test will actually validate the performance
of the TRUNCATE operation which was issued during bootstrap, instead of
the subsequent, retried TRUNCATEs which could have been issued after the
bootstrap was complete.

Fixes: #26347

Closes scylladb/scylladb#27245

(cherry picked from commit d883ff2317)

Closes scylladb/scylladb#27503
2025-12-23 17:09:37 +02:00
Yaron Kaikov
2df9f878c7 auto-backport.py: modify instruction for making PR ready for review
Update the comment sent when PR has conflicts with clear instrauctions how to make the PR Ready for review

Fixes: https://scylladb.atlassian.net/browse/RELENG-152

Closes scylladb/scylladb#27547

(cherry picked from commit d3e199984e)

Closes scylladb/scylladb#27562
2025-12-22 15:20:44 +02:00
Emil Maskovsky
00a6671543 test/raft: fix race condition in failure_detector_test
The test had a sporadic failure due to a broken promise exception.
The issue was in `test_pinger::ping()` which captured the promise by
move into the subscription lambda, causing the promise to be destroyed
when the lambda was destroyed during coroutine unwinding.

Simplify `test_pinger::ping()` by replacing manual abort_source/promise
logic with `seastar::sleep_abortable()`.
This removes the risk of promise lifetime/race issues and makes the code
simpler and more robust.

Fixes: scylladb/scylladb#27136

Backport to active branches: This fixes a CI test issue, so it is
beneficial to backport the fix. As this is a test-only fix, it is a low
risk change.

Closes scylladb/scylladb#27737

(cherry picked from commit 2a75b1374e)

Closes scylladb/scylladb#27778
2025-12-21 19:26:45 +02:00
Anna Stuchlik
b1d2b188e3 doc: remove the links to the Download Center
This commit removes the remaining links to the Download Center on the website.
We no longer use it for installation, and we don't want users to infer that
something like that still exists.

Fixes https://github.com/scylladb/scylladb/issues/27753

Closes scylladb/scylladb#27756

(cherry picked from commit f65db4e8eb)

Closes scylladb/scylladb#27779
2025-12-21 19:21:52 +02:00
Patryk Jędrzejczak
18402fd300 Merge '[Backport 2025.1] topology_coordinator: handle seastar::abort_requested_exception alongside raft::request_aborted' from Scylladb[bot]
In several exception handlers, only `raft::request_aborted` was being caught and rethrown, while `seastar::abort_requested_exception` was falling through to the generic catch(...) block. This caused the exception to be incorrectly treated as a failure that triggers rollback, instead of being recognized as an abort signal.

For example, during tablet draining, the error log showed: "tablets draining failed with seastar::abort_requested_exception (abort requested). Aborting the topology operation"

This change adds `seastar::abort_requested_exception` handling alongside `raft::request_aborted` in all places where it was missing. When rethrown, these exceptions propagate up to the main `run()` loop where `handle_topology_coordinator_error()` recognizes them as normal abort signals and allows the coordinator to exit gracefully without triggering unnecessary rollback operations.

Fixes: scylladb/scylladb#27255

No backport: The problem was only seen in tests and not reported in customer tickets, so it's enough to fix it in the main branch.

- (cherry picked from commit 37e3dacf33)

Parent PR: #27314

Closes scylladb/scylladb#27660

* https://github.com/scylladb/scylladb:
  topology_coordinator: handle seastar::abort_requested_exception alongside raft::request_aborted
  topology_coordinator: consistently rethrow `raft::request_aborted` for direct/global commands
2025-12-20 19:37:17 +01:00
Michael Litvak
86e112d7c6 view_builder: reduce log level for expected aborts during view creation
When draining the view builder, we abort ongoing operations using the
view builder's abort source, which may cause them to fail with
abort_requested_exception or raft::request_aborted exceptions.

Since these failures are expected during shutdown, reduce the log level
in add_new_view from 'error' to 'debug' for these specific exceptions
while keeping 'error' level for unexpected failures.

Closes scylladb/scylladb#26297

(cherry picked from commit 6bc41926e2)

Closes scylladb/scylladb#27536
2025-12-19 17:02:02 +01:00
Emil Maskovsky
66765c6bd3 topology_coordinator: handle seastar::abort_requested_exception alongside raft::request_aborted
In several exception handlers, only raft::request_aborted was being
caught and rethrown, while seastar::abort_requested_exception was
falling through to the generic catch(...) block. This caused the
exception to be incorrectly treated as a failure that triggers
rollback, instead of being recognized as an abort signal.

For example, during tablet draining, the error log showed:
"tablets draining failed with seastar::abort_requested_exception
(abort requested). Aborting the topology operation"

This change adds seastar::abort_requested_exception handling
alongside raft::request_aborted in all places where it was missing.
When rethrown, these exceptions propagate up to the main run() loop
where handle_topology_coordinator_error() recognizes them as normal
abort signals and allows the coordinator to exit gracefully without
triggering unnecessary rollback operations.

Fixes: scylladb/scylladb#27255

(cherry picked from commit 37e3dacf33)
2025-12-19 16:27:16 +01:00
Emil Maskovsky
2d57dc32a3 topology_coordinator: consistently rethrow raft::request_aborted for direct/global commands
Ensure all direct and global topology commands rethrow the
`raft::request_aborted` exception when aborted, typically due to
leadership changes. This makes abortion explicit to callers, enabling
proper handling such as retries or workflow termination.

This change completes the work started in PR scylladb/scylladb#23962,
covering all remaining cases where the exception was not rethrown.

Fixes: scylladb/scylladb#23589

(cherry picked from commit 943af1ef1c)
2025-12-19 16:26:56 +01:00
Aleksandra Martyniuk
5385b799ba repair: throw if flush failed in get_flush_time
Currently, _flush_time was stored as a std::optional<gc_clock::time_point>
and std::nullopt indicates that the flush was needed but failed. It's confusing
for the caller and does not work as expected since the _flush_time is initialized
with value (not optional).

Change _flush_time type to gc_clock::time_point. If a flush is needed but failed,
get_flush_time() throws an exception.

This was suppose to be a part of https://github.com/scylladb/scylladb/pull/26319
but it was mistakenly overwritten during rebases.

Refs: https://github.com/scylladb/scylladb/issues/24415.

Closes scylladb/scylladb#26794

(cherry picked from commit e3e81a9a7a)
2025-12-17 17:08:14 +01:00
Aleksandra Martyniuk
f8bebd2455 db: fix indentation
(cherry picked from commit 6fc43f27d0)
2025-12-16 15:57:29 +01:00
Aleksandra Martyniuk
a6c72c5020 test: add reproducer for data resurrection
Add a reproducer to check that the repair_time isn't updated
if the batchlog replay fails.

If repair_time was updated, tombstones could be GC'd before the
batchlog is replayed. The replay could later cause the data
resurrection.

(cherry picked from commit 1935268a87)
2025-12-16 15:55:25 +01:00
Aleksandra Martyniuk
0a7e9857c6 repair: fail tablet repair if any batch wasn't sent successfully
If any batch replay failed, we cannot update repair_time as we risk the
data resurrection.

If replay of any batch needs to be retried, run the whole repair but
fail at the very end, so that the repair_time for it won't be updated.

(cherry picked from commit d436233209)
2025-12-16 15:55:25 +01:00
Aleksandra Martyniuk
8e6a709c7f db/batchlog_manager: fix making decision to skip batch replay
Currently, we skip batch replay if less than batch_log_timeout passed
from the moment the batch was written. batch_log_timeout value can
be configured. If it is large, it won't be replayed for a long time.
If the tombstone will be GC'd before the batch is replayed, then we
risk the data resurrection.

To ensure safety we can skip only the batches that won't be GC'd.
In this patch we skip replay of the batches for which:
    now() < written_at + min(timeout + propagation_delay)

repair_time is set as a start of batchlog replay, so at the moment
of the check we will have:
    repair_time <= now()

So we know that:
    repair_time < written_at + propagation_delay

With this condition we are sure that GC won't happen.

(cherry picked from commit e1b2180092)
2025-12-16 15:55:25 +01:00
Aleksandra Martyniuk
9422baf49f db: repair: throw if replay fails
Return a flag determining whether all the batches were sent successfully in
batchlog_manager::replay_all_failed_batches (batches skipped due to being
too fresh are not counted). Throw in repair_flush_hints_batchlog_handler
if not all batches were replayed, to ensure that repair_time isn't updated.

(cherry picked from commit 7f20b66eff)
2025-12-16 15:55:25 +01:00
Aleksandra Martyniuk
ee0fb13a84 db/batchlog_manager: delete batch with incorrect or unknown version
batchlog_manager::replay_all_failed_batches skips batches that have
unknown or incorrect version. Next round will process these batches
again.

Such batches will probably be skipped everytime, so there is no point
in keeping them. Even if at some point the version becomes correct,
we should not replay the batch - it might be old and this may lead
to data resurrection.

(cherry picked from commit 904183734f)
2025-12-16 15:55:25 +01:00
Aleksandra Martyniuk
9554b4ef28 db/batchlog_manager: coroutinize replay_all_failed_batches
(cherry picked from commit 502b03dbc6)
2025-12-16 15:55:25 +01:00
Jenkins Promoter
bc863d96fe Update pgo profiles - aarch64 2025-12-15 04:45:57 +02:00
Benny Halevy
9ac657aa20 utils: error_injection: wait_for_message: print injection_name and caller source_location on timeout
When waiting for the condition variable times out
we call on_internal_error, but unfortunately, the backtrace
it generates is obfuscated by
`coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume`.

To make the log more useful, print the error injection name
and the caller's source_location in the timeout error message.

Fixes #27531

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27532

(cherry picked from commit 5f13880a91)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27579
2025-12-12 14:21:24 +01:00
Anna Stuchlik
58b490869f replace the Driver pages with a link to the new Drivers pages
This commit removes the now redundant driver pages from
the Scylla DB documentation. Instead, the link to the pages
where we moved the diver information is added.
Also, the links are updated across the ScyllaDB manual.

Redirections are added for all the removed pages.

Fixes https://github.com/scylladb/scylladb/issues/26871

Closes scylladb/scylladb#27277

(cherry picked from commit c5580399a8)

Closes scylladb/scylladb#27436
2025-12-12 10:35:42 +01:00
Yaron Kaikov
71d82e4271 Add JIRA issue validation to backport PR fixes check
Extend the Fixes validation pattern to also accept JIRA issue references
(format: [A-Z]+-\d+) in addition to GitHub issue references. This allows
backport PRs to reference JIRA issues in the format 'Fixes: PROJECT-123'.

Fixes: https://github.com/scylladb/scylladb/issues/27571

Closes scylladb/scylladb#27572

(cherry picked from commit 3dfa5ebd7f)

Closes scylladb/scylladb#27594
2025-12-12 09:38:08 +02:00
Avi Kivity
dfb5d1c776 tools: toolchain: prepare: replace 'reg' with 'skopeo'
The prepare scripts uses 'reg' to verify we're not going to
overwrite an existing image. The 'reg' command is not
available in Fedora 43. Use 'skopeo' instead. Skopeo
is part of the podman ecosystem so hopefully will live longer.

Fixes #27178.

Closes scylladb/scylladb#27179

(cherry picked from commit d6ef5967ef)

Closes scylladb/scylladb#27193
2025-12-07 13:15:12 +02:00
Łukasz Paszkowski
e81c7f55f9 topology_coordinator: Fix the indentation for the cleanup_target case 2025-12-04 13:33:28 +01:00
Łukasz Paszkowski
a9b06865f1 topology_coordinator: Add barrier to cleanup_target
Consider the following scenario:
1. A table has RF=3 and writes use CL=QUORUM
2. One node is down
3. There is a pending tablet migration from the unavailable node
   that is reverted

During the revert, there can be a time window where the pending replica
being cleaned up still accepts writes. This leads to write failures,
as only two nodes (out of four) are able to acknowledge writes.

This patch fixes the issue by adding a barrier to the cleanup_target
tablet transition state, ensuring that the coordinator switches back to
the previous replica set before cleanup is triggered.

Fixes https://github.com/scylladb/scylladb/issues/26512
2025-12-04 13:33:14 +01:00
Łukasz Paszkowski
7f8f5ba24e test_node_failure_during_tablet_migration: Increase RF from 2 to 3
The patch prepares the test for additional write workload to be
executed in parallel with node failures. With the original RF=2,
QUORUM is also 2, which causes writes to fail during node outage.

To address it, the third rack with a single node is added and the
replication factor is increased to 3.
2025-12-04 13:29:12 +01:00
Calle Wilund
7defa0b4cd commitlog::read_log_file: Check for eof position on all data reads
Fixes #24346

When reading, we check for each entry and each chunk, if advancing there
will hit EOF of the segment. However, IFF the last chunk being read has
the last entry _exactly_ matching the chunk size, and the chunk ending
at _exactly_ segment size (preset size, typically 32Mb), we did not check
the position, and instead complained about not being able to read.

This has literally _never_ happened in actual commitlog (that was replayed
at least), but has apparently happened more and more in hints replay.

Fix is simple, just check the file position against size when advancing
said position, i.e. when reading (skipping already does).

v2:

* Added unit test

Closes scylladb/scylladb#27236

(cherry picked from commit 59c87025d1)

Closes scylladb/scylladb#27336
2025-12-03 12:21:13 +03:00
Pavel Emelyanov
388627365f Merge '[Backport 2025.1] tablet: scheduler: Do not emit conflicting migration in merge colocation' from Scylladb[bot]
The tablet scheduler should not emit conflicting migrations for the same tablet. This was addressed initially in scylladb/scylladb#26038 but the check is missing in the merge colocation plan, so add it there as well.

Without this check, the merge colocation plan could generate a conflicting migration for a tablet that is already scheduled for migration, as the test demonstrates.

This can cause correctness problems, because if the load balancer generates two migrations for a single tablet, both will be written as mutations, and the resulting mutation could contain mixed cells from both migrations.

Fixes scylladb/scylladb#27304

backport to existing releases - this is a bug that can affect correctness

- (cherry picked from commit 97b7c03709)

Parent PR: #27312

Closes scylladb/scylladb#27329

* github.com:scylladb/scylladb:
  tablet: scheduler: Do not emit conflicting migration in merge colocation
  tablet: scheduler: Do not emit conflicting migrations in the plan
2025-12-03 12:20:39 +03:00
Aleksandra Martyniuk
5297084bd1 replica: database: change type of tables_metadata::_ks_cf_to_uuid
If there is a lot of tables, a node reports oversized allocation
in _ks_cf_to_uuid of type flat_hash_map.

Change the type to std::unordered_map to prevent oversized allocations.

Fixes: https://github.com/scylladb/scylladb/issues/26787.

Closes scylladb/scylladb#27165

(cherry picked from commit 19a7d8e248)

Closes scylladb/scylladb#27192
2025-12-03 12:19:12 +03:00
Ernest Zaslavsky
6cbf09dae1 streaming:: add more logging
Start logging all missed streaming options like `scope` and `primary_replica` flags

Fixes: https://github.com/scylladb/scylladb/issues/27299

Closes scylladb/scylladb#27311

(cherry picked from commit 1d5f60baac)

Closes scylladb/scylladb#27335
2025-12-02 12:21:01 +01:00
Jenkins Promoter
641a573757 Update pgo profiles - aarch64 2025-12-01 04:49:52 +02:00
Michael Litvak
e3f5924f71 tablet: scheduler: Do not emit conflicting migration in merge colocation
The tablet scheduler should not emit conflicting migrations for the same
tablet. This was addressed initially in scylladb/scylladb#26038 but the
check is missing in the merge colocation plan, so add it there as well.

Without this check, the merge colocation plan could generate a
conflicting migration for a tablet that is already scheduled for
migration, as the test demonstrates.

This can cause correctness problems, because if the load balancer
generates two migrations for a single tablet, both will be written as
mutations, and the resulting mutation could contain mixed cells from
both migrations.

Fixes scylladb/scylladb#27304

Closes scylladb/scylladb#27312

(cherry picked from commit 97b7c03709)
2025-11-30 10:37:58 +01:00
Tomasz Grabiec
dc1a318971 tablet: scheduler: Do not emit conflicting migrations in the plan
Plan-making is invoked independently for different DCs (and in the
future, racks) and then plans are merged. It could be that the same
tablets are selected for migration in different DCs. Only one
migration will prevail and be committed to group0, so it's not a
correctness problem. Next cycle will recognize that the tablet is in
transition and will not be selected by plan-maker. But it makes
plan-making less efficient.

It may also surprise consumers of the plan, like we saw in #25912.

So we should make plan-maker be aware of already scheduled transitions
and not consider those tablets as candidates.

Fixes #26038

Closes scylladb/scylladb#26048

(cherry picked from commit 981592bca5)
2025-11-30 10:37:58 +01:00
Patryk Jędrzejczak
edbd2e80b2 Merge '[Backport 2025.1] fix notification about expiring erm held for to long' from Scylladb[bot]
Commit 6e4803a750 broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix the assign operator to call the destructor like it should.

Fixes https://github.com/scylladb/scylladb/issues/27141

- (cherry picked from commit 9f97c376f1)

- (cherry picked from commit 5dcdaa6f66)

Parent PR: #27140

Closes scylladb/scylladb#27273

* https://github.com/scylladb/scylladb:
  test: test that expired erm that held for too long triggers notification
  token_metadata: fix notification about expiring erm held for to long
2025-11-27 12:35:18 +01:00
Patryk Jędrzejczak
d735fe2ecc Merge '[Backport 2025.1] locator/node: include _excluded in missing places' from Scylladb[bot]
We currently ignore the `_excluded` field in `node::clone()` and the verbose
formatter of `locator::node`. The first one is a bug that can have
unpredictable consequences on the system. The second one can be a minor
inconvenience during debugging.

We fix both places in this PR.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-72

This PR is a bugfix that should be backported to all supported branches.

- (cherry picked from commit 4160ae94c1)

- (cherry picked from commit 287c9eea65)

Parent PR: #27265

Closes scylladb/scylladb#27288

* https://github.com/scylladb/scylladb:
  locator/node: include _excluded in verbose formatter
  locator/node: preserve _excluded in clone()
2025-11-27 12:32:38 +01:00
Jenkins Promoter
4b95b6ac21 Update ScyllaDB version to: 2025.1.11 2025-11-27 07:19:36 +02:00
Patryk Jędrzejczak
e3af767196 locator/node: include _excluded in verbose formatter
It can be helpful during debugging.

(cherry picked from commit 287c9eea65)
2025-11-26 23:03:36 +00:00
Patryk Jędrzejczak
b779f15cdb locator/node: preserve _excluded in clone()
We currently ignore the `_excluded` field in `clone()`. Losing
information about exclusion can have unpredictable consequences. One
observed effect (that led to finding this issue) is that the
`/storage_service/nodes/excluded` API endpoint sometimes misses excluded
nodes.

(cherry picked from commit 4160ae94c1)
2025-11-26 23:03:36 +00:00
Gleb Natapov
01c71681a8 test: test that expired erm that held for too long triggers notification
(cherry picked from commit 5dcdaa6f66)
2025-11-26 15:07:28 +00:00
Gleb Natapov
58f927fd8e token_metadata: fix notification about expiring erm held for to long
Commit 6e4803a750 broke notification about expired erms held for too
long since it resets the tracker without calling its destructor (where
notification is triggered). Fix assign operator to call destructor.

(cherry picked from commit 9f97c376f1)
2025-11-26 15:07:28 +00:00
Ernest Zaslavsky
d10f02e49b streaming: fix loop break condition in tablet_sstable_streamer::stream
Correct the loop termination logic that previously caused
certain SSTables to be prematurely excluded, resulting in
lost mutations. This change ensures all relevant SSTables
are properly streamed and their mutations preserved.

(cherry picked from commit dedc8bdf71)

Closes scylladb/scylladb#27146
Fixes: #26979

Parent PR: #26980
Unfortunatelly the pytest based test cannot be ported back because of changes made to the testing harness and scylla-tools
2025-11-25 11:56:38 +03:00
Pavel Emelyanov
14fd0d9c21 lister: Fix race between readdir and stat
Sometimes file::list_directory() returns entries without type set. In
thase case lister calls file_type() on the entry name to get it. In case
the call returns disengated type, the code assumes that some error
occurred and resolves into exception.

That's not correct. The file_type() method returns disengated type only
if the file being inspected is missing (i.e. on ENOENT errno). But this
can validly happen if a file is removed bettween readdir and stat. In
that case it's not "some error happened", but a enry should be just
skipped. In "some error happened", then file_type() would resolve into
exceptional future on its own.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26595

(cherry picked from commit d9bfbeda9a)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26754
2025-11-24 17:19:42 +03:00
Pavel Emelyanov
4f0ea75f51 Merge '[Backport 2025.1] Synchronize tablet split and load-and-stream' from Raphael Raph Carvalho
Load-and-stream is broken when running concurrently to the finalization step of tablet split.

Consider this:

1. split starts
2. split finalization executes barrier and succeed
3. load-and-stream runs now, starts writing sstable (pre-split)
4. split finalization publishes changes to tablet metadata
5. load-and-stream finishes writing sstable
6. sstable cannot be loaded since it spans two tablets

two possible fixes (maybe both):

load-and-stream awaits for topology to quiesce
perform split compaction on sstable that spans both sibling tablets
This patch implements # 1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.

Fixes https://github.com/scylladb/scylladb/issues/26455.

(cherry picked from commit 3abc66da5a)

(cherry picked from commit 4654cdc6fd)

Parent PR: https://github.com/scylladb/scylladb/pull/26456

Closes scylladb/scylladb#27126

* https://github.com/scylladb/scylladb:
  sstables_loader: Don't bypass synchronization with busy topology
  test: Add reproducer for l-a-s and split synchronization issue
  sstables_loader: Synchronize tablet split and load-and-stream
  sstable_set: incremental_reader_selector: be more careful when filtering out already engaged sstables
2025-11-24 17:17:53 +03:00
Raphael S. Carvalho
90e6e88f69 sstables_loader: Don't bypass synchronization with busy topology
The patch c543059f86 fixed the synchronization issue between tablet
split and load-and-stream. The synchronization worked only with
raft topology, and therefore was disabled with gossip.
To do the check, storage_service::raft_topology_change_enabled()
but the topology kind is only available/set on shard 0, so it caused
the synchronization to be bypassed when load-and-stream runs on
any shard other than 0.

The reason the reproducer didn't catch it is that it was restricted
to single cpu. It will now run with multi cpu and catch the
problem observed.

Fixes #22707

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#26730

(cherry picked from commit 7f34366b9d)
(cherry picked from commit 4c466ace4f)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-11-21 10:39:53 -03:00
Raphael S. Carvalho
d2bddea515 test: Add reproducer for l-a-s and split synchronization issue
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 4654cdc6fd)

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-11-21 10:39:53 -03:00
Raphael S. Carvalho
e6dd6b462e sstables_loader: Synchronize tablet split and load-and-stream
Load-and-stream is broken when running concurrently to the
finalization step of tablet split.

Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets

two possible fixes (maybe both):

1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets

This patch implements #1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.

Fixes #26455.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 3abc66da5a)

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-11-21 10:39:53 -03:00
Raphael S. Carvalho
451ae85b38 sstable_set: incremental_reader_selector: be more careful when filtering out already engaged sstables
The incremental reader selector maintains an unordered_set of
sstables that are already engaged, and uses std::views::filter
to filter those out. It adds the sstable under consideration to the
set, and if addition failed (because it's already in) then it
filters it out.

This breaks if the filter view is executed twice - the first pass
will add every sstable to the set, and the second will consider
every sstable already filtered. This is what happens with
libstdc++ 15 (due to the addition of vector(from_range_t) constructor),
which uses the first pass to calculate the vector size
and the second pass to insert the elements into a correctly-sized
vector.

Fix by open-coding the loop.

Closes scylladb/scylladb#23597

(cherry picked from commit ac3d25eb44)

Fixes #26247

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-11-21 10:38:59 -03:00
Botond Dénes
cb1f72dc81 Merge '[Backport 2025.1] Automatic cleanup improvements' from Scylladb[bot]
This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously.

Fixes https://github.com/scylladb/scylladb/issues/26866

Backport to all supported version since automatic cleanup behaviour  as it is now may create unexpected by the operator load during cluster resizing.

- (cherry picked from commit e872f9cb4e)

- (cherry picked from commit 0f0ab11311)

Parent PR: #26868

Closes scylladb/scylladb#27089

* github.com:scylladb/scylladb:
  cleanup: introduce "nodetool cluster cleanup" command  to run cleanup on all dirty nodes in the cluster
  cleanup: Add RESTful API to allow reset cleanup needed flag
2025-11-21 13:52:08 +02:00
Patryk Jędrzejczak
5ba523fa77 test: test_raft_recovery_stuck: ensure mutual visibility before using driver
Not waiting for nodes to see each other as alive can cause the driver to
fail the request sent in `wait_for_upgrade_state()`.

scylladb/scylladb#19771 has already replaced concurrent restarts with
`ManagerClient.rolling_restart()`, but it has missed this single place,
probably because we do concurrent starts here.

Fixes #27055

Closes scylladb/scylladb#27075

(cherry picked from commit e35ba974ce)

Closes scylladb/scylladb#27107
2025-11-20 10:55:12 +02:00
Botond Dénes
cf39bd8f3e Merge '[Backport 2025.1] service/qos: Fall back to default scheduling group when using maintenance socket' from Scylladb[bot]
The service level controller relies on `auth::service` to collect
information about roles and the relation between them and the service
levels (those attached to them). Unfortunately, the service level
controller is initialized way earlier than `auth::service` and so we
had to prevent potential invalid queries of user service levels
(cf. 46193f5e79).

Unfortunately, that came at a price: it made the maintenance socket
incompatible with the current implementation of the service level
controller. The maintenance socket starts early, before the
`auth::service` is fully initialized and registered, and is exposed
almost immediately. If the user attempts to connect to Scylla within
this time window, via the maintenance socket, one of the things that
will happen is choosing the right service level for the connection.
Since the `auth::service` is not registered, Scylla with fail an
assertion and crash.

A similar scenario occurs when using maintenance mode. The maintenance
socket is how the user communicates with the database, and we're not
prepared for that either.

To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.

Some accesses to `auth::service` are not affected and we do not modify
those.

Fixes scylladb/scylladb#26816

Backport: yes. This is a fix of a regression.

- (cherry picked from commit c0f7622d12)

- (cherry picked from commit 222eab45f8)

- (cherry picked from commit 394207fd69)

- (cherry picked from commit b357c8278f)

Parent PR: #26856

Closes scylladb/scylladb#27029

* github.com:scylladb/scylladb:
  test/cluster/test_maintenance_mode.py: Wait for initialization
  test: Disable maintenance mode correctly in test_maintenance_mode.py
  test: Fix keyspace in test_maintenance_mode.py
  service/qos: Do not crash Scylla if auth_integration absent
2025-11-20 10:47:06 +02:00
Botond Dénes
a84b331b09 Merge '[Backport 2025.1] cdc: set column drop timestamp in the future' from Scylladb[bot]
When dropping a column from a CDC log table, set the column drop
timestamp several seconds into the future.

If a value is written to a column concurrently with dropping that
column, the value's timestamp may be after the column drop timestamp. If
this value is also flushed to an SSTable, the SSTable would be
corrupted, because it considers the column missing after the drop
timestamp and doesn't allow values for it.

While this issue affects general tables, it especially impacts CDC tables
because this scenario can occur when writing to a table with CDC preimage
enabled while dropping a column from the base table. This happens even if
the base mutation doesn't write to the dropped column, because CDC log
mutations can generate values for a column even if the base mutation doesn't.
For general tables, this issue can be avoided by simply not writing to a
column while dropping it.

We fix this for the more problematic case of CDC log tables by setting
the column drop timestamp several seconds into the future, ensuring that
writes concurrent with column drops are much less likely to have
timestamps greater than the column drop timestamp.

Fixes https://github.com/scylladb/scylladb/issues/26340

the issue affects all previous releases, backport to improve stability

- (cherry picked from commit eefae4cc4e)

- (cherry picked from commit 48298e38ab)

- (cherry picked from commit 039323d889)

- (cherry picked from commit e85051068d)

Parent PR: #26533

Closes scylladb/scylladb#27025

* github.com:scylladb/scylladb:
  test: test concurrent writes with column drop with cdc preimage
  cdc: check if recreating a column too soon
  cdc: set column drop timestamp in the future
2025-11-20 10:46:24 +02:00
Gleb Natapov
1368f48221 cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster
97ab3f6622 changed "nodetool cleanup" (without arguments) to run
cleanup on all dirty nodes in the cluster. This was somewhat unexpected,
so this patch changes it back to run cleanup on the target node only (and
reset "cleanup needed" flag afterwards) and it adds "nodetool cluster
cleanup" command that runs the cleanup on all dirty nodes in the
cluster.

(cherry picked from commit 0f0ab11311)
2025-11-18 16:01:27 +02:00
Gleb Natapov
c6d443869a cleanup: Add RESTful API to allow reset cleanup needed flag
Cleaning up a node using per keyspace/table interface does not reset cleanup
needed flag in the topology. The assumption was that running cleanup on
already clean node does nothing and completes quickly. But due to
https://github.com/scylladb/scylladb/issues/12215 (which is closed as
WONTFIX) this is not the case. This patch provides the ability to reset
the flag in the topology if operator cleaned up the node manually
already.

(cherry picked from commit e872f9cb4e)
2025-11-18 15:46:39 +02:00
Yaron Kaikov
a01c2dc7e4 install-dependencies.sh: update node_exporter to 1.10.2
Update node exporter to solve CVE-2025-22871

[regenerate frozen toolchain with optimized clang from
	https://devpkg.scylladb.com/clang/clang-18.1.8-Fedora-40-aarch64.tar.gz
	https://devpkg.scylladb.com/clang/clang-18.1.8-Fedora-40-x86_64.tar.gz
]
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-5

Closes scylladb/scylladb#26916

(cherry picked from commit c601371b57)

Closes scylladb/scylladb#26950
2025-11-16 16:14:19 +02:00
Yaron Kaikov
1a224e7d05 auto-backport: Add support for JIRA issue references
- Added support for JIRA issue references in PR body and commit messages
- Supports both short format (PKG-92) and full URL format
- Maintains existing GitHub issue reference support
- JIRA pattern matches https://scylladb.atlassian.net/browse/{PROJECT-ID}
- Allows backporting for PRs that reference JIRA issues with 'fixes' keyword

Fixes: https://github.com/scylladb/scylladb/issues/26955

Closes scylladb/scylladb#26954

(cherry picked from commit 3ade3d8f5b)

Closes scylladb/scylladb#26963
2025-11-16 16:14:03 +02:00
Benny Halevy
4dcb8c19bd scylla-sstable: correctly dump sharding_metadata
This patch fixes 2 issues at one go:

First, Currently sstables::load clears the sharding metadata
(via open_data()), and so scylla-sstable always prints
an empty array for it.

Second, printing token values would generate invalid json
as they are currently printed as binary bytes, and they
should be printed simply as numbers, as we do elsewhere,
for example, for the first and last keys.

Fixes #26982

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#26991

(cherry picked from commit f9ce98384a)

Closes scylladb/scylladb#27030
2025-11-16 16:05:14 +02:00
Michael Litvak
c8466afe74 test: test concurrent writes with column drop with cdc preimage
add a test that writes to a table concurrently with dropping a column,
where the table has CDC enabled with preimage.

the test reproduces issue #26340 where this results in a malformed
sstable.

(cherry picked from commit e85051068d)
2025-11-16 10:19:08 +01:00
Michael Litvak
903afafa5f cdc: check if recreating a column too soon
When we drop a column from a CDC log table, we set the column drop
timestamp a few seconds into the future. This can cause unexpected
problems if a user tries to recreate a CDC column too soon, before
the drop timestamp has passed.

To prevent this issue, when creating a CDC column we check its
creation timestamp against the existing drop timestamp, if any, and
fail with an informative error if the recreation attempt is too soon.

(cherry picked from commit 039323d889)
2025-11-16 10:19:08 +01:00
Michael Litvak
1d6538cd30 cdc: set column drop timestamp in the future
When dropping a column from a CDC log table, set the column drop
timestamp several seconds into the future.

If a value is written to a column concurrently with dropping that
column, the value's timestamp may be after the column drop timestamp. If
this value is also flushed to an SSTable, the SSTable would be
corrupted, because it considers the column missing after the drop
timestamp and doesn't allow values for it.

While this issue affects general tables, it especially impacts CDC tables
because this scenario can occur when writing to a table with CDC preimage
enabled while dropping a column from the base table. This happens even if
the base mutation doesn't write to the dropped column, because CDC log
mutations can generate values for a column even if the base mutation doesn't.
For general tables, this issue can be avoided by simply not writing to a
column while dropping it.

We fix this for the more problematic case of CDC log tables by setting
the column drop timestamp several seconds into the future, ensuring that
writes concurrent with column drops are much less likely to have
timestamps greater than the column drop timestamp.

Fixes scylladb/scylladb#26340

(cherry picked from commit 48298e38ab)
2025-11-16 10:19:01 +01:00
Dawid Mędrek
2313aa5856 test/cluster/test_maintenance_mode.py: Wait for initialization
If we try to perform queries too early, before the call to
`storage_service::start_maintenance_mode` has finished, we will
fail with the following error:

```
ERROR 2025-11-12 20:32:27,064 [shard 0:sl:d] token_metadata - sorted_tokens is empty in first_token_index!
```

To avoid that, we should wait until initialization is complete.

(cherry picked from commit b357c8278f)
2025-11-15 22:09:14 +00:00
Dawid Mędrek
b217a5e43a test: Disable maintenance mode correctly in test_maintenance_mode.py
Although setting the value of `maintenance_mode` to the string `"false"`
disables maintenance mode, the testing framework misinterprets the value
and thinks that it's actually enabled. As a result, it might try to
connect to Scylla via the maintenance socket, which we don't want.

(cherry picked from commit 394207fd69)
2025-11-15 22:09:14 +00:00
Dawid Mędrek
7808d85ecb test: Fix keyspace in test_maintenance_mode.py
The keyspace used in the test is not necessarily called `ks`.

(cherry picked from commit 222eab45f8)
2025-11-15 22:09:14 +00:00
Dawid Mędrek
f47b0743d0 service/qos: Do not crash Scylla if auth_integration absent
If the user connects to Scylla via the maintenance socket, it may happen
that `auth_integration` has not been registered in the service level
controller yet. One example is maintenance mode when that will never
happen; another when the connection occurs before Scylla is fully
initialized.

To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.

In those cases, we completely circumvent any calls to `auth_integration`
and handle them separately. The modified methods are:

* `get_user_scheduling_group`,
* `with_user_service_level`,
* `describe_service_levels`.

For the first two, the new behavior is in line with the previous
implementation of those functions. The last behaves differently now,
but since it's a soft error, crashing the node is not necessary anyway.
We throw an exception instead, whose error message should give the user
a hint of what might be wrong.

The other uses of `auth_integration` within the service level controller
are not problematic:

* `find_effective_service_level`,
* `find_cached_effective_service_level`.

They take the name of a role as their argument. Since the anonymous role
doesn't have a name, it's not possible to call them with it.

Fixes scylladb/scylladb#26816

(cherry picked from commit c0f7622d12)
2025-11-15 22:09:13 +00:00
Jenkins Promoter
3818e15d91 Update pgo profiles - aarch64 2025-11-15 05:02:53 +02:00
Jenkins Promoter
a945742c2a Update pgo profiles - x86_64 2025-11-15 04:02:16 +02:00
Aleksandra Martyniuk
fbe6007a62 api: storage_service: tasks: unify upgrade_sstable
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.

Unify the handlers of /storage_service/keyspace_upgrade_sstables/{keyspace}
and /tasks/compaction/keyspace_upgrade_sstables/{keyspace}.

(cherry picked from commit fdd623e6bc)
2025-11-14 16:54:53 +01:00
Aleksandra Martyniuk
1dd579d0d5 api: storage_service: tasks: force_keyspace_cleanup
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.

Unify the handlers of /storage_service/keyspace_cleanup/{keyspace}
and /tasks/compaction/keyspace_cleanup/{keyspace}.

(cherry picked from commit 044b001bb4)
2025-11-14 16:54:52 +01:00
Aleksandra Martyniuk
ee5bec0fce api: storage_service: tasks: unify force_keyspace_compaction
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.

Add consider_only_existing_data parameter to /tasks/compaction/keyspace_compaction/{keyspace},
to match the synchronous version of the api (/storage_service/keyspace_compaction/{keyspace}).

Unify the handlers of both apis.

(cherry picked from commit 12dabdec66)
2025-11-14 16:41:45 +01:00
Ernest Zaslavsky
e185740c54 minio: update CLI usage, remove deprecated mc options
Replace phased-out `mc` command options with supported alternatives.
Ensures compatibility with the latest MinIO version.

Closes scylladb/scylladb#24363

(cherry picked from commit 1446f57635)

Closes scylladb/scylladb#27004
2025-11-14 10:48:00 +02:00
Andrei Chekun
88556e6c77 test.py: rewrite the wait_for_first_completed
Rewrite wait_for first_completed to return only first completed task guarantee
of awaiting(disappearing) all cancelled and finished tasks
Use wait_for_first_completed to avoid false pass tests in the future and issues
like #26148
Use gather_safely to await tasks and removing warning that coroutine was
not awaited

Closes scylladb/scylladb#26435

(cherry picked from commit 24d17c3ce5)

Closes scylladb/scylladb#26661
2025-11-12 11:50:51 +01:00
Botond Dénes
bec413a671 service/storage_proxy: send batches with CL=EACH_QUORUM
Batches that fail on the initial send are retired later, until they
succeed. These retires happen with CL=ALL, regardless of what the
original CL of the batch was. This is unnecessarily strict. We tried to
follow Cassandra here, but Cassandra has a big caveat in their use of
CL=ALL for batches. They accept saving just a hint for any/all of the
endpoints, so a batch which was just logged in hints is good enough for
them.
We do not plan on replicating this usage of hints at this time, so as a
middle ground, the CL is changed to EACH_QUORUM.

Fixes: scylladb/scylladb#25432

Closes scylladb/scylladb#26304

(cherry picked from commit d9c3772e20)

Closes scylladb/scylladb#26927
2025-11-11 10:23:59 +03:00
Jenkins Promoter
01e929805a Update pgo profiles - aarch64 2025-11-01 05:04:05 +02:00
Jenkins Promoter
9f961d67d7 Update pgo profiles - x86_64 2025-11-01 04:31:27 +02:00
Anna Stuchlik
ee35b5aa90 doc: add --list-active-releases to Web Installer
Fixes https://github.com/scylladb/scylladb/issues/26688

V2 of https://github.com/scylladb/scylladb/pull/26687

Closes scylladb/scylladb#26689

(cherry picked from commit bd5b966208)

Closes scylladb/scylladb#26751
2025-10-29 11:42:03 +02:00
Patryk Jędrzejczak
0cd118f5d6 test: test_raft_recovery_stuck: reconnect driver after rolling restarts
It turns out that #21477 wasn't sufficient to fix the issue. The driver
may still decide to reconnect the connection after `rolling_restart`
returns. One possible explanation is that the driver sometimes handles
the DOWN notification after all nodes consider each other UP.

Reconnecting the driver after restarting nodes seems to be a reliable
workaround that many tests use. We also use it here.

Fixes #19959

Closes scylladb/scylladb#26638

(cherry picked from commit 5321720853)

Closes scylladb/scylladb#26749
2025-10-29 11:41:23 +02:00
Anna Stuchlik
2019ef899f doc: add support for Debian 12
Fixes https://github.com/scylladb/scylladb/issues/26640

Closes scylladb/scylladb#26668

(cherry picked from commit 9c0ff7c46b)

Closes scylladb/scylladb#26676
2025-10-29 11:40:50 +02:00
Botond Dénes
2e375ade3a Merge '[Backport 2025.1] service/qos: set long timeout for auth queries on SL cache update' from Scylladb[bot]
pass an appropriate query state for auth queries called from service
level cache reload. we use the function qos_query_state to select a
query_state based on caller context - for internal queries, we set a
very long timeout.

the service level cache reload is called from group0 reload. we want it
to have a long timeout instead of the default 5 seconds for auth
queries, because we don't have strict latency requirement on the one
hand, and on the other hand a timeout exception is undesired in the
group0 reload logic and can break group0 on the node.

Fixes https://github.com/scylladb/scylladb/issues/25290

backport possible to improve stability

- (cherry picked from commit a1161c156f)

- (cherry picked from commit 3c3dd4cf9d)

- (cherry picked from commit ad1a5b7e42)

Parent PR: #26180

Closes scylladb/scylladb#26475

* github.com:scylladb/scylladb:
  service/qos: set long timeout for auth queries on SL cache update
  auth: add query_state parameter to query functions
  auth: refactor query_all_directly_granted
2025-10-29 11:40:17 +02:00
Asias He
7e82e3b56c repair: Always reset node ops progress to 100% upon completion
Always set the node ops progress to 100% when the operation finishes,
regardless of success or failure. This ensures the progress never
remains below 100%, which would otherwise indicates a pending node
operation in case of an error.

Fixes #26193

Closes scylladb/scylladb#26194

(cherry picked from commit b31e651657)

Closes scylladb/scylladb#26262
2025-10-29 11:39:39 +02:00
Andrei Chekun
c85a0615e0 test.py: fix flaky LDAP tests
The issue with current approach is that LDAP server starting on
localhost where ports can be busy. This PR migrate using HostRegistry()
instead of localhost where no busy ports.
This fix has the same idea that was on master #23235. Simple backport is
not possible due to huge differences between the branches.
Additionally, Minio's host fixed as well, to avoid flakiness.

Fixes: #26295

Closes scylladb/scylladb#26518
2025-10-28 13:57:54 +03:00
Piotr Dulikowski
5cb0dc3f2b Merge '[Backport 2025.1] transport: call update_scheduling_group for non-auth connections' from Andrzej Jackowski
This is backport of fix for #26040 and related test (#26589) to 2025.1.

Before this change, unauthorized connections stayed in `main`
scheduling group. It is not ideal, in such case, rather `sl:default`
should be used, to have a consistent behavior with a scenario
where users is authenticated but there is no service level assigned
to the user.

This commit adds a call to `update_scheduling_group` at the end of
connection creation for an unauthenticated user, to make sure the
service level is switched to `sl:default`.

Fixes: scylladb/scylladb#26040
Fixes: scylladb/scylladb#26581

(cherry picked from commit 278019c328)
(cherry picked from commit 8642629e8e)

No backport, as it's already a backport (but similar PRs will be created for 2025.2, 2025.3, and 2025.4)

Closes scylladb/scylladb#26718

* github.com:scylladb/scylladb:
  test: add test_anonymous_user to test_raft_service_levels
  transport: call update_scheduling_group for non-auth connections
2025-10-27 22:21:24 +01:00
Andrzej Jackowski
014ecbb2e0 test: add test_anonymous_user to test_raft_service_levels
The primary goal of this test is to reproduce scylladb/scylladb#26040
so the fix (278019c328) can be backported
to older branches.

Scenario: connect via CQL as an anonymous user and verify that the
`sl:default` scheduling group is used. Before the fix for #26040
`main` scheduling group was incorrectly used instead of `sl:default`.

Control connections may legitimately use `sl:driver`, so the test
accepts those occurrences while still asserting that regular anonymous
queries use `sl:default`.

This adds explicit coverage on master. After scylladb#24411 was
implemented, some other tests started to fail when scylladb#26040
was unfixed. However, none of the tests asserted this exact behavior.

Refs: scylladb/scylladb#26040
Refs: scylladb/scylladb#26581

Closes scylladb/scylladb#26589

(cherry picked from commit 8642629e8e)
2025-10-27 10:10:51 +01:00
Andrzej Jackowski
9ddf7feabd transport: call update_scheduling_group for non-auth connections
Before this change, unauthorized connections stayed in `main`
scheduling group. It is not ideal, in such case, rather `sl:default`
should be used, to have a consistent behavior with a scenario
where users is authenticated but there is no service level assigned
to the user.

This commit adds a call to `update_scheduling_group` at the end of
connection creation for an unauthenticated user, to make sure the
service level is switched to `sl:default`.

Fixes: scylladb/scylladb#26040
Fixes: scylladb/scylladb#26581
(cherry picked from commit 278019c328)
2025-10-27 10:10:12 +01:00
Asias He
691f2740b2 repair: Fix uuid and nodes_down order in the log
Fixes #26536

Closes scylladb/scylladb#26547

(cherry picked from commit 33bc1669c4)

Closes scylladb/scylladb#26627
2025-10-22 11:28:10 +03:00
Jenkins Promoter
c4a775be6f Update ScyllaDB version to: 2025.1.10 2025-10-15 11:29:08 +03:00
Jenkins Promoter
17cf9270b3 Update pgo profiles - aarch64 2025-10-15 05:03:59 +03:00
Jenkins Promoter
779a5b0919 Update pgo profiles - x86_64 2025-10-15 04:32:12 +03:00
Michael Litvak
0a881e8c24 service/qos: set long timeout for auth queries on SL cache update
pass an appropriate query state for auth queries called from service
level cache reload. we use the function qos_query_state to select a
query_state based on caller context - for internal queries, we set a
very long timeout.

the service level cache reload is called from group0 reload. we want it
to have a long timeout instead of the default 5 seconds for auth
queries, because we don't have strict latency requirement on the one
hand, and on the other hand a timeout exception is undesired in the
group0 reload logic and can break group0 on the node.

Fixes scylladb/scylladb#25290

(cherry picked from commit ad1a5b7e42)
2025-10-09 12:46:53 +00:00
Michael Litvak
28d45bf612 auth: add query_state parameter to query functions
add a query_state parameter to several auth functions that execute
internal queries. currently the queries use the
internal_distributed_query_state() query state, and we maintain this as
default, but we want also to be able to pass a query state from the
caller.

in particular, the auth queries currently use a timeout of 5 seconds,
and we will want to set a different timeout when executed in some
different context.

(cherry picked from commit 3c3dd4cf9d)
2025-10-09 12:46:53 +00:00
Michael Litvak
874f336b3e auth: refactor query_all_directly_granted
rewrite query_all_directly_granted to use execute_internal instead of
query_internal in a style that is more consistent with the rest of the
module.

This will also be useful for a later change because execute_internal
accepts an additional parameter of query_state.

(cherry picked from commit a1161c156f)
2025-10-09 12:46:53 +00:00
Jenkins Promoter
6c539463bb Update ScyllaDB version to: 2025.1.9 2025-10-09 12:26:37 +03:00
Avi Kivity
67c4d980dd Revert "Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski"
This reverts commit 1fd82d32e0. It causes
connection storms to snowball into a node crash via this mechanism:

1. large node suffers mild connection storm
2. password hash requests queue up on alien hash thread
3. incoming hash requests queue faster than the alien thread can retire them.
4. auth latency grows without bounds
5. this encourages the clients to create new connections
6. problem grows

Reverting the patch restores the hash stall, but at least prevents node
crashes.

Fixes #26461 (2025.1)

Closes scylladb/scylladb#26462
2025-10-09 11:04:34 +03:00
Botond Dénes
9dd32a02af Merge '[Backport 2025.1] tools: fix documentation links after change to source-available' from Scylladb[bot]
Some tools commands have links to online documentation in their help output. These links were left behind in the source-available change, they still point to the old opensource docs. Furthermore, the links in the scylla-sstable help output always point to the latest stable release's documentation, instead of the appropriate one for the branch the tool was built from. Fix both of these.

Fixes: scylladb/scylladb#26320

Broken documentation link fix for the  tool help output, needs backport to all live source-available versions.

- (cherry picked from commit 5a69838d06)

- (cherry picked from commit 15a4a9936b)

- (cherry picked from commit fe73c90df9)

Parent PR: #26322

Closes scylladb/scylladb#26386

* github.com:scylladb/scylladb:
  tools/scylla-sstable: fix doc links
  release: adjust doc_link() for the post source-available world
  tools/scylla-nodetool: remove trailing " from doc urls
2025-10-08 06:38:50 +03:00
Botond Dénes
694fb53aad tools/scylla-sstable: fix doc links
The doc links in scylla-sstable help output are static, so they always
point to the documentation of the latest stable release, not to the
documentation of the release the tool binary is from. On top of that,
the links point to old open-source documentation, which is now EOL.
Fix both problems: point link at the new source-available documentation
pages and make them version aware.

(cherry picked from commit fe73c90df9)
2025-10-07 10:22:29 +03:00
Botond Dénes
b6a458f9a9 Merge '[Backport 2025.1] scylla-gdb: Fix fair-queue entry printing' from Scylladb[bot]
Catching a live entry in IO queue is very rare event, so we haven't seen it so far, but the `_ticket` member had been removed ~2 years ago and had been replaced with `_capacity` which is plain 64bit integer.

Fixes #26184

The issue is present in 2025.x as well and looks cheap to backport

- (cherry picked from commit 8438c59ad3)

Parent PR: #26185

Also includes backport of #24835 which also applies to 2025.1 and is now crucial.
The scylla_io_queues.ticket() method is renamed by this backport, but without 24835 it will be problematic to fix all callers of it

Closes scylladb/scylladb#26260

* github.com:scylladb/scylladb:
  scylla-gdb: Fix fair-queue entry printing
  scylla-gdb: Don't show io_queue executing and queued resources
2025-10-06 17:05:35 +03:00
Benny Halevy
10b473ffcb test_tablets_merge: test_tablet_split_merge_with_many_tables: reduce number of tables in debug mode
As the test hits timeouts in debug mode on aarch64.

Fixes #26252

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#26303

(cherry picked from commit b81c6a339b)

Closes scylladb/scylladb#26323
2025-10-06 17:05:00 +03:00
Botond Dénes
b37ddaee90 Merge '[Backport 2025.1] compaction: ensure that all compaction executors are stopped' from Scylladb[bot]
Currently, while stopping the compaction_manager, we stop task_manager
compaction module and concurrently run compaction_manager::really_do_stop.
really_do_stop stops and waits for all task_executors that are kept
in compaction_manager::_tasks, but nothing ensures that no more tasks will
be added there. Due to leftover tasks, we trigger  on_fatal_internal_error.

Modify the order of compaction_manager::stop. After the change, we stop
compaction tasks in the following order:
- abort module abort source;
- close module gate in the background;
- stop_ongoing_compactions (kept in compaction_manager::_tasks);
- wait until module gate is closed.

Check module abort source before creating compaction executor and
adding it to _tasks.

Thanks to the above, we can be sure that:
- after module::stop there will be no tasks in _tasks;
- compaction_manager::stop aborts all tasks; we don't wait for any whole
  compaction to finish.

Fixes: https://github.com/scylladb/scylladb/issues/25806.

Fixes shutdown bug; Needs backports to all version

- (cherry picked from commit 17707d0e6b)

- (cherry picked from commit 97c77d7cd5)

Parent PR: #25885

Closes scylladb/scylladb#26222

* github.com:scylladb/scylladb:
  compaction: move _tasks check
  compaction: stop compaction module in really_do_stop
2025-10-06 07:04:20 +03:00
Botond Dénes
ee792f7257 release: adjust doc_link() for the post source-available world
There is no more separate enterprise product and the doc urls are
slightly different.

(cherry picked from commit 15a4a9936b)
2025-10-03 14:27:27 +00:00
Botond Dénes
4f01659eda tools/scylla-nodetool: remove trailing " from doc urls
They are accidental leftover from a previous way of storing command
descriptions.

(cherry picked from commit 5a69838d06)
2025-10-03 14:27:27 +00:00
Piotr Smaron
54bfbb6303 [2025.1] test/ldap: assign non-busy ports to ldap
It may happen that the ports we randomly choose for LDAP are busy, and
that'd fail the test suite, so once we randomly select ports, now we'll
see if they're busy or not, and if they're busy, we'll select next ones,
until we finally have some free ports for LDAP.
Tested with: `./test.py ldap/ldap_connection_test --repeat 1000 -j 10`:
before the fix, this command fails after ~112 runs, and of course it
passes with the fix.

This is a backport of https://github.com/scylladb/scylladb/pull/23275,
but not 1:1, because the patch no longer applies, since the originally
modified files no longer exist on this branch.

Fixes: scylladb/scylla-enterprise#5120
Fixes: #23149
Fixes: #23242

Fixes: scylladb/scylladb#26295

(cherry picked from commit d365d9b2ad)

Closes scylladb/scylladb#26310
2025-10-02 06:10:26 +03:00
Jenkins Promoter
ac33223701 Update pgo profiles - aarch64 2025-10-01 04:58:10 +03:00
Jenkins Promoter
eef3cc2baf Update pgo profiles - x86_64 2025-10-01 04:35:21 +03:00
Tomasz Grabiec
e71c8a40a9 Merge '[Backport 2025.1] replica: Fix split compaction when tablet boundaries change' from Scylladb[bot]
Consider the following:
1) balancer emits split decision
2) split compaction starts
3) split decision is revoked
4) emits merge decision
5) completes merge, before compaction in step 2 finishes

After last step, split compaction initiated in step 2 can fail because it works with the global tablet map, rather than the map when the compaction started. With the global state changing under its feet, on merge, the mutation splitting writer will think it's going backwards since sibling tablets are merged.

This problem was also seen when running load-and-stream, where split initiated by the sstable writer failed, split completed, and the unsplit sstable is left in the table dir, causing problems in the restart.

To fix this, let's make split compaction always work with the state when it started, not a global state.

Fixes #24153.

All 2025.* versions are vulnerable, so fix must be backported to them.

- (cherry picked from commit 0c1587473c)

- (cherry picked from commit 68f23d54d8)

Parent PR: #25690

Closes scylladb/scylladb#25933

* github.com:scylladb/scylladb:
  replica: Fix split compaction when tablet boundaries change
  replica: Futurize split_compaction_options()
  test: fix flakiness of test_missing_data
2025-09-30 19:45:45 +02:00
Pavel Emelyanov
75918dda8d scylla-gdb: Fix fair-queue entry printing
Catching a live entry in IO queue is very rare event, so we haven't seen
it so far, but the `_ticket` member had been removed ~2 years ago and
had been replaced with `_capacity` which is plain 64bit integer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26185

(cherry picked from commit 8438c59ad3)
2025-09-30 11:23:38 +03:00
Pavel Emelyanov
67d0f0b754 scylla-gdb: Don't show io_queue executing and queued resources
These counters are no longer accounted by io-queue code and are always
zero. Even more -- accounting removal happened years ago and we don't
have Scylla versions built with seastar older than that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24835
2025-09-30 11:23:38 +03:00
Gleb Natapov
1787e9aaa3 storage_service: change node_ops_info::ignore_nodes to host id
It drop useless translation from id to ip during removenode through
topology coordinator.

Closes scylladb/scylladb#25958

(cherry picked from commit d3badf7406)

Closes scylladb/scylladb#26228
2025-09-26 10:59:15 +02:00
Aleksandra Martyniuk
f322369f07 compaction: move _tasks check
In compaction_manager::really_do_stop we check whether _tasks list
is empty after the compactions are stopped. However, a new task may
still sneak in, causing the assertion failure. Such a task won't
be there for long - module::make_task will fail as the module is
already stopped.

Move the assertion, that checks if _tasks is empty, after the
compaction_states' gates are closed.

Fixes: #25806.
(cherry picked from commit 97c77d7cd5)
2025-09-25 16:18:46 +02:00
Aleksandra Martyniuk
9c5ed9586d compaction: stop compaction module in really_do_stop
Currently, compaction::task_manager_module is stopped in compaction_manager::stop,
concurrently to really_do_stop. We can't predict the order of the two.

Do not set _task_manager_module to nullptr at stop, because
compaction_manager::really_do_stop() may be called before the actual
shutdown, while other components still try to use it.
compaction::task_manager_module does not keep a pointer to compaction_manager,
so we won't end up with memory leak.

Stop compaction module in really_do_stop, after ongoing compactions
are stopped.

It's a preparation for further patches.

(cherry picked from commit 17707d0e6b)
2025-09-25 16:18:45 +02:00
Ferenc Szili
38aa7be6a1 load_balancer: fix std::out_of_bounds when decommissioning with empty nodes
Consider the following:

The tablet load balancer is working on:

- node1: an empty node (no tablets) with a large disk capacity
- node2: an empty node (no tablets) with a lower disk capacity then node1
- node3: is being decommissioned and contains tablet replicas

In load_balancer::make_internode_plan() the initial destination
node/shard is selected like this:

// Pick best target shard.
auto dst = global_shard_id {target, _load_sketch->get_least_loaded_shard(target)};

load_sketch::get_least_loaded_shard(host_id) calls ensure_node() which
adds the host to load_sketch's internal hash maps in case the node was
not yet seen by load_sketch.

Let's assume dst is a shard on node1.

Later in load_balancer::make_internode_plan() we will call
pick_candidate() to try to find a better destination node than the
initial one:

// May choose a different source shard than src.shard or different destination host/shard than dst.
auto candidate = co_await pick_candidate(nodes, src_node_info, target_info, src, dst, nodes_by_load_dst,
                                            drain_skipped);
auto source_tablets = candidate.tablets;
src = candidate.src;
dst = candidate.dst;

If pick_candidate() selects some other empty destination (due to larger
capacity: node1) node, and that node has not yet been seen by
load_sketch (because it was empty), a subsequent call to
load_sketch::pick() will search for the node using
std::unordered_map::at(), and because the node is not found it will
throw a std::out_of_bounds() exception crashing the load balancer.

This problem is fixed by changing load_sketch::populate() to initialize
its internal maps with all the nodes which populate()'s arguments
filter for.

Fixes: #26203

Closes scylladb/scylladb#26207

(cherry picked from commit c6c9c316a7)

Closes scylladb/scylladb#26238
2025-09-25 09:51:23 +03:00
Ferenc Szili
142156a808 docs: add capacity based balancing explanation
Capacity based balancing was introduced in 2025.1. It computes balance
based on a node's capacity: the number of tablets located on a node
should be directly proportional to that node's storage capacity.

This change adds this explanation to the docs.

Fixes: #25686

Closes scylladb/scylladb#25687

(cherry picked from commit de5dab8429)

Closes scylladb/scylladb#26105
2025-09-25 09:50:41 +03:00
Raphael S. Carvalho
ca1974da91 replica: Fix split compaction when tablet boundaries change
Consider the following:
1) balancer emits split decision
2) split compaction starts
3) split decision is revoked
4) emits merge decision
5) completes merge, before compaction in step 2 finishes

After last step, split compaction initiated in step 2 can fail
because it works with the global tablet map, rather than the
map when the compaction started. With the global state changing
under its feet, on merge, the mutation splitting writer will
think it's going backwards since sibling tablets are merged.

This problem was also seen when running load-and-stream, where
split initiated by the sstable writer failed, split completed,
and the unsplit sstable is left in the table dir, causing
problems in the restart.

To fix this, let's make split compaction always work with
the state when it started, not a global state.

Fixes #24153.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 68f23d54d8)
2025-09-24 20:27:11 -03:00
Raphael S. Carvalho
b4e6795bd5 replica: Futurize split_compaction_options()
Prepararation for the fix of #24153.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 0c1587473c)
2025-09-24 19:28:07 -03:00
Dawid Mędrek
ee800b9682 db/batchlog: Drop batch if table has been dropped
If there are pending mutations in the batchlog for a table that
has been dropped, we'll keep attempting to replay them but with
no success -- `db::no_such_column_family` exceptions will be thrown,
and we'll keep trying again and again.

To prevent that, we drop the batch in that case just like we do
in the case of a non-existing keyspace.

A reproducer test has been included in the commit. It fails without
the changes in `db/batchlog_manager.cc`, and it succeeds with them.

Fixes scylladb/scylladb#24806

Closes scylladb/scylladb#26057

(cherry picked from commit 35f7d2aec6)

Closes scylladb/scylladb#26198
2025-09-24 09:51:29 +03:00
Dawid Mędrek
d6c59c5224 test/perf/tablet_load_balancing.cc: Create nodes within one DC
In 789a4a1ce7, we adjusted the test file
to work with the configuration option `rf_rack_valid_keyspaces`. Part of
the commit was making the two tables used in the test replicate in
separate data centers.

Unfortunately, that destroyed the point of the test because the tables
no longer competed for resources. We fix that by enforcing the same
replication factor for both tables.

We still accept different values of replication factor when provided
manually by the user (by `--rf1` and `--rf2` commandline options). Scylla
won't allow for creating RF-rack-invalid keyspaces, but there's no reason
to take away the flexibility the user of the test already has.

Fixes scylladb/scylladb#26026

Closes scylladb/scylladb#26115

(cherry picked from commit 0d2560c07f)

Closes scylladb/scylladb#26170
2025-09-24 09:50:39 +03:00
Pavel Emelyanov
4431dd158f Merge '[Backport 2025.1] compaction/scrub: register sstables for compaction before validation' from Scylladb[bot]
compaction/scrub: register sstables for compaction before validation

When `scrub --validate` runs, it collects all candidate sstables at the
start and validates them one by one in separate compaction tasks.
However, scrub in validate mode does not register these sstables for
compaction, which allows regular compaction to pick them up and
potentially compact them away before validation begins. This leads to
scrub failures because the sstables can no longer be found.

This patch fixes the issue by first disabling compaction, collecting the
sstables, and then registering them for compaction before starting
validation. This ensures that the enqueued sstables remain available for
the entire duration of the scrub validation task.

Fixes #23363

This reported scrub failure occurs on all versions that have the
checksum/digest validation feature for uncompressed sstables.
So, backport it to older versions.

- (cherry picked from commit 84f2e99c05)

- (cherry picked from commit 7cdda510ee)

Parent PR: #26034

Closes scylladb/scylladb#26097

* github.com:scylladb/scylladb:
  compaction/scrub: register sstables for compaction before validation
  compaction/scrub: handle exceptions when moving invalid sstables to quarantine
2025-09-24 09:50:05 +03:00
Nadav Har'El
b7f1497efd alternator: fix bug in combination of AttributeUpdates + ReturnValues
In test/alternator/test_returnvalues.py we had tests for the
ReturnValues feature on UpdateItem requests - but we only tested
UpdateItem requests with the "modern" UpdateExpression, and forgot to
test the combination of ReturnValues with the old AttributeUpdates API.

It turns out this combination is buggy: when both ReturnValues=ALL_OLD
and AttributeUpdates need the previous value of the item, we may wrongly
std::move() the value out, and the operation will fail with a strange
error:

    An error occurred (ValidationException) when calling the UpdateItem
    operation: JSON assert failed on condition 'IsObject()'

The fix in this patch is trivial - just move the std::move() to the
correct place, after both UpdateExpression and AttributeUpdates
handling is done.

This patch also includes a reproducing test, which fails before this
patch and passes with it - and of course passes on DynamoDB. This
test reproduces two cases where the bug happened, as well as one
case where it didn't (to make sure we don't regress in what already
worked).

Fixes #25894

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25900

(cherry picked from commit 3c0032deb4)

Closes scylladb/scylladb#26094
2025-09-24 09:48:49 +03:00
Łukasz Paszkowski
51b13edd2c compaction_manager: cancel submission timer on drain
The `drain` method, cancels all running compactions and moves the
compaction manager into the disabled state. To move it back to
the enabled state, the `enable` method shall be called.

This, however, throws an assertion error as the submission time is
not cancelled and re-enabling the manager tries to arm the armed timer.

Thus, cancel the timer, when calling the drain method to disable
the compaction manager.

Fixes https://github.com/scylladb/scylladb/issues/24504

All versions are affected. So it's a good candidate for a backport.

Closes scylladb/scylladb#24505

(cherry picked from commit a9a53d9178)

Closes scylladb/scylladb#24589
2025-09-23 10:30:34 +03:00
Tomasz Grabiec
ce99996670 tablets: scheduler: Run plan-maker in maintenance scheduling group
Currently, it runs in the gossiper scheduling group, because it's
invoked by the topology coordinator. That scheduling group has the
same amount of shares as user workload. Plan-making can take
significant amount of time during rebalancing, and we don't want that
to impact user workload which happens to run on the same shard.

Reduce impact by running in the maintenance scheduling group.

Fixes #26037

Closes scylladb/scylladb#26046

(cherry picked from commit ddbcea3e2a)

Closes scylladb/scylladb#26166
2025-09-23 02:04:01 +02:00
Szymon Malewski
21610258df alternator/expressions.g: Fix antlr3 missing token leak
This patch overrides the antlr3 function that allocates the missing
tokens that would eventually leak. The override stores these tokens in
a vector, ensuring memory is freed whenever the parser is destroyed.
Solution is copied from CQL implementation.

A unit test to reproduce the issue is added - leak would be reported
by ASAN, when running this test in debug mode - the test passed but
the leak is discovered when the test file exits.

Fixes #25878

Closes scylladb/scylladb#25930

(cherry picked from commit 776f90e2f8)

Closes scylladb/scylladb#26083
2025-09-19 19:13:24 +03:00
Lakshmi Narayanan Sreethar
f26bc7e2e1 compaction/scrub: register sstables for compaction before validation
When `scrub --validate` runs, it collects all candidate sstables at the
start and validates them one by one in separate compaction tasks.
However, scrub in validate mode does not register these sstables for
compaction, which allows regular compaction to pick them up and
potentially compact them away before validation begins. This leads to
scrub failures because the sstables can no longer be found.

This patch fixes the issue by first disabling compaction, collecting the
sstables, and then registering them for compaction before starting
validation. This ensures that the enqueued sstables remain available for
the entire duration of the scrub validation task.

Fixes #23363

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 7cdda510ee)
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-09-19 18:30:13 +05:30
Lakshmi Narayanan Sreethar
236d577721 compaction/scrub: handle exceptions when moving invalid sstables to quarantine
In validate mode, scrub moves invalid sstables into the quarantine
folder. If validation fails because the sstable files are missing from
disk, there is nothing to move, and the quarantine step will throw an
exception. Handle such exceptions so scrub can return a proper
compaction_result instead of propagating the exception to the caller.
This will help the testcase for #23363 to reliably determine if the
scrub has failed or not.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 84f2e99c05)
2025-09-18 15:13:35 +05:30
Pavel Emelyanov
5d7515cbb0 s3: Export memory usage gauge (metrics)
The memory usage is tracked with the help of a semaphore, so just export
its "consumed" units.

One tricky place here is the need to skip metrics registration for
scylla-sstable tool. The thing is that the tools starts the storage
manager and sstables manager on start and then some of tool's operations
may want to start both managers again (via cql environment) causing
double metrics registration exception.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#25769

backport note: in 2025.1 scylla-sstable tool doesn't start storage
manager (because #22321 is not there), so the whole "skip_metrics_reg."
bit is not needed. Still it's here, always OFF

Closes scylladb/scylladb#25936
2025-09-18 07:45:35 +03:00
Sergey Zolotukhin
e88e65426e gossiper: ensure gossiper operations are executed in gossiper scheduling group
Sometimes gossiper operations invoked from storage_service and other components
run under a non-gossiper scheduling group. If these operations acquire gossiper
locks, priority inversion can occur: higher-priority gossiper tasks may wait
behind lower-priority tasks (e.g. streaming), which can cause gossiper slowness
or even failures.
This patch ensures that gossiper operations requiring locks on gossiper
structures are explicitly executed in the gossiper scheduling group.
To help detect similar issues in the future, a warning is logged whenever
a gossiper lock is acquired under a non-gossiper scheduling group.

Fixes scylladb/scylladb#25907

(cherry picked from commit 6c2a145f6c)

Closes scylladb/scylladb#26068
2025-09-18 07:45:10 +03:00
Sergey Zolotukhin
1a69ac0ed5 raft: disable caching for raft log.
This change disables caching for raft log table due to the following reasons:
* Immediate reason is a deficiency in handling emerging range tombstones in the cache, which causes stalls.
* Long-term reason is that sequential reads from the raft log do not benefit from the cache, making it better to bypass it to free up space and avoid stalls.

Fixes scylladb/scylladb#26027

Closes scylladb/scylladb#26031

(cherry picked from commit 2640b288c2)

Closes scylladb/scylladb#26069
2025-09-18 07:44:08 +03:00
Wojciech Mitros
c32229b35c storage_proxy: send hints to pending replicas
Consider the following scenario:
- Current replica set is [A, B, C]
- write succeeds on [A, B], and a hint is logged for node C
- before the hint is replayed, D bootstraps and the token migrates from C to D
- hint is replayed to node C while D is pending, but it's too late, since streaming for that token is already done
- C is cleaned up, replayed data is lost, and D has a stale copy until next repair.
In the scenario we effectively fail to send the hint. This scenario is also more likely to happen with tablets,
as it can happen for every tablet migration.

This issue is particularly detrimental to materialized views. View updates use hints by default and a specific
view update may be sent to just one view replica (when a single base replica has a different row state due to
reordering or missed writes). When we lose a hint for such a view update, we can generate a persistent inconsistency
between the base and view - ghost rows can appear due to a lost tombstone and rows may be missing in the view due
to a lost row update. Such inconsistencies can't be fixed neither by repairing the view or the base table.

To handle this, in this patch we add the pending replicas to the list of targets of each hint, even if the original
target is still alive.

This will cause some updates to be redundant. These updates are probably unavoidable for now, but they shouldn't
be too common either. The scenarios for them are:
1. managing to send the hint to the source of a migrating replica before streaming that its token - the write will
arrive on the pending replica anyway in streaming
2. the hint target not being the source of the migration - if we managed to apply the original write of the hint to
the actual source of the migration, the pending replica will get it during streaming
3. sending the same hint to many targets at a similar time - while sending to each target, we'll see the same pending
replica for the hint so we'll send it multiple times
4. possible retries where even though the hint was successfully sent to the main target, we failed to send it to the
pending replica, so we need to retry the entire write

This patch handles both tablet migrations and tablet rebuilds. In the future, for tablet migrations, we can avoid
sending the hint to pending replias if the hint target is not the source fo the migration, which would allow us to
avoid the redundant writes 2 and 3. For rack-aware RF, this will be as simple as checking whether the replicas are
in the same rack.

We also add a test case reproducing the issue.

Co-Authored-By: Raphael S. Carvalho <raphaelsc@scylladb.com>

Fixes https://github.com/scylladb/scylladb/issues/19835

Closes scylladb/scylladb#25590

(cherry picked from commit 10b8e1c51c)

Closes scylladb/scylladb#25880
2025-09-17 08:06:58 +02:00
Wojciech Mitros
90536040ad mv: delete previously undetected ghost rows in PRUNE MATERIALIZED VIEW statement
The PRUNE MATERIALIZED VIEW statement is supposed to remove ghost rows from the
view. Ghost rows are rows in the view with no corresponding row in the base table.
Before this patch, only rows whose primary key columns of the base table had
different values than any of the base rows were treated as ghost rows by the PRUNE
statement. However, view rows which have a column in their primary key that's not
in the base primary can also be ghost rows if this column has a different value
than the base row with the same values of remaining primary key columns. That's
because these rows won't be deleted unless we change value of this column in the
base table to this specific value.
In this patch we add a check for this column in the PRUNE MATERIALIZED VIEW logic.
If this column isn't the same in the base table and the view, these rows are also
deleted.

Fixes https://github.com/scylladb/scylladb/issues/25655

Closes scylladb/scylladb#25720

(cherry picked from commit 1f9be235b8)

Closes scylladb/scylladb#25954
2025-09-16 16:00:02 +02:00
Jenkins Promoter
4fbb084580 Update pgo profiles - aarch64 2025-09-15 04:48:34 +03:00
Jenkins Promoter
8041db0a69 Update pgo profiles - x86_64 2025-09-15 04:06:01 +03:00
Jenkins Promoter
58db3e7efe Update ScyllaDB version to: 2025.1.8 2025-09-14 16:35:47 +03:00
Patryk Jędrzejczak
e4c2930d7c Merge '[Backport 2025.1] test: cluster: deflake consistency checks after decommission' from Scylladb[bot]
In the Raft-based topology, a decommissioning node is removed from group
0 after the decommission request is considered finished (and the token
ring is updated). Therefore, `check_token_ring_and_group0_consistency`
called just after decommission might fail when the decommissioned node
is still in group 0 (as a non-voter). We deflake all tests that call
`check_token_ring_and_group0_consistency` after decommission in this PR.

Fixes #25809

This PR improves CI stability and changes only tests, so it should be
backported to all supported branches.

- (cherry picked from commit e41fc841cd)

- (cherry picked from commit bb9fb7848a)

Parent PR: #25927

Closes scylladb/scylladb#25961

* https://github.com/scylladb/scylladb:
  test: cluster: deflake consistency checks after decommission
  test: cluster: util: handle group 0 changes after token ring changes in wait_for_token_ring_and_group0_consistency
2025-09-12 10:33:12 +02:00
Patryk Jędrzejczak
e10cd81dfb test: cluster: deflake consistency checks after decommission
In the Raft-based topology, a decommissioning node is removed from group
0 after the decommission request is considered finished (and the token
ring is updated). Therefore, `check_token_ring_and_group0_consistency`
called just after decommission might fail when the decommissioned node
is still in group 0 (as a non-voter). We deflake all tests that call
`check_token_ring_and_group0_consistency` after decommission in this
commit.

Fixes #25809

(cherry picked from commit bb9fb7848a)
2025-09-11 15:25:36 +02:00
Patryk Jędrzejczak
b32d90b3f5 test: cluster: util: handle group 0 changes after token ring changes in wait_for_token_ring_and_group0_consistency
In the Raft-based topology, a decommissioning node is removed from group
0 after the decommission request is considered finished (and the token
ring is updated). `wait_for_token_ring_and_group0_consistency` doesn't
handle such a case; it only handles cases where the token ring is
updated later. We fix this in this commit.

We rely on the new implementation of
`wait_for_token_ring_and_group0_consistency` in the following commit to
fix flakiness of some tests.

We also update the obsolete docstring in this commit.

(cherry picked from commit e41fc841cd)
2025-09-11 15:25:32 +02:00
Yaron Kaikov
84749ce199 build_docker.sh: enable debug symboles installation
Adding the latest scylla.repo location to our docker container, this
will allow installation scylla-debuginfo package in case it's needed

Fixes: https://github.com/scylladb/scylladb/issues/24271

Closes scylladb/scylladb#25646

(cherry picked from commit d57741edc2)

Closes scylladb/scylladb#25889
2025-09-10 11:04:40 +03:00
Dawid Mędrek
cf376778e1 test/perf: Adjust tablet_load_balancing.cc to RF-rack-validity
We modify the logic to make sure that all of the keyspaces that the test
creates are RF-rack-valid. For that, we distribute the nodes across two
DCs and as many racks as the provided replication factor.

That may have an effect on the load balancing logic, but since this is
a performance test and since tablet load balancing is still taking place,
it should be acceptable.

This commit also finishes work in adjusting perf tests to pass with
the `rf_rack_valid_keyspaces` configuration option enabled. The remaining
tests either don't attempt to create keyspaces or they already create
RF-rack-valid keyspaces.

We don't need to explicitly enable the configuration option. It's already
enabled by default by `cql_test_config`. The reason why we haven't run into
any issue because of that is that performance tests are not part of our CI.

Fixes scylladb/scylladb#25127

Closes scylladb/scylladb#25728

(cherry picked from commit 789a4a1ce7)

Closes scylladb/scylladb#25920
2025-09-10 11:02:09 +03:00
Asias He
74f739352d streaming: Enclose potential throws in try block and ensure sink close before logging
- Move the initialization of log_done inside the try block to catch any
  exceptions it may throw.

- Relocate the failure warning log after sink.close() cleanup
  to guarantee sink.close() is always called before logging errors.

Refs #25497

Closes scylladb/scylladb#25591

(cherry picked from commit b12404ba52)

Closes scylladb/scylladb#25901
2025-09-10 11:02:09 +03:00
Asias He
31ce11ca58 streaming: Fix use after move in the tablet_stream_files_handler
The files object is moved before the log when stream finishes. We've
logged the files when the stream starts. Skip it in the end of
streaming.

Fixes #25830

Closes scylladb/scylladb#25835

(cherry picked from commit 451e1ec659)

Closes scylladb/scylladb#25888
2025-09-10 11:02:09 +03:00
Patryk Jędrzejczak
32471fa8db Merge '[Backport 2025.1] gossiper: fix issues in processing gossip status during the startup and when messages are delayed to avoid empty host ids' from Scylladb[bot]
Populate the local state during gossiper initialization in start_gossiping, preventing an empty state from being added to _endpoint_state_map and returned in get_endpoint_states responses, that was causing an 'empty host id issue' on the other nodes during nodes restart.

Check for a race condition in do_apply_state_locally In do_apply_state_locally, a race condition can occur if a task is suspended at a preemption point while the node entry is not locked.
During this time, the host may be removed from _endpoint_state_map. When the task resumes, this can lead to inserting an entry with an empty host ID into the map, causing various errors, including a node crash.

This change adds a check after locking the map entry: if a gossip ACK update does not contain a host ID, we verify that an entry with that host ID still exists in the gossiper’s _endpoint_state_map.

Fixes https://github.com/scylladb/scylladb/issues/25831
Fixes https://github.com/scylladb/scylladb/issues/25803
Fixes https://github.com/scylladb/scylladb/issues/25702
Fixes https://github.com/scylladb/scylladb/issues/25621

Ref https://github.com/scylladb/scylla-enterprise/issues/5613

Backport: The issue affects all current releases(2025.x), therefore this PR needs to be backported to all 2025.1-2025.3.

- (cherry picked from commit 28e0f42a83)

- (cherry picked from commit f08df7c9d7)

- (cherry picked from commit 775642ea23)

- (cherry picked from commit b34d543f30)

Parent PR: #25849

Closes scylladb/scylladb#25896

* https://github.com/scylladb/scylladb:
  gossiper: fix empty initial local node state
  gossiper: add test for a race condition in start_gossiping
  gossiper: check for a race condition in `do_apply_state_locally`
  test/gossiper: add reproducible test for race condition during node decommission
2025-09-09 12:47:22 +02:00
Sergey Zolotukhin
0d013d6c9e gossiper: fix empty initial local node state
This change removes the addition of an empty state to `_endpoint_state_map`.
Instead, a new state is created locally and then published via replicate,
avoiding the issue of an empty state existing in `_endpoint_state_map`
before the preemption point. Since this resolves the issue tested in
`test_gossiper_empty_self_id_on_shadow_round`, the `xfail` mark has been removed.

Fixes: scylladb/scylladb#25831
(cherry picked from commit b34d543f30)
2025-09-09 01:04:51 +02:00
Sergey Zolotukhin
398d1d0560 gossiper: add test for a race condition in start_gossiping
This change adds a test for a race condition in `start_gossiping` that
can lead to an empty self state sent in `gossip_get_endpoint_states_response`.

Test for scylladb/scylladb#25831

(cherry picked from commit 775642ea23)
2025-09-09 01:04:51 +02:00
Sergey Zolotukhin
c0e757738a gossiper: check for a race condition in do_apply_state_locally
In do_apply_state_locally, a race condition can occur if a task is
suspended at a preemption point while the node entry is not locked.
During this time, the host may be removed from _endpoint_state_map.
When the task resumes, this can lead to inserting an entry with an
empty host ID into the map, causing various errors, including a node
crash.

This change
1. adds a check after locking the map entry: if a gossip ACK update
   does not contain a host ID, we verify that an entry with that host ID
   still exists in the gossiper’s _endpoint_state_map.
2. Removes xfail from the test_gossiper_race test since the issue is now
   fixed.
3. Adds exception handling in `do_shadow_round` to skip responses from
   nodes that sent an empty host ID.

This re-applies the commit 13392a40d4 that
was reverted in 46aa59fe49, after fixing
the issues that caused the CI to fail.

Fixes: scylladb/scylladb#25702
Fixes: scylladb/scylladb#25621

Ref: scylladb/scylla-enterprise#5613
(cherry picked from commit f08df7c9d7)
2025-09-09 01:03:43 +02:00
Emil Maskovsky
695f345821 test/gossiper: add reproducible test for race condition during node decommission
This change introduces a targeted test that simulates the gossiper race
condition observed during node decommissioning. The test delays gossip
state application and host ID lookup to reliably reproduce the scenario
where `gossiper::get_host_id()` is called on a removed endpoint,
potentially triggering an abort in `apply_new_states`.

There is a specific error injection added to widen the race window, in
order to increase the likelihood of hitting the race condition. The
error injection is designed to delay the application of gossip state
updates, for the specific node that is being decommissioned. This should
then result in the server abort in the gossiper.

This re-applies the commit 5dac4b38fb that
was reverted in dc44fca67c, but modified
to relax the check from "on_internal_error" to a just warning log. The
more strict can be re-introduced later once we are sure that all
remaining problems are resolved and it will not break the CI.

Refs: scylladb/scylladb#25621
Fixes: scylladb/scylladb#25721
(cherry picked from commit 28e0f42a83)
2025-09-09 01:03:43 +02:00
Avi Kivity
1fd82d32e0 Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski
Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive.

This PR addresses the issue in two ways:
1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case).
2) `passwords::check` is moved to a dedicated alien thread.

Regarding point 1: before this change, the following hashing schemes were supported by     `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However:
- The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it  not good idea to fix or use it.
- SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512.
- MD5 is no longer considered secure for password hashing.

Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers.

Fixes https://github.com/scylladb/scylladb/issues/24524

Backport not needed, as it is a new feature.

Closes scylladb/scylladb#24924

* github.com:scylladb/scylladb:
  main: utils: add thread names to alien workers
  auth: move passwords::check call to alien thread
  test: wait for 3 clients with given username in test_service_level_api
  auth: refactor password checking in password_authenticator
  auth: make SHA-512 the only password hashing scheme for new passwords
  auth: whitespace change in identify_best_supported_scheme()
  auth: require scheme as parameter for `generate_salt`
  auth: check password hashing scheme support on authenticator start

(cherry picked from commit c762425ea7)
2025-09-07 15:47:34 +03:00
Botond Dénes
927bedaf9b Merge '[Backport 2025.1] repair: distribute tablet_repair_task_metas between shards' from Aleksandra Martyniuk
Currently, in repair_service::repair_tablets a shard that initiates
the repair keeps repair_tablet_metas of all tablets that have a replica
on this node (on any shard). This may lead to oversized allocations.

Modify tablet_repair_task_impl to repair only the tablets which replicas
are kept on this shard. Modify repair_service::repair_tablets to gather
repair_tablet_metas only on local shard. repair_tablets is invoked on
all shards.

Add a new legacy_tablet_repair_task_impl that covers tablet repair started with
async_repair. A user can use sequence number of this task to manage the repair
using storage_service API.

In a test that reproduced this, we have seen 11136 tablets and 5636096 bytes
allocation failure. If we had a node with 250 shards, 100 tablets each, we could
reach 12MB kept on one shard for the whole repair time.

Fixes: https://github.com/scylladb/scylladb/issues/23632

Needs backport to all live branches as they are all vulnerable to such crashes.

Closes scylladb/scylladb#25353

* github.com:scylladb/scylladb:
  repair: distribute tablet_repair_task_meta among shards
  repair: do not keep erm in tablet_repair_task_meta
2025-09-05 19:05:13 +03:00
Pavel Emelyanov
235fe9f83e Revert "test/gossiper: add reproducible test for race condition during node decommission"
This reverts commit 4b6097ab5b because
parent PR had been reverted as per #25803
2025-09-05 10:05:55 +03:00
Aleksandra Martyniuk
dfc18045f0 repair: distribute tablet_repair_task_meta among shards
Currently, in repair_service::repair_tablets a shard that initiates
the repair keeps tablet_repair_task_meta of all tablets that have a replica
on this node (on any shard). This may lead to oversized allocations.

Add remote_metas class which takes care of distributing tablet_repair_task_meta
among different shards. An additional class remote_metas_builder was
added in order to ensure safety and separate writes and reads to meta
vectors.

Fixes: #23632
(cherry picked from commit 132e6495a3)
2025-09-04 11:28:05 +02:00
Calle Wilund
9631beeafd system_keyspace: Prune dropped tables from truncation on start/drop
Fixes #25683

Once a table drop is complete, there should be no reason to retain
truncation records for it, as any replay should skip mutations
anyway (no CF), and iff we somehow resurrect a dropped table,
this replay-resurrected data is the least problem anyway.

Adds a prune phase to the startup drop_truncation_rp_records run,
which ignores updating, and instead deletes records for non-existant
tables (which should patch any existing servers with lingering data
as well).

Also does an explicit delete of records on actual table DROP, to
ensure we don't grow this table more than needed even in long
uptime nodes.

Small unit test included.

Closes scylladb/scylladb#25699

(cherry picked from commit bc20861afb)

Closes scylladb/scylladb#25811
2025-09-04 08:41:30 +03:00
Pavel Emelyanov
1f6c11d7f3 Merge '[Backport 2025.1] drop table: fix crash on drop table with concurrent cleanup' from Scylladb[bot]
Consider the following scenario:

- A tablet is migrated away from a shard
- The tablet cleanup stage closes the storage group's async_gate
- A drop table runs truncate which attempts to disable compaction on the tablet with its gate closed. This fails, because table::parallel_foreach_compaction_group() ultimately calls storage_group_manager::parallel_foreach_storage_group() which will not disable compaction if it can't hold the storage group's gate
- Truncate calls table::discard_sstables() which checks if the compaction has been disabled, and because it hasn't, it then runs on_internal_error() with "compaction not disabled on table ks.cf during TRUNCATE" which causes a crash

Fixes: #25706

This needs to be backported to all supported versions with tablets

- (cherry picked from commit a0934cf80d)

- (cherry picked from commit 1b8a44af75)

Parent PR: #25708

Closes scylladb/scylladb#25783

* github.com:scylladb/scylladb:
  test: reproducer and test for drop with concurrent cleanup
  truncate: check for closed storage group's gate in discard_sstables
2025-09-04 08:41:04 +03:00
Pavel Emelyanov
bf0d1cf272 Merge '[Backport 2025.1] encryption_at_rest_test: Improve exception handling and increase verbosity in fake proxy' from Nikos Dragazis
PRs included:
* #22810 (encryption_at_rest_test/encryption: Add some verbosity etc to help diagnose test run issues)
* #23778 (encryption_at_rest_test: Make fake_proxy read/write loop noexcept)

Backporting them together because the latter needs the former to avoid conflicts, but the former cannot be backported individually due to not fixing an issue.

* (cherry picked from commit 83aa66da1a)
* (cherry picked from commit 5905c19ab4)
* (cherry picked from commit 00263aa57a)
* (cherry picked from commit 4a44651fce)

Refs #22628.
Fixes #23774.

Closes scylladb/scylladb#25774

* github.com:scylladb/scylladb:
  encryption_at_rest_test: Make fake_proxy read/write loop noexcept
  gcp/aws kms: Promote service_error to recoverable + use malformed_response_error
  encryption_at_rest_test: Add verbosity + earlier stream close to proxy
  encryption: Add exception handler to context init (for tests)
2025-09-03 20:51:45 +03:00
Pavel Emelyanov
dba40c7a38 Merge '[Backport 2025.1] service/qos: Modularize service level controller to avoid invalid access to auth::service' from Scylladb[bot]
Move management over effective service levels from `service_level_controller`
to a new dedicated type -- `auth_integration`.

Before these changes, it was possible for the service level controller to try
to access `auth::service` after it was deinitialized. For instance, it could
happen when reloading the cache. That HAS happened as described in the following
issue: scylladb/scylladb#24792.

Although the problem might have been mitigated or even resolved in
scylladb/scylladb@10214e13bd, it's not clear
how the service will be used in the future. It's best to prevent similar bugs
than trying to fix them later on.

The logic responsible for preventing to access an uninitialized `auth::service`
was also either non-existent, complex, or non-sufficient.

To prevent accessing `auth::service` by the service level controller, we extract
the relevant portion of the code to a separate entity -- `auth_integration`.
It's an internal helper type whose sole purpose is to manage effective service
levels.

Thanks to that, we were able to nest the lifetime of `auth_integration` within
the lifetime of `auth::service`. It's now impossible to attempt to dereference
it while it's uninitialized.

If a bug related to an invalid access is spotted again, though, it might also
be easier to debug it now.

There should be no visible change to the users of the interface of the service
level controller. We strived to make the patch minimal, and the only affected
part of the logic should be related to how `auth::service` is accessed.

The relevant portion of the initialization and deinitialization flow:

(a) Before the changes:

1. Initialize `service_level_controller`. Pass a reference to an uninitialized
   `auth::service` to it.
2. Initialize other services.
3. Initialize and start `auth::service`.
4. (work)
5. Stop and deinitialize `auth::service`.
6. Deinitialize other services.
7. Deinitialize `service_level_controller`.

(b) After the changes:

1. Initialize `service_level_controller`. Pass a reference to an uninitialized
   `auth::service` to it. (*)
2. Initialize other services.
3. Initialize and start `auth::service`.
4. Initialize `auth_integration`. Register it in `service_level_controller`.
5. (work)
6. Unregister `auth_integration` in `service_level_controller` and deinitialize
   it.
7. Stop and deinitialize `auth::service`.
8. Deinitialize other services.
9. Deinitialize `service_level_controller`.

(*):
    The reference to `auth::service` in `service_level_controller` is still
    necessary. We need to access the service when dropping a distributed
    service level.

    Although it would be best to cut that link between the service level
    controller and `auth::service` too, effectively separating the entities,
    it would require more work, so we leave it as-is for now.

    It shouldn't prove problematic as far as accessing an uninitialized service
    goes. Trying to drop a service level at the point when we're de-initializing
    auth should be impossible.

    For more context, see the function `drop_distributed_service_level` in
    `service_level_controller`.

A trivial test has been included in the PR. Although its value is questionable
as we only try to reload the service level cache at a specific moment, it's
probably the best we can deliver to provide a reproducer of the issue this patch
is resolving.

Fixes scylladb/scylladb#24792

Backport: The impact of the bug was minimal as it only affected the shutdown.
However, since CI is failing because of it, let's backport the change to all
supported versions.

- (cherry picked from commit 7d0086b093)

- (cherry picked from commit 34afb6cdd9)

- (cherry picked from commit e929279d74)

- (cherry picked from commit dd5a35dc67)

- (cherry picked from commit fc1c41536c)

Parent PR: #25478

Closes scylladb/scylladb#25750

* github.com:scylladb/scylladb:
  service/qos: Move effective SL cache to auth_integration
  service/qos: Add auth::service to auth_integration
  service/qos: Reload effective SL cache conditionally
  service/qos: Add gate to auth_integration
  service/qos: Introduce auth_integration
2025-09-03 20:21:12 +03:00
Ferenc Szili
d8dd7e8eba test: reproducer and test for drop with concurrent cleanup
This change adds a reproducer and test for issue #25706

(cherry picked from commit 1b8a44af75)
2025-09-03 16:12:49 +02:00
Emil Maskovsky
4b6097ab5b test/gossiper: add reproducible test for race condition during node decommission
This change introduces a targeted test that simulates the gossiper race
condition observed during node decommissioning. The test delays gossip
state application and host ID lookup to reliably reproduce the scenario
where `gossiper::get_host_id()` is called on a removed endpoint,
potentially triggering an abort in `apply_new_states`.

There is a specific error injection added to widen the race window, in
order to increase the likelihood of hitting the race condition. The
error injection is designed to delay the application of gossip state
updates, for the specific node that is being decommissioned. This should
then result in the server abort in the gossiper.

Refs: scylladb/scylladb#25621
Fixes: scylladb/scylladb#25721

Backport: The test is primarily for an issue found in 2025.1, so it
needs to be backported to all the 2025.x branches.

Closes scylladb/scylladb#25685

(cherry picked from commit 5dac4b38fb)

Closes scylladb/scylladb#25779
2025-09-03 13:35:37 +02:00
Raphael S. Carvalho
5dd35e2de5 replica: Fix take_storage_snapshot() running concurrently to merge completion
Some background:
When merge happens, a background fiber wakes up to merge compaction
groups of sibling tablets into main one. It cannot happen when
rebuilding the storage group list, since token metadata update is
not preemptable. So a storage group, post merge, has the main
compaction group and two other groups to be merged into the main.
When the merge happens, those two groups are empty and will be
freed.

Consider this scenario:
1) merge happens, from 2 to 1 tablet
2) produces a single storage group, containing main and two
other compaction groups to be merged into main.
3) take_storage_snapshot(), triggered by migration post merge,
gets a list of pointer to all compaction groups.
4) t__s__s() iterates first on main group, yields.
5) background fiber wakes up, moves the data into main
and frees the two groups
6) t__s__s() advances to other groups that are now freed,
since step 5.
7) segmentation fault

In addition to memory corruption, there's also a potential for
data to escape the iteration in take_storage_snapshot(), since
data can be moved across compaction groups in background, all
belonging to the same storage group. That could result in
data loss.

Readers should all operate on storage group level since it can
provide a view on all the data owned by a tablet replica.
The movement of sstable from group A to B is atomic, but
iteration first on A, then later on B, might miss data that
was moved from B to A, before the iteration reached B.
By switching to storage group in the interface that retrieves
groups by token range, we guarantee that all data of a given
replica can be found regardless of which compaction group they
sit on.

Fixes #23162.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#24058

(cherry picked from commit 28056344ba)

Closes scylladb/scylladb#25337
2025-09-03 13:59:32 +03:00
Calle Wilund
39242c3d5a commitlog: Ensure segment deletion is re-entrant
Fixes #25709

If we have large allocations, spanning more than one segment, and
the internal segment references from lead to secondary are the
only thing keeping a segment alive, the implicit drop in
discard_unused_segments and orphan_all can cause a recursive call
to discard_unused_segments, which in turn can lead to vector
corruption/crash, or even double free of segment (iterator confusion).

Need to separate the modification of the vector (_segments) from
actual releasing of objects. Using temporaries is the easiest
solution.

To further reduce recursion, we can also do an early clear of
segment dependencies in callbacks from segment release (cf release).

Closes scylladb/scylladb#25719

(cherry picked from commit cc9eb321a1)

Closes scylladb/scylladb#25754
2025-09-03 06:55:59 +03:00
Dawid Mędrek
79caefc24b service/qos: Move effective SL cache to auth_integration
Since `auth_integration` manages effective service levels, let's move
the relevant cache from `service_level_controller` to it.

(cherry picked from commit fc1c41536c)
2025-09-02 19:43:56 +02:00
Dawid Mędrek
7b085f9198 service/qos: Add auth::service to auth_integration
The new service, `auth_integration`, has taken over the responsibility
over managing effective service levels from `service_level_controller`.
However, before these changes, it still accessed `auth::service` via
the service level controller. Let's change that.

Note that we also remove a check that `auth::service` has been
initialized. It's not necessary anymore because the lifetime of
`auth_integration` is strictly nested within the lifetime of `auth::service`.

In actuality, `service_level_controller` should lose its reference to
`auth::service` completely. All of the management over effective service
levels has already been moved to `auth_integration`. However, the
referernce is still needed when dropping a distributed service level
because we need to update the corresponding attribute for relevant
roles.

That should not lead to invalid accesses, though. Dropping a service level
should not be possible when `auth::service` is not initialized.

(cherry picked from commit dd5a35dc67)
2025-09-02 19:43:56 +02:00
Dawid Mędrek
c533c4bf96 service/qos: Reload effective SL cache conditionally
Since `service_level_controller` outlives `auth_integration`, it may
happen that we try to access it when it has already been deinitialized.
To prevent that, we only try to reload or clear the effective service
level cache when the object is still alive.

These changes solve an existing problem with an invalid memory access.
For more context, see issue scylladb/scylladb#24792.

We provide a reproducer test that consistently fails before these
changes but passes after them.

Fixes scylladb/scylladb#24792

(cherry picked from commit e929279d74)
2025-09-02 19:43:51 +02:00
Dawid Mędrek
7181596d5f service/qos: Add gate to auth_integration
We add a gate to `auth_integration` that will aid us in synchronizing
ongoing tasks with stopping the service.

(cherry picked from commit 34afb6cdd9)
2025-09-02 19:34:35 +02:00
Dawid Mędrek
c904f6a0fc service/qos: Introduce auth_integration
We introduce a new type, `auth_integration`, that will be used internally
by `service_level_controller`. Its purpose is to take over the responsibility
over managing effective service levels.

The main problem of the current implementation of service level controller
is its dependency on `auth::service` whose lifetime is strictly nested
within the lifetime of service level controller. That may and already have
led to invalid memory accesses; for an example, see issue
scylladb/scylladb#24792.

Our strategy is to split service level controller into smaller parts and
ensure that we access `auth::service` only when it's valid to do so.
This commit is the first step towards that.

We don't change anything in the logic yet, just add the new type. Further
adjustments will be made in following commits.

(cherry picked from commit 7d0086b093)
2025-09-02 19:25:23 +02:00
Botond Dénes
1b233a25fd Merge '[Backport 2025.1] system_keyspace: add peers cache to get_ip_from_peers_table' from Scylladb[bot]
The gossiper can call `storage_service::on_change` frequently (see  scylladb/scylla-enterprise#5613), which may cause high CPU load and even trigger OOMs or related issues.

This PR adds a temporary cache for `system.peers` to resolve host_id -> ip without hitting storage on every call. The cache is short-lived to handle the unlikely case where `system.peers` is updated directly via CQL.

This is a temporary fix; a more thorough solution is tracked in https://github.com/scylladb/scylladb/issues/25620.

Fixes scylladb/scylladb#25660

backport: this patch needs to be backported to all supported versions (2025.1/2/3).

- (cherry picked from commit 91c633371e)

- (cherry picked from commit de5dc4c362)

- (cherry picked from commit 4b907c7711)

Parent PR: #25658

Closes scylladb/scylladb#25762

* github.com:scylladb/scylladb:
  storage_service: move get_host_id_to_ip_map to system_keyspace
  system_keyspace: use peers cache in get_ip_from_peers_table
  storage_service: move get_ip_from_peers_table to system_keyspace
2025-09-02 11:24:06 +03:00
Ferenc Szili
6b9f77fdcd truncate: check for closed storage group's gate in discard_sstables
Consider the following scenario:

- A tablet is migrated away from a shard
- The tablet cleanup stage closes the storage group's async_gate
- A drop table runs truncate which attempts to disable compaction on the
  tablet with its gate closed. This fails, because
  table::parallel_foreach_compaction_group() ultimately calls
  storage_group_manager::parallel_foreach_storage_group() which will not
  disable compaction if it can't hold the storage group's gate
- Truncate calls table::discard_sstables() which checks if the compaction
  has been disabled, and because it hasn't, it then runs
  on_internal_error() with "compaction not disabled on table ks.cf during
  TRUNCATE" which causes a crash

This patch makes dicard_sstables check if the storage group's gate is
closed whend checking for disabled compaction.

(cherry picked from commit a0934cf80d)
2025-09-02 02:17:47 +00:00
Nadav Har'El
fc738ee9ce Merge '[backport 2025.1] token_range_vector: fragment' from Avi Kivity
token_range_vector is a sequence of intervals of tokens. It is used
to describe vnodes or token ranges owned by shards.

Since tokens are bloated (16 bytes instead of 8), and intervals are bloated
(40 byte of overhead instead of 8), and since we have plenty of token ranges,
such vectors can exceed our allocation unit of 128 kB and cause allocation stalls.

This series fixes that by first generalizing some helpers and then changing
token_range_vector to use chunked_vector.

Although this touches IDL, there is no compatibility problem since the encoding
for vector and chunked_vector are identical.

There is no performance concern since token_range_vector is never used on
any hot path (hot paths always contain a partition key).

Fixes #3335.
Fixes #24115.

Fixes #24156

Backport notes:

Due to compiler limitations in this toolchain, the template template parameters were replaced
by elaborate template metaprogramming, see patch 'partition_range_compat: generalize wrap/unwrap helpers'.

Closes scylladb/scylladb#25704

* github.com:scylladb/scylladb:
  dht: fragment token_range_vector
  partition_range_compat: generalize wrap/unwrap helpers
  utils: chunked_vector: add swap() method
  utils: chunked_vector: add range insert() overloads
2025-09-01 19:07:54 +03:00
Calle Wilund
57f3f1465e encryption_at_rest_test: Make fake_proxy read/write loop noexcept
Fixes #23774

Test code falls into same when_all issue as http client did.
Avoid passing exceptions through this, and instead catch and
report in worker lambda.

Closes scylladb/scylladb#23778

(cherry picked from commit 4a44651fce)
2025-09-01 16:08:33 +03:00
Calle Wilund
bf6f51dc11 gcp/aws kms: Promote service_error to recoverable + use malformed_response_error
Refs #22628

Mark problems parsing response (partial message, network error without exception etc
- hello testing), as "malformed_response_error", and promote this as well as
general "service_error" to recoverable exceptions (don't isolate node on error).

This to better handle intermittent network issues as well as making error-testing
more deterministic.

(cherry picked from commit 00263aa57a)
2025-09-01 16:07:45 +03:00
Calle Wilund
e52675415f encryption_at_rest_test: Add verbosity + earlier stream close to proxy
Refs #22628

Adds some verbosity to track issues with the network proxy used to test
EAR connector difficulties. Also adds an earlier close in input stream
to help network usage.

Note: This is a diagnostic helper. Still cannot repro the issue above.
(cherry picked from commit 5905c19ab4)
2025-09-01 16:06:57 +03:00
Calle Wilund
19dffe1b19 encryption: Add exception handler to context init (for tests)
Adds exception handler + cleanup for the case where we have a
bad config/env vars (hint minio) or similar, such that we fail
with exception during setting up the EAR context.
In a normal startup, this is ok. We will report the exception,
and the do a exit(1).

In tests however, we don't and active context will instead be
freed quite proper, in which case we need to call stop to ensure
we don't crash on shared pointer destruction on wrong shard.
Doing so will hide the real issue from whomever runs the test.

(cherry picked from commit 83aa66da1a)
2025-09-01 16:06:40 +03:00
Petr Gusev
751a06a252 storage_service: move get_host_id_to_ip_map to system_keyspace
Reimplemented the function to use the peers cache. It could be replaced
with get_ip_from_peers_table, but that would create a coroutine frame for
each call.

(cherry picked from commit 4b907c7711)
2025-09-01 11:12:29 +02:00
Petr Gusev
8f058aa575 system_keyspace: use peers cache in get_ip_from_peers_table
The storage_service::on_change method can be called quite often
by the gossiper, see scylladb/scylla-enterprise#5613. In this commit
we introduce a temporal cache for system.peers so that we don't have
to go to the storage each time we need to resolve host_id -> ip.
We keep the cache only for a small amount of time to handle the
(unlikely) scenario when the user wants to update system.peers table
from CQL.

Fixes scylladb/scylladb#25660

(cherry picked from commit de5dc4c362)
2025-09-01 11:08:18 +02:00
Petr Gusev
d96a153966 storage_service: move get_ip_from_peers_table to system_keyspace
We plan to add a cache to get_ip_from_peers_table in upcoming commits.
It's more convenient to do this from system_keyspace, since the only two
methods that mutate system.peers (remove_endpoint and update_peers_info)
are already there.

(cherry picked from commit 91c633371e)
2025-09-01 11:08:00 +02:00
Pavel Emelyanov
e22dbb05a8 Merge '[Backport 2025.1] GCP Key Provider: Fix authentication issues' from Scylladb[bot]
* Fix discovery of application default credentials by using fully expanded pathnames (no tildes).
* Fix grant type in token request with user credentials.

Fixes #25345.

- (cherry picked from commit 77cc6a7bad)

- (cherry picked from commit b1d5a67018)

Parent PR: #25351

Closes scylladb/scylladb#25405

* github.com:scylladb/scylladb:
  encryption: gcp: Fix the grant type for user credentials
  encryption: gcp: Expand tilde in pathnames for credentials file
2025-09-01 09:03:41 +03:00
Taras Veretilnyk
bd4a80301c keys: from_nodetool_style_string don't split single partition keys
Users with single-column partition keys that contain colon characters
were unable to use certain REST APIs and 'nodetool' commands, because the
API split key by colon regardless of the partition key schema.

Affected commands:
- 'nodetool getendpoints'
- 'nodetool getsstables'
Affected endpoints:
- '/column_family/sstables/by_key'
- '/storage_service/natural_endpoints'

Refs: #16596 - This does not fully fix the issue, as users with compound
keys will face the issue if any column of the partition key contains
a colon character.

Closes scylladb/scylladb#24829

Closes scylladb/scylladb#25554
2025-09-01 09:03:12 +03:00
Pavel Emelyanov
625306fec3 Merge '[Backport 2025.1] cql3: Warn when creating RF-rack-invalid keyspace' from Scylladb[bot]
Although RF-rack-valid keyspaces are not universally enforced
yet (they're governed by the configuration option
`rf_rack_valid_keyspaces`), we'd like to encourage the user to
abide by the restriction.

To that end, we're introducing a warning when creating or
altering a keyspace. If the configuration option is disabled,
but the user is trying to create an RF-rack-invalid keyspace,
they'll receive a warning.

If the option is turned off, we will also log all of the
RF-rack-invalid keyspaces at start-up.

We provide validation tests.

Fixes scylladb/scylladb#23330

Backport: we'd like to encourage the user to abide by the restriction
even when they don't enforce it to make it easier in the future to
adjust the schema when there's no way to disable it anymore. Because
of that, we'd like to backport it to all relevant versions, starting with 2025.1.

- (cherry picked from commit 60ea22d887)

- (cherry picked from commit af8a3dd17b)

- (cherry picked from commit 837d267cbf)

Parent PR: #24785

Closes scylladb/scylladb#25633

* github.com:scylladb/scylladb:
  main: Log RF-rack-invalid keyspaces at startup
  cql3/statements: Fix indentation
  cql3: Warn when creating RF-rack-invalid keyspace
2025-09-01 09:02:34 +03:00
Aleksandra Martyniuk
84717267ff replica: lower severity of failure log
Flush failure with seastar::named_gate_closed_exception is expected
if a respective compaction group was already stopped.

Lower the severity of a log in dirty_memory_manager::flush_one
for this exception.

Fixes: https://github.com/scylladb/scylladb/issues/25037.

Closes scylladb/scylladb#25355

(cherry picked from commit a10e241228)

Closes scylladb/scylladb#25648
2025-09-01 09:02:16 +03:00
Emil Maskovsky
14d90bc62a storage: pass host_id as parameter to maybe_reconnect_to_preferred_ip()
Previously, `maybe_reconnect_to_preferred_ip()` retrieved the host ID
using `gossiper::get_host_id()`. Since the host ID is already available
in the calling function, we now pass it directly as a parameter.

This change simplifies the code and eliminates a potential race condition
where `gossiper::get_host_id()` could fail, as described in scylladb/scylladb#25621.

Refs: scylladb/scylladb#25621
Fixes: scylladb/scylladb#25715

Backport: Recommended for 2025.x release branches to avoid potential issues
from unnecessary calls to `gossiper::get_host_id()` in subscribers.

(cherry picked from commit cfc87746b6)

Closes scylladb/scylladb#25716
2025-09-01 09:01:45 +03:00
Calle Wilund
c9286a78c5 system_keyspace: Limit parallelism in drop_truncation_records
Fixes #25682
Refs scylla-enterprise#5580

If the truncation table is large in entries, we might create a
huge parallel execution, quite possibly consuming loads of resources
doing something quite trivial.
Limit concurrency to a small-ish number

Closes scylladb/scylladb#25678

(cherry picked from commit 2eccd17e70)

Closes scylladb/scylladb#25747
2025-09-01 09:01:21 +03:00
Jenkins Promoter
4bc70cfccb Update pgo profiles - aarch64 2025-09-01 05:00:19 +03:00
Jenkins Promoter
feb043c757 Update pgo profiles - x86_64 2025-09-01 04:34:07 +03:00
Avi Kivity
f2d0444dd1 dht: fragment token_range_vector
token_range_vector is a linear vector containing intervals
of tokens. It can grow quite large in certain places
and so cause stalls.

Convert it to utils::chunked_vector, which prevents allocation
stalls.

It is not used in any hot path, as it usually describes
vnodes or similar things.

Fixes #3335.

(cherry picked from commit 844a49ed6e)
2025-08-27 18:28:58 +03:00
Avi Kivity
1c2a4540eb partition_range_compat: generalize wrap/unwrap helpers
These helpers convert vectors of wrapped intervals to
vectors of unwrapped intervals and vice versa.

Generalize them to work on any sequence type. This is in
preparation of moving from vectors to chunked_vectors.

(cherry picked from commit 83c2a2e169)

Backport notes: due to compiler limitations in this toolchain,
the template template parameter was replaced by elaborate
metaprogramming to detect the value type and rebind the container
to a new value type.
2025-08-27 18:19:33 +03:00
Avi Kivity
951841f71f utils: chunked_vector: add swap() method
Following std::vector(), we implement swap(). It's a simple matter
of swapping all the contents.

A unit test is added.

(cherry picked from commit 13a75ff835)
2025-08-27 14:24:35 +03:00
Avi Kivity
81891a6c31 utils: chunked_vector: add range insert() overloads
Inserts an iterator range at some position.

Again we insert the range at the end and use std::rotate() to
move the newly inserted elements into place, forgoing possible
optimizations.

Unit tests are added.

(cherry picked from commit 24e0d17def)
2025-08-27 14:24:35 +03:00
Avi Kivity
24937ee61c Update seastar submodule (more information on segfault)
* seastar b2ff08a502...c4ca171b62 (1):
  > reactor: more info, robustness on segfault

Fixes #25681
2025-08-27 14:20:33 +03:00
Nadav Har'El
d2a4f009da alternator: avoid oversized allocation in Query/Scan
This patch fixes one cause of oversized allocations - and therefore
potentially stalls and increased tail latencies - in Alternator.

Alternator's Scan or Query operation return a page of results. When the
number of items is not limited by a "Limit" parameter, the default is
to return a 1 MB page. If items are short, a large number of them can
fit in that 1MB. The test test_query.py::test_query_large_page_small_rows
has 30,000 items returned in a single page.

In the response JSON, all these items are returned in a single array
"Items". Before this patch, we build the full response as a RapidJSON
object before sending it. The problem is that unfortunately, RapidJSON
stores arrays as contiguous allocations. This results in large
contiguous allocations in workloads that scan many small items, and
large contiguous allocations can also cause stalls and high tail
latencies. For example, before this patch, running

    test/alternator/run --runveryslow \
        test_query.py::test_query_large_page_small_rows

reports in the log:

    oversized allocation: 573440 bytes.

After this patch, this warning no longer appears.
The patch solves the problem by collecting the scanned items not in a
RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e,
a chunked (non-contiguous) array of items (each a JSON value).
After collecting this array separately from the response object, we
need to print its content without actually inserting it into the object -
we add a new function print_with_extra_array() to do that.

The new separate-chunked-vector technique is used when a large number
(currently, >256) of items were scanned. When there is a smaller number
of items in a page (this is typical when each item is longer), we just
insert those items in the object and print it as before.

Beyond the original slow test that demonstrated the oversized allocation
(which is now gone), this patch also includes a new test which
exercises the new code with a scan of 700 (>256) items in a page -
but this new test is fast enough to be permanently in our test suite
and not a manual "veryslow" test as the other test.

Fixes #23535

(cherry picked from commit 2385fba4b6)

Closes scylladb/scylladb#25657
2025-08-27 10:26:00 +03:00
Pavel Emelyanov
3565e19291 Merge '[Backport 2025.1] raft_sys_table_storage: avoid temp buffer when deserializing log_entry' from Scylladb[bot]
The get_blob method linearizes data by copying it into a single buffer, which can cause 'oversized allocation' warnings.

In this commit we avoid copying by creating input stream on top of the original fragmened managed bytes, returned by untyped_result_set_row::get_view.

Fixes scylladb/scylladb#23903

backport: no need, not a critical issue.

- (cherry picked from commit 6496ae6573)

- (cherry picked from commit f245b05022)

Parent PR: #24123

Closes scylladb/scylladb#25595

* github.com:scylladb/scylladb:
  raft_sys_table_storage: avoid temporary buffer when deserializing log_entry
  serializer_impl.hh:  add as_input_stream(managed_bytes_view) overload
2025-08-27 10:23:56 +03:00
Petr Gusev
8a761a10fe raft_sys_table_storage: avoid temporary buffer when deserializing log_entry
The get_blob() method linearizes data by copying it into a
single buffer, which can trigger "oversized allocation" warnings.
This commit avoids that extra copy by creating an input stream
directly over the original fragmented managed bytes returned by
untyped_result_set_row::get_view().

Fixes scylladb/scylladb#23903

(cherry picked from commit f245b05022)
2025-08-26 15:57:07 +02:00
Petr Gusev
f4e04d640b serializer_impl.hh: add as_input_stream(managed_bytes_view) overload
It's useful to have it here so that people can find it easily.

(cherry picked from commit 6496ae6573)
2025-08-26 15:57:07 +02:00
Dawid Mędrek
8ba65e4014 main: Log RF-rack-invalid keyspaces at startup
When the configuration option `rf_rack_valid_keyspaces` is enabled and there
is an RF-rack-invalid keyspace, starting a node fails. However, when the
configuration option is disabled, but there still is a keyspace that violates
the condition, we'd like Scylla to print a warning informing the user about
the fact. That's what happens in this commit.

We provide a validation test.

(cherry picked from commit 837d267cbf)
2025-08-25 18:45:06 +02:00
Dawid Mędrek
dd0c756551 cql3/statements: Fix indentation
(cherry picked from commit af8a3dd17b)
2025-08-25 18:41:29 +02:00
Dawid Mędrek
31e10b97a2 cql3: Warn when creating RF-rack-invalid keyspace
Although RF-rack-valid keyspaces are not universally enforced
yet (they're governed by the configuration option
`rf_rack_valid_keyspaces`), we'd like to encourage the user to
abide by the restriction.

To that end, we're introducing a warning when creating or
altering a keyspace. If the configuration option is disabled,
but the user is trying to create an RF-rack-invalid keyspace,
they'll receive a warning.

We provide a validation test.

(cherry picked from commit 60ea22d887)
2025-08-25 18:41:26 +02:00
kendrick-ren
3e7a99b283 Update launch-on-gcp.rst
Add the missing '=' mark in --zone option. Otherwise the command complains.

Closes scylladb/scylladb#25471

(cherry picked from commit d6e62aeb6a)

Closes scylladb/scylladb#25643
2025-08-25 10:28:23 +03:00
Benny Halevy
bec1e9536f api: storage_service: fix token_range documentation
Note that the token_range type is used only by describe_ring.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#25609

(cherry picked from commit 45c496c276)

Closes scylladb/scylladb#25638
2025-08-25 10:27:53 +03:00
Michał Jadwiszczak
d71ad682c9 test/cqlpy/test_service_level_api: update to service levels on raft and remove flakiness
Tests in `test_service_level_api` were written before
scylladb/scylladb#16585 and they were doing 10s sleeps to wait for
service level controller to update its configuration. Now performing
a read barrier is sufficient to ensure SL configuration is up-to-date,
which significantly reduces tests time (from ~60s to ~2-3s).

Moreover, there was flakiness in the `test_switch_tenants` test.
Until now, the test waited up to 60s for the connections to update
their scheduling groups. However, it is difficult to determine
how long the process might take because a connection may be blocked
while waiting for the next request to be processed,
and the scheduling group will be updated only after a request is processed
(see `generic_server::connection::process_until_tenant_switch()`).
To address this issue, 100 simple queries are executed so that
connections on all shards process at least one request
and update their scheduling groups.

Fixes scylladb/scylladb#22768

Closes scylladb/scylladb#23381

(cherry-picked from commit 0ee0696959)

Closes scylladb/scylladb#25597
2025-08-25 10:27:23 +03:00
Pavel Emelyanov
3710eadb93 Merge '[Backport 2025.1] db/hints: Improve logs' from Scylladb[bot]
Before these changes, the logs in hinted handoff often didn't provide
crucial information like the identifier of the node that hints were
being sent to. Also, some of the logs were misleading and referred to
other places in the code than the one where an exception or some other
situation really occurred.

We modify those logs, extending them by more valuable information
and fixing existing issues. What's more, all of the logs in
`hint_endpoint_manager` and `hint_sender` follow a consistent format
now:

```
<class_name>[<destination host ID>]:<function_name>: <message>
```

This way, we should always have AT LEAST the basic information.

Fixes scylladb/scylladb#25466

Backport:
There is no risk in backporting these changes. They only have
impact on the logs. On the other hand, they might prove helpful
when debugging an issue in hinted handoff.

- (cherry picked from commit 2327d4dfa3)

- (cherry picked from commit d7bc9edc6c)

- (cherry picked from commit 6f1fb7cfb5)

Parent PR: #25470

Closes scylladb/scylladb#25536

* github.com:scylladb/scylladb:
  db/hints: Add new logs
  db/hints: Adjust log levels
  db/hints: Improve logs
2025-08-25 10:26:42 +03:00
Pavel Emelyanov
5e86507adc Merge '[Backport 2025.1] generic server: 2 step shutdown' from Scylladb[bot]
This PR implements solution proposed in scylladb/scylladb#24481

Instead of terminating connections immediately, the shutdown now proceeds in two stages: first closing the receive (input) side to stop new requests, then waiting for all active requests to complete before fully closing the connections.

The updated shutdown process is as follows:

1. Initial Shutdown Phase
   * Close the accept gate to block new incoming connections.
   * Abort all accept() calls.
   * For all active connections:
      * Close only the input side of the connection to prevent new requests.
      * Keep the output side open to allow responses to be sent.

2. Drain Phase
   * Wait for all in-progress requests to either complete or fail.

3. Final Shutdown Phase
   * Fully close all connections.

Fixes scylladb/scylladb#24481

- (cherry picked from commit 122e940872)

- (cherry picked from commit 3848d10a8d)

- (cherry picked from commit 3610cf0bfd)

- (cherry picked from commit 27b3d5b415)

- (cherry picked from commit 061089389c)

- (cherry picked from commit 7334bf36a4)

- (cherry picked from commit ea311be12b)

- (cherry picked from commit 4f63e1df58)

Parent PR: #24499

Closes scylladb/scylladb#25517

* github.com:scylladb/scylladb:
  test: Set `request_timeout_on_shutdown_in_seconds` to `request_timeout_in_ms`,  decrease request timeout.
  generic_server: Two-step connection shutdown.
  generic_server: replace empty destructor with `= default`
  generic_server: refactor connection::shutdown to use `shutdown_input` and `shutdown_output`
  generic_server: add `shutdown_input` and `shutdown_output` functions to `connection` class.
  test: Add test for query execution during CQL server shutdown
2025-08-22 10:19:30 +03:00
Michał Chojnowski
4e0e8f5632 sstables/types.hh: fix fmt::formatter<sstables::deletion_time>
Obvious typo.

Fixes scylladb/scylladb#25556

Closes scylladb/scylladb#25557

(cherry picked from commit c1b513048c)

Closes scylladb/scylladb#25586
2025-08-21 13:00:22 +03:00
Wojciech Mitros
6c35b50221 test: run mv tests depending on metrics on a standalone instance
The test_base_partition_deletion_with_metrics test case (and the batch
variant) uses the metric of view updates done during its runtime to check
if we didn't perform too many of them. The test runs in the cqlpy suite,
which  runs all test cases sequentially on one Scylla instance. Because
of this, if another test case starts a process which generates view
updates and doesn't wait for it to finish before it exists, we may
observe too many view updates in test_base_partition_deletion_with_metrics
and fail the test.
In all test cases we make sure that all tables that were created
during the test are dropped at the end. However, that doesn't
stop the view building process immediately, so the issue can happen
even if we drop the view. I confirmed it by adding a test just before
test_base_partition_deletion_with_metrics which builds a big
materialized view and drops it at the end - the metrics check still failed.

The issue could be caused by any of the existing test cases where we create
a view and don't wait for it to be built. Note that even if we start adding
rows after creating the view, some of them may still be included in the view
building, as the view building process is started asynchronously. In such
a scenario, the view building also doesn't cause any issues with the data in
these tests - writes performed after view creation generate view updates
synchronously when they're local (and we're running a single Scylla server),
the corresponding view udpates generated during view building are redundant.

Because we have many test cases which could be causing this issue, instead
of waiting for the view building to finish in every single one of them, we
move the susceptible test cases to be run on separate Scylla instances, in
the "cluster" suite. There, no other test cases will influence the results.

Fixes https://github.com/scylladb/scylladb/issues/20379

Closes scylladb/scylladb#25209

(cherry picked from commit 2ece08ba43)

Closes scylladb/scylladb#25502
2025-08-21 10:29:52 +02:00
Patryk Jędrzejczak
8f2fe85990 test: test_maintenance_socket: use cluster_con for driver sessions
The test creates all driver sessions by itself. As a consequence, all
sessions use the default request timeout of 10s. This can be too low for
the debug mode, as observed in scylladb/scylla-enterprise#5601.

In this commit, we change the test to use `cluster_con`, so that the
sessions have the request timeout set to 200s from now on.

Fixes scylladb/scylla-enterprise#5601

This commit changes only the test and is a CI stability improvement,
so it should be backported all the way to 2024.2. 2024.1 doesn't have
this test.

Closes scylladb/scylladb#25510

(cherry picked from commit 03cc34e3a0)

Closes scylladb/scylladb#25545
2025-08-20 10:42:14 +02:00
Sergey Zolotukhin
fa4c4a0e5c test: Set request_timeout_on_shutdown_in_seconds to request_timeout_in_ms,
decrease request timeout.

In debug mode, queries may sometimes take longer than the default 30 seconds.
To address this, the timeout value `request_timeout_on_shutdown_in_seconds`
during tests is aligned with other request timeouts.
Change request timeout for tests from 180s to 90s since we must keep the request
timeout during shutdown significantly lower than the graceful shutdown timeout(2m),
or else a request timeout would cause a graceful shutdown timeout and fail a test.

(cherry picked from commit 4f63e1df58)
2025-08-20 10:30:51 +02:00
Sergey Zolotukhin
0cb64577fd generic_server: Two-step connection shutdown.
When shutting down in `generic_server`, connections are now closed in two steps.
First, only the RX (receive) side is shut down. Then, after all ongoing requests
are completed, or a timeout happened the connections are fully closed.

Fixes scylladb/scylladb#24481

(cherry picked from commit ea311be12b)
2025-08-20 10:30:09 +02:00
Sergey Zolotukhin
27105dd533 generic_server: replace empty destructor with = default
This change improves code readability by explicitly marking the destructor as defaulted.

(cherry picked from commit 27b3d5b415)
2025-08-20 10:29:20 +02:00
Sergey Zolotukhin
e4806d9d6d generic_server: refactor connection::shutdown to use shutdown_input and shutdown_output
This change improves logging and modifies the behavior to attempt closing
the output side of a connection even if an error occurs while closing the input side.

(cherry picked from commit 3610cf0bfd)
2025-08-20 10:28:36 +02:00
Sergey Zolotukhin
2a531ec52a generic_server: add shutdown_input and shutdown_output functions to
`connection` class.

The functions are just wrappers for  _fd.shutdown_input() and  _fd.shutdown_output(), with added error reporting.
Needed by later changes.

(cherry picked from commit 3848d10a8d)
2025-08-20 10:28:07 +02:00
Sergey Zolotukhin
5da3e4346c test: Add test for query execution during CQL server shutdown
This test simulates a scenario where a query is being executed while
the query coordinator begins shutting down the CQL server and client
connections. The shutdown process should wait until the query execution
is either completed or timed out.

Test for scylladb/scylladb#24481

(cherry picked from commit 122e940872)
2025-08-20 10:27:25 +02:00
Pavel Emelyanov
8e6e54e899 Merge '[Backport 2025.1] test: test_mv_backlog: fix to consider internal writes' from Scylladb[bot]
The PR fixes a test flakiness issue in test_mv_backlog related to reading metrics.

The first commit fixes a more general issue in the ScyllaMetrics helper class where it doesn't return the value of all matching lines when a specific shard is requested, but it breaks after the first match.

The second commit fixes a test issue where it expects exactly one write to be throttled, not taking into account other internal writes that may be executed during this time.

Fixes https://github.com/scylladb/scylladb/issues/23139

backport to improve CI stability - test only change

- (cherry picked from commit 5c28cffdb4)

- (cherry picked from commit 276a09ac6e)

Parent PR: #25279

Closes scylladb/scylladb#25473

* github.com:scylladb/scylladb:
  test: test_mv_backlog: fix to consider internal writes
  test/pylib/rest_client: fix ScyllaMetrics filtering
2025-08-19 17:02:31 +03:00
Dawid Mędrek
dc15d64c50 db/hints: Add new logs
We're adding new logs in just a few places that may however prove
important when debugging issues in hinted handoff in the future.

(cherry picked from commit 6f1fb7cfb5)
2025-08-18 15:59:42 +02:00
Dawid Mędrek
b1ecfe6ce4 db/hints: Adjust log levels
Some of the logs could be clogging Scylla's logs, so we demote their
level to a lower one.

On the other hand, some of the logs would most likely not do that,
and they could be useful when debugging -- we promote them to debug
level.

(cherry picked from commit d7bc9edc6c)
2025-08-18 15:59:42 +02:00
Dawid Mędrek
ebd4355255 db/hints: Improve logs
Before these changes, the logs in hinted handoff often didn't provide
crucial information like the identifier of the node that hints were
being sent to. Also, some of the logs were misleading and referred to
other places in the code than the one where an exception or some other
situation really occurred.

We modify those logs, extending them by more valuable information
and fixing existing issues. What's more, all of the logs in
`hint_endpoint_manager` and `hint_sender` follow a consistent format
now:

```
<class_name>[<destination host ID>]:<function_name>: <message>
```

This way, we should always have AT LEAST the basic information.

(cherry picked from commit 2327d4dfa3)
2025-08-18 15:59:39 +02:00
Abhinav Jha
08cd442ddc raft: replication test: change rpc_propose_conf_change test to SEASTAR_THREAD_TEST_CASE
RAFT_TEST_CASE macro creates 2 test cases, one with random 20% packet
loss named name_drops. The framework makes hard coded assumptions about
leader which doesn't hold well in case of packet losses.

This short term fix disables the packet drop variant of the specified test.
It should be safe to re-enable it once the whole framework is re-worked to
remove these hard coded assumptions.

This PR fixes a bug. Hence we need to backport it.

Fixes: scylladb/scylladb#23816

Closes scylladb/scylladb#25489

(cherry picked from commit a0ee5e4b85)

Closes scylladb/scylladb#25526
2025-08-18 12:26:17 +02:00
Jenkins Promoter
4724237537 Update ScyllaDB version to: 2025.1.7 2025-08-17 15:45:37 +03:00
Jenkins Promoter
f22ea03851 Update pgo profiles - aarch64 2025-08-15 04:32:33 +03:00
Jenkins Promoter
46ae7fcae1 Update pgo profiles - x86_64 2025-08-15 04:06:06 +03:00
Yaron Kaikov
88373e930a Revert "dist/docker/debian/build_docker.sh: add scylla-server-dbg"
This reverts commit d7a02eceea.

This makes our containers MUCH larger than they need to be: 800.46 MB (2025.1.5) vs. 273.36 M (2025.1.3).

Fixes: https://github.com/scylladb/scylladb/issues/25479

Closes scylladb/scylladb#25102
2025-08-14 14:54:04 +03:00
Wojciech Przytuła
b9ae9473ba Fix link to ScyllaDB manual
The link would point to outdated OS docs. I fixed it to point to up-to-date Enterprise docs.

Closes scylladb/scylladb#25328

(cherry picked from commit 7600ccfb20)

Closes scylladb/scylladb#25483
2025-08-13 11:17:03 +03:00
Dawid Mędrek
7f205fe063 test: Enable RF-rack-valid keyspaces in all Python suites
We're enabling the configuration option `rf_rack_valid_keyspaces`
in all Python test suites. All relevant tests have been adjusted
to work with it enabled.

That encompasses the following suites:

* alternator,
* broadcast_tables,
* cluster (already enabled in scylladb/scylladb@ee96f8dcfc),
* cql,
* cqlpy (already enabled in scylladb/scylladb@be0877ce69),
* nodetool,
* rest_api.

Two remaining suites that use tests written in Python, redis and scylla_gdb,
are not affected, at least not directly.

The redis suite requires creating an instance of Scylla manually, and the tests
don't do anything that could violate the restriction.

The scylla_gdb suite focuses on testing the capabilities of scylla-gdb.py, but
even then it reuses the `run` file from the cqlpy suite.

Fixes scylladb/scylladb#25126

Closes scylladb/scylladb#24617

(cherry picked from commit b41151ff1a)

Closes scylladb/scylladb#25229
2025-08-13 09:24:09 +03:00
Asias He
dec3c84799 repair: Skip hints and batchlog flush in case of nodes down
The flush api could not detect if the node is down and fail the flush
before the timeout. This patch detects if there is down node and skip
the flush if so, since the flush will fail after the timeout in this
case anyway.

The slowness due to the flush timeout in
compaction_test.py::TestCompaction::test_delete_tombstone_gc_node_down
is fixed with this patch.

Fixes #22413

Closes scylladb/scylladb#22445

(cherry picked from commit 0682b1c716)

Closes scylladb/scylladb#25433
2025-08-13 09:23:43 +03:00
Dawid Mędrek
69307eaf2d db/commitlog: Extend error messages for corrupted data
We're providing additional information in error messages when throwing
an exception related to data corruption: when a segment is truncated
and when it's content is invalid. That might prove helpful when debugging.

Closes scylladb/scylladb#25190

(cherry picked from commit 408b45fa7e)

Closes scylladb/scylladb#25459
2025-08-13 09:22:50 +03:00
Michael Litvak
475db8f48b test: test_mv_backlog: fix to consider internal writes
The test executes a single write, fetching metrics before and after the
write, and expects the total throttled writes count to be increased
exactly by one.

However, other internal writes (compaction for example) may be executed
during this time and be throttled, causing the metrics to be increased
by more than expected.

To address this, we filter the metrics by the scheduling group label of
the user write, to filter out the compaction writes that run in the
compaction scheduling group.

Fixes scylladb/scylladb#23139

(cherry picked from commit 276a09ac6e)
2025-08-12 20:55:48 +03:00
Michael Litvak
abe7ea389e test/pylib/rest_client: fix ScyllaMetrics filtering
In the ScyllaMetrics `get` function, when requesting the value for a
specific shard, it is expected to return the sum of all values of
metrics for that shard that match the labels.

However, it would return the value of the first matching line it finds
instead of summing all matching lines.

For example, if we have two lines for one shard like:
some_metric{scheduling_group_name="compaction",shard="0"} 1
some_metric{scheduling_group_name="sl:default",shard="0"} 2

The result of this call would be 1 instead of 3:
get('some_metric', shard="0")

We fix this to sum all matching lines.

The filtering of lines by labels is fixed to allow specifying only some
of the labels. Previously, for the line to match the filter, either the
filter needs to be empty, or all the labels in the metric line had to be
specified in the filter parameter and match its value, which is
unexpected, and breaks when more labels are added.

We also simplify the function signature and the implementation - instead
of having the shard as a separate parameter, it can be specified as a
label, like any other label.

(cherry picked from commit 5c28cffdb4)
2025-08-12 20:55:48 +03:00
Szymon Malewski
38f4c3325d test/alternator: enable more relevant logs in CI.
This patch sets, for alternator test suite, all 'alternator-*' loggers and 'paxos' logger to trace level. This should significantly ease debugging of failed tests, while it has no effect on test time and increases log size only by 7%.
This affects running alternator tests only with `test.py`, not with `test/alternator/run`.

Closes #24645

Closes scylladb/scylladb#25327

(cherry picked from commit eb11485969)

Closes scylladb/scylladb#25381
2025-08-11 07:04:52 +03:00
Botond Dénes
b7e606ac91 Merge '[Backport 2025.1] truncate: change check for write during truncate into a log warning' from Scylladb[bot]
TRUNCATE TABLE performs a memtable flush and then discards the sstables of the table being truncated. It collects the highest replay position for both of these. When the highest replay position of the discarded sstables is higher than the highest replay position of the flushed memtable, that means that we have had writes during truncate which have been flushed to disk independently of the truncate process. We check for this and trigger an on_internal_error() which throws an exception, informing the user that writing data concurrently with TRUNCATE TABLE is not advised.

The problem with this is that truncate is also called from DROP KEYSPACE and DROP TABLE. These are raft operations and exceptions thrown by them are caught by the (...) exception handler in the raft applier fiber, which then exits leaving the node without the ability to execute subsequent raft commands.

This commit changes the on_internal_error() into a warning log entry. It also outputs to keyspace/table names, and the offending replay positions which caused the check to fail.

This PR also adds a test which validates that TRUNCATE works correctly with concurrent writes. More specifically, it checks that:
- all data written before TRUNCATE starts is deleted
- none of the data after TRUNCATE completes is deleted

Fixes: #25173
Fixes: #25013

Backport is needed in versions which check for truncate with concurrent writes using `on_internal_error()`: 2025.3 2025.2 2025.1

- (cherry picked from commit 268ec72dc9)

- (cherry picked from commit 33488ba943)

Parent PR: #25174

Closes scylladb/scylladb#25348

* github.com:scylladb/scylladb:
  truncate: add test for truncate with concurrent writes
  truncate: change check for write during truncate into a log warning
2025-08-11 07:02:55 +03:00
Taras Veretilnyk
1ae3cd310b docs: fix typo in command name enbleautocompaction -> enableautocompaction
Renamed the file and updated all references from 'enbleautocompaction' to the correct 'enableautocompaction'.

Fixes scylladb/scylladb#25172

Closes scylladb/scylladb#25175

(cherry picked from commit 6b6622e07a)

Closes scylladb/scylladb#25215
2025-08-11 07:01:57 +03:00
Botond Dénes
7ab6911b03 Merge '[Backport 2025.1] storage_service: cancel all write requests after stopping transports' from Scylladb[bot]
When a node shuts down, in storage service, after storage_proxy RPCs are stopped, some write handlers within storage_proxy may still be waiting for background writes to complete. These handlers hold appropriate ERMs to block schema changes before the write finishes. After the RPCs are stopped, these writes cannot receive the replies anymore.

If, at the same time, there are RPC commands executing `barrier_and_drain`, they may get stuck waiting for these ERM holders to finish, potentially blocking node shutdown until the writes time out.

This change introduces cancellation of all outstanding write handlers from storage_service after the storage proxy RPCs were stopped.

Fixes scylladb/scylladb#23665

Backport: since this fixes an issue that frequently causes issues in CI, backport to 2025.1, 2025.2, and 2025.3.

- (cherry picked from commit bc934827bc)

- (cherry picked from commit e0dc73f52a)

Parent PR: #24714

Closes scylladb/scylladb#25168

* github.com:scylladb/scylladb:
  storage_service: Cancel all write requests on storage_proxy shutdown
  test: Add test for unfinished writes during shutdown and topology change
2025-08-11 07:01:09 +03:00
Sergey Zolotukhin
4eab3b6a91 storage_service: Cancel all write requests on storage_proxy shutdown
During a graceful node shutdown, RPC listeners are stopped in `storage_service::drain_on_shutdown`
as one of the first steps. However, even after RPCs are shut down, some write handlers in
`storage_proxy` may still be waiting for background writes to complete. These handlers retain the ERM.
Since the RPC subsystem is no longer active, replies cannot be received, and if any RPC commands are
concurrently executing `barrier_and_drain`, they may get stuck waiting for those writes. This can block
the messaging server shutdown and delay the entire shutdown process until the write timeout occurs.

This change introduces the cancellation of all outstanding write handlers in `storage_proxy`
during shutdown to prevent unnecessary delays.

Fixes scylladb/scylladb#23665

(cherry picked from commit e0dc73f52a)
2025-08-08 15:53:37 +02:00
Sergey Zolotukhin
f309d035f4 test: Add test for unfinished writes during shutdown and topology change
This test reproduces an issue where a topology change and an ongoing write query
during query coordinator shutdown can cause the node to get stuck.

When a node receives a write request, it creates a write handler that holds
a copy of the current table's ERM (Effective Replication Map). The ERM ensures
that no topology or schema changes occur while the request is being processed.

After the query coordinator receives the required number of replica write ACKs
to satisfy the consistency level (CL), it sends a reply to the client. However,
the write response handler remains alive until all replicas respond — the remaining
writes are handled in the background.

During shutdown, when all network connections are closed, these responses can no longer
be received. As a result, the write response handler is only destroyed once the write
timeout is reached.

This becomes problematic because the ERM held by the handler blocks topology or schema
change commands from executing. Since shutdown waits for these commands to complete,
this can lead to unnecessary delays in node shutdown and restarts, and occasional
test case failures.

Test for: scylladb/scylladb#23665

(cherry picked from commit bc934827bc)
2025-08-08 15:53:17 +02:00
Nikos Dragazis
53c261611a encryption: gcp: Fix the grant type for user credentials
Exchanging a refresh token for an access token requires the
"refresh_token" grant type [1].

[1] https://datatracker.ietf.org/doc/html/rfc6749#section-6

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit b1d5a67018)
2025-08-07 21:45:12 +00:00
Nikos Dragazis
4b4def23ce encryption: gcp: Expand tilde in pathnames for credentials file
The GCP host searches for application default credentials in known
locations within the user's home directory using
`seastar::file_exists()`. However, this function does not perform tilde
expansion in pathnames.

Replace tildes with the home directory from the HOME environment
variable.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 77cc6a7bad)
2025-08-07 21:45:12 +00:00
Taras Veretilnyk
599c2351d0 docs: Sort commands list in nodetool.rst
Fixes scylladb/scylladb#25330

Closes scylladb/scylladb#25331

(cherry picked from commit bcb90c42e4)

Closes scylladb/scylladb#25370
2025-08-07 13:14:55 +03:00
Ferenc Szili
a1e80365a7 truncate: add test for truncate with concurrent writes
test_validate_truncate_with_concurrent_writes checks if truncate deletes
all the data written before the truncate starts, and does not delete any
data after truncate completes.

(cherry picked from commit 33488ba943)
2025-08-07 09:56:49 +02:00
Botond Dénes
b09297c0b9 Merge '[Backport 2025.1] repair: postpone repair until topology is not busy ' from Scylladb[bot]
Currently, repair_service::repair_tablets starts repair if there
is no ongoing tablet operations. The check does not consider global
topology operations, like tablet resize finalization.

Hence, if:
- topology is in the tablet_resize_finalization state;
- repair starts (as there is no tablet transitions) and holds the erm;
- resize finalization finishes;

then the repair sees a topology state different than the actual -
it does not see that the storage groups were already split.
Repair code does not handle this case and it results with
on_internal_error.

Start repair when topology is not busy. The check isn't atomic,
as it's done on a shard 0. Thus, we compare the topology versions
to ensure that the business check is valid.

Fixes: https://github.com/scylladb/scylladb/issues/24195.

Needs backport to all branches since they are affected

- (cherry picked from commit df152d9824)

- (cherry picked from commit 83c9af9670)

Parent PR: #24202

Closes scylladb/scylladb#24778

* github.com:scylladb/scylladb:
  test: add test for repair and resize finalization
  repair: postpone repair until topology is not busy
2025-08-07 06:25:21 +03:00
Nikos Dragazis
22dbdafd64 test: kmip: Fix segfault from premature destruction of port_promise
`kmip_test_helper()` is a utility function to spawn a dedicated PyKMIP
server for a particular Boost test case. The function runs the server as
an external process and uses a thread to parse the port from the
server's logs. The thread communicates the port to the main thread via
a promise.

The current implementation has a bug where the thread may set a value
to the promise after its destruction, causing a segfault. This happens
when the server does not start within 20 seconds, in which case the port
future throws and the stack unwinding machinery destroys the port
promise before the thread that writes to it.

Fix the bug by declaring the promise before the cleanup action.

The bug has been encountered in CI runs on slow machines, where the
PyKMIP server takes too long to create its internal tables (due to slow
fdatasync calls from SQLite). This patch does not improve CI stability -
it only ensures that the error condition is properly reflected in the
test output.

This patch is not a backport. The same bug has been fixed in master as
part of a larger rewrite of the `kmip_test_helper()` (see 722e2bce96).

Refs #24747, #24842.
Fixes #24574.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#25028
2025-08-06 11:56:59 +03:00
Aleksandra Martyniuk
02590ca8bb repair: do not keep erm in tablet_repair_task_meta
Do not keep erm in tablet_repair_task_meta to avoid non-owner shared
pointer access when metas will be distributes among shards.

Pass std::chunked_vector of erms to tablet_repair_task_impl to
preserve safety.

(cherry picked from commit 603a2dbb10)
2025-08-06 10:39:25 +02:00
Anna Stuchlik
f38b543937 doc: add the patch release upgrade procedure for version 2025.1
Fixes https://github.com/scylladb/scylladb/issues/25321

Closes scylladb/scylladb#25324
2025-08-06 11:23:37 +03:00
Aleksandra Martyniuk
f8d2f83c41 test: add test for repair and resize finalization
Add test that checks whether repair does not start if there is an
ongoing resize finalization.

(cherry picked from commit 83c9af9670)
2025-08-06 10:08:50 +02:00
Aleksandra Martyniuk
d976ab3933 repair: postpone repair until topology is not busy
Currently, repair_service::repair_tablets starts repair if there
is no ongoing tablet operations. The check does not consider global
topology operations, like tablet resize finalization. This may cause
a data race and unexpected behavior.

Start repair when topology is not busy.

(cherry picked from commit df152d9824)
2025-08-06 09:46:05 +02:00
Botond Dénes
f453b5bfa3 Merge '[Backport 2025.1] sstables: Fix quadratic space complexity in partitioned_sstable_set' from Scylladb[bot]
Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with.

A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range.

Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled).

There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried.
And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them.

Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario.

It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly.

Fixes #23634.

We can backport this into 2024.2, 2025.1, but we should let this cook in master for 1 month or so.

- (cherry picked from commit 494ed6b887)

- (cherry picked from commit 59dad2121f)

- (cherry picked from commit 21d1e78457)

- (cherry picked from commit c77f710a0c)

- (cherry picked from commit d5bee4c814)

Parent PR: #23806

Closes scylladb/scylladb#24012

* github.com:scylladb/scylladb:
  test: Verify partitioned set store split and unsplit correctly
  sstables: Fix quadratic space complexity in partitioned_sstable_set
  compaction: Wire table_state into make_sstable_set()
  compaction: Introduce token_range() to table_state
  dht: Add overlap_ratio() for token range
2025-08-06 09:56:43 +03:00
Michał Jadwiszczak
5c8a2784e8 storage_service, group0_state_machine: move SL cache update from topology_state_load() to load_snapshot()
Currently the service levels cache is unnecessarily updated in every
call of `topology_state_load()`.
But it is enough to reload it only when a snapshot is loaded.
(The cache is also already updated when there is a change to one of
`service_levels_v2`, `role_members`, `role_attributes` tables.)

Fixes scylladb/scylladb#25114
Fixes scylladb/scylladb#23065

Closes scylladb/scylladb#25116

(cherry picked from commit 10214e13bd)

Closes scylladb/scylladb#25303
2025-08-06 09:54:24 +03:00
Nikos Dragazis
3e96b9a13d test: Use in-memory SQLite for PyKMIP server
The PyKMIP server uses an SQLite database to store artifacts such as
encryption keys. By default, SQLite performs a full journal and data
flush to disk on every CREATE TABLE operation. Each operation triggers
three fdatasync(2) calls. If we multiply this by 16, that is the number
of tables created by the server, we get a significant number of file
syncs, which can last for several seconds on slow machines.

This behavior has led to CI stability issues from KMIP unit tests where
the server failed to complete its schema creation within the 20-second
timeout (observed on spider9 and spider11).

Fix this by configuring the server to use an in-memory SQLite.

Fixes #24842.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#24995

(cherry picked from commit 2656fca504)

Closes scylladb/scylladb#25297
2025-08-06 09:53:18 +03:00
Aleksandra Martyniuk
a65117617d api: storage_service: do not log the exception that is passed to user
The exceptions that are thrown by the tasks started with API are
propagated to users. Hence, there is no need to log it.

Remove the logs about exception in user started tasks.

Fixes: https://github.com/scylladb/scylladb/issues/16732.

Closes scylladb/scylladb#25153

(cherry picked from commit e607ef10cd)

Closes scylladb/scylladb#25295
2025-08-06 09:49:51 +03:00
Botond Dénes
f2a1e9f9ad Merge '[Backport 2025.1] test/cqlpy: Adjust tests to RF-rack-valid keyspaces' from Scylladb[bot]
In this PR, we adjust tests in the cqlpy test suite so they
only use RF-rack-valid keyspaces. After that, we enable
the configuration option `rf_rack_valid_keyspaces` in the
suite by default.

Refs scylladb/scylladb#23428
Fixes scylladb/scylladb#25306

Backport: backporting to 2025.1 so we can test the option there too.

- (cherry picked from commit 6bde01bb59)

- (cherry picked from commit 958eaec056)

- (cherry picked from commit a59842257a)

- (cherry picked from commit be0877ce69)

Parent PR: #23489

Closes scylladb/scylladb#25307

* github.com:scylladb/scylladb:
  test/cqlpy: Enable rf_rack_valid_keyspaces by default
  test: Move test_alter_tablet_keyspace_rf to cluster suite
  test/cqlpy: Adjust tests to RF-rack-valid keyspaces
  test/cqlpy/cassandra_tests: Adjust to RF-rack-valid keyspaces
2025-08-06 09:49:18 +03:00
Ferenc Szili
75c1ed0e86 truncate: change check for write during truncate into a log warning
TRUNCATE TABLE performs a memtable flush and then discards the sstables
of the table being truncated. It collects the highest replay position
for both of these. When the highest replay position of the discarded
sstables is higher than the highest replay position of the flushed
memtable, that means that we have had writes during truncate which have
been flushed to disk independently of the truncate process. We check for
this and trigger an on_internal_error() which throws an exception,
informing the user that writing data concurrently with TRUNCATE TABLE is
not advised.

The problem with this is that truncate is also called from DROP KEYSPACE
and DROP TABLE. These are raft operations and exceptions thrown by them
are caught by the (...) exception handler in the raft applier fiber,
which then exits leaving the node without the ability to execute
subsequent raft commands.

This commit changes the on_internal_error() into a warning log entry. It
also outputs to keyspace/table names, the truncated_at timepoint, the
offending replay positions which caused the check to fail.

Fixes: #25173
Fixes: #25013
(cherry picked from commit 268ec72dc9)
2025-08-06 00:51:06 +00:00
Avi Kivity
6f44cd672e Merge '[Backport 2025.1] qos: don't populate effective service level cache until auth is migrated to raft' from Scylladb[bot]
Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work.

In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version.

Fixes: scylladb/scylladb#24963

Should be backported to all versions which support upgrade to topology over raft - the issue described here may put the cluster into a state which is difficult to get out of (group0 apply fiber can break on multiple nodes, which necessitates their restart).

- (cherry picked from commit 2bb800c004)

- (cherry picked from commit 3a082d314c)

Parent PR: #25188

Closes scylladb/scylladb#25283

* github.com:scylladb/scylladb:
  test: sl: verify that legacy auth is not queried in sl to raft upgrade
  qos: don't populate effective service level cache until auth is migrated to raft
2025-08-03 15:35:32 +03:00
Dawid Mędrek
9666c255a4 test/cqlpy: Enable rf_rack_valid_keyspaces by default
All of the tests in the suite have been adjusted so they only
use RF-rack-valid keyspaces, so let's start enabling the option
by default.

(cherry picked from commit be0877ce69)
2025-08-01 21:41:40 +02:00
Dawid Mędrek
254f9b9427 test: Move test_alter_tablet_keyspace_rf to cluster suite
We move the test `test_alter_tablet_keyspace_rf` from the cqlpy to the
cluster test suite. The reason behind the change is that the test cannot
be run with `rf_rack_valid_keyspaces` turned on in the configuration.
During the test, we make the keyspace RF-rack-invalid multiple times.
Since RF-rack-validity is a very strong constraint, adjust the test
otherwise is impossible.

By moving it to the cluster test suite, we're able to change the
configuration of the node used in the test, and so the test can work
again.

(cherry picked from commit a59842257a)
2025-08-01 21:41:37 +02:00
Dawid Mędrek
1e7a6643fb test/cqlpy: Adjust tests to RF-rack-valid keyspaces
(cherry picked from commit 958eaec056)
2025-08-01 21:41:34 +02:00
Dawid Mędrek
929aed3a30 test/cqlpy/cassandra_tests: Adjust to RF-rack-valid keyspaces
We adjust three existing Cassandra tests so that they don't create
RF-rack-invalid keyspaces. We modify the replication factor used
in the problematic tests. The changes don't affect the tests as
the value of the RF is unrelated to what they verify. Thanks to
that, we can run them now even with enforced RF-rack-valid keyspaces.

The drawback is that the modified ALTER statements do not modify
the RF at all. However, since the tests seem to verify that the code
responsible for VALIDATING a request works as intended, that should
have little to no impact on them.

(cherry picked from commit 6bde01bb59)
2025-08-01 21:41:30 +02:00
Jenkins Promoter
edfdff5b1d Update pgo profiles - aarch64 2025-08-01 04:46:49 +03:00
Jenkins Promoter
706ad5baa6 Update pgo profiles - x86_64 2025-08-01 04:39:41 +03:00
Piotr Dulikowski
e5a47753ce test: sl: verify that legacy auth is not queried in sl to raft upgrade
Adjust `test_service_levels_upgrade`: right before upgrade to topology
on raft, enable an error injection which triggers when the standard role
manager is about to query the legacy auth tables in the
system_auth keyspace. The preceding commit which fixes
scylladb/scylladb#24963 makes sure that the legacy tables are not
queried during upgrade to topology on raft, so the error injection does
not trigger and does not cause a problem; without that commit, the test
fails.

(cherry picked from commit 3a082d314c)
2025-07-31 15:12:51 +00:00
Piotr Dulikowski
afe786843f qos: don't populate effective service level cache until auth is migrated to raft
Right now, service levels are migrated in one group0 command and auth
is migrated in the next one. This has a bad effect on the group0 state
reload logic - modifying service levels in group0 causes the effective
service levels cache to be recalculated, and to do so we need to fetch
information about all roles. If the reload happens after SL upgrade and
before auth upgrade, the query for roles will be directed to the legacy
auth tables in system_auth - and the query, being a potentially remote
query, has a timeout. If the query times out, it will throw
an exception which will break the group0 apply fiber and the node will
need to be restarted to bring it back to work.

In order to solve this issue, make sure that the service level module
does not start populating and using the service level cache until both
service levels and auth are migrated to raft. This is achieved by adding
the check both to the cache population logic and the effective service
level getter - they now look at service level's accessor new method,
`can_use_effective_service_level_cache` which takes a look at the auth
version.

Fixes: scylladb/scylladb#24963
(cherry picked from commit 2bb800c004)
2025-07-31 15:12:50 +00:00
Jakub Smolar
7044e81e2d gdb: handle zero-size reads in managed_bytes
Fixes: https://github.com/scylladb/scylladb/issues/25048

Closes scylladb/scylladb#25050

(cherry picked from commit 6e0a063ce3)

Closes scylladb/scylladb#25138
2025-07-31 13:07:15 +03:00
Pavel Emelyanov
22147d053d Merge '[Backport 2025.1] transport: remove throwing protocol_exception on connection start' from Dario Mirovic
Note: The simplest approach to resolving `process_request_one` merge issues, since it has been refactored, was to include the three commits from before, and then the commits that are actually being backported.

`protocol_exception` is thrown in several places. This has become a performance issue, especially when starting/restarting a server. To alleviate this issue, throwing the exception has to be replaced with returning it as a result or an exceptional future.

This PR replaces throws in the `transport/server` module. This is achieved by using result_with_exception, and in some places, where suitable, just by creating and returning an exceptional future.

There are four commits in this PR. The first commit introduces tests in `test/cqlpy`. The second commit refactors transport server `handle_error` to not rethrow exceptions. The third commit refactors reusable buffer writer callbacks. The fourth commit replaces throwing `protocol_exception` to returning it.

Based on the comments on an issue linked in https://github.com/scylladb/scylladb/issues/24567, the main culprit from the side of protocol exceptions is the invalid protocol version one, so I tested that exception for performance.

In order to see if there is a measurable difference, a modified version of `test_protocol_version_mismatch` Python is used, with 100'000 runs across 10 processes (not threads, to avoid Python GIL). One test run consisted of 1 warm-up run and 5 measured runs. First test run has been executed on the current code, with throwing protocol exceptions. Second test urn has been executed on the new code, with returning protocol exceptions. The performance report is in https://github.com/scylladb/scylladb/pull/24738#issuecomment-3051611069. It shows ~10% gains in real, user, and sys time for this test.

Testing

Build: `release`

Test file: `test/cqlpy/test_protocol_exceptions.py`
Test name: `test_protocol_version_mismatch` (modified for mass connection requests)

Test arguments:
```
max_attempts=100'000
num_parallel=10
```

Throwing `protocol_exception` results:
```
real=1:26.97  user=10:00.27  sys=2:34.55  cpu=867%
real=1:26.95  user=9:57.10  sys=2:32.50  cpu=862%
real=1:26.93  user=9:56.54  sys=2:35.59  cpu=865%
real=1:26.96  user=9:54.95  sys=2:32.33  cpu=859%
real=1:26.96  user=9:53.39  sys=2:33.58  cpu=859%

real=1:26.95 user=9:56.85 sys=2:34.11 cpu=862%   # average
```

Returning `protocol_exception` as `result_with_exception` or an exceptional future:
```
real=1:18.46  user=9:12.21  sys=2:19.08  cpu=881%
real=1:18.44  user=9:04.03  sys=2:17.91  cpu=869%
real=1:18.47  user=9:12.94  sys=2:19.68  cpu=882%
real=1:18.49  user=9:13.60  sys=2:19.88  cpu=883%
real=1:18.48  user=9:11.76  sys=2:17.32  cpu=878%

real=1:18.47 user=9:10.91 sys=2:18.77 cpu=879%   # average
```

This PR replaced `transport/server` throws of `protocol_exception` with returns. There are a few other places where protocol exceptions are thrown, and there are many places where `invalid_request_exception` is thrown. That is out of scope of this single PR, so the PR just refs, and does not resolve issue #24567.

Refs: #24567

This PR improves performance in cases when protocol exceptions happen, for example during connection storms. It will require backporting.

* (cherry picked from commit 7aaeed012e)

* (cherry picked from commit 30d424e0d3)

* (cherry picked from commit 9f4344a435)

* (cherry picked from commit 5390f92afc)

* (cherry picked from commit 4a6f71df68)

Parent PR: #24738

Closes scylladb/scylladb#25240

* github.com:scylladb/scylladb:
  test/cqlpy: add cpp exception metric test conditions
  transport/server: replace protocol_exception throws with returns
  utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
  transport/server: avoid exception-throw overhead in handle_error
  test/cqlpy: add protocol_exception tests
  transport: remove redundant references in process_request_one
  transport: fix the indentation in process_request_one
  transport: add futures in CQL server exception handling
2025-07-31 12:20:41 +03:00
Anna Stuchlik
1ecf3fcfcd doc: add tablets support information to the Drivers table
This commit:

- Extends the Drivers support table with information on which driver supports tablets
  and since which version.
- Adds the driver support policy to the Drivers page.
- Reorganizes the Drivers page to accommodate the updates.

In addition:
- The CPP-over-Rust driver is added to the table.
- The information about Serverless (which we don't support) is removed
  and replaced with tablets to correctly describe the contents of the table.

Fixes https://github.com/scylladb/scylladb/issues/19471

Refs https://github.com/scylladb/scylladb-docs-homepage/issues/69

Closes scylladb/scylladb#24635

(cherry picked from commit 18b4d4a77c)

Closes scylladb/scylladb#25247
2025-07-31 12:20:29 +03:00
Aleksandra Martyniuk
2bd1339620 streaming: close sink when exception is thrown
If an exception is thrown in result_handling_cont in streaming,
then the sink does not get closed. This leads to a node crash.

Close sink in exception handler.

Fixes: https://github.com/scylladb/scylladb/issues/25165.

Closes scylladb/scylladb#25238

(cherry picked from commit 99ff08ae78)

Closes scylladb/scylladb#25266
2025-07-31 12:20:12 +03:00
Dario Mirovic
3e388e7910 test/cqlpy: add cpp exception metric test conditions
Tested code paths should not throw exceptions. `scylla_reactor_cpp_exceptions`
metric is used. This is a global metric. To address potential test flakiness,
each test runs multiple times:
- `run_count = 100`
- `cpp_exception_threshold = 10`

If a change in the code introduced an exception, expectation is that the number
of registered exceptions will be > `cpp_exception_threshold` in `run_count` runs.
In which case the test fails.

Fixes: #25273
(cherry picked from commit 4a6f71df68)
2025-07-30 21:57:25 +02:00
Dario Mirovic
e8478982dc transport/server: replace protocol_exception throws with returns
Replace throwing protocol_exception with returning it as a result
or an exceptional future in the transport server module. This
improves performance, for example during connection storms and
server restarts, where protocol exceptions are more frequent.

In functions already returning a future, protocol exceptions are
propagated using an exceptional future. In functions not already
returning a future, result_with_exception is used.

Notable change is checking v.failed() before calling v.get() in
process_request function, to avoid throwing in case of an
exceptional future.

Refs: #24567
Fixes: #25273
(cherry picked from commit 5390f92afc)
2025-07-30 21:57:20 +02:00
Dario Mirovic
028de964c8 utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
Make make_bytes_ostream and make_fragmented_temporary_buffer accept
writer callbacks that return utils::result_with_exception instead of
forcing them to throw on error. This lets callers propagate failures
by returning an error result rather than throwing an exception.

Introduce buffer_writer_for, bytes_ostream_writer, and fragmented_buffer_writer
concepts to simplify and document the template requirements on writer callbacks.

This patch does not modify the actual callbacks passed, except for the syntax
changes needed for successful compilation, without changing the logic.

Refs: #24567
Fixes: #25273
(cherry picked from commit 9f4344a435)
2025-07-30 21:57:15 +02:00
Dario Mirovic
2fbcdfd4b4 transport/server: avoid exception-throw overhead in handle_error
Previously, connection::handle_error always called f.get() inside a try/catch,
forcing every failed future to throw and immediately catch an exception just to
classify it. This change eliminates that extra throw/catch cycle by first checking
f.failed(), getting the stored std::exception_ptr via f.get_exception(), and
then dispatching on its type via utils::try_catch<T>(eptr).

The error-response logic is not changed - cassandra_exception, std::exception,
and unknown exceptions are caught and processed, and any exceptions thrown by
write_response while handling those exceptions continues to escape handle_error.

Refs: #24567
Fixes: #25273
(cherry picked from commit 30d424e0d3)
2025-07-30 21:57:09 +02:00
Dario Mirovic
96f5bcc5be test/cqlpy: add protocol_exception tests
Add a helper to fetch scylla_transport_cql_errors_total{type="protocol_error"} counter
from Scylla's metrics endpoint. These metrics are used to track protocol error
count before and after each test.

Add cql_with_protocol context manager utility for session creation with parameterized
protocol_version value. This is used for testing connection establishment with
different protocol versions, and proper disposal of successfully established sessions.

The tests cover two failure scenarios:
- Protocol version mismatch in test_protocol_version_mismatch which tests both supported
and unsupported protocol version
- Malformed frames via raw socket in _protocol_error_impl, used by several test functions,
and also test_no_protocol_exceptions test to assert that the error counters never decrease
during test execution, catching unintended metric resets

Refs: #24567
Fixes: #25273
(cherry picked from commit 7aaeed012e)
2025-07-30 21:56:45 +02:00
Tomasz Grabiec
24435185d7 Merge '[Backport 2025.1] streaming: Avoid deadlock by running view checks in a separate scheduling group' from Scylladb[bot]
This issue happens with removenode, when RBNO is disabled, so range
streamer is used.

The deadlock happens in a scenario like this:
1. Start 3 nodes: {A, B, C}, RF=2
2. Node A is lost
3. removenode A
4. Both B and C gain ownership of ranges.
5. Streaming sessions are started with crossed directions: B->C, C->B

Readers created by sender side exhaust streaming semaphore on B and C.
Receiver side attempts to obtain a permit indirectly by calling
check_needs_view_update_path(), which reads local tables. That read is
blocked and times-out, causing streaming to fail. The streaming writer
is already using a tracking-only permit.

Even if we didn't deadlock, and the streaming semaphore was simply exhausted
by other receiving sessions (via tracking-only permit), the query may still time-out due to starvation.

To avoid that, run the query under a different scheduling group, which
translates to the system semaphore instead of the maintenance
semaphore, to break the dependency. The gossip group was chosen
because it shouldn't be contended and this change should not interfere
with it much.

Fixes #24807
Fixes #24925

- (cherry picked from commit ee2fa58bd6)

- (cherry picked from commit dff2b01237)

Parent PR: #24929

Closes scylladb/scylladb#25052

* github.com:scylladb/scylladb:
  streaming: Avoid deadlock by running view checks in a separate scheduling group
  service: migration_manager: Run group0 barrier in gossip scheduling group
2025-07-30 02:22:51 +02:00
Tomasz Grabiec
2a9eecdb65 streaming: Avoid deadlock by running view checks in a separate scheduling group
This issue happens with removenode, when RBNO is disabled, so range
streamer is used.

The deadlock happens in a scenario like this:
1. Start 3 nodes: {A, B, C}, RF=2
2. Node A is lost
3. removenode A
4. Both B and C gain ownership of ranges.
5. Streaming sessions are started with crossed directions: B->C, C->B

Readers created by sender side exhaust streaming semaphore on B and C.
Receiver side attempts to obtain a permit indirectly by calling
check_needs_view_update_path(), which reads local tables. That read is
blocked and times-out, causing streaming to fail. The streaming writer
is already using a tracking-only permit.

To avoid that, run the query under a different scheduling group, which
translates to the system semaphore instead of the maintenance
semaphore, to break the dependency. The gossip group was chosen
because it shouldn't be contended and this change should not interfere
with it much.

Fixes: #24807
(cherry picked from commit dff2b01237)
2025-07-28 21:47:50 +02:00
Andrzej Jackowski
d6fa9c95c4 transport: remove redundant references in process_request_one
The references were added and used in previous commits to
limit the number of line changes for a reviewer convenience.

This commit removes the redundant references to make the code
more clear and concise.

(cherry picked from commit 9b1f062827)
2025-07-28 20:45:37 +02:00
Andrzej Jackowski
e8dbb82949 transport: fix the indentation in process_request_one
Fix the indentation after the previous commit that intentionally had
a wrong indent to limit the number of changed lines

(cherry picked from commit 9c0f369cf8)
2025-07-28 20:45:29 +02:00
Andrzej Jackowski
aed0a71efe transport: add futures in CQL server exception handling
Prepare for the next commit that will introduce a
seastar::sleep in handling of selected exception.

This commit:
 - Rewrite cql_server::connection::process_request_one to use
   seastar::futurize_invoke and try_catch<> instead of
   utils::result_try.
 - The intentation is intentionally incorrect to reduce the
   number of changed lines. Next commits fix it.

(cherry picked from commit 8a7454cf3e)
2025-07-28 20:45:18 +02:00
Aleksandra Martyniuk
23479939b9 tasks: do not use binary progress for task manager tasks
Currently, progress of a parent task depends on expected_total_workload,
expected_children_number, and children progresses. Basically, if total
workload is known or all children have already been created, progresses
of children are summed up. Otherwise binary progress is returned.

As a result, two tasks of the same type may return progress in different
units. If they are children of the same task and this parent gathers the
progress - it becomes meaningless.

Drop expected_children_number as we can't assume that children are able
to show their progresses.

Modify get_progress method - progress is calculated based on children
progresses. If expected_total_workload isn't specified, the total
progress of a task may grow. If expected_total_workload isn't specified
and no children are created, empty progress (0/0) is returned.

Fixes: https://github.com/scylladb/scylladb/issues/24650.

Closes scylladb/scylladb#25113

(cherry picked from commit a7ee2bbbd8)

Closes scylladb/scylladb#25197
2025-07-28 09:23:57 +03:00
Aleksandra Martyniuk
3f93cdc61b nodetool: repair: skip tablet keyspaces
Currently, nodetool repair command repairs both vnode and tablet keyspaces
if no keyspace is specified. We should use this command to repair
only vnode keyspaces, but this isn't easily accessible - we have to
explicitly run repair only on vnode keyspaces.

nodetool repair skips tablet keyspaces unless a tablet keyspace
is explicitely passed as an argument.

Fixes: #24040.

Closes scylladb/scylladb#24042

(cherry picked from commit 6f8b378e80)

Closes scylladb/scylladb#25152
2025-07-24 16:32:37 +03:00
Piotr Dulikowski
8298405805 auth: fix crash when migration code runs parallel with raft upgrade
The functions password_authenticator::start and
standard_role_manager::start have a similar structure: they spawn a
fiber which invokes a callback that performs some migration until that
migration succeeds. Both handlers set a shared promise called
_superuser_created_promise (those are actually two promises, one for the
password authenticator and the other for the role manager).

The handlers are similar in both cases. They check if auth is in legacy
mode, and behave differently depending on that. If in legacy mode, the
promise is set (if it was not set before), and some legacy migration
actions follow. In auth-on-raft mode, the superuser is attempted to be
created, and if it succeeds then the promise is _unconditionally_ set.

While it makes sense at a glance to set the promise unconditionally,
there is a non-obvious corner case during upgrade to topology on raft.
During the upgrade, auth switches from the legacy mode to auth on raft
mode. Thus, if the callback didn't succeed in legacy mode and then tries
to run in auth-on-raft mode and succeds, it will unconditionally set a
promise that was already set - this is a bug and triggers an assertion
in seastar.

Fix the issue by surrounding the `shared_promise::set_value` call with
an `if` - like it is already done for the legacy case.

Backport note: the bugfix part for password_authenticator was removed
from the commit because 2025.1 does not have scylladb/scylladb#22532 and
thus does not contain the bug.

Fixes: scylladb/scylladb#24975

Closes scylladb/scylladb#24976

(cherry picked from commit a14b7f71fe)

Closes scylladb/scylladb#25017
2025-07-24 09:16:18 +02:00
Calle Wilund
e53d5ed0dc utils::http::dns_connection_factory: Use a shared certificate_credentials
Fixes #24447

This factory type, which is really more a data holder/connection producer
per connection instance, creates, if using https, a new certificate_credentials
on every instance. Which when used by S3 client is per client and
scheduling groups.

Which eventually means that we will do a set_system_trust + "cold" handshake
for every tls connection created this way.

This will cause both IO and cold/expensive certificate checking -> possible
stalls/wasted CPU. Since the credentials object in question is literally a
"just trust system", it could very well be shared across the shard.

This PR adds a thread local static cached credentials object and uses this
instead. Could consider moving this to seastar, but maybe this is too much.

Closes scylladb/scylladb#24448

(cherry picked from commit 80feb8b676)

Closes scylladb/scylladb#24460
2025-07-22 11:02:59 +03:00
Michael Litvak
a57c51b9d7 tablets: stop storage group on deallocation
When a tablet transitions to a post-cleanup stage on the leaving replica
we deallocate its storage group. Before the storage can be deallocated
and destroyed, we must make sure it's cleaned up and stopped properly.

Normally this happens during the tablet cleanup stage, when
table::cleanup_table is called, so by the time we transition to the next
stage the storage group is already stopped.

However, it's possible that tablet cleanup did not run in some scenario:
1. The topology coordinator runs tablet cleanup on the leaving replica.
2. The leaving replica is restarted.
3. When the leaving replica starts, still in `cleanup` stage, it
   allocates a storage group for the tablet.
4. The topology coordinator moves to the next stage.
5. The leaving replica deallocates the storage group, but it was not
   stopped.

To address this scenario, we always stop the storage group when
deallocating it. Usually it will be already stopped and complete
immediately, and otherwise it will be stopped in the background.

Fixes scylladb/scylladb#24857
Fixes scylladb/scylladb#24828

Closes scylladb/scylladb#24896

(cherry picked from commit fa24fd7cc3)

Closes scylladb/scylladb#24906
2025-07-22 11:02:28 +03:00
Pavel Emelyanov
57b24383ed Merge '[Backport 2025.1] Improve background disposal of tablet_metadata' from Scylladb[bot]
As seen in #23284, when the tablet_metadata contains many tables, even empty ones,
we're seeing a long queue of seastar tasks coming from the individual destruction of
`tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>`.

This change improves `tablet_metadata::clear_gently` to destroy the `tablet_map_ptr` objects
on their owner shard by sorting them into vectors, per- owner shard.

Also, background call to clear_gently was added to `~token_metadata`, as it is destroyed
arbitrarily when automatic token_metadata_ptr variables go out of scope, so that the
contained tablet_metadata would be cleared gently.

Finally, a unit test was added to reproduce the `Too long queue accumulated for gossip` symptom
and verify that it is gone with this change.

Fixes #24814
Refs #23284

This change is not marked as fixing the issue since we still need to verify that there is no impact on query performance, reactor stalls, or large allocations, with a large number of tablet-based tables.

* Since the issue exists in 2025.1, requesting backport to 2025.1 and upwards

- (cherry picked from commit 3acca0aa63)

- (cherry picked from commit 493a2303da)

- (cherry picked from commit e0a19b981a)

- (cherry picked from commit 2b2cfaba6e)

- (cherry picked from commit 2c0bafb934)

- (cherry picked from commit 4a3d14a031)

- (cherry picked from commit 6e4803a750)

Parent PR: #24618

Closes scylladb/scylladb#24862

* github.com:scylladb/scylladb:
  token_metadata_impl: clear_gently: release version tracker early
  test: topology_custom: test_tablets_merge: add test_tablet_split_merge_with_many_tables
  token_metadata: clear_and_destroy_impl when destroyed
  token_metadata: keep a reference to shared_token_metadata
  token_metadata: move make_token_metadata_ptr into shared_token_metadata class
  replica: database: get and expose a mutable locator::shared_token_metadata
  locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction
2025-07-22 11:02:00 +03:00
Aleksandra Martyniuk
5f07f9963d replica: hold compaction group gate during flush
Destructor of database_sstable_write_monitor, which is created
in table::try_flush_memtable_to_sstable, tries to get the compaction
state of the processed compaction group. If at this point
the compaction group is already stopped (and the compaction state
is removed), e.g. due to concurrent tablet merge, an exception is
thrown and a node coredumps.

Add flush gate to compaction group to wait for flushes in
compaction_group::stop. Hold the gate in seal function in
table::make_memtable_list. seal function is turned into
a coroutine to ensure it won't throw.

Wait until async_gate is closed before flushing, to ensure that
all data is written into sstables. Stop ongoing compactions
beforehand.

Remove unnecessary flush in tablet_storage_group_manager::merge_completion_fiber.
Stop method already flushes the compaction group.

Fixes: #23911.

Closes scylladb/scylladb#24582

(cherry picked from commit 2ec54d4f1a)

Closes scylladb/scylladb#24950
2025-07-22 11:01:37 +03:00
Patryk Jędrzejczak
f11413274d test: test_zero_token_nodes_multidc: properly handle reads with CL=ONE
The test could fail with RF={DC1: 2, DC2: 0} and CL=ONE when:
- both writes succeeded with the same replica responding first,
- one of the following reads succeeded with the other replica
  responding before it applied mutations from any of the writes.

We fix the test by not expecting reads with CL=ONE to return a row.

We also harden the test by inserting different rows for every pair
(CL, coordinator), where one of the two coordinators is a normal
node from DC1, and the other one is a zero-token node from DC2.
This change makes sure that, for example, every write really
inserts a row.

Fixes scylladb/scylladb#22967

The fix addresses CI flakiness and only changes the test, so it
should be backported.

Closes scylladb/scylladb#23518

(cherry picked from commit 21edec1ace)

Fixing conflicts required additionally backporting the log line
from scylladb/scylladb#22968.

Closes scylladb/scylladb#24983
2025-07-22 11:01:18 +03:00
Pavel Emelyanov
fb20a59242 Merge '[Backport 2025.1] cdc: throw error if column doesn't exist' from Scylladb[bot]
in the CDC log transformer, when creating a CDC mutation based on some
base table mutation, for each value of a base column we set the value in
the CDC column with the same name.

When looking up the column in the CDC schema by name, we may get a null
pointer if a column by that name is not found. This shouldn't happen
normally because the base schema and CDC schema should be compatible,
and for each base column there should be a CDC column with the same
name.

However, there are scenarios where the base schema and CDC schema are
incompatible for a short period of time when they are being altered.
When a base column is being added or dropped, we could get a base
mutation with this column set, and then the CDC transformer picks up the
latest CDC schema which doesn't have this column.

If such thing happens, we fix the code to throw an exception instead of
crashing on null pointer dereference. Currently we don't have a safer
approach to handle this, but this might be changed in the future. The
other alternative is dropping that data silently which we prefer not to
do.

Throwing an error is acceptable because this scenario most likely
indicates this behavior by the user:
* The user adds a new column, and start writing values to the column
  before the ALTER is complete. or,
* The user drops a column, and continues writing values to the column
  while it's being dropped.

Both cases might as well fail with an error because the column is not
found in the base table.

Fixes https://github.com/scylladb/scylladb/issues/24952

backport needed - simple fix for a node crash

- (cherry picked from commit b336f282ae)

- (cherry picked from commit 86dfa6324f)

Parent PR: #24986

Closes scylladb/scylladb#25065

* github.com:scylladb/scylladb:
  test: cdc: add test_cdc_with_alter
  cdc: throw error if column doesn't exist
2025-07-22 11:00:32 +03:00
Pavel Emelyanov
7cb673b6c2 Merge '[Backport 2025.1] cdc: Forbid altering columns of CDC log tables directly' from Scylladb[bot]
The set of columns of a CDC log table should be managed automatically
by Scylla, and the user should not have the ability to manipulate them
directly. That could lead to disastrous consequences such as a
segmentation fault.

In this commit, we're restricting those operations. We also provide two
validation tests.

One of the existing tests had to be adjusted as it modified the type
of a column in a CDC log table. Since the test simply verifies that
the user has sufficient permissions to perform `ALTER TABLE` on the log
table, the test is still valid.

Fixes scylladb/scylladb#24643

Backport: we should backport the change to all affected
branches to prevent the consequences that may affect the user.

- (cherry picked from commit 20d0050f4e)

- (cherry picked from commit 59800b1d66)

Parent PR: #25008

Closes scylladb/scylladb#25106

* github.com:scylladb/scylladb:
  cdc: Forbid altering columns of inactive CDC log table
  cdc: Forbid altering columns of CDC log tables directly
2025-07-22 11:00:18 +03:00
Dawid Mędrek
739dbf0f7d cdc: Forbid altering columns of inactive CDC log table
When CDC becomes disabled on the base table, the CDC log table
still exsits (cf. scylladb/scylladb@adda43edc7).
If it continues to exist up to the point when CDC is re-enabled
on the base table, no new log table will be created -- instead,
the old olg table will be *re-attached*.

Since we want to avoid situations when the definition of the log
table has become misaligned with the definition of the base table
due to actions of the user, we forbid modifying the set of columns
or renaming them in CDC log tables, even when they're inactive.

Validation tests are provided.

(cherry picked from commit 59800b1d66)
2025-07-21 11:42:38 +00:00
Dawid Mędrek
02108573b6 cdc: Forbid altering columns of CDC log tables directly
The set of columns of a CDC log table should be managed automatically
by Scylla, and the user should not have the ability to manipulate them
directly. That could lead to disastrous consequences such as a
segmentation fault.

In this commit, we're restricting those operations. We also provide two
validation tests.

One of the existing tests had to be adjusted as it modified the type
of a column in a CDC log table. Since the test simply verifies that
the user has sufficient permissions to perform `ALTER TABLE` on the log
table, the test is still valid.

Fixes scylladb/scylladb#24643

(cherry picked from commit 20d0050f4e)
2025-07-21 11:42:38 +00:00
Benny Halevy
5592d66d32 token_metadata_impl: clear_gently: release version tracker early
No need to wait for all members to be cleared gently.
We can release the version earlier since the
held version may be awaited for in barriers.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 6e4803a750)
2025-07-21 10:23:33 +03:00
Benny Halevy
7414cf1327 test: topology_custom: test_tablets_merge: add test_tablet_split_merge_with_many_tables
Reproduces #23284

Currently skipped in release mode since it requires
the `short_tablet_stats_refresh_interval` interval.
Ref #24641

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 4a3d14a031)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-07-21 10:23:31 +03:00
Benny Halevy
d70ed984cb token_metadata: clear_and_destroy_impl when destroyed
We have a lot of places in the code where
a token_metadata_ptr is kept in an automatic
variable and destroyed when it leaves the scope.
since it's a referenced counted lw_shared_ptr,
the token_metadata object is rarely destroyed in
those cases, but when it is, it doesn't go through
clear_gently, and in particular its tablet_metadata
is not cleared gently, leading to inefficient destruction
of potentially many foreign_ptr:s.

This patch calls clear_and_destroy_impl that gently
clears and destroys the impl object in the background
using the shared_token_metadata.

Fixes #13381

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2c0bafb934)
2025-07-21 10:02:19 +03:00
Benny Halevy
e24bfb0bf3 token_metadata: keep a reference to shared_token_metadata
To be used by a following patch to gently clean and destroy
the token_data_impl in the background.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2b2cfaba6e)
2025-07-21 10:00:49 +03:00
Benny Halevy
aa8e65616e token_metadata: move make_token_metadata_ptr into shared_token_metadata class
So we can use the local shared_token_metadata instance
for safe background destroy of token_metadata_impl:s.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit e0a19b981a)
2025-07-21 10:00:49 +03:00
Benny Halevy
a3025520d2 replica: database: get and expose a mutable locator::shared_token_metadata
Prepare for next patch, the will use this shared_token_metadata
to make mutable_token_metadata_ptr:s

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 493a2303da)
2025-07-21 09:59:17 +03:00
Benny Halevy
2d55d417b9 locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction
Sort all tablet_map_ptr:s by shard_id
and then destroy them on each shard to prevent
long cross-shard task queues for foreign_ptr destructions.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 3acca0aa63)
2025-07-21 09:51:08 +03:00
Michael Litvak
e54414ab28 test: cdc: add test_cdc_with_alter
Add a test that tests adding and dropping a column to a table with CDC
enabled while writing to it.

(cherry picked from commit 86dfa6324f)
2025-07-20 10:20:24 +02:00
Michael Litvak
9e76902da1 cdc: throw error if column doesn't exist
in the CDC log transformer, when creating a CDC mutation based on some
base table mutation, for each value of a base column we set the value in
the CDC column with the same name.

When looking up the column in the CDC schema by name, we may get a null
pointer if a column by that name is not found. This shouldn't happen
normally because the base schema and CDC schema should be compatible,
and for each base column there should be a CDC column with the same
name.

However, there are scenarios where the base schema and CDC schema are
incompatible for a short period of time when they are being altered.
When a base column is being added or dropped, we could get a base
mutation with this column set, and then the CDC transformer picks up the
latest CDC schema which doesn't have this column.

If such thing happens, we fix the code to throw an exception instead of
crashing on null pointer dereference. Currently we don't have a safer
approach to handle this, but this might be changed in the future. The
other alternative is dropping that data silently which we prefer not to
do.

Throwing an error is acceptable because this scenario most likely
indicates this behavior by the user:
* The user adds a new column, and start writing values to the column
  before the ALTER is complete. or,
* The user drops a column, and continues writing values to the column
  while it's being dropped.

Both cases might as well fail with an error because the column is not
found in the base table.

Fixes scylladb/scylladb#24952

(cherry picked from commit b336f282ae)
2025-07-18 10:35:29 +00:00
Tomasz Grabiec
434ecdee0e service: migration_manager: Run group0 barrier in gossip scheduling group
Fixes two issues.

One is potential priority inversion. The barrier will be executed
using scheduling group of the first fiber which triggers it, the rest
will block waiting on it. For example, CQL statements which need to
sync the schema on replica side can block on the barrier triggered by
streaming. That's undesirable. This is theoretical, not proved in the
field.

The second problem is blocking the error path. This barrier is called
from the streaming error handling path. If the streaming concurrency
semaphore is exhausted, and streaming fails due to timeout on
obtaining the permit in check_needs_view_update_path(), the error path
will block too because it will also attempt to obtain the permit as
part of the group0 barrier. Running it in the gossip scheduling group
prevents this.

Fixes #24925

(cherry picked from commit ee2fa58bd6)
2025-07-17 17:24:33 +00:00
Jenkins Promoter
74b1e40502 Update ScyllaDB version to: 2025.1.6 2025-07-17 15:58:23 +03:00
Avi Kivity
36d2f80f38 Merge '[Backport 2025.1] storage_service: Use utils::chunked_vector to avoid big allocation' from Scylladb[bot]
The following was seen:

```
!WARNING | scylla[6057]:  [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
[Backtrace #0]
void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99
seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136
seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169
seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848
seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911
operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706
std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196
 (inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515
 (inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380
 (inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596
locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
 (inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80
seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635
std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684
```

Fix by using chunked_vector.

Fixes #24158

- (cherry picked from commit c5a136c3b5)

Parent PR: #24561

Closes scylladb/scylladb#24890

* github.com:scylladb/scylladb:
  storage_service: Use utils::chunked_vector to avoid big allocation
  utils: chunked_vector: implement erase() for single elements and ranges
  utils: chunked_vector: implement insert() for single-element inserts
2025-07-16 17:28:36 +03:00
Asias He
478d02ce83 storage_service: Use utils::chunked_vector to avoid big allocation
The following was seen:

```
!WARNING | scylla[6057]:  [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
[Backtrace #0]
void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99
seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136
seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169
seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848
seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911
operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706
std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196
 (inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515
 (inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380
 (inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596
locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
 (inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80
seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635
std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684
```

Fix by using chunked_vector.

Fixes #24158

Closes scylladb/scylladb#24561

(cherry picked from commit c5a136c3b5)
2025-07-16 15:39:51 +08:00
Piotr Dulikowski
451cf275bf Merge '[Backport 2025.1] Fix for cassandra role gets recreated after DROP ROLE' from Scylladb[bot]
This patchset fixes regression introduced by 7e749cd848 when we started re-creating default superuser role and password from the config, even if new custom superuser was created by the user.

Now we'll check, first with CL LOCAL_ONE if there is a need to create default superuser role or password, confirm
it with CL QUORUM and only then atomically create role or password.

If server is started without cluster quorum we'll skip creating role or password.

Fixes https://github.com/scylladb/scylladb/issues/24469
Backport: all versions since 2024.2

- (cherry picked from commit 68fc4c6d61)

- (cherry picked from commit c96c5bfef5)

- (cherry picked from commit 2e2ba84e94)

- (cherry picked from commit f85d73d405)

- (cherry picked from commit d9ec746c6d)

- (cherry picked from commit a3bb679f49)

- (cherry picked from commit 67a4bfc152)

- (cherry picked from commit 0ffddce636)

- (cherry picked from commit 5e7ac34822)

Parent PR: #24451

Closes scylladb/scylladb#24693

* github.com:scylladb/scylladb:
  test: auth_cluster: add test for password reset procedure
  auth: cache roles table scan during startup
  test: auth_cluster: add test for replacing default superuser
  test: pylib: add ability to specify default authenticator during server_start
  test: pylib: allow rolling restart without waiting for cql
  auth: split auth-v2 logic for adding default superuser password
  auth: split auth-v2 logic for adding default superuser role
  auth: ldap: fix waiting for underlying role manager
  auth: wait for default role creation before starting authorizer and authenticator
2025-07-16 08:08:49 +02:00
Avi Kivity
60135c8d6c utils: chunked_vector: implement erase() for single elements and ranges
Implement using std::rotate() and resize(). The elements to be erased
are rotated to the end, then resized out of existence.

Again we defer optimization for trivially copyable types.

Unit tests are added.

Needed for range_streamer with token_ranges using chunked_vector.

(cherry picked from commit d6eefce145)
2025-07-16 07:38:55 +08:00
Avi Kivity
efad391fac utils: chunked_vector: implement insert() for single-element inserts
partition_range_compat's unwrap() needs insert if we are to
use it for chunked_vector (which we do).

Implement using push_back() and std::rotate().

emplace(iterator, args) is also implemented, though the benefit
is diluted (it will be moved after construction).

The implementation isn't optimal - if T is trivially copyable
then using std::memmove() will be much faster that std::rotate(),
but this complex optimization is left for later.

Unit tests are added.

(cherry picked from commit 5301f3d0b5)
2025-07-16 07:38:55 +08:00
Botond Dénes
9258d0e1cf test/cluster/test_read_repair: write 100 rows in trace test
This test asserts that a read repair really happened. To ensure this
happens it writes a single partition after enabling the database_apply
error injection point. For some reason, the write is sometimes reordered
with the error injection and the write will get replicated to both nodes
and no read repair will happen, failing the test.
To make the test less sensitive to such rare reordering, add a
clustering column to the table and write a 100 rows. The chance of *all*
100 of them being reordered with the error injection should be low
enough that it doesn't happen again (famous last words).

Fixes: #24330

Closes scylladb/scylladb#24403

(cherry picked from commit 495f607e73)

Closes scylladb/scylladb#24972
2025-07-15 20:16:52 +03:00
Yaron Kaikov
83baf94bea dist/common/scripts/scylla_sysconfig_setup: fix SyntaxWarning: invalid escape sequence
There are invalid escape sequence warnings where raw strings should be used for the regex patterns

Fixes: https://github.com/scylladb/scylladb/issues/24915

Closes scylladb/scylladb#24916

(cherry picked from commit fdcaa9a7e7)

Closes scylladb/scylladb#24966
2025-07-15 11:09:16 +02:00
Yaron Kaikov
a20e56230b auto-backport.py: Avoid bot push to existing backport branches
Changed the backport logic so that the bot only pushes the backport branch if it does not already exist in the remote fork.
If the branch exists, the bot skips the push, allowing only users to update (force-push) the branch after the backport PR is open.

Fixes: https://github.com/scylladb/scylladb/issues/24953

Closes scylladb/scylladb#24954

(cherry picked from commit ed7c7784e4)

Closes scylladb/scylladb#24965
2025-07-15 10:54:55 +02:00
Aleksandra Martyniuk
23bcff5c52 repair: Reduce max row buf size when small table optimization is on
If small_table_optimization is on, a repair works on a whole table
simultaneously. It may be distributed across the whole cluster and
all nodes might participate in repair.

On a repair master, row buffer is copied for each repair peer.
This means that the memory scales with the number of peers.

In large clusters, repair with small_table_optimization leads to OOM.

Divide the max_row_buf_size by the number of repair peers if
small_table_optimization is on.

Use max_row_buf_size to calculate number of units taken from mem_sem.

Fixes: https://github.com/scylladb/scylladb/issues/22244.

Closes scylladb/scylladb#24868

(cherry picked from commit 17272c2f3b)

Closes scylladb/scylladb#24904
2025-07-15 09:38:16 +03:00
Jenkins Promoter
8c860ad264 Update pgo profiles - aarch64 2025-07-15 04:37:19 +03:00
Jenkins Promoter
e24f80aef7 Update pgo profiles - x86_64 2025-07-15 04:36:48 +03:00
Raphael S. Carvalho
c307d84925 test: fix flakiness of test_missing_data
2025.1 only is susceptible. Merge has slightly different logic in
master, test had to be adjusted for 2025.1 but is flaky.
Can happen two successive merges cause the merge waiting to never
finish.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Fixes scylladb/scylladb#24821

Closes scylladb/scylladb#24936
2025-07-14 14:28:05 +03:00
Gleb Natapov
ce492fa5d0 api: unregister raft_topology_get_cmd_status on shutdown
In c8ce9d1c60 we introduced
raft_topology_get_cmd_status REST api but the commit forgot to
unregister the handler during shutdown.

Fixes #24910

Closes scylladb/scylladb#24911

(cherry picked from commit 89f2edf308)

Closes scylladb/scylladb#24921
2025-07-14 11:44:44 +02:00
Lakshmi Narayanan Sreethar
46dfe09e64 db/corrupt_data_handler: guard stop() against null _fragment_semaphore
The `system_table_corrupt_data_handler::_fragment_semaphore` member is
initialized only when the `system_keyspace` sharded service is
initialized by `main`. If the server shuts down before that due to an
unrelated reason, `_fragment_semaphore` remains default-initialized to
`nullptr`. When the shutdown process later attempts to call `stop()` on
`system_table_corrupt_data_handler`, it tries to call `stop()` on
`_fragment_semaphore`, leading to a segfault.

Fix this by checking if `_fragment_semaphore` is null before invoking
`stop()` on it.

Although `corrupt_data_handler` was backported to 2025.1, this issue
does not occur in 2025.2 and master. The recent versions include #23113,
which changes how the system keyspace is stopped and PR #24492, which
originally introduced `corrupt_data_handler`, builds on that change to
ensure `_fragment_semaphore` is stopped only if it has been created.

Fixes #24920

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#24931
2025-07-14 12:06:29 +03:00
Avi Kivity
477cad246f tools: optimized_clang: make it work in the presence of a scylladb profile
optimized_clang.sh trains the compiler using profile-guided optimization
(pgo). However, while doing that, it builds scylladb using its own profile
stored in pgo/profiles and decompressed into build/profile.profdata. Due
to the funky directory structure used for training the compiler, that
path is invalid during the training and the build fails.

The workaround was to build on a cloud machine instead of a workstation -
this worked because the cloud machine didn't have git-lfs installed, and
therefore did not see the stored profile, and the whole mess was averted.

To make this work on a machine that does have access to stored profiles,
disable use of the stored profile even if it exists.

Fixes #22713

Closes scylladb/scylladb#24571

(cherry picked from commit 52f11e140f)

Closes scylladb/scylladb#24620
2025-07-13 21:43:49 +03:00
Raphael S. Carvalho
accc42d264 test: fix flakiness of test_missing_data
2025.1 only is susceptible. Merge has slightly different logic in
master, test had to be adjusted for 2025.1 but is flaky.
Can happen two successive merges cause the merge waiting to never
finish.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-07-12 16:05:48 -03:00
Raphael S. Carvalho
ef8a2bbc88 test: Verify partitioned set store split and unsplit correctly
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit d5bee4c814)
2025-07-11 10:05:34 -03:00
Raphael S. Carvalho
63bdbebdef sstables: Fix quadratic space complexity in partitioned_sstable_set
Interval map is very susceptible to quadratic space behavior when
it's flooded with many entries overlapping all (or most of)
intervals, since each such entry will have presence on all
intervals it overlaps with.

A trigger we observed was memtable flush storm, which creates many
small "L0" sstables that spans roughly the entire token range.

Since we cannot rely on insertion order, solution will be about
storing sstables with such wide ranges in a vector (unleveled).

There should be no consequence for single-key reads, since upper
layer applies an additional filtering based on token of key being
queried.
And for range scans, there can be an increase in memory usage,
but not significant because the sstables span an wide range and
would have been selected in the combined reader if the range of
scan overlaps with them.

Anyway, this is a protection against storm of memtable flushes
and shouldn't be the common scenario.

It works both with tablets and vnodes, by adjusting the token
range spanned by compaction group accordingly.

Fixes #23634.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit c77f710a0c)
2025-07-11 10:05:30 -03:00
Raphael S. Carvalho
c633bb84ac compaction: Wire table_state into make_sstable_set()
This will be useful for feeding token range owned by compaction group
into sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 21d1e78457)
2025-07-11 09:21:40 -03:00
Raphael S. Carvalho
be71400fb2 compaction: Introduce token_range() to table_state
This provides a way for compaction layer to know compaction group's
token range. It will be important for sstable set impl to know
the token range of underlying group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 59dad2121f)
2025-07-11 09:21:40 -03:00
Raphael S. Carvalho
35e0b1ac9b dht: Add overlap_ratio() for token range
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 494ed6b887)
2025-07-11 09:21:40 -03:00
Ferenc Szili
3e3147c03a logging: Add row count to large partition warning message
When writing large partitions, that is: partitions with size or row count
above a configurable threshold, ScyllaDB outputs a warning to the log:

WARN ... large_data - Writing large partition test/test:  (1200031 bytes) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db

This warning contains the information about the size of the partition,
but it does not contain the number of rows written. This can lead to
confusion because in cases where the warning was written because of the
row count being larger than the threshold, but the partition size is below
the threshold, the warning will only contain the partition size in bytes,
leading the user to believe the warning was output because of the
partition size, when in reality it was the row count that triggered the
warning. See #20125

This change adds a size_desc argument to cql_table_large_data_handler::try_record(),
which will contain the description of the size of the object written.
This method is used to output warnings for large partitions, row counts,
row sizes and cell sizes. This change does not modify the warning message
for row and cell sizes, only for partition size and row count.

The warning for large partitions and row counts will now look like this:

WARN ... large_data - Writing large partition test/test:  (1200031 bytes/100001 rows) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db

Closes scylladb/scylladb#22010

(cherry picked from commit 96267960f8)

Closes scylladb/scylladb#24681
2025-07-10 16:20:23 +02:00
Marcin Maliszkiewicz
0bbc701863 test: auth_cluster: add test for password reset procedure
(cherry picked from commit aef531077b)
2025-07-10 11:19:09 +02:00
Marcin Maliszkiewicz
52dfc23f9e auth: cache roles table scan during startup
It may be particularly beneficial during connection
storms on startup. In such cases, it can happen that
none of the user's read requests succeed, preventing
the cache from being populated. This, in turn, makes
it more difficult for subsequent reads to
succeed, reducing resiliency against such storms.

(cherry picked from commit 887c57098e)
2025-07-10 11:19:09 +02:00
Marcin Maliszkiewicz
b91eee6103 test: auth_cluster: add test for replacing default superuser
This test demonstrates creating custom superuser guide:
https://opensource.docs.scylladb.com/stable/operating-scylla/security/create-superuser.html

(cherry picked from commit d9223b61a2)
2025-07-10 11:19:09 +02:00
Marcin Maliszkiewicz
a10e17106b test: pylib: add ability to specify default authenticator during server_start
Sometimes we may not want to use default cassandra role for
control connection, especially when we test dropping default role.

(cherry picked from commit 08bf7237f066cead133bf0cac9bba215f238070a)
2025-07-10 11:19:09 +02:00
Marcin Maliszkiewicz
b79415a072 test: pylib: allow rolling restart without waiting for cql
Waiting for CQL requires default superuser being present
in db. In some cases we may delete it and still want to do
rolling restart. Additionally if we need CQL we may want to
wait after restart is complete (once, and not for each node).

(cherry picked from commit d9ec746c6d)
2025-07-10 11:19:09 +02:00
Marcin Maliszkiewicz
0665f9a12b auth: split auth-v2 logic for adding default superuser password
In raft mode (auth-v2) we need to do atomic write after read as
we give stricter consistency guarantees. Instead of patching
legacy logic this commit adds different path as:
- old code may be less tested now so it's best to not change it
- new code path avoids quorum selects in a typical flow (passwords set)

There may be a case when user deletes a superuser or password
right before restarting a node, in such case we may ommit
updating a password but:
- this is a trade-off between quorum reads on startup
- it's far more important to not update password when it shouldn't be
- if needed password will be updated on next node restart

If there is no quorum on startup we'll skip creating password
because we can't perform any raft operation.

Additionally this fixes a problem when password is created despite
having non default superuser in auth-v2.

(cherry picked from commit f85d73d405)
2025-07-10 11:19:09 +02:00
Marcin Maliszkiewicz
b916f51df1 auth: split auth-v2 logic for adding default superuser role
In raft mode (auth-v2) we need to do atomic write after read as
we give stricter consistency guarantees. Instead of patching
legacy logic this commit adds different path as:
  - old code may be less tested now so it's best to not change it
  - new code path avoids quorum selects in a typical flow (roles set)

This fixes a problem when superuser role is created despite
having non default superuser in auth-v2.

If there is no quorum on startup we'll skip creating role
because we can't perform any raft operation.

(cherry picked from commit 2e2ba84e94)
2025-07-10 11:19:09 +02:00
Marcin Maliszkiewicz
f5bf479ae7 auth: ldap: fix waiting for underlying role manager
ldap_role_manager depends on standard_role_manager,
therefore it needs to wait for superuser initialization.
If this is missing, the password authenticator will start
checking the default password too early and may fail to
create the default password if there is no default
role yet.

Currently password authenticator will create password
together with the role in such case but in following
commits we want to separate those responsibilities correctly.

(cherry picked from commit c96c5bfef5)
2025-07-10 11:19:09 +02:00
Marcin Maliszkiewicz
015d0f846f auth: wait for default role creation before starting authorizer and authenticator
There is a hidden dependency: the creation of the default superuser role
is split between the password authenticator and the role manager.
To work correctly, they must start in the right order: role manager first,
then password authenticator.

(cherry picked from commit 68fc4c6d61)
2025-07-10 11:19:09 +02:00
Jenkins Promoter
858d9c34d5 Update ScyllaDB version to: 2025.1.5 2025-07-09 21:09:32 +03:00
Piotr Dulikowski
e59374b721 Merge '[Backport 2025.1] batchlog_manager: abort replay of a failed batch on shutdown or node down' from Scylladb[bot]
When replaying a failed batch and sending the mutation to all replicas, make the write response handler cancellable and abort it on shutdown or if some target is marked down. also set a reasonable timeout so it gets aborted if it's stuck for some other unexpected reason.

Previously, the write response handler is not cancellable and has no timeout. This can cause a scenario where some write operation by the batchlog manager is stuck indefinitely, and node shutdown gets stuck as well because it waits for the batchlog manager to complete, without aborting the operation.

backport to relevant versions since the issue can cause node shutdown to hang

Fixes scylladb/scylladb#24599

- (cherry picked from commit 8d48b27062)

- (cherry picked from commit fc5ba4a1ea)

- (cherry picked from commit 7150632cf2)

- (cherry picked from commit 74a3fa9671)

- (cherry picked from commit a9b476e057)

- (cherry picked from commit d7af26a437)

Parent PR: #24595

Closes scylladb/scylladb#24878

* github.com:scylladb/scylladb:
  test: test_batchlog_manager: batchlog replay includes cdc
  test: test_batchlog_manager: test batch replay when a node is down
  batchlog_manager: set timeout on writes
  batchlog_manager: abort writes on shutdown
  batchlog_manager: create cancellable write response handler
  storage_proxy: add write type parameter to mutate_internal
2025-07-09 17:23:26 +02:00
Raphael S. Carvalho
f926083fbd replica: Fix truncate assert failure
Truncate doesn't really go well with concurrent writes. The fix (#23560) exposed
a preexisting fragility which I missed.

1) truncate gets RP mark X, truncated_at = second T
2) new sstable written during snapshot or later, also at second T (difference of MS)
3) discard_sstables() get RP Y > saved RP X, since creation time of sstable
with RP Y is equal to truncated_at = second T.

So the problem is that truncate is using a clock of second granularity for
filtering out sstables written later, and after we got low mark and truncate time,
it can happen that a sstable is flushed later within the same second, but at a
different millisecond.
By switching to a millisecond clock (db_clock), we allow sstables written later
within the same second from being filtered out. It's not perfect but
extremely unlikely a new write lands and get flushed in the same
millisecond we recorded truncated_at timepoint. In practice, truncate
will not be used concurrently to writes, so this should be enough for
our tests performing such concurrent actions.
We're moving away from gc_clock which is our cheap lowres_clock, but
time is only retrieved when creating sstable objects, which frequency of
creation is low enough for not having significant consequences, and also
db_clock should be cheap enough since it's usually syscall-less.

Fixes #23771.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#24426

(cherry picked from commit 2d716f3ffe)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#24875
2025-07-09 17:39:19 +03:00
Patryk Jędrzejczak
7365eb2c70 Merge '[Backport 2025.1] Make it easier to debug stuck raft topology operation.' from Scylladb[bot]
The series adds more logging and provides new REST api around topology command rpc execution to allow easier debugging of stuck topology operations.

Backport since we want to have in the production as quick as possible.

Fixes #24860

- (cherry picked from commit c8ce9d1c60)

- (cherry picked from commit 4e6369f35b)

Parent PR: #24799

Closes scylladb/scylladb#24877

* https://github.com/scylladb/scylladb:
  topology coordinator: log a start and an end of topology coordinator command execution at info level
  topology coordinator: add REST endpoint to query the status of ongoing topology cmd rpc
2025-07-09 13:11:33 +02:00
Michael Litvak
ae7b0838f6 test: test_batchlog_manager: batchlog replay includes cdc
Add a new test that verifies that when replaying batch mutations from
the batchlog, the mutations include cdc augmentation if needed.

This is done in order to verify that it works currently as expected and
doesn't break in the future.

(cherry picked from commit d7af26a437)
2025-07-08 12:32:26 +03:00
Michael Litvak
6d45cb3d5c test: test_batchlog_manager: test batch replay when a node is down
Add a test of the batchlog manager replay loop applying failed batches
while some replica is down.

The test reproduces an issue where the batchlog manager tries to replay
a failed batch, doesn't get a response from some replica, and becomes
stuck.

It verifies that the batchlog manager can eventually recover from this
situation and continue applying failed batches.

(cherry picked from commit a9b476e057)
2025-07-08 12:32:26 +03:00
Gleb Natapov
a9e4938ec6 topology coordinator: log a start and an end of topology coordinator command execution at info level
Those calls a relatively rare and the output may help to analyze issues
in production.

(cherry picked from commit 4e6369f35b)
2025-07-08 11:56:36 +03:00
Gleb Natapov
4b20a06996 topology coordinator: add REST endpoint to query the status of ongoing topology cmd rpc
The topology coordinator executes several topology cmd rpc against some nodes
during a topology change. A topology operation will not proceed unless
rpc completes (successfully or not), but sometimes it appears that it
hangs and it is hard to tell on which nodes it did not complete yet.
Introduce new REST endpoint that can help with debugging such cases.
If executed on the topology coordinator it returns currently running
topology rpc (if any) and a list of nodes that did not reply yet.

(cherry picked from commit c8ce9d1c60)
2025-07-08 11:56:30 +03:00
Michael Litvak
0e95704df1 batchlog_manager: set timeout on writes
Set a timeout on writes of replayed batches by the batchlog manager.

We want to avoid having infinite timeout for the writes in case it gets
stuck for some unexpected reason.

The timeout is set to be high enough to allow any reasonable write to
complete.

(cherry picked from commit 74a3fa9671)
2025-07-08 06:24:30 +00:00
Michael Litvak
9199c15813 batchlog_manager: abort writes on shutdown
On shutdown of batchlog manager, abort all writes of replayed batches
by the batchlog manager.

To achieve this we set the appropriate write_type to BATCH, and on
shutdown cancel all write handlers with this type.

(cherry picked from commit 7150632cf2)
2025-07-08 06:24:30 +00:00
Michael Litvak
d161a0bb35 batchlog_manager: create cancellable write response handler
When replaying a batch mutation from the batchlog manager and sending it
to all replicas, create the write response handler as cancellable.

To achieve this we define a new wrapper type for batchlog mutations -
batchlog_replay_mutation, and this allows us to overload
create_write_response_handler for this type. This is similar to how it's
done with hint_wrapper and read_repair_mutation.

(cherry picked from commit fc5ba4a1ea)
2025-07-08 06:24:29 +00:00
Michael Litvak
d41cdc808a storage_proxy: add write type parameter to mutate_internal
Currently mutate_internal has a boolean parameter `counter_write` that
indicates whether the write is of counter type or not.

We replace it with a more general parameter that allows to indicate the
write type.

It is compatible with the previous behavior - for a counter write, the
type COUNTER is passed, and otherwise a default value will be used
as before.

(cherry picked from commit 8d48b27062)
2025-07-08 06:24:29 +00:00
Avi Kivity
20afd27765 storage_proxy: avoid large allocation when storing batch in system.batchlog
Currently, when computing the mutation to be stored in system.batchlog,
we go through data_value. In turn this goes through `bytes` type
(#24810), so it causes a large contiguous allocation if the batch is
large.

Fix by going through the more primitive, but less contiguous,
atomic_cell API.

Fixes #24809.

Closes scylladb/scylladb#24811

(cherry picked from commit 60f407bff4)

Closes scylladb/scylladb#24844
2025-07-05 00:38:00 +03:00
Botond Dénes
99ed32b178 Merge '[Backport 2025.1] sstables: purge SCYLLA_ASSERT from the sstable read/parse paths' from Scylladb[bot]
Introduce `sstables::parse_assert()`, to replace `SCYLLA_ASSERT()` on the read/parse path. SSTables can get corrupt for various reasons, some outside of the database's control. A bad SSTable should not bring down the database, the parsing should simply be aborted, with as much information printed as possible for the investigation of the nature of the corruption. The newly introduced `parse_assert()` uses `on_internal_error()` under the hood, which prints a backtrace and optionally allows for aborting when on the error, to generate a coredump.

Fixes https://github.com/scylladb/scylladb/issues/20845

We just hit another case of `SCYLLA_ASSERT()` triggering due to corrupt sstables bringing down nodes in the field, should be backported to all releases, so we don't hit this in the future

- (cherry picked from commit 27e26ed93f)

- (cherry picked from commit bce89c0f5e)

Parent PR: #24534

Closes scylladb/scylladb#24683

* github.com:scylladb/scylladb:
  sstables: replace SCYLLA_ASSERT() with parse_assert() on the read path
  sstables/exceptions: introduce parse_assert()
2025-07-04 08:46:21 +03:00
Botond Dénes
4a0e61daf0 sstables: replace SCYLLA_ASSERT() with parse_assert() on the read path
So parse errors on corrupt SSTables don't result in crashes, instead
just aborting the read in process.
There are a lot of SCYLLA_ASSERT() usages remaining in sstables/. This
patch tried to focus on those usages which are in the read path. Some
places not only used on the read path may have been converted too, where
the usage of said method is not clear.

(cherry picked from commit bce89c0f5e)
2025-07-03 09:53:50 +03:00
Botond Dénes
62b1f1a581 sstables/exceptions: introduce parse_assert()
To replace SCYLLA_ASSERT on the read/parse path. SSTables can get
corrupt for various reasons, some outside of the database's control. A
bad SSTable should not bring down the database, the parsing should
simply be aborted, with as much information printed as possible for the
investigation of the nature of the corruption.
The newly introduced parse_assert() uses on_internal_error() under the
hood, which prints a backtrace and optionally allows for aborting when
on the error, to generate a coredump.

(cherry picked from commit 27e26ed93f)
2025-07-03 09:53:50 +03:00
Botond Dénes
ffcd772a92 Merge '[Backport 2025.1] sstables/mx/writer: handle non-full prefix row keys' from Scylladb[bot]
Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full prefix.
This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions.

Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.

Add a full-stack test which checks that rows with bad keys are correctly handled.

Fixes: https://github.com/scylladb/scylladb/issues/24489

The bug is present in all versions, has to be backported to all supported versions.

- (cherry picked from commit 92b5fe8983)

- (cherry picked from commit 0753643606)

- (cherry picked from commit b0d5462440)

- (cherry picked from commit 093d4f8d69)

- (cherry picked from commit 678deece88)

- (cherry picked from commit 64f8500367)

- (cherry picked from commit b931145a26)

- (cherry picked from commit 3e1c50e9a7)

- (cherry picked from commit 46ff7f9c12)

- (cherry picked from commit ebd9420687)

- (cherry picked from commit aae212a87c)

- (cherry picked from commit 592ca789e2)

- (cherry picked from commit edc2906892)

Parent PR: #24492

Closes scylladb/scylladb#24740

* github.com:scylladb/scylladb:
  test/boost/sstable_datafile_test: add test for corrupt data
  sstables/mx/writer: handler rows with empty keys
  test/lib/cql_assertions: introduce columns_assertions
  sstables: add corrupt_data_handler to sstables::sstables
  tools/scylla-sstable: make large_data_handler a local
  db: introduce corrupt_data_handler
  mutation: introduce frozen_mutation_fragment_v2
  mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
  mutation/mutation_partition_view: extract de-ser of {clustering,static} row
  idl-compiler.py: generate skip() definition for enums serializers
  idl: extract full_position.idl from position_in_partition.idl
  db/system_keyspace: add apply_mutation()
  db/system_keyspace: introduce the corrupt_data table
2025-07-03 07:20:07 +03:00
Botond Dénes
872ed2b359 test/boost/sstable_datafile_test: add test for corrupt data
* create a table with random schema
* generate data: random mutations + one row with bad key
* write data to sstable
* check that only good data is written to sstable
* check that the bad data was saved to system.corrupt_data

(cherry picked from commit edc2906892)
2025-07-02 14:04:11 +03:00
Botond Dénes
9affca795c sstables/mx/writer: handler rows with empty keys
Although valid for compact tables, non-full (or empty) clustering key
prefixes are not handled for row keys when writing sstables. Only the
present components are written, consequently if the key is empty, it is
omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full
prefix. This mis-match results in parsing failures, as the parser parses
part of the row content as a key resulting in a garbage key and
subsequent mis-parsing of the row content and maybe even subsequent
partitions.

Use the recently introduced corrupt_data_handler to handle rows with
such corrupt keys. This way, we avoid corrupting the sstables beyond
parsing and the rows are also kept around in system.corrupt_data for
later inspection and possible recovery.

(cherry picked from commit 592ca789e2)
2025-07-02 14:04:11 +03:00
Botond Dénes
db07141cea test/lib/cql_assertions: introduce columns_assertions
To enable targeted and optionally typed assertions against individual
columns in a row.

(cherry picked from commit aae212a87c)
2025-07-02 14:04:11 +03:00
Botond Dénes
00402cb4c5 sstables: add corrupt_data_handler to sstables::sstables
Similar to how large_data_handler is handled, propagate through
sstables::sstables_manager and store its owner: replica::database.
Tests and tools are also patched. Mostly mechanical changes, updating
constructors and patching callers.

(cherry picked from commit ebd9420687)
2025-07-02 14:04:09 +03:00
Botond Dénes
135727fd07 tools/scylla-sstable: make large_data_handler a local
No reason for it to be a global, not even convenience.

(cherry picked from commit 46ff7f9c12)
2025-07-02 14:03:03 +03:00
Botond Dénes
820b6a3a3f db: introduce corrupt_data_handler
Similar to large_data_handler, this interface allows sstable writers to
delegate the handling of corrupt data.
Two implementations are provided:
* system_table_corrupt_data_handler - saved corrupt data in
  system.corrupt_data, with a TTL=10days (non-configurable for now)
* nop_corrupt_data_handler - drops corrupt data

(cherry picked from commit 3e1c50e9a7)
2025-07-02 13:57:27 +03:00
Michał Chojnowski
0cf3a1796b utils/alien_worker: fix a data race in submit()
We move a `seastar::promise` on the external worker thread,
after the matching `seastar::future` was returned to the shard.

That's illegal. If the `promise` move occurs concurrently with some
operation (move, await) on the `future`, it becomes a data race
which could cause various kinds of corruption.

This patch fixes that by keeping the promise at a stable address
on the shard (inside a coroutine frame) and only passing through
the worker.

Fixes #24751

Closes scylladb/scylladb#24752

(cherry picked from commit a29724479a)

Closes scylladb/scylladb#24776
2025-07-02 12:17:58 +03:00
Botond Dénes
c39e395776 docs: cql/types.rst: remove reference to frozen-only UDTs
ScyllaDB supports non-frozen UDTs since 3.2, no need to keep referencing
this limitation in the current docs. Replace the description of the
limitation with general description of frozen semantics for UDTs.

Fixes: #22929

Closes scylladb/scylladb#24763

(cherry picked from commit 37ef9efb4e)

Closes scylladb/scylladb#24779
2025-07-02 12:14:31 +03:00
Avi Kivity
853bd8794b repair: row_level: unstall to_repair_rows_on_wire() destroying its input
to_repair_rows_on_wire() moves the contents of its input std::list
and is careful to yield after each element, but the final destruction
of the input list still deals with all of the list elements without
yielding. This is expensive as not all contents of repair_row are moved
(_dk_with_hash is of type lw_shared_ptr<const decorated_key_with_hash>).

To fix, destroy each row element as we move along. This is safe as we
own the input and don't reference row_list other than for the iteration.

Fixes #24725.

Closes scylladb/scylladb#24726

(cherry picked from commit 6aa71205d8)

Closes scylladb/scylladb#24766
2025-07-02 12:09:23 +03:00
Botond Dénes
1ad5214bdd Merge '[Backport 2025.1] mutation: check key of inserted rows' from Scylladb[bot]
Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations.

Fixes: https://github.com/scylladb/scylladb/issues/24506

Not a typical backport candidate (not a bugfix or regression fix), but we should still backport so we have the additional checks deployed to existing production clusters.

- (cherry picked from commit 8b756ea837)

- (cherry picked from commit ab96c703ff)

Parent PR: #24497

Closes scylladb/scylladb#24739

* github.com:scylladb/scylladb:
  mutation: check key of inserted rows
  compound: optimize is_full() for single-component types
2025-07-02 12:03:29 +03:00
Lakshmi Narayanan Sreethar
99ec69e27d utils/big_decimal: fix scale overflow when parsing values with large exponents
The exponent of a big decimal string is parsed as an int32, adjusted for
the removed fractional part, and stored as an int32. When parsing values
like `1.23E-2147483647`, the unscaled value becomes `123`, and the scale
is adjusted to `2147483647 + 2 = 2147483649`. This exceeds the int32
limit, and since the scale is stored as an int32, it overflows and wraps
around, losing the value.

This patch fixes that the by parsing the exponent as an int64 value and
then adjusting it for the fractional part. The adjusted scale is then
checked to see if it is still within int32 limits before storing. An
exception is thrown if it is not within the int32 limits.

Note that strings with exponents that exceed the int32 range, like
`0.01E2147483650`, were previously not parseable as a big decimal. They
are now accepted if the final adjusted scale fits within int32 limits.
For the above value, unscaled_value = 1 and scale = -2147483648, so it
is now accepted. This is in line with how Java's `BigDecimal` parses
strings.

Fixes: #24581

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#24640

(cherry picked from commit 279253ffd0)

Closes scylladb/scylladb#24691
2025-07-02 11:34:38 +03:00
Szymon Malewski
1d78f17697 utils/exceptions.cc: Added check for exceptions::request_timeout_exception in is_timeout_exception function.
It solves the issue, where in some cases a timeout exceptions in CAS operations are logged incorrectly as a general failure.

Fixes #24591

Closes scylladb/scylladb#24619

(cherry picked from commit f28bab741d)

Closes scylladb/scylladb#24684
2025-07-02 11:30:26 +03:00
Botond Dénes
864ba576c5 Merge '[Backport 2025.1] tablets: fix missing data after tablet merge ' from Scylladb[bot]
Consider the following scenario:

1) let's assume tablet 0 has range [1, 5] (pre merge)
2) tablet merge happens, tablet 0 has now range [1, 10]
3) tablet_sstable_set isn't refreshed, so holds a stale state, thinks tablet 0 still has range [1, 5]
4) during a full scan, forward service will intersect the full range with tablet ranges and consume one tablet at a time
5) replica service is asked to consume range [1, 10] of tablet 0 (post merge)

We have two possible outcomes:

With cache bypass:

1) cache reader is bypassed
2) sstable reader is created on range [1, 10]
3) unrefreshed tablet_sstable_set holds stale state, but select correctly all sstables intersecting with range [1, 10]

With cache:

1) cache reader is created
2) finds partition with token 5 is cached
3) sstable reader is created on range [1, 4] (later would fast forward to range [6, 10]; also belongs to tablet 0)
4) incremental selector consumes the pre-merge sstable spanning range [1, 5]
4.1) since the partitioned_sstable_set pre-merge contains only that sstable, EOS is reached
4.2) since EOS is reached, the fast forward to range [6, 10] is not allowed.
So with the set refreshed, sstable set is aligned with tablet ranges, and no premature EOS is signalled, otherwise preventing fast forward to from happening and all data from being properly captured in the read.

This change fixes the bug and triggers a mutation source refresh whenever the number of tablets for the table has changed, not only when we have incoming tablets.

Additionally, includes a fix for range reads that span more than one tablet, which can happen during split execution.

Fixes: https://github.com/scylladb/scylladb/issues/23313

This change needs to be backported to all supported versions which implement tablet merge.

- (cherry picked from commit d0329ca370)

- (cherry picked from commit 1f9f724441)

- (cherry picked from commit 53df911145)

Parent PR: #24287

Closes scylladb/scylladb#24338

* github.com:scylladb/scylladb:
  replica: Fix range reads spanning sibling tablets
  test: add reproducer and test for mutation source refresh after merge
  tablets: trigger mutation source refresh on tablet count change
2025-07-02 11:28:35 +03:00
Gleb Natapov
583d33ac8c storage_proxy: retry paxos repair even if repair write succeeded
After paxos state is repaired in begin_and_repair_paxos we need to
re-check the state regardless if write back succeeded or not. This
is how the code worked originally but it was unintentionally changed
when co-routinized in 61b2e41a23.

Fixes #24630

Closes scylladb/scylladb#24651

(cherry picked from commit 5f953eb092)

Closes scylladb/scylladb#24701
2025-07-02 10:18:56 +02:00
Botond Dénes
3c5240064b mutation: introduce frozen_mutation_fragment_v2
Mirrors frozen_mutation_fragment and shares most of the underlying
serialization code, the only exception is replacing range_tombstone with
range_tombstone_change in the mutation fragment variant.

(cherry picked from commit b931145a26)
2025-07-01 15:37:02 +00:00
Botond Dénes
eb90ce1c8b mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
Instead of mutation_fragment, let caller convert into mutation_fragment.
Allows reuse in future callers which will want to convert to
mutation_fragment_v2.

(cherry picked from commit 64f8500367)
2025-07-01 15:37:02 +00:00
Botond Dénes
c6db404527 mutation/mutation_partition_view: extract de-ser of {clustering,static} row
From the visitor in frozen_mutation_fragment::unfreeze(). We will want
to re-use it in the future frozen_mutation_fragment_v2::unfreeze().

Code-movement only, the code is not changed.

(cherry picked from commit 678deece88)
2025-07-01 15:37:02 +00:00
Botond Dénes
6107f0ed26 idl-compiler.py: generate skip() definition for enums serializers
Currently they only have the declaration and so far they got away with
it, looks like no users exists, but this is about to change so generate
the definition too.

(cherry picked from commit 093d4f8d69)
2025-07-01 15:37:02 +00:00
Botond Dénes
1b23057f7a idl: extract full_position.idl from position_in_partition.idl
A future user of position_in_partition.idl doesn't need full_position
and so doesn't want to include full_position.hh to fix compile errors
when including position_in_partition.idl.hh.
Extract it to a separate idl file: it has a single user in a
storage_proxy VERB.

(cherry picked from commit b0d5462440)
2025-07-01 15:37:01 +00:00
Botond Dénes
16d039a04f db/system_keyspace: add apply_mutation()
Allow applying writes in the form of mutations directly to the keyspace.
Allows lower-level mutation API to build writes. Advantageous if writes
can contain large cells that would otherwise possibly cause large
allocation warnings if used via the internal CQL API.

(cherry picked from commit 0753643606)
2025-07-01 15:37:01 +00:00
Botond Dénes
85c3f12039 db/system_keyspace: introduce the corrupt_data table
To serve as a place to store corrupt mutation fragments. These fragments
cannot be written to sstables, as they would be spread around by
compaction and/or repair. They even might make parsing the sstable
impossible. So they are stored in this special table instead, kept
around to be inspected later and possibly restored if possible.

(cherry picked from commit 92b5fe8983)
2025-07-01 15:37:01 +00:00
Jenkins Promoter
33c79319d2 Update pgo profiles - aarch64 2025-07-01 04:32:49 +03:00
Jenkins Promoter
c50093a5dc Update pgo profiles - x86_64 2025-07-01 04:11:16 +03:00
Botond Dénes
76efa77466 test/boost/memtable_test: only inject error for test table
Currently the test indiscriminately injects failures into the flushes of
any table, via the IO extension mechanism. The tests want to check that
the node correctly handles the IO error by self isolating, however the
indiscriminate IO errors can have unintended consequences when they hit
raft, leading to disorderly shutdown and failure of the tests. Testing
raft's resiliency to IO errors if of course worth doing, but it is not
the goal of this particular test, so to avoid the fallout, the IO errors
are limited to the test tables only.

Fixes: https://github.com/scylladb/scylladb/issues/24637

Closes scylladb/scylladb#24638

(cherry picked from commit ee6d7c6ad9)

Closes scylladb/scylladb#24741
2025-06-30 20:15:07 +03:00
Anna Stuchlik
acbe88172a doc: remove OSS mention from the SI notes
This commit removes a confusing reference to an Open Source version
form the Local Secondary Indexes page.

Fixes https://github.com/scylladb/scylladb/issues/24668

Closes scylladb/scylladb#24673

(cherry picked from commit 2367330513)

Closes scylladb/scylladb#24722
2025-06-30 18:54:29 +03:00
Raphael S. Carvalho
52ae6c2aba replica: Fix range reads spanning sibling tablets
We don't guarantee that coordinators will only emit range reads that
span only one tablet.

Consider this scenario:

1) split is about to be finalized, barrier is executed, completes.
2) coordinator starts a read, uses pre-split erm (split not committed to group0 yet)
3) split is committed to group0, all replicas switch storage.
4) replica-side read is executed, uses a range which spans tablets.

We could fix it with two-phase split execution. Rather than pushing the
complexity to higher levels, let's fix incremental selector which should
be able to serve all the tokens owned by a given shard. During split
execution, either of sibling tablets aren't going anywhere since it
runs with state machine locked, so a single read spanning both
sibling tablets works as long as the selector works across tablet
boundaries.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 53df911145)
2025-06-30 10:50:59 -03:00
Ferenc Szili
b0140cb646 test: add reproducer and test for mutation source refresh after merge
This change adds a reproducer and test for the fix where the local mutation
source is not always refreshed after a tablet merge.

(cherry picked from commit 1f9f724441)
2025-06-30 10:50:53 -03:00
Botond Dénes
d37efb0c08 mutation: check key of inserted rows
Make sure the keys are full prefixes as it is expected to be the case
for rows. At severeal occasions we have seen empty row keys make their
ways into the sstables, despite the fact that they are not allowed by
the CQL frontend. This means that such empty keys are possibly results
of memory corruption or use-after-{free,copy} errors. The source of the
corruption is impossible to pinpoint when the empty key is discovered in
the sstable. So this patch adds checks for such keys to places where
mutations are built: when building or unserializing mutations.

The test row_cache_test/test_reading_of_nonfull_keys needs adjustment to
work with the changes: it has to make the schema use compact storage,
otherwise the non-full changes used by this tests are rejected by the
new checks.

Fixes: https://github.com/scylladb/scylladb/issues/24506
(cherry picked from commit ab96c703ff)
2025-06-30 12:43:04 +00:00
Botond Dénes
3af0561078 compound: optimize is_full() for single-component types
For such compounds, unserializing the key is not necessary to determine
whether the key is full or not.

(cherry picked from commit 8b756ea837)
2025-06-30 12:43:03 +00:00
Abhinav Jha
2b43fb9841 group0: modify start_operation logic to account for synchronize phase race condition
In the present scenario, the bootstrapping node undergoes synchronize phase after
initialization of group0, then enters post_raft phase and becomes fully ready for
group0 operations. The topology coordinator is agnostic of this and issues stream
ranges command as soon as the node successfully completes `join_group0`. Although for
a node booting into an already upgraded cluster, the time duration for which, node
remains in synchronize phase is negligible but this race condition causes trouble in a
small percentage of cases, since the stream ranges operation fails and node fails to bootstrap.

This commit addresses this issue and updates the error throw logic to account for this
edge case and lets the node wait (with timeouts) for synchronize phase to get over instead of throwing
error.

A regression test is also added to confirm the working of this code change. The test adds a
wait in synchronize phase for newly joining node and releases only after the program counter
reaches the synchronize case in the `start_operation` function. Hence it indicates that in the
updated code, the start_operation will wait for the node to get done with the
synchronize phase instead of throwing error.

This PR fixes a bug. Hence we need to backport it.

Fixes: scylladb/scylladb#23536

Closes scylladb/scylladb#23829

(cherry picked from commit 5ff693eff6)

Closes scylladb/scylladb#24627
2025-06-29 14:33:01 +03:00
Avi Kivity
837f3eb6c2 Merge '[Backport 2025.1] main: don't start maintenance auth service if not enabled' from Scylladb[bot]
In f96d30c2b5
we introduced the maintenance service, which is an additional
instance of auth::service. But this service has a somewhat
confusing 2-level startup mechanism: it's initialized with
sharded<Service>::start and then auth::service::start
(different method with the same name to confuse even more).

When maintenance_socket was disabled (default setting), the code
did only the first part of the startup. This registered a config
observer but didn't create a permission_cache instance.
As a result, a crash on SIGHUP when config is reloaded can occur.

Fixes: https://github.com/scylladb/scylladb/issues/24528
Backport: all not eol versions since 6.0 and 2025.1

- (cherry picked from commit 97c60b8153)

- (cherry picked from commit dd01852341)

Parent PR: #24527

Closes scylladb/scylladb#24569

* github.com:scylladb/scylladb:
  test: add test for live updates of permissions cache config
  main: don't start maintenance auth service if not enabled
2025-06-29 14:32:35 +03:00
Aleksandra Martyniuk
b7cb4dd413 test: rest_api: fix test_repair_task_progress
test_repair_task_progress checks the progress of children of root
repair task. However, nothing ensures that the children are
already created.

Wait until at least one child of a root repair task is created.

Fixes: #24556.

Closes scylladb/scylladb#24560

(cherry picked from commit 0deb9209a0)

Closes scylladb/scylladb#24654
2025-06-28 09:39:52 +03:00
Marcin Maliszkiewicz
d27150e1ef test: add test for live updates of permissions cache config
(cherry picked from commit dd01852341)
2025-06-27 16:06:03 +02:00
Marcin Maliszkiewicz
58446fba62 main: don't start maintenance auth service if not enabled
In f96d30c2b5
we introduced the maintenance service, which is an additional
instance of auth::service. But this service has a somewhat
confusing 2-level startup mechanism: it's initialized with
sharded<Service>::start and then auth::service::start
(different method with the same name to confuse even more).

When maintenance_socket was disabled (default setting), the code
did only the first part of the startup. This registered a config
observer but didn't create a permission_cache instance.
As a result, a crash on SIGHUP when config is reloaded can occur.

(cherry picked from commit 97c60b8153)
2025-06-27 16:06:03 +02:00
Gleb Natapov
be1fb59806 api: return error from get_host_id_map if gossiper is not enabled yet.
Token metadata api is initialized before gossiper is started.
get_host_id_map REST endpoint cannot function without the fully
initialized gossiper though. The gossiper is started deep in
the join_cluster call chain, but if we move token_metadata api
initialization after the call it means that no api will be available
during bootstrap. This is not what we want.

Make a simple fix by returning an error from the api if the gossiper is
not initialized yet.

Fixes: #24479

Closes scylladb/scylladb#24575

(cherry picked from commit e364995e28)

Closes scylladb/scylladb#24584
2025-06-24 10:05:15 +03:00
Anna Stuchlik
987c57c889 doc: improve the tablets limitations section
This PR improves the Limitations and Unsupported Features section
for tablets, as it has been confusing to the customers.

Refs https://github.com/scylladb/scylla-enterprise/issues/5465

Fixes https://github.com/scylladb/scylladb/issues/24562

Closes scylladb/scylladb#24563

(cherry picked from commit 17eabbe712)

Closes scylladb/scylladb#24586
2025-06-24 10:05:01 +03:00
Pavel Emelyanov
d431d8a99c Merge '[Backport 2025.1] memtable: ensure _flushed_memory doesn't grow above total_memory' from Scylladb[bot]
`dirty_memory_manager` tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.

"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.

"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.

`dirty_memory_manager` controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.

"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.

This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in `~flush_memory_accounter` which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.

But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen by `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the `mutation_cleaner` later during the
flush and decrease `total_memory` below `_flushed_memory`.

There is a piece of code in `mutation_cleaner` intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).

But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.

This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.

Fixes scylladb/scylladb#21413

Backport candidate because it fixes a crash that can happen in existing stable branches.

- (cherry picked from commit 7d551f99be)

- (cherry picked from commit 975e7e405a)

Parent PR: #21638

Closes scylladb/scylladb#24602

* github.com:scylladb/scylladb:
  memtable: ensure _flushed_memory doesn't grow above total memory usage
  replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
2025-06-24 10:03:50 +03:00
Piotr Dulikowski
810ce1c67a Merge '[Backport 2025.1] test/cluster: Adjust tests to RF-rack-valid keyspaces' from Scylladb[bot]
In this PR, we're adjusting most of the cluster tests so that they pass
with the `rf_rack_valid_keyspaces` configuration option enabled. In most
cases, the changes are straightforward and require little to no additional
insight into what the tests are doing or verifying. In some, however, doing
that does require a deeper understanding of the tests we're modifying.
The justification for those changes and their correctness is included in
the commit messages corresponding to them.

Note that this PR does not cover all of the cluster tests. There are few
remaining ones, but they require a bit more effort, so we delegate that
work to a separate PR.

I tested all of the modified tests locally with `rf_rack_valid_keyspaces`
set to true, and they all passed.

Fixes scylladb/scylladb#23959

Backport: we want to backport these changes to 2025.1 since that's the version where we introduced RF-rack-valid keyspaces in. Although the tests are not, by default, run with `rf_rack_valid_keyspaces` enabled yet, that will most likely change in the near future and we'll also want to backport those changes too. The reason for this is that we want to verify that Scylla works correctly even with that constraint.

- (cherry picked from commit dbb8835fdf)

- (cherry picked from commit 9281bff0e3)

- (cherry picked from commit 5b83304b38)

- (cherry picked from commit 73b22d4f6b)

- (cherry picked from commit 2882b7e48a)

- (cherry picked from commit 4c46551c6b)

- (cherry picked from commit 92f7d5bf10)

- (cherry picked from commit 5d1bb8ebc5)

- (cherry picked from commit d3c0cd6d9d)

- (cherry picked from commit 04567c28a3)

- (cherry picked from commit c8c28dae92)

- (cherry picked from commit c4b32c38a3)

- (cherry picked from commit ee96f8dcfc)

Parent PR: #23661

Closes scylladb/scylladb#24120

* github.com:scylladb/scylladb:
  test/{topology,topology_custom,object_store}/suite.yaml: Enable rf_rack_valid_keyspaces in suites
  test/topology, test/topology_custom: Disable rf_rack_valid_keyspaces in problematic tests
  test/topology_custom/test_tablets: Divide rack into two to adjust tests to RF-rack-validity
  test/topology_custom/test_tablets: Adjust test_tablet_rf_change to RF-rack-validity
  test/topology_custom/test_tablet_repair_scheduler.py: Adjust to RF-rack-validity
  test/pylib/repair.py: Assign nodes to multiple racks in create_table_insert_data_for_repair
  test/topology_custom/test_zero_token_nodes_topology_ops: Adjust to RF-rack-validity
  test/topology_custom/test_zero_token_nodes_no_replication.py: Adjust to RF-rack-validity
  test/topology_custom/test_zero_token_nodes_multidc.py: Adjust to RF-rack-validity
  test/topology_custom/test_not_enough_token_owners.py: Adjust to RF-rack-validity
  test/topology_custom/test_multidc.py: Adjust to RF-rack-validity
  test/object_store/test_backup.py: Adjust to RF-rack-validity
  test/topology, test/topology_custom: Adjust simple tests to RF-rack-validity
2025-06-23 09:05:46 +02:00
Michał Chojnowski
dbb3f5ab17 memtable: ensure _flushed_memory doesn't grow above total memory usage
dirty_memory_manager tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.

"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.

"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.

dirty_memory_manager controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.

"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.

This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in ~flush_memory_accounter which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.

But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the mutation_cleaner later during the
flush and decrease `total_memory` below `_flushed_memory`.

There is a piece of code in mutation_cleaner intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).

But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.

This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.

(cherry picked from commit 975e7e405a)
2025-06-22 17:37:47 +00:00
Michał Chojnowski
b2302d8a46 replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
The memtable wants to listen for changes in its `total_memory` in order
to decrease its `_flushed_memory` in case some of the freed memory has already
been accounted as flushed. (This can happen because the flush reader sees
and accounts even outdated MVCC versions, which can be deleted and freed
during the flush).

Today, the memtable doesn't listen to those changes directly. Instead,
some calls which can affect `total_memory` (in particular, the mutation cleaner)
manually check the value of `total_memory` before and after they run, and they
pass the difference to the memtable.

But that's not good enough, because `total_memory` can also change outside
of those manually-checked calls -- for example, during LSA compaction, which
can occur anytime. This makes memtable's accounting inaccurate and can lead
to unexpected states.

But we already have an interface for listening to `total_memory` changes
actively, and `dirty_memory_manager`, which also needs to know it,
does just that. So what happens e.g. when `mutation_cleaner` runs
is that `mutation_cleaner` checks the value of `total_memory` before it runs,
then it runs, causing several changes to `total_memory` which are picked up
by `dirty_memory_manager`, then `mutation_cleaner` checks the end value of
`total_memory` and passes the difference to `memtable`, which corrects
whatever was observed by `dirty_memory_manager`.

To allow memtable to modify its `_flushed_memory` correctly, we need
to make `memtable` itself a `region_listener`. Also, instead of
the situation where `dirty_memory_manager` receives `total_memory`
change notifications from `logalloc` directly, and `memtable` fixes
the manager's state later, we want to only the memtable listen
for the notifications, and pass them already modified accordingl
to the manager, so there is no intermediate wrong states.

This patch moves the `region_listener` callbacks from the
`dirty_memory_manager` to the `memtable`. It's not intended to be
a functional change, just a source code refactoring.
The next patch will be a functional change enabled by this.

(cherry picked from commit 7d551f99be)
2025-06-22 17:37:47 +00:00
Pavel Emelyanov
b6a4065ac1 Update seastar submodule (no nested stall backtraces)
* seastar ed31c1ce...b2ff08a5 (1):
  > stall_detector: no backtrace if exception

Fixes #24464

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24540
2025-06-19 10:07:45 +03:00
Dawid Mędrek
5a41282bb3 test/{topology,topology_custom,object_store}/suite.yaml: Enable rf_rack_valid_keyspaces in suites
Almost all of the tests have been adjusted to be able to be run with
the `rf_rack_valid_keyspaces` configuration option enabled, while
the rest, a minority, create nodes with it disabled. Thanks to that,
we can enable it by default, so let's do that.

(cherry picked from commit ee96f8dcfc)
2025-06-18 16:47:49 +02:00
Dawid Mędrek
6506eddcfd test/topology, test/topology_custom: Disable rf_rack_valid_keyspaces in problematic tests
Some of the tests in the test suite have proven to be more problematic
in adjusting to RF-rack-validity. Since we'd like to run as many tests
as possible with the `rf_rack_valid_keyspaces` configuration option
enabled, let's disable it in those. In the following commit, we'll enable
it by default.

(cherry picked from commit c4b32c38a3)
2025-06-18 14:21:53 +02:00
Dawid Mędrek
a63f1737d1 test/topology_custom/test_tablets: Divide rack into two to adjust tests to RF-rack-validity
Three tests in the file use a multi-DC cluster. Unfortunately, they put
all of the nodes in a DC in the same rack and because of that, they fail
when run with the `rf_rack_valid_keyspaces` configuration option enabled.
Since the tests revolve mostly around zero-token nodes and how they
affect replication in a keyspace, this change should have zero impact on
them.

(cherry picked from commit c8c28dae92)
2025-06-18 14:21:51 +02:00
Dawid Mędrek
720c80239d test/topology_custom/test_tablets: Adjust test_tablet_rf_change to RF-rack-validity
We reduce the number of nodes and the RF values used in the test
to make sure that the test can be run with the `rf_rack_valid_keyspaces`
configuration option. The test doesn't seem to be reliant on the
exact number of nodes, so the reduction should not make any difference.

(cherry picked from commit 04567c28a3)
2025-06-18 14:21:48 +02:00
Dawid Mędrek
dd7da8a9b9 test/topology_custom/test_tablet_repair_scheduler.py: Adjust to RF-rack-validity
The change boils down to matching the number of created racks to the number
of created nodes in each DC in the auxiliary function `prepare_multi_dc_repair`.
This way, we ensure that the created keyspace will be RF-rack-valid and so
we can run the test file even with the `rf_rack_valid_keyspaces` configuration
option enabled.

The change has no impact on the tests that use the function; the distribution
of nodes across racks does not affect how repair is performed or what the
tests do and verify. Because of that, the change is correct.

(cherry picked from commit d3c0cd6d9d)
2025-06-18 14:21:46 +02:00
Dawid Mędrek
05fffbdea0 test/pylib/repair.py: Assign nodes to multiple racks in create_table_insert_data_for_repair
We assign the newly created nodes to multiple racks. If RF <= 3,
we create as many racks as the provided RF. We disallow the case
of  RF > 3 to avoid trying to create an RF-rack-invalid keyspace;
note that no existing test calls `create_table_insert_data_for_repair`
providing a higher RF. The rationale for doing this is we want to ensure
that the tests calling the function can be run with the
`rf_rack_valid_keyspaces` configuration option enabled.

(cherry picked from commit 5d1bb8ebc5)
2025-06-18 14:21:44 +02:00
Dawid Mędrek
b0e60c2f90 test/topology_custom/test_zero_token_nodes_topology_ops: Adjust to RF-rack-validity
We assign the nodes to the same DC, but multiple racks to ensure that
the created keyspace is RF-rack-valid and we can run the test with
the `rf_rack_valid_keyspaces` configuration option enabled. The changes
do not affect what the test does and verifies.

(cherry picked from commit 92f7d5bf10)
2025-06-18 14:21:42 +02:00
Dawid Mędrek
358cb2f82f test/topology_custom/test_zero_token_nodes_no_replication.py: Adjust to RF-rack-validity
We simply assign the nodes used in the test to seprate racks to
ensure that the created keyspace is RF-rack-valid to be able
to run the test with the `rf_rack_valid_keyspaces` configuration
option set to true. The change does not affect what the test
does and verifies -- it only depends on the type of nodes,
whether they are normal token owners or not -- and so the changes
are correct in that sense.

(cherry picked from commit 4c46551c6b)
2025-06-18 14:21:40 +02:00
Dawid Mędrek
50793d1109 test/topology_custom/test_zero_token_nodes_multidc.py: Adjust to RF-rack-validity
We parameterize the test so it's run with and without enforced
RF-rack-valid keyspaces. In the test itself, we introduce a branch
to make sure that we won't run into a situation where we're
attempting to create an RF-rack-invalid keyspace.

Since the `rf_rack_valid_keyspaces` option is not commonly used yet
and because its semantics will most likely change in the future, we
decide to parameterize the test rather than try to get rid of some
of the test cases that are problematic with the option enabled.

(cherry picked from commit 2882b7e48a)
2025-06-18 14:21:37 +02:00
Dawid Mędrek
822a131d30 test/topology_custom/test_not_enough_token_owners.py: Adjust to RF-rack-validity
We simply assign DC/rack properties to every node used in the test.
We put all of them in the same DC to make sure that the cluster behaves
as closely to how it would before these changes. However, we distribute
them over multiple racks to ensure that the keyspace used in the test
is RF-rack-valid, so we can also run it with the `rf_rack_valid_keyspaces`
configuration option set to true. The distribution of nodes between racks
has no effect on what the test does and verifies, so the changes are
correct in that sense.

(cherry picked from commit 73b22d4f6b)
2025-06-18 14:21:35 +02:00
Dawid Mędrek
8da3f7003c test/topology_custom/test_multidc.py: Adjust to RF-rack-validity
Instead of putting all of the nodes in a DC in the same rack
in `test_putget_2dc_with_rf`, we assign them to different racks.
The distribution of nodes in racks is orthogonal to what the test
is doing and verifying, so the change is correct in that sense.
At the same time, it ensures that the test never violates the
invariant of RF-rack-valid keyspaces, so we can also run it
with `rf_rack_valid_keyspaces` set to true.

(cherry picked from commit 5b83304b38)
2025-06-18 14:21:33 +02:00
Dawid Mędrek
60bc1e6c9d test/object_store/test_backup.py: Adjust to RF-rack-validity
We modify the parameters of `test_restore_with_streaming_scopes`
so that it now represents a pair of values: topology layout and
the value `rf_rack_valid_keyspaces` should be set to.

Two of the already existing parameters violate RF-rack-validity
and so the test would fail when run with `rf_rack_valid_keyspaces: true`.
However, since the option isn't commonly used yet and since the
semantics of RF-rack-valid keyspaces will most likely change in
the future, let's keep those cases and just run them with the
option disabled. This way, we still test everything we can
without running into undesired failures that don't indicate anything.

(cherry picked from commit 9281bff0e3)
2025-06-18 14:21:31 +02:00
Dawid Mędrek
ebc1584823 test/topology, test/topology_custom: Adjust simple tests to RF-rack-validity
We adjust all of the simple cases of cluster tests so they work
with `rf_rack_valid_keyspaces: true`. It boils down to assigning
nodes to multiple racks. For most of the changes, we do that by:

* Using `pytest.mark.prepare_3_racks_cluster` instead of
  `pytest.mark.prepare_3_nodes_cluster`.
* Using an additional argument -- `auto_rack_dc` -- when calling
  `ManagerClient::servers_add()`.

In some cases, we need to assign the racks manually, which may be
less obvious, but in every such situation, the tests didn't rely
on that assignment, so that doesn't affect them or what they verify.

(cherry picked from commit dbb8835fdf)
2025-06-18 14:21:28 +02:00
Michał Chojnowski
25cb5bc724 test/boost/mutation_reader_test: fix a use-after-free in test_fast_forwarding_combined_reader_is_consistent_with_slicing
The contract in mutation_reader.hh says:

```
// pr needs to be valid until the reader is destroyed or fast_forward_to()
// is called again.
    future<> fast_forward_to(const dht::partition_range& pr) {
```

`test_fast_forwarding_combined_reader_is_consistent_with_slicing` violates
this by passing a temporary to `fast_forward_to`.

Fix that.

Fixes scylladb/scylladb#24542

Closes scylladb/scylladb#24543

(cherry picked from commit 27f66fb110)

Closes scylladb/scylladb#24547
2025-06-18 14:29:33 +03:00
Pavel Emelyanov
10ad2d9460 Merge '[Backport 2025.1] tablets: deallocate storage state on end_migration' from Scylladb[bot]
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:

* When the stage is updated from `cleanup` to `end_migration`, the
  storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
  then we don't allocate a storage group for it. This happens for
  example if the leaving replica is restarted during tablet migration.
  If it's initialized in `cleanup` stage then we allocate a storage
  group, and it will be deallocated when transitioning to
  `end_migration`.

This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.

It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.

Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.

This fixes the following issue:

1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it's already cleaned up
5. the storage group remains allocated on the leaving replica after the
   migration is completed - it's not cleaned up properly.

Fixes https://github.com/scylladb/scylladb/issues/23481

backport to all relevant releases since it's a bug that results in a crash

- (cherry picked from commit 34f15ca871)

- (cherry picked from commit fb18fc0505)

- (cherry picked from commit bd88ca92c8)

Parent PR: #24393

Closes scylladb/scylladb#24487

* github.com:scylladb/scylladb:
  test/cluster/test_tablets: test restart during tablet cleanup
  test: tablets: add get_tablet_info helper
  tablets: deallocate storage state on end_migration
2025-06-18 14:29:04 +03:00
Anna Stuchlik
607d39609d doc: add support for z3 GCP
This commit adds support for z3-highmem-highlssd instance types to
Cloud Instance Recommendations for GCP.

Fixes https://github.com/scylladb/scylladb/issues/24511

Closes scylladb/scylladb#24533

(cherry picked from commit 648d8caf27)

Closes scylladb/scylladb#24544
2025-06-17 18:26:34 +03:00
Michael Litvak
bc8289da8c test/cluster/test_tablets: test restart during tablet cleanup
Add a test that reproduces issue scylladb/scylladb#23481.

The test migrates a tablet from one node to another, and while the
tablet is in some stage of cleanup - either before or right after,
depending on the parameter - the leaving replica, on which the tablet is
cleaned, is restarted.

This is interesting because when the leaving replica starts and loads
its state, the tablet could be in different stages of cleanup - the
SSTables may still exist or they may have been cleaned up already, and
we want to make sure the state is loaded correctly.

(cherry picked from commit bd88ca92c8)
2025-06-17 18:11:26 +03:00
Michael Litvak
7f3883d072 test: tablets: add get_tablet_info helper
Add a helper for tests to get the tablet info from system.tablets for a
tablet owning a given token.

(cherry picked from commit fb18fc0505)
2025-06-17 18:11:26 +03:00
Michael Litvak
c45d465100 tablets: deallocate storage state on end_migration
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:

* When the stage is updated from `cleanup` to `end_migration`, the
  storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
  then we don't allocate a storage group for it. This happens for
  example if the leaving replica is restarted during tablet migration.
  If it's initialized in `cleanup` stage then we allocate a storage
  group, and it will be deallocated when transitioning to
  `end_migration`.

This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.

It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.

Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.

This fixes the following issue:

1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it was already cleaned up
4. the storage group remains allocated on the leaving replica after the
   migration is completed - it's not cleaned up properly.

Fixes scylladb/scylladb#23481

(cherry picked from commit 34f15ca871)
2025-06-17 18:11:26 +03:00
Anna Stuchlik
6592334520 doc: remove the limitation for disabling CDC
This commit removes the instruction to stop all writes before disabling CDC with ALTER.

Fixes https://github.com/scylladb/scylla-docs/issues/4020

Closes scylladb/scylladb#24406

(cherry picked from commit b0ced64c88)

Closes scylladb/scylladb#24475
2025-06-17 13:15:49 +03:00
Andrzej Jackowski
fa1244c2ca test: wait for normal state propagation in test_auth_v2_migration
By default, cluster tests have skip_wait_for_gossip_to_settle=0 and
ring_delay_ms=0. In tests with gossip topology, it may lead to a race,
where nodes see different state of each other.

In case of test_auth_v2_migration, there are three nodes. If the first
node already knows that the third node is NORMAL, and the second node
does not, the system_auth tables can return incomplete results.

To avoid such a race, this commit adds a check that all nodes see other
nodes as NORMAL before any writes are done.

Refs: #24163

Closes scylladb/scylladb#24185

(cherry picked from commit 555d897a15)

Closes scylladb/scylladb#24519
2025-06-17 13:15:01 +03:00
Ferenc Szili
ea822c6c44 tablets: trigger mutation source refresh on tablet count change
Consider the following scenario:

- let's assume tablet 0 has range [1, 5] (pre merge)
- tablet merge happens, tablet 0 has now range [1, 10]
- tablet_sstable_set isn't refreshed, so holds a stale state, thinks tablet
  0 still has range [1, 5]
- during a full scan, forward service will intersect the full range with
  tablet ranges and consume one tablet at a time
- replica service is asked to consume range [1, 10] of tablet 0 (post merge)

We have two possible outcomes:

With cache bypass:
1) cache reader is bypassed
2) sstable reader is created on range [1, 10]
3) unrefreshed tablet_sstable_set holds stale state, but select correctly
   all sstables intersecting with range [1, 10]

With cache:
1) cache reader is created
2) finds partition with token 5 is cached
3) sstable reader is created on range [1, 4] (later would fast forward to
   range [6, 10]; also belongs to tablet 0)
4) incremental selector consumes the pre-merge sstable spanning range [1, 5]
4.1) since the partitioned_sstable_set pre-merge contains only that sstable,
     EOS is reached
4.2) since EOS is reached, the fast forward to range [6, 10] is not allowed.

So with the set refreshed, sstable set is aligned with tablet ranges, and no
premature EOS is signalled, otherwise preventing fast forward to from
happening and all data from being properly captured in the read.

This change fixes the bug and triggeres a mutation source refresh whenever
the number of tablets for the table has changed, not only when we have
incoming tablets.

Fixes: #23313
(cherry picked from commit d0329ca370)
2025-06-17 09:23:22 +00:00
Dawid Mędrek
730e16f475 test/cluster/mv: Adjust test to RF-rack-valid keyspaces
We adjust the test in the directory so that all of the used
keyspaces are RF-rack-valid throughout the their execution.

Refs scylladb/scylladb#23428

Closes scylladb/scylladb#23490

(cherry picked from commit 10589e966f)

Closes scylladb/scylladb#24508
2025-06-17 07:33:22 +02:00
Botond Dénes
05476e8aba Merge '[Backport 2025.1] test/cluster/test_read_repair.py: improve trace logging test (again)' from Scylladb[bot]
The test test_read_repair_with_trace_logging wants to test read repair with trace logging. Turns out that node restart + trace-level logging + debug mode is too much and even with 1 minute timeout, the read repair     times out sometimes. Refactor the test to use injection point instead of restart. To make sure the test still tests what it supposed to test, use tracing to assert that read repair did indeed happen.

Fixes: scylladb/scylladb#23968

Needs backport to 2025.1 and 6.2, both have the flaky test

- (cherry picked from commit 51025de755)

- (cherry picked from commit 29eedaa0e5)

Parent PR: #23989

Closes scylladb/scylladb#24049

* github.com:scylladb/scylladb:
  test/cluster/test_read_repair.py: improve trace logging test (again)
  test/cluster: extract execute_with_tracing() into pylib/util.py
2025-06-16 06:55:13 +03:00
Jenkins Promoter
e7a1183d9c Update pgo profiles - aarch64 2025-06-15 04:29:52 +03:00
Jenkins Promoter
85369cb7d7 Update pgo profiles - x86_64 2025-06-15 04:10:27 +03:00
Botond Dénes
d4e271bca4 Merge '[backport 2025.1] alternator: fix schema "concurrent modification" errors' from Nadav Har'El
In ScyllaDB, schema modification operations use "optimistic locking": A schema operation reads the current schema, decides what it wants to do and prepares changes to the schema, and then attempts to commit those changes - but only if the schema hasn't changed since the first read. If the schema has already been changed by some other node - we need to try again. In a loop.

In Alternator, there are six operations that perform schema modification: CreateTable, DeleteTable, UpdateTable, TagResource, UntagResource and UpdateTimeToLive. All of them were missing this loop. We knew about this - and even had FIXME in all places. So all these operations, when facing contention of concurrent schema modifications on different nodes may fail one of these operations with an error like:

   Internal server error: service::group0_concurrent_modification
   (Failed to apply group 0 change due to concurrent modification).

This problem had very minor effect, if any, on real users because the DynamoDB SDK automatically retries operations that fail with retryable errors - like this "Internal server error" - and most likely the schema operation will succeed upon retry. However, as shown in issue #13152 these failures were annoying in our CI, where tests - which disable request retries - failed on these errors.

This patch fixes all six operations (the last three operations all use one common function, db::modify_tags(), so are fixed by one change) to add the missing loop.

The patch also includes reproducing tests for all these operations - the new tests all fail before this patch, and pass with it.

These new tests are much more reliable reproducers than the dtests we had that only sometimes - very rarely - reproduced the problem. Moreover, the new tests reproduces the bug seperately for each of the six operations, so if we forget to fix one of the six operations, one of the tests would have continued to fail. Of course I checked this during development.

The new tests are in the test/cluster framework, not test/alternator, because this problem can only be reproduced in a multi-node cluster: On a single node, it serializes its schema modifications on its own; The collisions only happen when more than one node attempts schema modifications at the same time.

Fixes #13152

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23827

(cherry picked from commit 3ce7e250cc)

Closes scylladb/scylladb#24494

* github.com:scylladb/scylladb:
  alternator: fix indentation
  alternator: fix schema "concurrent modification" errors
2025-06-13 14:45:55 +03:00
Szymon Malewski
712d783164 mapreduce_service: Prevent race condition
In parallelized aggregation functions super-coordinator (node performing final merging step) receives and merges each partial result in parallel coroutines (`parallel_for_each`).
Usually responses are spread over time and actual merging is atomic.
However sometimes partial results are received at the similar time and if an aggregate function (e.g. lua script) yields, two coroutines can try to overwrite the same accumulator one after another,
which leads to losing some of the results.
To prevent this, in this patch each coroutine stores merging results in its own context and overwrites accumulator atomically, only after it was fully merged.
Comparing to the previous implementation order of operands in merging function is swapped, but the order of aggregation is not guaranteed anyway.

Fixes #20662

Closes scylladb/scylladb#24106

(cherry picked from commit 5969809607)

Closes scylladb/scylladb#24388
2025-06-13 14:45:21 +03:00
Botond Dénes
0a3ae01cbe Merge '[Backport 2025.1] test/boost: Adjust tests to RF-rack-valid keyspaces' from Scylladb[bot]
This PR adjusts existing Boost tests so they respect the invariant
introduced by enabling `rf_rack_valid_keyspaces` configuration option.
We disable it explicitly in more problematic tests. After that, we
enable the option by default in the whole test suite.

Fixes scylladb/scylladb#23958

Backport: backporting to 2025.1 to be able to test the implementation there too.

- (cherry picked from commit 6e2fb79152)

- (cherry picked from commit e4e3b9c3a1)

- (cherry picked from commit 1199c68bac)

- (cherry picked from commit cd615c3ef7)

- (cherry picked from commit fa62f68a57)

- (cherry picked from commit 22d6c7e702)

- (cherry picked from commit 237638f4d3)

- (cherry picked from commit c60035cbf6)

Parent PR: #23802

Closes scylladb/scylladb#24367

* github.com:scylladb/scylladb:
  test/lib/cql_test_env.cc: Enable rf_rack_valid_keyspaces by default
  test/boost/tablets_test.cc: Explicitly disable rf_rack_valid_keyspaces in problematic tests
  test/boost/tablets_test.cc: Fix indentation in test_load_balancing_with_random_load
  test/boost/tablets_test.cc: Adjust test_load_balancing_with_random_load to RF-rack-validity
  test/boost/tablets_test.cc: Adjust test_load_balancing_works_with_in_progress_transitions to RF-rack-validity
  test/boost/tablets_test.cc: Adjust test_load_balancing_resize_requests to RF-rack-validity
  test/boost/tablets_test.cc: Adjust test_load_balancing_with_two_empty_nodes to RF-rack-validity
  test/boost/tablets_test.cc: Adjust test_load_balancer_shuffle_mode to RF-rack-validity
2025-06-13 14:44:55 +03:00
Michał Chojnowski
ca865d2311 utils/stream_compressor: allocate memory for zstd compressors externally
The default and recommended way to use zstd compressors is to let
zstd allocate and free memory for compressors on its own.

That's what we did for zstd compressors used in RPC compression.
But it turns out that it generates allocation patterns we dislike.

We expected zstd not to generate allocations after the context object
is initialized, but it turns out that it tries to downsize the context
sometimes (by reallocation). We don't want that because the allocations
generated by zstd are large (1 MiB with the parameters we use),
so repeating them periodically stresses the reclaimer.

We can avoid this by using the "static context" API of zstd,
in which the memory for context is allocated manually by the user
of the library. In this mode, zstd doesn't allocate anything
on its own.

The implementation details of this patch adds a consideration for
forward compatibility: later versions of Scylla can't use a
window size greater than the one we hardcoded in this patch
when talking to the old version of the decompressor.

(This is not a problem, since those compressors are only used
for RPC compression at the moment, where cross-version communication
can be prevented by bumping COMPRESSOR_NAME. But it's something
that the developer who changes the window size must _remember_ to do).

Fixes #24160
Fixes #24183

Closes scylladb/scylladb#24161

(cherry picked from commit 185a032044)

Closes scylladb/scylladb#24279
2025-06-13 14:44:04 +03:00
Nikos Dragazis
e40208f825 sstables: Fix race when loading checksum component
`read_checksum()` loads the checksum component from disk and stores a
non-owning reference in the shareable components. To avoid loading the
same component twice, the function has an early return statement.
However, this does not guarantee atomicity - two fibers or threads may
load the component and update the shareable components concurrently.
This can lead to use-after-free situations when accessing the component
through the shareable components, since the reference stored there is
non-owning. This can happen when multiple compaction tasks run on the
same SSTable (e.g., regular compaction and scrub-validate).

Fix this by not updating the reference in shareable components, if a
reference is already in place. Instead, create an owning reference to
the existing component for the current fiber. This is less efficient
than using a mutex, since the component may be loaded multiple times
from disk before noticing the race, but no locks are used for any other
SSTable component either. Also, this affects uncompressed SSTables,
which are not that common.

Fixes #23728.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#23872

(cherry picked from commit eaa2ce1bb5)

Closes scylladb/scylladb#24268
2025-06-13 14:38:54 +03:00
Raphael S. Carvalho
65297d01fd replica: Fix race of some operations like cleanup with snapshot
There are two semaphores in table for synchronizing changes to sstable list:

sstable_set_mutation_sem: used to serialize two concurrent operations updating
the list, to prevent them from racing with each other.

sstable_deletion_sem: A deletion guard, used to serialize deletion and
iteration over the list, to prevent iteration from finding deleted files on
disk.

they're always taken in this order to avoid deadlocks:
sstable_set_mutation_sem -> sstable_deletion_sem.

problem:

A = tablet cleanup
B = take_snapshot()

1) A acquires sstable_set_mutation_sem for updating list
2) A acquires sstable_deletion_sem, then delete sstable before updating list
3) A releases sstable_deletion_sem, then yield
4) B acquires sstable_deletion_sem
5) B iterates through list and bumps sstable deleted in step 2
6) B fails since it cannot find the file on disk

Initial reaction is to say that no procedure must delete sstable before
updating the list, that's true.

But we want a iteration, running concurrently to cleanup, to not find sstables
being removed from the system. Otherwise, e.g. snapshot works with sstables
of a tablet that was just cleaned up. That's achieved by serializing iteration
with list update.
Since sstable_deletion_sem is used within the scope of deletion only, it's
useless for achieving this. Cleanup could acquire the deletion sem when
preparing list updates, and then pass the "permit" to deletion function, but
then sstable_deletion_sem would essentially become sstable_set_mutation_sem,
which was created exactly to protect the list update.

That being said, it makes sense to merge both semaphores. Also things become
easier to reason about, and we don't have to worry about deadlocks anymore.

The deletion goes through sstable_list_builder, which holds a permit throughout
its lifetime, which guarantees that list updates and deletion are atomic to
other concurrent operations. The interface becomes less error prone with that.
It allowed us to find discard_sstables() was doing deletion without any permit,
meaning another race could happen between truncate and snapshot.

So we're fixing race of (truncate|cleanup) with take_snapshot, as far as we
know. It's possible another unknown races are fixed as well.

Fixes #23049.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#23117

(cherry picked from commit fedd838b9d)

Closes scylladb/scylladb#23279
2025-06-13 14:35:53 +03:00
Nadav Har'El
56124e9dc5 alternator: fix indentation
The previous patch added some loops around without reindenting the code
in the loop. This patch reindents that code. No functional change.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-06-12 19:39:15 +03:00
Nadav Har'El
6585a056cf alternator: fix schema "concurrent modification" errors
In ScyllaDB, schema modification operations use "optimistic locking":
A schema operation reads the current schema, decides what it wants to do
and prepares changes to the schema, and then attempts to commit those
changes - but only if the schema hasn't changed since the first read.
If the schema has already been changed by some other node - we need to
try again. In a loop.

In Alternator, there are six operations that perform schema modification:
CreateTable, DeleteTable, UpdateTable, TagResource, UntagResource and
UpdateTimeToLive. All of them were missing this loop. We knew about
this - and even had FIXME in all places. So all these operations,
when facing contention of concurrent schema modifications on different
nodes may fail one of these operations with an error like:

   Internal server error: service::group0_concurrent_modification
   (Failed to apply group 0 change due to concurrent modification).

This problem had very minor effect, if any, on real users because the
DynamoDB SDK automatically retries operations that fail with retryable
errors - like this "Internal server error" - and most likely the schema
operation will succeed upon retry. However, as shown in issue #13152
these failures were annoying in our CI, where tests - which disable
request retries - failed on these errors.

This patch fixes all six operations (the last three operations all
use one common function, db::modify_tags(), so are fixed by one
change) to add the missing loop.

The patch also includes reproducing tests for all these operations -
the new tests all fail before this patch, and pass with it.

These new tests are much more reliable reproducers than the dtests
we had that only sometimes - very rarely - reproduced the problem.
Moreover, the new tests reproduces the bug seperately for each of the
six operations, so if we forget to fix one of the six operations, one
of the tests would have continued to fail. Of course I checked this
during development.

The new tests are in the test/cluster framework, not test/alternator,
because this problem can only be reproduced in a multi-node cluster:
On a single node, it serializes its schema modifications on its own;
The collisions only happen when more than one node attempts schema
modifications at the same time.

Fixes #13152

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23827

(cherry picked from commit 3ce7e250cc)
2025-06-12 14:20:26 +03:00
Dawid Mędrek
b2372cf578 test/lib/cql_test_env.cc: Enable rf_rack_valid_keyspaces by default
We've adjusted all of the Boost tests so they respect the invariant
enforced by the `rf_rack_valid_keyspaces` configuration option, or
explicitly disabled the option in those that turned out to be more
problematic and will require more attention. Thanks to that, we can
now enable it by default in the test suite.

(cherry picked from commit c60035cbf6)
2025-06-06 22:07:38 +02:00
Dawid Mędrek
4de1eac4c6 test/boost/tablets_test.cc: Explicitly disable rf_rack_valid_keyspaces in problematic tests
Some of the tests in the file verify more subtle parts of the behavior
of tablets and rely on topology layouts or using keyspaces that violate
the invariant the `rf_rack_valid_keyspaces` configuration option is
trying to enforce. Because of that, we explicitly disable the option
to be able to enable it by default in the rest of the test suite in
the following commit.

(cherry picked from commit 237638f4d3)
2025-06-06 22:07:31 +02:00
Jenkins Promoter
5030b6612f Update ScyllaDB version to: 2025.1.4 2025-06-05 13:10:05 +03:00
Dawid Mędrek
a20cacdc59 test/boost/tablets_test.cc: Fix indentation in test_load_balancing_with_random_load
(cherry picked from commit 22d6c7e702)
2025-06-04 13:11:24 +00:00
Dawid Mędrek
8f06dc5779 test/boost/tablets_test.cc: Adjust test_load_balancing_with_random_load to RF-rack-validity
We make sure that the keyspaces created in the test are always RF-rack-valid.
To achieve that, we change how the test is performed.

Before this commit, we first created a cluster and then ran the actual test
logic multiple times. Each of those test cases created a keyspace with a random
replication factor.

That cannot work with `rf_rack_valid_keyspaces` set to true. We cannot modify
the property file of a node (see commit: eb5b52f598),
so once we set up the cluster, we cannot adjust its layout to work with another
replication factor.

To solve that issue, we also recreate the cluster in each test case. Now we choose
the replication factor at random, create a cluster distributing nodes across as many
racks as RF, and perform the rest of the logic. We perform it multiple times in
a loop so that the test behaves as before these changes.

(cherry picked from commit fa62f68a57)
2025-06-04 13:11:24 +00:00
Dawid Mędrek
50aa589a66 test/boost/tablets_test.cc: Adjust test_load_balancing_works_with_in_progress_transitions to RF-rack-validity
We distribute the nodes used in the test across two racks so we can
run the test with `rf_rack_valid_keyspaces` set to true.

We want to avoid cross-rack migrations and keep the test as realistic
as possible. Since host3 is supposed to function as a new node in the
cluster, we change the layout of it: now, host1 has 2 shards and resides
in a separate rack. Most of the remaining test logic is preserved and behaves
as before this commit.

There is a slight difference in the tablet migrations. Before the commit,
we were migrating a tablet between nodes of different shard counts. Now
it's impossible because it would force us to migrate tablets between racks.
However, since the test wants to simply verify that an ongoing migration
doesn't interfere with load balancing and still leads to a perfect balance,
that still happens: we explicitly migrate ONLY 1 tablet from host2 to host3,
so to achieve the goal, one more tablet needs to be migrated, and we test
that.

(cherry picked from commit cd615c3ef7)
2025-06-04 13:11:24 +00:00
Dawid Mędrek
a685ad04bd test/boost/tablets_test.cc: Adjust test_load_balancing_resize_requests to RF-rack-validity
We assign the nodes created by the test to separate racks. It has no impact
on the test since the keyspace used in the test uses RF=2, so the tablet
replicas will still be the same.

(cherry picked from commit 1199c68bac)
2025-06-04 13:11:24 +00:00
Dawid Mędrek
8a20659cce test/boost/tablets_test.cc: Adjust test_load_balancing_with_two_empty_nodes to RF-rack-validity
We distribute the nodes used in the test between two racks. Although
that may affect how tablets behave in general, this change will not
have any real impact on the test. The test verifies that load balancing
eventually balances tablets in the cluster, which will still happen.
Because of that, the changes in this commit are safe to apply.

(cherry picked from commit e4e3b9c3a1)
2025-06-04 13:11:24 +00:00
Dawid Mędrek
e0f1eb52fa test/boost/tablets_test.cc: Adjust test_load_balancer_shuffle_mode to RF-rack-validity
We distribute the nodes used in the test between two racks. Although that
may have an impact on how tablets behave, it's orthogonal to what the test
verifies -- whether the topology coordinator is continuously in the tablet
migration track. Because of that, it's safe to make this change without
influencing the test.

(cherry picked from commit 6e2fb79152)
2025-06-04 13:11:24 +00:00
Nadav Har'El
7d3972a002 alternator: hide internal tags from users
The "tags" mechanism in Alternator is a convenient way to attach metadata
to Alternator tables. Recently we have started using it more and more for
internal metadata storage:

  * UpdateTimeToLive stores the attribute in a tag system:ttl_attribute
  * CreateTable stores provisioned throughput in tags
    system:provisioned_rcu and system:provisioned_wcu
  * CreateTable stores the table's creation time in a tag called
    system:table_creation_time.

We do not want any of these internal tags to be visible to a
ListTagsOfResource request, because if they are visible (as before this
patch), systems such as Terraform can get confused when they suddenly
see a tag which they didn't set - and may even attempt to delete it
(as reported in issue #24098).

Moreover, we don't want any of these internal tags to be writable
with TagResource or UntagResource: If a user wants to change the TTL
setting they should do it via UpdateTimeToLive - not by writing
directly to tags.

So in this patch we forbid read or write to *any* tag that begins
with the "system:" prefix, except one: "system:write_isolation".
That tag is deliberately intended to be writable by the user, as
a configuration mechanism, and is never created internally by
Scylla. We should have perhaps chosen a different prefix for
configurable vs. internal tags, or chosen more unique prefixes -
but let's not change these historic names now.

This patch also adds regression tests for the internal tags features,
failing before this patch and passing after:
1. internal tags, specifically system:ttl_attribute, are not visible
   in ListTagsOfResource, and cannot be modified by TagResource or
   UntagResource.
2. system:write_isolation is not internal, and be written by either
   TagResource or UntagResource, and read with ListTagsOfResource.

This patch also fixes a bug in the test where we added more checks
for system:write_isolation - test_tag_resource_write_isolation_values.
This test forgot to remove the system:write_isolation tags from
test_table when it ended, which would lead to other tests that run
later to run with a non-default write isolation - something which we
never intended.

Fixes #24098.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#24299

(cherry picked from commit 6cbcabd100)

Closes scylladb/scylladb#24376
2025-06-04 09:55:08 +03:00
Michał Chojnowski
4938f5d34a utils/lsa/chunked_managed_vector: fix the calculation of max_chunk_capacity()
`chunked_managed_vector` is a vector-like container which splits
its contents into multiple contiguous allocations if necessary,
in order to fit within LSA's max preferred contiguous allocation
limits.

Each limited-size chunk is stored in a `managed_vector`.
`managed_vector` is unaware of LSA's size limits.
It's up to the user of `managed_vector` to pick a size which
is small enough.

This happens in `chunked_managed_vector::max_chunk_capacity()`.
But the calculation is wrong, because it doesn't account for
the fact that `managed_vector` has to place some metadata
(the backreference pointer) inside the allocation.
In effect, the chunks allocated by `chunked_managed_vector`
are just a tiny bit larger than the limit, and the limit is violated.

Fix this by accounting for the metadata.

Also, before the patch `chunked_managed_vector::max_contiguous_allocation`,
repeats the definition of logalloc::max_managed_object_size.
This is begging for a bug if `logalloc::max_managed_object_size`
changes one day. Adjust it so that `chunked_managed_vector` looks
directly at `logalloc::max_managed_object_size`, as it means to.

Fixes scylladb/scylladb#23854

(cherry picked from commit 7f9152babc)

Closes scylladb/scylladb#24370
2025-06-03 21:11:52 +03:00
Michael Litvak
7249530b8c test_cdc_generation_publishing: fix to read monotonically
The test test_multiple_unpublished_cdc_generations reads the CDC
generation timestamps to verify they are published in the correct order.
To do so it issues reads in a loop with a short sleep period and checks
the differences between consecutive reads, assuming they are monotonic.

However the assumption that the reads are monotonic is not valid,
because the reads are issued with consistency_level=ONE, thus we may read
timestamps {A,B} from some node, then read timestamps {A} from another
node that didn't apply the write of the new timestamp B yet. This will
trigger the assert in the test and fail.

To ensure the reads are monotonic we change the test to use consistency
level ALL for the reads.

Fixes scylladb/scylladb#24262

Closes scylladb/scylladb#24272

(cherry picked from commit 3a1be33143)

Closes scylladb/scylladb#24333
2025-06-03 10:35:20 +03:00
Ran Regev
8663351e99 changed the string literals into the correct ones
Fixes: #23970

use correct string literals:
KMIP_TAG_CRYPTOGRAPHIC_LENGTH_STR --> KMIP_TAGSTR_CRYPTOGRAPHIC_LENGTH
KMIP_TAG_CRYPTOGRAPHIC_USAGE_MASK_STR --> KMIP_TAGSTR_CRYPTOGRAPHIC_USAGE_MASK

From https://github.com/scylladb/scylladb/issues/23970 description of the
problem (emphasizes are mine):

When transparent data encryption at rest is enabled with KMIP as a key
provider, the observation is that before creating a new key, Scylla tries
to locate an existing key with provided specifications (key algorithm &
length), with the intention to re-use existing key, **but the attributes
sent in the request have minor spelling mistakes** which are rejected by
the KMIP server key provider, and hence scylla assumes that a key with
these specifications doesn't exist, and creates a new key in the KMIP
server. The issue here is that for every new table, ScyllaDB will create
a key in the KMIP server, which could clutter the KMS, and make key
lifecycle management difficult for DBAs.

Closes scylladb/scylladb#24057

(cherry picked from commit 37854acc92)

Closes scylladb/scylladb#24302
2025-06-03 10:34:09 +03:00
Anna Stuchlik
4715692c50 doc: update migration tools overview
This commit updates the migration overview page:

- It removes the info about migration from SSTable to CQL.
- It updates the link to the migrator docs.

Fixes https://github.com/scylladb/scylladb/issues/24247

Refs https://github.com/scylladb/scylladb/pull/21775

Closes scylladb/scylladb#24258

(cherry picked from commit b197d1a617)

Closes scylladb/scylladb#24280
2025-06-03 10:33:23 +03:00
Anna Stuchlik
6a49126196 doc: remove copyright from Cassandra Stress
This commit removes the Apache copyright note from the Cassandra Stress page.

It's a follow up to https://github.com/scylladb/scylladb/pull/21723, which missed
that update (see https://github.com/scylladb/scylladb/pull/21723#discussion_r1944357143).

Cassandra Stress is a separate tool with separate repo with the docs, so the copyright
information on the page is incorrect.

Fixes https://github.com/scylladb/scylladb/issues/23240

Closes scylladb/scylladb#24219

(cherry picked from commit d303edbc39)

Closes scylladb/scylladb#24253
2025-06-03 10:31:27 +03:00
Botond Dénes
8fd50a207b Merge '[Backport 2025.1] Alternator WCU tracking in batch_write_item' from Amnon Heiman
This series adds support for WCU tracking in batch_write_item and tests it.

The patches include:

Switch the metrics (RCU and WCU) to count units vs half-units as they were, to make the metrics clearer for users.

Adding a public static get_half_units function to wcu_consumed_capacity_counter for use by batch write item, which cannot directly use the counter object.

Adding WCU calculation support to batch_write_item, based on item size for puts and a fixed 1 WCU for deletes. WCU metrics are updated, and consumed capacity is returned per table when requested.

The return handling was refactored to be coroutine-like for easier management of the consumed capacity array.

Adding tests that validate WCU calculation for batch put requests on a single table and across multiple tables, ensuring delete operations are counted correctly.

Adding a test that validates that WCU metrics are updated correctly during batch write item operations, ensuring the WCU of each item is calculated independently.

Need backport, WCU is partially supported, and is missing from batch_write_item

Fixes https://github.com/scylladb/scylladb/issues/23940

(cherry picked from commit 5ae11746fa)

(cherry picked from commit f2ade71f4f)

(cherry picked from commit 68db77643f)

(cherry picked from commit 14570f1bb5)

(cherry picked from commit 2ab99d7a07)

Parent PR: https://github.com/scylladb/scylladb/pull/23941
Replaces #24028

Closes scylladb/scylladb#24034

* github.com:scylladb/scylladb:
  alternator/test_metrics.py: batch_write validate WCU
  alternator/test_returnconsumedcapacity.py: Add tests for batch write WCU
  alternator/executor: add WCU for batch_write_items
  alternator/consumed_capacity: make wcu get_units public
  Alternator: Change the WCU/RCU to use units
2025-06-03 10:30:30 +03:00
Botond Dénes
fabe1082fd Merge '[Backport 2025.1] mv: make base_info in view schemas immutable' from Scylladb[bot]
Currently, the base_info may or may not be set in view schemas.
Even when it's set, it may be modified. This necessitates extra
checks when handling view schemas, as we'll as potentially causing
errors when we forget to set it at some point.

Instead, we want to make the base info an immutable member of view
schemas (inside view_info). To achieve this, in this series we remove
all base_info members that can change due to a base schema update,
and we calculate the remaining values during view update generation,
using the most up-to-date base schema version.

To calculate the values that depend on the base schema version, we
need to iterate over the view primary key and find the corresponding
columns, which adds extra overhead for each batch of view updates.
However, this overhead should be relatively small, as when creating
a view update, we need to prepare each of its columns anyway. And
if we need to read the old value of the base row, the relative
overhead is even lower.

After this change, the base info in view schemas stays the same
for all base schema updates, so we'll no longer get issues with
base_info being incompatible with a base schema version. Additionally,
it's a step towards making the schema objects immutable, which
we sometimes incorrectly assumed in the past (they're still not
completely immutable yet, as some other fields in view_info other
than base_info are initialized lazily and may depend on the base
schema version).

Fixes https://github.com/scylladb/scylladb/issues/9059
Fixes https://github.com/scylladb/scylladb/issues/21292
Fixes https://github.com/scylladb/scylladb/issues/22194
Fixes https://github.com/scylladb/scylladb/issues/22410

- (cherry picked from commit 900687c818)

- (cherry picked from commit a33963daef)

- (cherry picked from commit a3d2cd6b5e)

- (cherry picked from commit 32258d8f9a)

- (cherry picked from commit 6e539c2b4d)

- (cherry picked from commit 05fce91945)

- (cherry picked from commit ad55935411)

- (cherry picked from commit ea462efa3d)

- (cherry picked from commit d7bd86591e)

- (cherry picked from commit d77f11d436)

- (cherry picked from commit bf7bba9634)

- (cherry picked from commit ee5883770a)

Parent PR: #23337

Closes scylladb/scylladb#23937

* github.com:scylladb/scylladb:
  test: remove flakiness from test_schema_is_recovered_after_dying
  mv: add a test for dropping an index while it's building
  base_info: remove the lw_shared_ptr variant
  view_info: don't re-set base_info after construction
  base_info: remove base_info snapshot semantics
  base_info: remove base schema from the base_info
  schema_registry: store base info instead of base schema for view entries
  base_info: make members non-const
  view_info: move the base info to a separate header
  view_info: move computation of view pk columns not in base pk to view_updates
  view_info: move base-dependent variables into base_info
  view_info: set base info on construction
  alter_table_statement: fix renaming multiple columns in tables with views
2025-06-03 10:29:01 +03:00
Botond Dénes
945c41fe46 mutation/mutation_compactor: cache regular/shadowable max-purgable in separate members
Max purgeable has two possible values for each partition: one for
regular tombstones and one for shadowable ones. Yet currently a single
member is used to cache the max-purgeable value for the partition, so
whichever kind of tombstone is checked first, its max-purgeable will
become sticky and apply to the other kind of tombstones too. E.g. if the
first can_gc() check is for a regular tombstone, its max-purgeable will
apply to shadowable tombstones in the partition too, meaning they might
not be purged, even though they are purgeable, as the shadowable
max-purgeable is expected to be more lenient. The other way around is
worse, as it will result in regular tombstone being incorrectly purged,
permitted by the more lenient shadowable tombstone max-purgeable.
Fix this by caching the two possible values in two separate members.
A reproducer unit test is also added.

Fixes: scylladb/scylladb#23272

Closes scylladb/scylladb#24171

(cherry picked from commit 7db956965e)

Closes scylladb/scylladb#24328
2025-06-03 09:53:21 +03:00
Pavel Emelyanov
a8da6c69d7 test/result_utils: Do not assume map_reduce reducing order
When map_reduce is called on a collection, one shouldn't expect that it
processes the elements of the collection in any specific order.

Current test of map-reduce over boost outcome assumes that if reduce
function is the string concatenation, then it would concatenate the
given vector of strings in the order they are listed. That requirement
should be relaxed, and the result may have reversed concatentation.

Fixes scylladb/scylladb#24321

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24325

(cherry picked from commit a65ffdd0df)

Closes scylladb/scylladb#24334
2025-06-02 14:00:50 +03:00
Yaron Kaikov
d7a02eceea dist/docker/debian/build_docker.sh: add scylla-server-dbg
Adding missing scylla-server-dbg package, to enable debug symbols in our
container

Fixes: https://github.com/scylladb/scylladb/issues/24271

Closes scylladb/scylladb#24285
2025-06-02 13:17:49 +03:00
Botond Dénes
fd0e132587 Merge '[Backport 2025.1] replica: Fix use-after-free with concurrent schema change and sstable set update' from Scylladb[bot]
When schema is changed, sstable set is updated according to the compaction strategy of the new schema (no changes to set are actually made, just the underlying set type is updated), but the problem is that it happens without a lock, causing a use-after-free when running concurrently to another set update.

Example:

1) A: sstable set is being updated on compaction completion
2) B: schema change updates the set (it's non deferring, so it happens in one go) and frees the set used by A.
3) when A resumes, system will likely crash since the set is freed already.

ASAN screams about it:
SUMMARY: AddressSanitizer: heap-use-after-free sstables/sstable_set.cc ...

Fix is about deferring update of the set on schema change to compaction, which is triggered after new schema is set. Only strategy state and backlog tracker are updated immediately, which is fine since strategy doesn't depend on any particular implementation of sstable set.

Fixes #22040.

- (cherry picked from commit 628bec4dbd)

- (cherry picked from commit 434c2c4649)

Parent PR: #23680

Closes scylladb/scylladb#24081

* github.com:scylladb/scylladb:
  replica: Fix use-after-free with concurrent schema change and sstable set update
  sstables: Implement sstable_set_impl::all_sstable_runs()
2025-06-02 10:31:04 +03:00
Jenkins Promoter
db0e12aa16 Update pgo profiles - aarch64 2025-06-01 04:21:24 +03:00
Jenkins Promoter
d7f9690d8c Update pgo profiles - x86_64 2025-06-01 04:06:19 +03:00
Anna Stuchlik
4357ee0d0d doc: clarify RF increase issues for tablets vs. vnodes
This commit updates the guidelines for increasing the Replication Factor
depending on whether tablets are enabled or disabled.

To present it in a clear way, I've reorganized the page.

Fixes https://github.com/scylladb/scylladb/issues/23667

Closes scylladb/scylladb#24221

(cherry picked from commit efce03ef43)

Closes scylladb/scylladb#24283
2025-05-30 15:17:11 +03:00
Raphael S. Carvalho
b046c5df93 replica: Fix use-after-free with concurrent schema change and sstable set update
When schema is changed, sstable set is updated according to the compaction
strategy of the new schema (no changes to set are actually made, just
the underlying set type is updated), but the problem is that it happens
without a lock, causing a use-after-free when running concurrently to
another set update.

Example:

1) A: sstable set is being updated on compaction completion
2) B: schema change updates the set (it's non deferring, so it
happens in one go) and frees the set used by A.
3) when A resumes, system will likely crash since the set is freed
already.

ASAN screams about it:
SUMMARY: AddressSanitizer: heap-use-after-free sstables/sstable_set.cc ...

Fix is about deferring update of the set on schema change to compaction,
which is triggered after new schema is set. Only strategy state and
backlog tracker are updated immediately, which is fine since strategy
doesn't depend on any particular implementation of sstable set, since
patch "sstables: Implement sstable_set_impl::all_sstable_runs()".

Fixes #22040.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 434c2c4649)
2025-05-29 17:19:43 -03:00
Raphael S. Carvalho
d3831fb67b sstables: Implement sstable_set_impl::all_sstable_runs()
With upcoming change where table::set_compaction_strategy() might delay
update of sstable set, ICS might temporarily work with sstable set
implementations other than partitioned_sstable_set. ICS relies on
all_sstable_runs() during regular compaction, and today it triggers
bad_function_call exception if not overriden by set implementation.
To remove this strong dependency between compaction strategy and
a particular set implementation, let's provide a default implementation
of all_sstable_runs(), such that ICS will still work until the set
is updated eventually through a process that adds or remove a
sstable.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 628bec4dbd)
2025-05-29 17:15:43 -03:00
Piotr Dulikowski
a5f3c0dce9 Merge '[Backport 2025.1] test_mv_tablets_replace: wait for tablet replicas to balance before working on them' from Scylladb[bot]
In the test test_tablet_mv_replica_pairing_during_replace we stop 2 out of 4 servers while using RF=2. Even though in the test we use exactly 4 tablets (1 for each replica of a base table and view), intially, the tablets may not be split evenly between all nodes. Because of this, even when we chose a server that hosts the view and a different server that hosts the base table, we sometimes stoped all replicas of the base or the view table because the node with the base table replica may also be a view replica.

After some time, the tablets should be distributed across all nodes. When that happens, there will be no common nodes with a base and view replica, so the test scenario will continue as planned.

In this patch, we add this waiting period after creating the base and view, and continue the test only when all 4 tablets are on distinct nodes.

Fixes scylladb/scylladb#23982
Fixes scylladb/scylladb#23997
Fixes scylladb/scylladb#24250

(cherry picked from commit bceb64fb5a)
(cherry picked from commit 5074daf1b7)

Parent PR: scylladb/scylladb#24111

Closes scylladb/scylladb#24128

* github.com:scylladb/scylladb:
  test: actually wait for tablets to distribute across nodes
  test_mv_tablets_replace: wait for tablet replicas to balance before working on them
2025-05-29 14:26:36 +02:00
Wojciech Mitros
282bd88d14 test: actually wait for tablets to distribute across nodes
In test_tablet_mv_replica_pairing_during_replace, after we create
the tables, we want to wait for their tablets to distribute evenly
across nodes and we have a wait_for for that.
But we don't await this wait_for, so it's a no-op. This patch fixes
it by adding the missing await.

Refs scylladb/scylladb#23982
Refs scylladb/scylladb#23997

Closes scylladb/scylladb#24250

(cherry picked from commit 5074daf1b7)
2025-05-29 12:13:02 +02:00
Wojciech Mitros
47562951c1 test: remove flakiness from test_schema_is_recovered_after_dying
Due to the changes in creating schemas with base info the
test_schema_is_recovered_after_dying seems to be flaky when checking
that the schema is actually lost after 'grace_period'. We don't
actually guarantee that the the schema will be lost at that exact
moment so there's no reason to test this. To remove the flakiness,
we remove the check and the related sleep, which should also slightly
improve the speed of this test.

(cherry picked from commit ee5883770a)
2025-05-27 21:43:00 +02:00
Wojciech Mitros
ec41601929 mv: add a test for dropping an index while it's building
Dropping an index is a schema change of its base table and
a schema drop of the index's materialized view. This combination
of schema changes used to cause issues during view building, because
when a view schema was dropped, it wasn't getting updated with the
new version of the base schema, and while the view building was
in progress, we would update the base schema for the base table
mutation reader and try generating updates with a view schema that
wasn't compatible with the base schema, failing on an `on_internal_error`.

In this patch we add a test for this scenario. We create an index,
halt its view building process using an injection, and drop it.
If no errors are thrown, the test succeeds.

The test was failing before https://github.com/scylladb/scylladb/pull/23337
and is passing afterwards.

(cherry picked from commit bf7bba9634)
2025-05-27 21:42:56 +02:00
Wojciech Mitros
f1fd053572 base_info: remove the lw_shared_ptr variant
The base_dependent_view_info is no longer needed to be shared or
modified in the view_info, so we no longer need to keep it as
a shared pointer.

(cherry picked from commit d77f11d436)
2025-05-27 21:40:23 +02:00
Wojciech Mitros
70b21012cd view_info: don't re-set base_info after construction
In the previous commits we made sure that the base info is not dependent
on the base schema version, and the info dependent on the base schema
version is calculated when it's needed. In this patch we remove the
unnecessary re-setting of the base_info.

The set_base_info method isn't removed completely, because it also has
a secondary function - zeroing the view_info fields other than base_info.
Because of this, in this patch we rename it accordingly and limit its
use to the updates caused by a base schema change.

(cherry picked from commit d7bd86591e)
2025-05-27 21:40:23 +02:00
Wojciech Mitros
4f789bf4c7 base_info: remove base_info snapshot semantics
The base info in view schemas no longer changes on base schema
updates, so saving the base info with a view schema from a specific
point in time doesn't provide any additional benefits.
In this patch we remove the code using the base_and_view snapshots
as it's no longer useful.

(cherry picked from commit ea462efa3d)
2025-05-27 21:40:23 +02:00
Wojciech Mitros
d922a415b5 base_info: remove base schema from the base_info
The base info now only contains values which are not reliant on the
base schema version. We remove the the base schema from the base info
to make it immutable regardless of base schema version, at the point
of this patch it's also not needed anywhere - the new base info can
replace the base schema in most places, and in the few (view_updates)
where we need it, we pull the most recent base schema version from
the database.

After this change, the base info no longer changes in a view schema
after creation, so we'll no longer get errors when we try generating
view updates with a base_info that's incompatible with a specific
base schema version.

Fixes #9059
Fixes #21292
Fixes #22410

(cherry picked from commit ad55935411)
2025-05-27 21:40:23 +02:00
Wojciech Mitros
d9fe006a20 schema_registry: store base info instead of base schema for view entries
In the following patch we plan to remove the base schema from the base_info
to make the base_info immutable. To do that, we first prepare the schema
registry for the change; we need to be able to create view schemas from
frozen schemas there and frozen schemas have no information about the base
table. Unless we do this change, after base schemas are removed from the
base info, we'll no longer be able to load a view schema to the schema registry
without looking up the base schema in the database.

This change also required some updates to schema building:
* we add a method for unfreezing a view schema with base info instead of
a base schema
* we make it possible to use schema_builder with a base info instead of
a base schema
* we add a method for creating a view schema from mutations with a base info
instead of a base schema
* we add a view_info constructor withat base info instead of a base schema
* we update the naming in schema_registry to reflect the usage of base info
instead of base schema

(cherry picked from commit 05fce91945)
2025-05-27 21:40:23 +02:00
Wojciech Mitros
4e2f1d0edb base_info: make members non-const
In the following patches we'll add the base info instead of the
base schema to various places (schema building, schema registry).
There, we'll sometimes need to update the base_info fields, which
we can't do with const members. There's also a place (global_schema_ptr)
where we won't be able to use the base_info_ptr (a shared pointer to the
base_info), so we can't just use the base_info_ptr everywhere instead.

In this patch we unmark these members as const.
In the following patches we'll remove the methods for changing the
base_info in the view schema, so it will remain effectively const.

(cherry picked from commit 6e539c2b4d)
2025-05-27 21:40:23 +02:00
Wojciech Mitros
da5ad10002 view_info: move the base info to a separate header
In the following commits the base_depenedent_view_info will be needed
in many more places. To avoid including the whole db/view/view.hh
or forward declaring (where possible) the base info, we move it to
a separate header which can be included anywhere at almost no cost.

(cherry picked from commit 32258d8f9a)
2025-05-27 21:40:22 +02:00
Wojciech Mitros
6e40197c52 view_info: move computation of view pk columns not in base pk to view_updates
In preparation of making the base_info immutable, we want to get rid of
any base_dependent_view_info fields that can change when base schema
is updated.
The _base_regular_columns_in_view_pk and _base_static_columns_in_view_pk
base column_ids of corresponding base columns and they can change
(decrease) when an earlier column is dropped in the base table.
view_updates is the only location where these values are used and calculating
them is not expensive when comparing to the overall work done while performing
a view update - we iterate over all view primary key columns and look them up
in the base table.
With this in mind, we can just calculate them when creating a view_updates
object, instead of keeping them in the base_info. We do that in this patch.

(cherry picked from commit a3d2cd6b5e)
2025-05-27 21:40:22 +02:00
Wojciech Mitros
4e0c8b7ab8 view_info: move base-dependent variables into base_info
The has_computed_column_depending_on_base_non_primary_key
and is_partition_key_permutation_of_base_partition_key variables
in the view_info depend on the base table so they should be in the
base_dependent_view_info instead of view_info.

(cherry picked from commit a33963daef)
2025-05-27 21:40:22 +02:00
Wojciech Mitros
7422796845 view_info: set base info on construction
Currently, the base_info may or may not be set in view schemas.
Even when it's set, it may be modified. This necessitates extra
checks when handling view schemas, as well as potentially causing
errors when we forget to set it at some point.

Instead, we want to make the base info an immutable member of view
schemas (inside view_info). The first step towards that is making
sure that all newly created schemas have the base info set.
We achieve that by requiring a base schema when constructing a view
schema. Unfortunately, this adds complexity each time we're making
a view schema - we need to get the base schema as well.
In most cases, the base schema is already available. The most
problematic scenario is when we create a schema from mutations:
- when parsing system tables we can get the schema from the
database, as regular tables are parsed before views
- when loading a view schema using the schema loader tool, we need
to load the base additionally to the view schema, effectively
doubling the work
- when pulling the schema from another node - in this case we can
only get the current version of the base schema from the local
database

Additionally, we need to consider the base schema version - when
we generate view updates the version of the base schema used for
reads should match the version of the base schema in view's base
info.
This is achieved by selecting the correct (old or new) schema in
`db::schema_tables::merge_tables_and_views` and using the stored
base schema in the schema_registry.

(cherry picked from commit 900687c818)
2025-05-27 21:40:22 +02:00
Wojciech Mitros
6f8fc57908 alter_table_statement: fix renaming multiple columns in tables with views
When we rename columns in a table which has materialized views depending
on it, we need to also rename them in the materialized views' WHERE
clauses.
Currently, we do that by creating a new WHERE clause after each rename,
with the updated column. This is later converted to a mutation that
overwrites the WHERE clause. After multiple renames, we have multiple
mutations, each overwriting the WHERE clause with one column renamed.
As a result, the final WHERE clause is one of the modified clauses with
one column renamed.
Instead, we should prepare one new WHERE clause which includes all the
renamed columns. This patch accomplishes this by processing all the
column renames first, and only preparing the new view schema with the
new WHERE clause afterwards.

This patch also includes a test reproducer for this scenario.

Fixes scylladb/scylladb#22194

Closes scylladb/scylladb#23152

(cherry picked from commit 88d3fc68b5)
2025-05-27 21:40:22 +02:00
Wojciech Mitros
a5e16c69a4 test_mv_tablets_replace: wait for tablet replicas to balance before working on them
In the test test_tablet_mv_replica_pairing_during_replace we stop 2 out of 4 servers while using RF=2.
Even though in the test we use exactly 4 tablets (1 for each replica of a base table and view), intially,
the tablets may not be split evenly between all nodes. Because of this, even when we chose a server that
hosts the view and a different server that hosts the base table, we sometimes stoped all replicas of the
base or the view table because the node with the base table replica may also be a view replica.

After some time, the tablets should be distributed across all nodes. When that happens, there will be
no common nodes with a base and view replica, so the test scenario will continue as planned.

In this patch, we add this waiting period after creating the base and view, and continue the test only
when all 4 tablets are on distinct nodes.

Fixes https://github.com/scylladb/scylladb/issues/23982
Fixes https://github.com/scylladb/scylladb/issues/23997

Closes scylladb/scylladb#24111

(cherry picked from commit bceb64fb5a)
2025-05-27 18:34:30 +02:00
Botond Dénes
5a67119dce Merge '[Backport 2025.1] Merge 'test/pylib: servers_add: support list of property_files' from Benny Halevy' from Dawid Mędrek
So that a multi-dc/multi-rack cluster can be populated in a single call.

Original PR: scylladb/scylladb#23341

Fixes https://github.com/scylladb/scylladb/issues/23551

(cherry picked from commit 0fdf2a2090)

Closes scylladb/scylladb#23549

* github.com:scylladb/scylladb:
  Merge 'test/pylib: servers_add: support list of property_files' from Benny Halevy
  test.py: Remove reuse cluster in cluster tests
  test.py apply prepare_3_nodes_cluster in topology
  test.py: introduce prepare_3_nodes_cluster marker
2025-05-27 14:30:07 +03:00
Botond Dénes
70f13f7ff3 test/cluster/test_read_repair.py: improve trace logging test (again)
The test test_read_repair_with_trace_logging wants to test read repair
with trace logging. Turns out that node restart + trace-level logging
+ debug mode is too much and even with 1 minute timeout, the read repair
times out sometimes.
Refactor the test to use injection point instead of restart. To make
sure the test still tests what it supposed to test, use tracing to
assert that read repair did indeed happen.

(cherry picked from commit 29eedaa)
2025-05-26 17:42:32 +03:00
Botond Dénes
dc4e5db1d2 test/cluster: extract execute_with_tracing() into pylib/util.py
To allow reuse in other tests.

(cherry picked from commit 51025de755)
2025-05-26 14:14:31 +03:00
Anna Stuchlik
ee2e6b09fb doc: fix the product name for version 2025.1 (on branch-2025.1)
Starting with 2025.1, ScyllaDB versions are no longer called "Enterprise", but the OS support page still uses that label.
This commit fixes that by replacing "Enterprise" with "ScyllaDB".

This update is required since we've removed "Enterprise" from everywhere else, including the commands, so having it here is confusing.

The same update was added on master and on branch-2025.2: https://github.com/scylladb/scylladb/pull/24204
However, that PR cannot be backported to branch-2025.1 because versions 2025.2 and later
store OS support in a JSON file, whereas 2025.1 and earlier in a regular RST file.
See https://github.com/scylladb/scylladb/issues/24179#issuecomment-2888632722.

Fixes https://github.com/scylladb/scylladb/issues/24179

Closes scylladb/scylladb#24222
2025-05-26 10:28:57 +03:00
Łukasz Paszkowski
a1082de797 tools/scylla-nodetool: fix crash when rows_merged cells contain null
Any empty object of the json::json_list type has its internal
_set variable assigned to false which results in such objects
being skipped by the json::json_builder.

Hence, the json returned by the api GET//compaction_manager/compaction_history
does not contain the field `rows_merged` if a cell in the
system.compaction_history table is null or an empty list.

In such cases, executing the command `nodetool compactionhistory`
will result in a crash with the following error message:
`error running operation: rjson::error (JSON assert failed on condition 'false'`

The patch fixes it by checking if the json object contains the
`rows_merged` element before processing. If the element does
not exist, the nodetool will now produce an empty list.

Fixes https://github.com/scylladb/scylladb/issues/23540

Closes scylladb/scylladb#23514

(cherry picked from commit 113647550f)

Closes scylladb/scylladb#24113
2025-05-20 08:30:33 +03:00
Aleksandra Martyniuk
1db71eac3c test_tablet_repair_hosts_filter: change injected error
test_tablet_repair_hosts_filter checks whether the host filter
specfied for tablet repair is correctly persisted. To check this,
we need to ensure that the repair is still ongoing and its data
is kept. The test achieves that by failing the repair on replica
side - as the failed repair is going to be retried.

However, if the filter does not contain any host (included_host_count = 0),
the repair is started on no replica, so the request succeeds
and its data is deleted. The test fails if it checks the filter
after repair request data is removed.

Fail repair on topology coordinator side, so the request is ongoing
regardless of the specified hosts.

Fixes: #23986.

Closes scylladb/scylladb#24003

(cherry picked from commit 2549f5e16b)

Closes scylladb/scylladb#24075
2025-05-19 12:36:45 +03:00
Pavel Emelyanov
3296920fa3 Merge '[Backport 2025.1] logalloc_test: don't test performance in test background_reclaim' from Scylladb[bot]
The test is failing in CI sometimes due to performance reasons.

There are at least two problems:
1. The initial 500ms (wall time) sleep might be too short. If the reclaimer
   doesn't manage to evict enough memory during this time, the test will fail.
2. During the 100ms (thread CPU time) window given by the test to background
   reclaim, the `background_reclaim` scheduling group isn't actually
   guaranteed to get any CPU, regardless of shares. If the process is
   switched out inside the `background_reclaim` group, it might
   accumulate so much vruntime that it won't get any more CPU again
   for a long time.

We have seen both.

This kind of timing test can't be run reliably on overcommitted machines
without modifying the Seastar scheduler to support that (by e.g. using
thread clock instead of wall time clock in the scheduler), and that would
require an amount of effort disproportionate to the value of the test.

So for now, to unflake the test, this patch removes the performance test
part. (And the tradeoff is a weakening of the test). After the patch,
we only check that the background reclaim happens *eventually*.

Fixes https://github.com/scylladb/scylladb/issues/15677

Backporting this is optional. The test is flaky even in stable branches, but the failure is rare.

- (cherry picked from commit c47f438db3)

- (cherry picked from commit 1c1741cfbc)

Parent PR: #24030

Closes scylladb/scylladb#24092

* github.com:scylladb/scylladb:
  logalloc_test: don't test performance in test `background_reclaim`
  logalloc: make background_reclaimer::free_memory_threshold publicly visible
2025-05-19 12:35:39 +03:00
Aleksandra Martyniuk
b4e2773e63 streaming: use host_id in file streaming
Use host ids instead of ips in file-streaming.

Fixes: #22421.

Closes scylladb/scylladb#24055

(cherry picked from commit 2dcea5a27d)

Closes scylladb/scylladb#24118
2025-05-19 12:33:54 +03:00
Aleksandra Martyniuk
b0a7ca7c17 cql_test_env: main: move stream_manager initialization
Currently, stream_manager is initialized after storage_service and
so it is stopped before the storage_service is. In its stop method
storage_service accesses stream_manager which is uninitialized
at a time.

Move stream_manager initialization over the storage_service initialization.

Fixes: #23207.

Closes scylladb/scylladb#24008

(cherry picked from commit 9c03255fd2)

Closes scylladb/scylladb#24189
2025-05-19 12:30:01 +03:00
Michał Chojnowski
c339f464b6 logalloc_test: don't test performance in test background_reclaim
The test is failing in CI sometimes due to performance reasons.

There are at least two problems:
1. The initial 500ms (wall time) sleep might be too short. If the reclaimer
   doesn't manage to evict enough memory during this time, the test will fail.
2. During the 100ms (thread CPU time) window given by the test to background
   reclaim, the `background_reclaim` scheduling group isn't actually
   guaranteed to get any CPU, regardless of shares. If the process is
   switched out inside the `background_reclaim` group, it might
   accumulate so much vruntime that it won't get any more CPU again
   for a long time.

We have seen both.

This kind of timing test can't be run reliably on overcommitted machines
without modifying the Seastar scheduler to support that (by e.g. using
thread clock instead of wall time clock in the scheduler), and that would
require an amount of effort disproportionate to the value of the test.

So for now, to unflake the test, this patch removes the performance test
part. (And the tradeoff is a weakening of the test).

(cherry picked from commit 1c1741cfbc)
2025-05-16 11:49:18 +00:00
Michał Chojnowski
5f4927eaaf logalloc: make background_reclaimer::free_memory_threshold publicly visible
Wanted by the change to the background_reclaim test in the next patch.

(cherry picked from commit c47f438db3)
2025-05-16 11:49:18 +00:00
Wojciech Mitros
cadc3eeae8 mv: remove queue length limit from the view update read concurrency semaphore
Each view update is correlated to a write that generates it (aside from view
building which is throttled separately). These writes are limited by a throttling
mechanism, which effectively works by performing the writes with CL=ALL if
ongoing writes exceed some memory usage limit

When writes generate view updates, they usually also need to perform a read. This read
goes through a read concurrency semaphore where it can get delayed or killed. The
semaphore allows up to 100 concurrent reads and puts all remaining reads in a queue.
If the number of queued reads exceeds a specific limit, the view update will fail on
the replica, causing inconsistencies.

This limit is not necessary. When a read gets queued on the semaphore, the write that's
causing the view update is paused, so the write takes part in the regular write throttling.
If too many writes get stuck on view update reads, they will get throttled, so their
number is limited and the number of queued reads is also limited to the same amount.

In this patch we remove the specified queue length limit for the view update read concurrency
semaphore. Instead of this limit, the queue will be now limited indirectly, by the base write
throttling mechanism. This may allow the queue grow longer than with the previous limit, but
it shouldn't ever cause issues - we only perform up to 100 actual reads at once, and the
remaining ones that get queued use a tiny amount of memory, less than the writes that generated
them and which are getting limited directly.

Fixes https://github.com/scylladb/scylladb/issues/23319

Closes scylladb/scylladb#24112

(cherry picked from commit 5920647617)

Closes scylladb/scylladb#24168
2025-05-16 11:46:15 +03:00
Dawid Mędrek
6987a5e363 locator/production_snitch_base: Reduce log level when property file incomplete
We're reducing the log level in case the provided property file is incomplete.
The rationale behind this change is related to how CCM interacts with Scylla:

* The `GossipingPropertyFileSnitch` reloads the `cassandra-rackdc.properties`
  configuration every 60 seconds.
* When a new node is added to the cluster, CCM recreates the
  `cassandra-rackdc.properties` file for EVERY node.

If those two processes start happening at about the same time, it may lead
to Scylla trying to read a not-completely-recreated file, and an error will
be produced.

Although we would normally fix this issue and try to avoid the race, that
behavior will be no longer relevant as we're making the rack and DC values
immutable (cf. scylladb/scylladb#23278). What's more, trying to fix the problem
in the older versions of Scylla could bring a more serious regression. Having
that in mind, this commit is a compromise between making CI less flaky and
having minimal impact when backported.

We do the same for when the format of the file is invalid: the rationale
is the same.

We also do that for when there is a double declaration. Although it seems
impossible that this can stem from the same scenario the other two errors
can (since if the format of the file is valid, the error is justified;
if the format is invalid, it should be detected sooner than a doubled
declaration), let's stay consistent with the logging level.

Fixes scylladb/scylladb#20092

Closes scylladb/scylladb#23956

(cherry picked from commit 9ebd6df43a)

Closes scylladb/scylladb#24142
2025-05-16 11:45:48 +03:00
Anna Stuchlik
6aaf3354bc doc: remove the redundant pages
This commit removes two redundant pages and adds the related redirections.

- The Tutorials page is a duplicate and is not maintained anymore.
  Having it in the docs hurts the SEO of the up-to-date Tutorias page.
- The Contributing page is not helpful. Contributions-related information
  should be maintained in the project README file.

Fixes https://github.com/scylladb/scylladb/issues/17279
Fixes https://github.com/scylladb/scylladb/issues/24060

Closes scylladb/scylladb#24090

(cherry picked from commit eed8373b77)

Closes scylladb/scylladb#24140
2025-05-16 11:45:28 +03:00
Ernest Zaslavsky
4cd6f58111 database_test: Wait for the index to be created
Just call `wait_until_built` for the index in question

fix: https://github.com/scylladb/scylladb/issues/24059

Closes scylladb/scylladb#24117

(cherry picked from commit 4a7c847cba)

Closes scylladb/scylladb#24131
2025-05-16 11:45:04 +03:00
Piotr Smaron
90af439e47 cql: fix CREATE tablets KS warning msg
Materialized Views and Secondary Indexes are yet another features that
keyspaces with tablets do not support, but these were not listed in a
warning message returned to the user on CREATE KEYSPACE statement. This
commit adds the 2 missing features.

Fixes: #24006

Closes scylladb/scylladb#23902

(cherry picked from commit f740f9f0e1)

Closes scylladb/scylladb#24079
2025-05-16 11:44:24 +03:00
Piotr Dulikowski
89d24d813c topology_coordinator: silence ERROR messages on abort
When the topology coordinator is shut down while doing a long-running
operation, the current operation might throw a raft::request_aborted
exception. This is not a critical issue and should not be logged with
ERROR verbosity level.

Make sure that all the try..catch blocks in the topology coordinator
which:

- May try to acquire a new group0 guard in the `try` part
- Have a `catch (...)` block that print an ERROR-level message

...have a pass-through `catch (raft::request_aborted&)` block which does
not log the exception.

Fixes: scylladb/scylladb#22649

Closes scylladb/scylladb#23962

(cherry picked from commit 156ff8798b)

Closes scylladb/scylladb#24078
2025-05-16 11:43:50 +03:00
Botond Dénes
a0c01964cd tools/scylla-nodetool: status: handle negative load sizes
Negative load sizes don't make sense, but we've seen a case in
production, where a negative number was returned by ScyllaDB REST API,
so be prepared to handle these too.

Fixes: scylladb/scylladb#24134

Closes scylladb/scylladb#24135

(cherry picked from commit 700a5f86ed)

Closes scylladb/scylladb#24167
2025-05-15 17:39:36 +03:00
Jenkins Promoter
2034578d0e Update pgo profiles - aarch64 2025-05-15 04:22:14 +03:00
Jenkins Promoter
e97c883ef8 Update pgo profiles - x86_64 2025-05-15 04:07:11 +03:00
Tomasz Grabiec
e46e02e3f5 Merge '[Backport 2025.1] test/topology: use standard new_test_keyspace functions' from Scylladb[bot]
This PR improves and refactors the test.topology.util new_test_keyspace generator
and adds a corresponding create_new_test_keyspace function to be used by most if not
all topology unit tests in order to standardize the way the tests create keyspaces
and to mitigate the python driver create keyspace retry issue: https://github.com/scylladb/python-driver/issues/317

Fixes #22342
Fixes #21905
Refs https://github.com/scylladb/scylla-enterprise/issues/5060

Fixes #23699

- (cherry picked from commit 50ce0aaf1c)

- (cherry picked from commit 5d448f721e)

- (cherry picked from commit f946302369)

- (cherry picked from commit 0fd1b846fe)

- (cherry picked from commit a66ddb7c04)

- (cherry picked from commit df84097a4b)

- (cherry picked from commit 59687c25e0)

- (cherry picked from commit fdb339bf28)

- (cherry picked from commit 205ed113dd)

- (cherry picked from commit 57faab9ffa)

- (cherry picked from commit 4fefffe335)

- (cherry picked from commit 480a5837ab)

- (cherry picked from commit fed078a38a)

- (cherry picked from commit c6653e65ba)

- (cherry picked from commit 9c095b622b)

- (cherry picked from commit 0668c642a2)

- (cherry picked from commit 0e11aad9c5)

- (cherry picked from commit ef85c4b27e)

- (cherry picked from commit b13e48b648)

- (cherry picked from commit a82e734110)

- (cherry picked from commit 629ee3cb46)

- (cherry picked from commit 42a104038d)

- (cherry picked from commit d5e3c578f5)

- (cherry picked from commit c05794c156)

- (cherry picked from commit 966cf82dae)

- (cherry picked from commit 11005b10db)

- (cherry picked from commit ff9c8428df)

- (cherry picked from commit 55b35eb21c)

- (cherry picked from commit 5759a97eb4)

- (cherry picked from commit c68d2a471c)

- (cherry picked from commit e05372afa4)

- (cherry picked from commit 380c5e5ac8)

- (cherry picked from commit 3f35491264)

- (cherry picked from commit e72a9d3faa)

- (cherry picked from commit 47326d01b7)

- (cherry picked from commit 72bc4016e7)

- (cherry picked from commit 4fd6c2d24e)

- (cherry picked from commit 50a8f5c1c0)

- (cherry picked from commit 005ceb77d3)

- (cherry picked from commit 649e68c6db)

- (cherry picked from commit 0b88ea9798)

- (cherry picked from commit 6b37d04aa9)

- (cherry picked from commit e59aca66bf)

- (cherry picked from commit 5ff3153912)

- (cherry picked from commit 20f7eda16e)

- (cherry picked from commit f30e4c6917)

- (cherry picked from commit 96d327fb83)

- (cherry picked from commit 16ef78075c)

- (cherry picked from commit 2d4af01281)

- (cherry picked from commit b810791fbb)

- (cherry picked from commit 46b1850f0c)

- (cherry picked from commit 0564e95c51)

- (cherry picked from commit 12f85ce57c)

- (cherry picked from commit 9829b1594f)

- (cherry picked from commit cbe79b20f7)

- (cherry picked from commit cc281ff88d)

Parent PR: #22399

Closes scylladb/scylladb#23408

* github.com:scylladb/scylladb:
  test_tablet_repair_scheduler: prepare_multi_dc_repair: use create_new_test_keyspace
  test/repair: create_table_insert_data_for_repair: create keyspace with unique name
  topology_tasks/test_tablet_tasks: use new_test_keyspace
  topology_tasks/test_node_ops_tasks: use new_test_keyspace
  topology_custom/test_zero_token_nodes_no_replication: use create_new_test_keyspace
  topology_custom/test_zero_token_nodes_multidc: use create_new_test_keyspace
  topology_custom/test_view_build_status: use new_test_keyspace
  topology_custom/test_truncate_with_tablets: use new_test_keyspace
  topology_custom/test_topology_failure_recovery: use new_test_keyspace
  topology_custom/test_tablets_removenode: use create_new_test_keyspace
  topology_custom/test_tablets_migration: use new_test_keyspace
  topology_custom/test_tablets_merge: use new_test_keyspace
  topology_custom/test_tablets_intranode: use new_test_keyspace
  topology_custom/test_tablets_cql: use new_test_keyspace
  topology_custom/test_tablets2: use *new_test_keyspace
  topology_custom/test_tablets2: test_schema_change_during_cleanup: drop unused check function
  test/cluster/test_tablets.py: Fix test errorneous indentation
  topology_custom/test_tablets: use new_test_keyspace
  topology_custom/test_table_desc_read_barrier: use new_test_keyspace
  topology_custom/test_shutdown_hang: use new_test_keyspace
  topology_custom/test_select_from_mutation_fragments: use new_test_keyspace
  topology_custom/test_rpc_compression: use new_test_keyspace
  topology_custom/test_reversed_queries_during_simulated_upgrade_process: use new_test_keyspace
  topology_custom/test_raft_snapshot_truncation: use create_new_test_keyspace
  topology_custom/test_raft_no_quorum: use new_test_keyspace
  topology_custom/test_raft_fix_broken_snapshot: use new_test_keyspace
  topology_custom/test_query_rebounce: use new_test_keyspace
  topology_custom/test_not_enough_token_owners: use new_test_keyspace
  topology_custom/test_node_shutdown_waits_for_pending_requests: use new_test_keyspace
  topology_custom/test_node_isolation: use create_new_test_keyspace
  topology_custom/test_mv_topology_change: use new_test_keyspace
  topology_custom/test_mv_tablets_replace: use new_test_keyspace
  topology_custom/test_mv_tablets_empty_ip: use new_test_keyspace
  topology_custom/test_mv_tablets: use new_test_keyspace
  topology_custom/test_mv_read_concurrency: use new_test_keyspace
  topology_custom/test_mv_fail_building: use new_test_keyspace
  topology_custom/test_mv_delete_partitions: use new_test_keyspace
  topology_custom/test_mv_building: use new_test_keyspace
  topology_custom/test_mv_backlog: use new_test_keyspace
  topology_custom/test_mv_admission_control: use new_test_keyspace
  topology_custom/test_major_compaction: use new_test_keyspace
  topology_custom/test_maintenance_mode: use new_test_keyspace
  topology_custom/test_lwt_semaphore: use new_test_keyspace
  topology_custom/test_ip_mappings: use new_test_keyspace
  topology_custom/test_hints: use new_test_keyspace
  topology_custom/test_group0_schema_versioning: use new_test_keyspace
  topology_custom/test_data_resurrection_after_cleanup: use new_test_keyspace
  topology_custom/test_read_repair_with_conflicting_hash_keys: use new_test_keyspace
  topology_custom/test_read_repair: use new_test_keyspace
  topology_custom/test_compacting_reader_tombstone_gc_with_data_in_memtable: use new_test_keyspace
  topology_custom/test_commitlog_segment_data_resurrection: use new_test_keyspace
  topology_custom/test_change_replication_factor_1_to_0: use new_test_keyspace
  topology/test_tls: test_upgrade_to_ssl: use new_test_keyspace
  test/topology/util: new_test_keyspace: drop keyspace only on success
  test/topology/util: refactor new_test_keyspace
  test/topology/util: CREATE KEYSPACE IF NOT EXISTS
  test/topology/util: new_test_keyspace: accept ManagerClient
2025-05-13 11:29:44 +02:00
Benny Halevy
ee14a0dac1 test_tablet_repair_scheduler: prepare_multi_dc_repair: use create_new_test_keyspace
and return the keyspace unique name to the caller.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit cc281ff88d)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-05-12 13:58:19 +03:00
Benny Halevy
5eca0b6d16 test/repair: create_table_insert_data_for_repair: create keyspace with unique name
and return it to the caller

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit cbe79b20f7)
2025-05-12 13:58:19 +03:00
Benny Halevy
dc1ec6e6d5 topology_tasks/test_tablet_tasks: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 9829b1594f)
2025-05-12 13:58:19 +03:00
Benny Halevy
fda3a37026 topology_tasks/test_node_ops_tasks: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 12f85ce57c)
2025-05-12 13:58:18 +03:00
Benny Halevy
5b848ecf0a topology_custom/test_zero_token_nodes_no_replication: use create_new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0564e95c51)
2025-05-12 13:58:18 +03:00
Benny Halevy
0c73e4522f topology_custom/test_zero_token_nodes_multidc: use create_new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 46b1850f0c)
2025-05-12 13:58:18 +03:00
Benny Halevy
2fec75d557 topology_custom/test_view_build_status: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b810791fbb)
2025-05-12 13:58:18 +03:00
Benny Halevy
80a5c121a3 topology_custom/test_truncate_with_tablets: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2d4af01281)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-05-12 13:58:18 +03:00
Benny Halevy
77aaa3bfbb topology_custom/test_topology_failure_recovery: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 16ef78075c)
2025-05-12 13:58:18 +03:00
Benny Halevy
12191fca90 topology_custom/test_tablets_removenode: use create_new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 96d327fb83)
2025-05-12 13:58:18 +03:00
Benny Halevy
d687488471 topology_custom/test_tablets_migration: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit f30e4c6917)
2025-05-12 13:58:18 +03:00
Benny Halevy
2475e333ef topology_custom/test_tablets_merge: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 20f7eda16e)
2025-05-12 13:58:18 +03:00
Benny Halevy
98704d9033 topology_custom/test_tablets_intranode: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 5ff3153912)
2025-05-12 13:58:18 +03:00
Benny Halevy
aa6e421b84 topology_custom/test_tablets_cql: use new_test_keyspace
And create_new_test_keyspace when we need drop
to be explicit.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit e59aca66bf)
2025-05-12 13:58:18 +03:00
Benny Halevy
d01d121b64 topology_custom/test_tablets2: use *new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 6b37d04aa9)
2025-05-12 13:58:18 +03:00
Benny Halevy
2949cb2c50 topology_custom/test_tablets2: test_schema_change_during_cleanup: drop unused check function
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0b88ea9798)
2025-05-12 13:58:18 +03:00
Dawid Mędrek
6f65e0469a test/cluster/test_tablets.py: Fix test errorneous indentation
Some of the statements in the test are not indented properly
and, as a result, are never run. It's most likely a small mistake,
so let's fix it.

Closes scylladb/scylladb#23659

(cherry picked from commit 0ed21d9cc1)
2025-05-12 13:58:18 +03:00
Benny Halevy
2f2d392d54 topology_custom/test_tablets: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 649e68c6db)
2025-05-12 13:58:18 +03:00
Benny Halevy
871ef9c9b2 topology_custom/test_table_desc_read_barrier: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 005ceb77d3)
2025-05-12 13:58:18 +03:00
Benny Halevy
84ff0eefdc topology_custom/test_shutdown_hang: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 50a8f5c1c0)
2025-05-12 13:58:18 +03:00
Benny Halevy
a866a1ccf3 topology_custom/test_select_from_mutation_fragments: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 4fd6c2d24e)
2025-05-12 13:58:18 +03:00
Benny Halevy
b20816ffd2 topology_custom/test_rpc_compression: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 72bc4016e7)
2025-05-12 13:58:18 +03:00
Benny Halevy
c87d12c416 topology_custom/test_reversed_queries_during_simulated_upgrade_process: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 47326d01b7)
2025-05-12 13:58:18 +03:00
Benny Halevy
23052b4df1 topology_custom/test_raft_snapshot_truncation: use create_new_test_keyspace
Using the new_test_keyspace fixture is awkward for this test
as it is written to explicitly drop the created keyspaces
at certain points.

Therefore, just use create_new_test_keyspace to standardize the
creation procedure.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit e72a9d3faa)
2025-05-12 13:58:18 +03:00
Benny Halevy
51cd55fb2e topology_custom/test_raft_no_quorum: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 3f35491264)
2025-05-12 13:58:18 +03:00
Benny Halevy
296fbbf1d6 topology_custom/test_raft_fix_broken_snapshot: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 380c5e5ac8)
2025-05-12 13:58:18 +03:00
Benny Halevy
9b602cf6c1 topology_custom/test_query_rebounce: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit e05372afa4)
2025-05-12 13:58:18 +03:00
Benny Halevy
4c96ff3e36 topology_custom/test_not_enough_token_owners: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c68d2a471c)
2025-05-12 13:58:18 +03:00
Benny Halevy
0d6383f70d topology_custom/test_node_shutdown_waits_for_pending_requests: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 5759a97eb4)
2025-05-12 13:58:18 +03:00
Benny Halevy
2c8d6075b8 topology_custom/test_node_isolation: use create_new_test_keyspace
new_test_keyspace is problematic here since
the presence of the banned node can fail the automatic drop of
the test keyspace due to NoHostAvailable (in debug mode for
some reason)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 55b35eb21c)
2025-05-12 13:58:18 +03:00
Benny Halevy
7a8e3f7bb3 topology_custom/test_mv_topology_change: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ff9c8428df)
2025-05-12 13:58:18 +03:00
Benny Halevy
8ece858560 topology_custom/test_mv_tablets_replace: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 11005b10db)
2025-05-12 13:58:18 +03:00
Benny Halevy
b29a01a453 topology_custom/test_mv_tablets_empty_ip: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 966cf82dae)
2025-05-12 13:58:18 +03:00
Benny Halevy
4b6d5c82ed topology_custom/test_mv_tablets: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c05794c156)
2025-05-12 13:58:18 +03:00
Benny Halevy
33cf81e195 topology_custom/test_mv_read_concurrency: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit d5e3c578f5)
2025-05-12 13:58:18 +03:00
Benny Halevy
c2c0c0b4e1 topology_custom/test_mv_fail_building: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 42a104038d)
2025-05-12 13:58:18 +03:00
Benny Halevy
149eab295a topology_custom/test_mv_delete_partitions: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 629ee3cb46)
2025-05-12 13:58:18 +03:00
Benny Halevy
1c0b6aaa47 topology_custom/test_mv_building: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit a82e734110)
2025-05-12 13:58:18 +03:00
Benny Halevy
ec888b253a topology_custom/test_mv_backlog: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b13e48b648)
2025-05-12 13:58:18 +03:00
Benny Halevy
5b418eebe7 topology_custom/test_mv_admission_control: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ef85c4b27e)
2025-05-12 13:58:18 +03:00
Benny Halevy
3b4ac3cb2d topology_custom/test_major_compaction: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0e11aad9c5)
2025-05-12 13:58:18 +03:00
Benny Halevy
67c2ca55a6 topology_custom/test_maintenance_mode: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0668c642a2)
2025-05-12 13:58:18 +03:00
Benny Halevy
4e778b4875 topology_custom/test_lwt_semaphore: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 9c095b622b)
2025-05-12 13:58:18 +03:00
Benny Halevy
dbf42b8e75 topology_custom/test_ip_mappings: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c6653e65ba)
2025-05-12 13:58:18 +03:00
Benny Halevy
9adb5a4e84 topology_custom/test_hints: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit fed078a38a)
2025-05-12 13:58:18 +03:00
Benny Halevy
ecd337763c topology_custom/test_group0_schema_versioning: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 480a5837ab)
2025-05-12 13:58:18 +03:00
Benny Halevy
9751bd3dbf topology_custom/test_data_resurrection_after_cleanup: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 4fefffe335)
2025-05-12 13:58:18 +03:00
Benny Halevy
9179938036 topology_custom/test_read_repair_with_conflicting_hash_keys: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 57faab9ffa)
2025-05-12 13:58:17 +03:00
Benny Halevy
a9c8651722 topology_custom/test_read_repair: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 205ed113dd)
2025-05-12 13:58:17 +03:00
Benny Halevy
5a563503b4 topology_custom/test_compacting_reader_tombstone_gc_with_data_in_memtable: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit fdb339bf28)
2025-05-12 13:58:17 +03:00
Benny Halevy
1d3d48c79e topology_custom/test_commitlog_segment_data_resurrection: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 59687c25e0)
2025-05-12 13:58:17 +03:00
Benny Halevy
84adef0489 topology_custom/test_change_replication_factor_1_to_0: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit df84097a4b)
2025-05-12 13:58:17 +03:00
Benny Halevy
bffd3fa3b7 topology/test_tls: test_upgrade_to_ssl: use new_test_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit a66ddb7c04)
2025-05-12 13:58:17 +03:00
Benny Halevy
ab86683e7a test/topology/util: new_test_keyspace: drop keyspace only on success
When the test fails with exception, keep the keyspace
intact for post-mortem analysis.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0fd1b846fe)
2025-05-12 13:58:17 +03:00
Benny Halevy
09ffcb2d98 test/topology/util: refactor new_test_keyspace
Define create_new_test_keyspace that can be used in
cases we cannot automatically drop the newly created keyspace
due to e.g. loss of raft majority at the end of the test.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit f946302369)
2025-05-12 13:58:17 +03:00
Benny Halevy
ce0c6e7595 test/topology/util: CREATE KEYSPACE IF NOT EXISTS
Workaround spurious keyspace creation errors
due to retries caused by
https://github.com/scylladb/python-driver/issues/317.
This is safe since the function uses a unique_name for the
keyspace so it should never exist by mistake.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 5d448f721e)
2025-05-12 13:58:17 +03:00
Benny Halevy
6d88937dcd test/topology/util: new_test_keyspace: accept ManagerClient
Following patch will convert topology tests to use
new_test_keyspace and friends.

Some tests restart server and reset the driver connection
so we cannot use the original cql Session for
dropping the created keyspace in the `finally` block.

Pass the ManagerClient instead to get a new cql
session for dropping the keyspace.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 50ce0aaf1c)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-05-12 13:58:16 +03:00
Patryk Jędrzejczak
654c097121 Merge '[Backport 2025.1] Correctly skip updating node's own ip address due to oudated gossiper data ' from Scylladb[bot]
Used host id to check if the update is for the node itself. Using IP is unreliable since if a node is restarted with different IP a gossiper message with previous IP can be misinterpreted as belonging to a different node.

Fixes: #22777

Backport to 2025.1 since this fixes a crash. Older version do not have the code.

- (cherry picked from commit a2178b7c31)

- (cherry picked from commit ecd14753c0)

- (cherry picked from commit 7403de241c)

Parent PR: #24000

Closes scylladb/scylladb#24088

* https://github.com/scylladb/scylladb:
  test: add reproducer for #22777
  storage_service: use id to check for local node
2025-05-12 09:37:49 +02:00
Gleb Natapov
6a2259a9d1 test: add reproducer for #22777
Add sleep before starting gossiper to increase a chance of getting old
gossiper entry about yourself before updating local gossiper info with
new IP address.

(cherry picked from commit 7403de241c)
2025-05-11 11:38:31 +03:00
Gleb Natapov
64a7a5ed0f storage_service: use id to check for local node
IP may change and an old gossiper message with previous IP may be
processed when it shouldn't.

Fixes: #22777
(cherry picked from commit a2178b7c31)
2025-05-09 12:55:44 +00:00
Michał Chojnowski
94606edff6 test/boost/mvcc_test: fix an overly-strong assertion in test_snapshot_cursor_is_consistent_with_merging
The test checks that merging the partition versions on-the-fly using the
cursor gives the same results as merging them destructively with apply_monotonically.

In particular, it tests that the continuity of both results is equal.
However, there's a subtlety which makes this not true.
The cursor puts empty dummy rows (i.e. dummies shadowed by the partition
tombstone) in the output.
But the destructive merge is allowed (as an expection to the general
rule, for optimization reasons), to remove those dummies and thus reduce
the continuity.

So after this patch we instead check that the output of the cursor
has continuity equal to the merged continuities of version.
(Rather than to the continuity of merged versions, which can be
smaller as described above).

Refs https://github.com/scylladb/scylladb/pull/21459, a patch which did
the same in a different test.
Fixes https://github.com/scylladb/scylladb/issues/13642

Closes scylladb/scylladb#24044

(cherry picked from commit 746ec1d4e4)

Closes scylladb/scylladb#24077
2025-05-09 13:08:59 +02:00
David Garcia
65d57819c9 docs: fix md redirections for multiversion support
This change resolves an issue where selecting a version from the multiversion dropdown on Markdown pages (e.g. https://docs.scylladb.com/manual/stable/alternator/getting-started.html) incorrectly redirected users to the main page instead of the corresponding versioned page.

The underlying cause was that the `multiversion` extension relies on `source_suffix` to identify available pages for URL mapping. Without this configuration, proper redirection fails for `.md` files.

This fix should be backported to `2025.1` to ensure correct behavior. Otherwise, the fix will only take effect in future releases.

Testing locally is non-trivial: clone the repository, apply the changes to each relevant branch, set `smv_remote_whitelist` to "", then run `make multiversionpreview`. Afterward, switch between versions in the dropdown to verify behavior. I've tested it locally, so the best next step is to merge and confirm that it works as expected in the live environment.

Closes scylladb/scylladb#23957

(cherry picked from commit 4ba7182515)

Closes scylladb/scylladb#24010
2025-05-08 11:18:56 +03:00
Botond Dénes
c341668371 Merge '[Backport 2025.1] replica: skip flush of dropped table' from Scylladb[bot]
Currently, flush throws no_such_column_family if a table is dropped. Skip the flush of dropped table instead.

Fixes: #16095.

Needs backport to 2025.1 and 6.2 as they contain the bug

- (cherry picked from commit 91b57e79f3)

- (cherry picked from commit c1618c7de5)

Parent PR: #23876

Closes scylladb/scylladb#23905

* github.com:scylladb/scylladb:
  test: test table drop during flush
  replica: skip flush of dropped table
2025-05-08 11:18:17 +03:00
Aleksandra Martyniuk
880f11fdba streaming: skip dropped tables
Currently, stream_session::prepare throws when a table in requests
or summaries is dropped. However, we do not want to fail streaming
if the table is dropped.

Delete table checks from stream_session::prepare. Further streaming
steps can handle the dropped table and finish the streaming successfully.

Fixes: #15257.

Closes scylladb/scylladb#23915

(cherry picked from commit 20c2d6210e)

Closes scylladb/scylladb#24052
2025-05-08 11:17:18 +03:00
Pavel Emelyanov
d19d02b543 sstable_directory: Print ks.cf when moving unshared remove sstables
When an sstable is identified by sstable_directory as remote-unshared,
it will at some point be moved to the target shard. When it happens a
log-message appears:

    sstable_directory - Moving 1 unshared SSTables to shard 1

Processing of tables by sstable_directory often happens in parallel, and
messages from sstable_directory are intermixed. Having a message like
above is not very informative, as it tells nothing about sstables that
are being moved.

Equip the message with ks:cf pair to make it more informative.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23912

(cherry picked from commit d40d6801b0)

Closes scylladb/scylladb#24015
2025-05-08 11:14:40 +03:00
Anna Stuchlik
db7de7f434 doc: add a link to the previous Enterprise documentation
This commit adds a link to the docs for previous Enterprise versions
at https://enterprise.docs.scylladb.com/ to the left menu.

As we still support versions 2024.1 and 2024.2, we need to ensure
easier access to those docs sets.

Fixes https://github.com/scylladb/scylladb/issues/23870

Closes scylladb/scylladb#23945

(cherry picked from commit 851a433663)

Closes scylladb/scylladb#24017
2025-05-08 10:58:05 +03:00
Botond Dénes
6201531b39 replica/database: memtable_list: save ref to memtable_table_shared_data
This is passed by reference to the constructor, but a copy is saved into
the _table_shared_data member. A reference to this member is passed down
to all memtable readers. Because of the copy, the memtable readers save
a reference to the memtable_list's member, which goes away together with
the memtable_list when the storage_group is destroyed.
This causes use-after-free when a storage group is destroyed while a
memtable read is still ongoing. The memtable reader keeps the memtable
alive, but its reference to the memtable_table_shared_data becomes
stale.
Fix by saving a reference in the memtable_list too, so memtable readers
receive a reference pointing to the original replica::table member,
which is stable accross tablet migrations and merges.
The copy was introduced by 2a76065e3d.
There was a copy even before this commit, but in the previous vnode-only
world this was fine -- there was one memtable_list per table and it was
around until the table itself was. In the tablet world, this is no
longer given, but the above commit didn't account for this.

A test is included, which reproduces the use-after-free on memtable
migration. The test is somewhat artificial in that the use-after-free
would be prevented by holding on to an ERM, but this is done
intentionaly to keep the test simple. Migration -- unlike merge where
this use-after-free was originally observed -- is easy to trigger from
unit tests.

Fixes: #23762

Closes scylladb/scylladb#23984

(cherry picked from commit 0a9ca52cfd)

Closes scylladb/scylladb#24033
2025-05-07 19:21:03 +03:00
Amnon Heiman
9e3026b6af alternator/test_metrics.py: batch_write validate WCU
This patch adds a test that verifies the WCU metrics are updated
correctly during a batch_write_item operation.
It ensures that the WCU of each item is calculated independently.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 2ab99d7a07)
2025-05-07 10:48:23 +03:00
Amnon Heiman
c5f0568fb8 alternator/test_returnconsumedcapacity.py: Add tests for batch write WCU
This patch adds two tests:
A test that validates WCU calculation for batch put requests on a single table.

A test that validates WCU calculation for batch requests across multiple
tables, including ensuring that delete operations are counted as 1 WCU.

Both tests verify that the consumed capacity is reported correctly
according to the WCU rules.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 14570f1bb5)
2025-05-07 10:47:26 +03:00
Amnon Heiman
b2e8cc5b9b alternator/executor: add WCU for batch_write_items
This patch adds consumed capacity unit support to batch_write_item.

It calculates the WCU based on an item's length (for put) or a static 1
WCU (for delete), for each item on each table.

The WCU metrics are always updated. if the user requests consumed
capacity, a vector of consumed capacity is returned with an entry for
each of the tables.

For code simplicity, the return part of batch_write_item was updated to
be coroutine-like; this makes it easier to manage the life cycle of the
returned consumed_capacity array.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 68db77643f)
2025-05-07 10:47:16 +03:00
Amnon Heiman
d7d401df13 alternator/consumed_capacity: make wcu get_units public
This patch adds a public static get_units function to
wcu_consumed_capacity_counter.  It will be used by the batch write item
implementation, which cannot use the wcu_consumed_capacity_counter
directly.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

consume_capacity need merge

(cherry picked from commit f2ade71f4f)
2025-05-07 10:38:42 +03:00
Amnon Heiman
f5fafd92b5 Alternator: Change the WCU/RCU to use units
This patch changes the RCU/WCU Alternator metrics to use whole units
instead of half units. The change includes the following:

Change the metrics documentation. Keep the RCU counter internally in
half units, but return the actual (whole unit) value.
Change the RCU name to be rcu_half_units_total to indicates that it
counts half units.
Change the WCU to count in whole units instead of half units.

Update the tests accordingly.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 5ae11746fa)
2025-05-07 10:38:21 +03:00
Piotr Dulikowski
7591e131a2 test: mv: skip test_view_building_scheduling_group in debug
The test populates a table with 50k rows, creates a view on that table
and then compares the time spent in streaming vs. gossip scheduling
groups. It only takes 10s in dev mode on my machine, but is much slower
in debug mode in CI - building the view doesn't finish within 2 minutes.

The bigger the view to build, the more accurrate the measurement;
moreover, the test scenario isn't interesting enough to be worth running
it in debug mode as this should be covered by other tests. Therefore,
just skip this test in debug mode.

Fixes: scylladb/scylladb#23862

Closes scylladb/scylladb#23866

(cherry picked from commit 3d73c79a72)

Closes scylladb/scylladb#23881
2025-05-06 10:16:57 +02:00
Aleksandra Martyniuk
bfdf7c944b test: test table drop during flush
(cherry picked from commit c1618c7de5)
2025-05-06 09:52:42 +02:00
Aleksandra Martyniuk
238cf27471 replica: skip flush of dropped table
(cherry picked from commit 91b57e79f3)
2025-05-06 09:52:30 +02:00
Benny Halevy
fc21a0f8a1 loading_cache_test: test_loading_cache_reload_during_eviction: use manual_clock
Rather than lowres_clock, as since
32b7cab917,
loading_cache_for_test uses manual_clock for timing
and relying on lowres_clock to time the test might
run out of memory on fast test machines.

Fixes #23497

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#23498

(cherry picked from commit 5f2ce0b022)

Closes scylladb/scylladb#23983
2025-05-01 14:07:25 +03:00
Piotr Dulikowski
b77db75280 utils::loading_cache: gracefully skip timer if gate closed
The loading_cache has a periodic timer which acquires the
_timer_reads_gate. The stop() method first closes the gate and then
cancels the timer - this order is necessary because the timer is
re-armed under the gate. However, the timer callback does not check
whether the gate was closed but tries to acquire it, which might result
in unhandled exception which is logged with ERROR severity.

Fix the timer callback by acquiring access to the gate at the beginning
and gracefully returning if the gate is closed. Even though the gate
used to be entered in the middle of the callback, it does not make sense
to execute the timer's logic at all if the cache is being stopped.

Fixes: scylladb/scylladb#23951

Closes scylladb/scylladb#23952

(cherry picked from commit 8ffe4b0308)

Closes scylladb/scylladb#23981
2025-05-01 08:28:46 +03:00
Aleksandra Martyniuk
aaaeb6dcee test_tablet_tasks: use injection to revoke resize
Currently, test_tablet_resize_revoked tries to trigger split revoke
by deleting some rows. This method isn't deterministic and so a test
is flaky.

Use error injection to trigger resize revoke.

Fixes: #22570.

Closes scylladb/scylladb#23966

(cherry picked from commit 1f4edd8683)

Closes scylladb/scylladb#23974
2025-05-01 08:28:09 +03:00
Botond Dénes
f76cf6118a Merge '[Backport 2025.1] topology coordinator: do not proceed further on invalid boostrap tokens' from Scylladb[bot]
In case when dht::boot_strapper::get_boostrap_tokens fail to parse the
tokens, the topology coordinator handles the exception and schedules a
rollback. However, the current code tries to continue with the topology
coordinator logic even if an exception occurs, leaving boostrap_tokens
empty. This does not make sense and can actually cause issues,
specifically in prepare_and_broadcast_cdc_generation_data which
implicitly expect that the bootstrap_tokens of the first node in the
cluster will not be empty.

Fix this by adding the missing break.

Fixes: scylladb/scylladb#23897

From the code inspection alone it looks like 2025.1 and 6.2 have this problem, so marking for backport to both of them.

- (cherry picked from commit 66acaa1bf8)

- (cherry picked from commit 845cedea7f)

- (cherry picked from commit 670a69007e)

Parent PR: #23914

Closes scylladb/scylladb#23949

* github.com:scylladb/scylladb:
  test: cluster: add test_bad_initial_token
  topology coordinator: do not proceed further on invalid boostrap tokens
  cdc: add sanity check for generating an empty generation
2025-05-01 08:23:06 +03:00
Botond Dénes
b8a8f288fb Merge '[Backport 2025.1] tasks: check whether a node is alive before rpc' from Scylladb[bot]
Check whether a node is alive before making an rpc that gathers children
infos from the whole cluster in virtual_task::impl::get_children.

Fixes: https://github.com/scylladb/scylladb/issues/22514.

Needs backport to 2025.1 and 6.2 as they contain the bug.

- (cherry picked from commit 53e0f79947)

- (cherry picked from commit e178bd7847)

Parent PR: #23787

Closes scylladb/scylladb#23943

* github.com:scylladb/scylladb:
  test: add test for getting tasks children
  tasks: check whether a node is alive before rpc
2025-05-01 08:21:49 +03:00
Botond Dénes
503748b1ae Merge '[Backport 2025.1] topology_coordinator: stop: await all background_action_holder:s' from Scylladb[bot]
Add missing awaits for the rebuild_repair and repair background actions.
Although the background actions hold the _async_gate
which is closed in topology_coordinator::run(),
stop() still needs to await all background action futures
and handle any errors they may have left behind.

Fixes #23755

* The issue exists since 6.2

- (cherry picked from commit d624795fda)

- (cherry picked from commit 6de79d0dd3)

- (cherry picked from commit 7a0f5e0a54)

Parent PR: #17712

Closes scylladb/scylladb#23799

* github.com:scylladb/scylladb:
  topology_coordinator: stop: await all background_action_holder:s
  topology_coordinator: stop: improve error messages
  topology_coordinator: stop: define stop_background_action helper
2025-05-01 08:20:03 +03:00
Jenkins Promoter
15c88b6453 Update pgo profiles - aarch64 2025-05-01 04:20:37 +03:00
Jenkins Promoter
0bdd7bd755 Update pgo profiles - x86_64 2025-05-01 04:06:29 +03:00
Aleksandra Martyniuk
6d9bf63e06 test: add test for getting tasks children
Add test that checks whether the children of a virtual task will be
properly gathered if a node is down.

(cherry picked from commit e178bd7847)
2025-04-30 10:25:09 +02:00
Aleksandra Martyniuk
9385dc1d4e tasks: check whether a node is alive before rpc
Check whether a node is alive before making an rpc that gathers children
infos from the whole cluster in virtual_task::impl::get_children.

(cherry picked from commit 53e0f79947)
2025-04-30 10:14:58 +02:00
Patryk Jędrzejczak
bf688890ad Merge '[Backport 2025.1] Ensure raft group0 RPCs use the gossip scheduling group.' from Scylladb[bot]
Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group.

For RAFT group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This patch ensures that RAFT operations use the `gossip` scheduling group to leverage the system semaphore.

Fixes scylladb/scylladb#21637

Backport: 6.2 and 6.1

- (cherry picked from commit 60f1053087)

- (cherry picked from commit e05c082002)

Parent PR: #22779

Closes scylladb/scylladb#23844

* https://github.com/scylladb/scylladb:
  Ensure raft group0 RPCs use the gossip scheduling group
  Move RAFT operations verbs to GOSSIP group.
2025-04-29 16:02:47 +02:00
Piotr Dulikowski
791b4a9fe0 test: cluster: add test_bad_initial_token
Adds a test which checks that rollback works properly in case when a bad
value of the initial_token function is provided.

(cherry picked from commit 670a69007e)
2025-04-28 17:07:43 +00:00
Piotr Dulikowski
d25dd24a7e topology coordinator: do not proceed further on invalid boostrap tokens
In case when dht::boot_strapper::get_boostrap_tokens fail to parse the
tokens, the topology coordinator handles the exception and schedules a
rollback. However, the current code tries to continue with the topology
coordinator logic even if an exception occurs, leaving boostrap_tokens
empty. This does not make sense and can actually cause issues,
specifically in prepare_and_broadcast_cdc_generation_data which
implicitly expect that the bootstrap_tokens of the first node in the
cluster will not be empty.

Fix this by adding the missing break.

Fixes: scylladb/scylladb#23897
(cherry picked from commit 845cedea7f)
2025-04-28 17:07:43 +00:00
Piotr Dulikowski
d8380e5d5e cdc: add sanity check for generating an empty generation
It doesn't make sense to create an empty CDC generation because it does
not make sense to have a cluster with no tokens. Add a sanity check to
cdc::make_new_generation_description which fails if somebody attempts to
do that (i.e. when the set of current tokens + optionally bootstrapping
node's tokens is empty).

The function does not work correctly if it is misused, as we saw in
scylladb/scylladb#23897. While the function should not be misused in the
first place, it's better to throw an exception rather than crash -
especially that this crash could happen on the topology coordinator.

(cherry picked from commit 66acaa1bf8)
2025-04-28 17:07:43 +00:00
Jenkins Promoter
964468d901 Update ScyllaDB version to: 2025.1.3 2025-04-28 16:10:06 +03:00
Tomasz Grabiec
962fa5a369 Merge '[Backport 2025.1] tablets: Equalize per-table balance when allocating tablets for a new table' from Scylladb[bot]
Fixes the following scenario:

1. Scale out adds new nodes to each rack
2. Table is created - all tablets are allocated to new nodes because they have low load
3. Rebalancing moves tablets from old nodes to new nodes - table balance for the new table is not fixed

We're wrong to try to equalize global load when allocating tablets,
and we should equalize per-table load instead, and let background load
balancing fix it in a fair way. It will add to the allocated storage
imbalance, but:

1. The table is initially empty, so doesn't impact actual storage imbalance.
2. It's more important to avoid overloading CPU on the nodes - imbalance hurts this aspect immediately.
3. If the table was created before imbalance was formed, we would end up in the same situation as in the problematic scenario after the patch.
4. It's the job of the load balancing to keep up with storage growing, and if it's not, scale out should kick in.

Before we have CPU-aware tablet allocation, and thus can prove we have
CPU capacity on the small nodes, we should respect per-table balance
as this is the way in which we achieve full CPU utilization.

Fixes #23631

Backport to 2025.1 because load imbalance is a serious problem in production.

- (cherry picked from commit d493a8d736)

- (cherry picked from commit 2597a7e980)

- (cherry picked from commit 1e407ab4d2)

Parent PR: #23708

Closes scylladb/scylladb#23873

* github.com:scylladb/scylladb:
  tablets: Equalize per-table balance when allocating tablets for a new table
  load_sketch: Tolerate missing tablet_map when selecting for a given table
  tests: tablets: Simplify tests by moving common code to topology_builder
2025-04-28 13:23:25 +02:00
Tomasz Grabiec
885924067f Merge '[Backport 2025.1] Simplify loading_cache_test and use manual_clock' from Scylladb[bot]
This series exposes a Clock template parameter for loading_cache so that the test could use
the manual_clock rather than the lowres_clock, since relying on the latter is flaky.

In addition, the test load function is simplified to sleep some small random time and co_return the expected string,
rather than reading it from a real file, since the latter's timing might also be flaky, and it out-of-scope for this test.

Fixes #20322

* The test was flaky forever, so backport is required for all live versions.

- (cherry picked from commit b509644972)

- (cherry picked from commit 934a9d3fd6)

- (cherry picked from commit d68829243f)

- (cherry picked from commit b258f8cc69)

- (cherry picked from commit 0841483d68)

- (cherry picked from commit 32b7cab917)

Parent PR: #22064

Closes scylladb/scylladb#23926

* github.com:scylladb/scylladb:
  tests: loading_cache_test: use manual_clock
  utils: loading_cache: make clock_type a template parameter
  test: loading_cache_test: use function-scope loader
  test: loading_cache_test: simlute loader using sleep
  test: lib: eventually: add sleep function param
  test: lib: eventually: make *EVENTUALLY_EQUAL inline functions
2025-04-28 10:55:14 +02:00
Benny Halevy
bf3b09313c tests: loading_cache_test: use manual_clock
Relying on a real-time clock like lowres_clock
can be flaky (in particular in debug mode).
Use manual_clock instead to harden the test against
timing issues.

Fixes #20322

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 32b7cab917)
2025-04-27 08:49:05 +00:00
Benny Halevy
49bc81415b utils: loading_cache: make clock_type a template parameter
So the unit test can use manual_clock rather than lowres_clock
which can be flaky (in particular in debug mode).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0841483d68)
2025-04-27 08:49:05 +00:00
Benny Halevy
325cdcbd58 test: loading_cache_test: use function-scope loader
Rather than a global function, accessing a thread-local `load_count`.
The thread-local load_count cannot be used when multiple test
cases run in parallel.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b258f8cc69)
2025-04-27 08:49:04 +00:00
Benny Halevy
a4584e205c test: loading_cache_test: simlute loader using sleep
This test isn't about reading values from file,
but rather it's about the loading_cache.
Reading from the file can sometimes take longer than
the expected refresh times, causing flakiness (see #20322).

Rather than reading a string from a real file, just
sleep a random, short time, and co_return the string.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit d68829243f)
2025-04-27 08:49:04 +00:00
Benny Halevy
52c82527f9 test: lib: eventually: add sleep function param
To allow support for manual_clock instead of seastar::sleep.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 934a9d3fd6)
2025-04-27 08:49:04 +00:00
Benny Halevy
3792810030 test: lib: eventually: make *EVENTUALLY_EQUAL inline functions
rather then macros.

This is a first cleanup step before adding a sleep function
parameter to support also manual_clock.

Also, add a call to BOOST_REQUIRE_EQUAL/BOOST_CHECK_EQUAL,
respectively, to make an error more visible in the test log
since those entry points print the offending values
when not equal.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b509644972)
2025-04-27 08:49:04 +00:00
Tomasz Grabiec
ba3d53be55 tablets: Equalize per-table balance when allocating tablets for a new table
Fixes the following scenario:

1. Scale out adds new nodes to each rack
2. Table is created - all tablets are allocated to new nodes because they have low load
3. Rebalancing moves tablets from old nodes to new nodes - table balance for the new table is not fixed

We're wrong to try to equalize global load when allocating tablets,
and we should equalize per-table load instead, and let background load
balancing fix it in a fair way. It will add to the allocated storage
imbalance, but:

1. The table is initially empty, so doesn't impact actual storage imbalance.
2. It's more important to avoid overloading CPU on the nodes - imbalance hurts this aspect immediately.
3. If the table was created before imbalance was formed, we would end up in the same situation in the problematic scenario after the patch.
4. It's the job of the load balancing to keep up with storage growing, and if it's not, scale out should kick in.

Before we have CPU-aware tablet allocation, and thus can prove we have
CPU capacity on the small nodes, we should respect per-table balance
as this is the way in which we achieve full CPU utilization.

Fixes #23631

(cherry picked from commit 1e407ab4d2)
2025-04-25 18:29:36 +02:00
Tomasz Grabiec
e73954da80 load_sketch: Tolerate missing tablet_map when selecting for a given table
To simplify future usage in
network_topology_strategy::add_tablets_in_dc() which invokes
populate() for a given table, which may be both new and preexisitng.

(cherry picked from commit 2597a7e980)
2025-04-25 18:29:36 +02:00
Tomasz Grabiec
93dab31007 tests: tablets: Simplify tests by moving common code to topology_builder
Reduces code duplication.

(cherry picked from commit d493a8d736)
2025-04-25 18:29:36 +02:00
Benny Halevy
923944bf21 test_tablets_cql: test_alter_dropped_tablets_keyspace: extend expected error
The query may fail also on a no_such_keyspace
exception, which generates the following cql error:
```
Error from server: code=2200 [Invalid query] message="Can\'t find a keyspace test_1745198244144_qoohq"
```
Extend the pytest.raises match expression to include
this error as well.

Fixes #23812

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#23875

(cherry picked from commit f279625f59)

Closes scylladb/scylladb#23887
2025-04-25 11:34:27 +02:00
Michael Litvak
f71eff37bb test: test_mv_topology_change: increase timeout for remove_node
The test `test_mv_write_to_dead_node` currently uses a timeout of 60
seconds for remove_node, after it was increased from 30 seconds to fix
scylladb/scylladb#22953. Apparently it is still too low, and it was
observed to fail in debug mode.

Normally remove_node uses a default timeout of TOPOLOGY_TIMEOUT = 1000
seconds, but the test requires a timeout which is shorter than 5
minutes, because it is a regression test for an issue where MV updates
hold topology changes for more than 5 minutes, and we want to verify in
the test that the topology change completes in less than 5 minutes.

To resolve the issue, we set the test to skip in debug mode, because the
remove node operation is unpredictably slow, and we increase the timeout
to 180 seconds which is hopefully enough time for remove_node in
non-debug modes, and still sufficient to satisfy the test requirements.

Fixes scylladb/scylladb#22530

Closes scylladb/scylladb#23833

(cherry picked from commit 5c1d24f983)

Closes scylladb/scylladb#23874
2025-04-24 17:43:16 +02:00
Botond Dénes
e5ddc0c5ce Merge 'test/pylib: servers_add: support list of property_files' from Benny Halevy
So that a multi-dc/multi-rack cluster can be populated
in a single call.

* Enhancement, no backport required

Closes scylladb/scylladb#23341

* github.com:scylladb/scylladb:
  test/pylib: servers_add: add auto_rack_dc parameter
  test/pylib: servers_add: support list of property_files

(cherry picked from commit 0fdf2a2090)
2025-04-22 19:50:03 +02:00
Andrei Chekun
4ba475eb22 test.py: Remove reuse cluster in cluster tests
Pool is not aware of the cluster configuration, so it can return cluster
to the test that is not suitable for it. Removing reuse will remove such
possibility, so there will be less flaky tests.

Closes scylladb/scylladb#23277

(cherry picked from commit d68e54c26d)
2025-04-22 19:10:27 +02:00
Artsiom Mishuta
b4568850f9 test.py apply prepare_3_nodes_cluster in topology
apply prepare_3_nodes_cluster for all tests in the topology folder
via applying mark at the test module level using pytestmark
https://docs.pytest.org/en/stable/example/markers.html#marking-whole-classes-or-modules

set initial initial_size for topology folder to 0

(cherry picked from commit cf48444e3b)
2025-04-22 19:09:18 +02:00
Artsiom Mishuta
72adae2c16 test.py: introduce prepare_3_nodes_cluster marker
prepare_3_nodes_cluster marker will allow preparing non-dirty 3 nodes cluster
that can be reused between tests

(cherry picked from commit 20777d7fc6)
2025-04-22 19:08:31 +02:00
Sergey Zolotukhin
d34061818c Ensure raft group0 RPCs use the gossip scheduling group
Scylla operations use concurrency semaphores to limit the number
of concurrent operations and prevent resource exhaustion. The
semaphore is selected based on the current scheduling group.
For Raft group operations, it is essential to use a system semaphore to
avoid queuing behind user operations.
This commit adds a check to ensure that the raft group0 RPCs are
executed with the `gossiper` scheduling group.

(cherry picked from commit e05c082002)
2025-04-22 07:59:01 +00:00
Sergey Zolotukhin
b0fe705c80 Move RAFT operations verbs to GOSSIP group.
In order for RAFT operations to use the gossip system semaphore, moving RAFT
verbs to the gossip group in `do_get_rpc_client_idx`,  messaging_service.

Fixes scylladb/scylladb21637

(cherry picked from commit 60f1053087)
2025-04-22 07:59:01 +00:00
Yaron Kaikov
502c62d91d install-dependencies.sh: update node_exporter to 1.9.0
Update node_exporter to 1.9.0 to resolve the following CVE's
https://github.com/advisories/GHSA-49gw-vxvf-fc2g
https://github.com/advisories/GHSA-8xfx-rj4p-23jm
https://github.com/advisories/GHSA-crqm-pwhx-j97f
https://github.com/advisories/GHSA-j7vj-rw65-4v26

Fixes: https://github.com/scylladb/scylladb/issues/22884

regenerate frozen toolchain with optimized clang from
* https://devpkg.scylladb.com/clang/clang-18.1.8-Fedora-40-aarch64.tar.gz
* https://devpkg.scylladb.com/clang/clang-18.1.8-Fedora-40-x86_64.tar.gz

Closes scylladb/scylladb#22987

(cherry picked from commit e6227f9a25)

Closes scylladb/scylladb#23021
2025-04-21 23:12:28 +03:00
Avi Kivity
8ea6f5824a Update seastar submodule
* seastar 6d8fccf14c...ed31c1ce82 (1):
  > Merge 'Share IO queues between mountpoints' from Pavel Emelyanov

Changes in io-queue call for scylla-gdb update as well -- now the
reactor map of device to io-queue uses seastar::shared_ptr, not
std::unique_ptr.

Closes scylladb/scylladb#23733

Ref 70ac5828a8

Fixes #23820
2025-04-20 13:31:17 +03:00
Benny Halevy
e9b57a149d topology_coordinator: stop: await all background_action_holder:s
Add missing awaits for the rebuild_repair and repair background actions.
Although the background actions hold the _async_gate
which is closed in topology_coordinator::run(),
stop() still needs to await all background action futures
and handle any errors they may have left behind.

Fixes #23755

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 7a0f5e0a54)
2025-04-20 07:46:23 +00:00
Benny Halevy
7bdb93aa1c topology_coordinator: stop: improve error messages
"when cleanup" is ill-formed. Use "when XYZ"
to "during XYZ" instead.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 6de79d0dd3)
2025-04-20 07:46:22 +00:00
Benny Halevy
985492018b topology_coordinator: stop: define stop_background_action helper
Refactor the code to use a helper to await background_action_holder
and handle any errors by printing a warning.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit d624795fda)
2025-04-20 07:46:22 +00:00
Avi Kivity
6a8b033510 Merge '[Backport 2025.1] managed_bytes: in the copy constructor, respect the target preferred allocation size' from Scylladb[bot]
Commit 14bf09f447 added a single-chunk layout to `managed_bytes`, which makes the overhead of `managed_bytes` smaller in the common case of a small buffer.

But there was a bug in it. In the copy constructor of `managed_bytes`, a copy of a single-chunk `managed_bytes` is made single-chunk too.

But this is wrong, because the source of the copy and the target of the copy might have different preferred max contiguous allocation sizes.

In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB is copied from the standard allocator into LSA, the resulting `managed_bytes` is a single chunk which violates LSA's preferred allocation size. (And therefore is placed by LSA in the standard allocator).

In other words, since Scylla 6.0, cache and memtable cells between 13 kiB and 128 kiB are getting allocated in the standard allocator rather than inside LSA segments.

Consequences of the bug:

1. Effective memory consumption of an affected cell is rounded up to the nearest power of 2.

2. With a pathological-enough allocation pattern (for example, one which somehow ends up placing a single 16 kiB memtable-owned allocation in every aligned 128 kiB span), memtable flushing could theoretically deadlock, because the allocator might be too fragmented to let the memtable grow by another 128 kiB segment, while keeping the sum of all allocations small enough to avoid triggering a flush. (Such an allocation pattern probably wouldn't happen in practice though).

3. It triggers a bug in reclaim which results in spurious allocation failures despite ample evictable memory.

   There is a path in the reclaimer procedure where we check whether reclamation succeeded by checking that the number of free LSA segments grew.

   But in the presence of evictable non-LSA allocations, this is wrong because the reclaim might have met its target by evicting the non-LSA allocations, in which case memory is returned directly to the standard allocator, rather than to the pool of free segments.

   If that happens, the reclaimer wrongly returns `reclaimed_nothing` to Seastar, which fails the allocation.

Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781

This is a regression fix, should be backported to all affected releases.

- (cherry picked from commit 4e2f62143b)

- (cherry picked from commit 6c1889f65c)

Parent PR: #23782

Closes scylladb/scylladb#23810

* github.com:scylladb/scylladb:
  managed_bytes_test: add a reproducer for #23781
  managed_bytes: in the copy constructor, respect the target preferred allocation size
2025-04-19 18:42:45 +03:00
Botond Dénes
e19ab12f5e Merge '[Backport 2025.1] service/storage_proxy: schedule_repair(): materialize the range into a vector' from Scylladb[bot]
Said method passes down its diff input to mutate_internal(), after some std::ranges massaging. Said massaging is destructive -- it moves items from the diff. If the output range is iterated-over multiple times, only the first time will see the actual output, further iterations will get an empty range.

When trace-level logging is enabled, this is exactly what happens: mutate_internal() iterates over the range multiple times, first to log its content, then to pass it down the stack. This ends up resulting in an empty range being pased down and write handlers being created with nullopt optionals.

Fixes: scylladb/scylladb#21907
Fixes: scylladb/scylladb#21714

A follow-up stability fix for the test is also included.

Fixes: https://github.com/scylladb/scylladb/issues/23513
Fixes: https://github.com/scylladb/scylladb/issues/23512

Based on code-inspection, all versions are vulnerable, although <=6.2 use boost::ranges, not std::ranges.

- (cherry picked from commit 7150442f6a)

Parent PR: #21910

Closes scylladb/scylladb#23791

* github.com:scylladb/scylladb:
  test/cluster/test_read_repair.py: increase read request timeout
  service/storage_proxy: schedule_repair(): materialize the range into a vector
2025-04-18 14:04:23 +03:00
Anna Stuchlik
f8615b8c53 doc: add info about Scylla Doctor Automation to the docs
Fixes https://github.com/scylladb/scylladb/issues/23642

Closes scylladb/scylladb#23745

(cherry picked from commit 0b4740f3d7)

Closes scylladb/scylladb#23776
2025-04-18 14:03:54 +03:00
Botond Dénes
34ea9af232 Merge '[Backport 2025.1] tablets: rebuild: use repair for tablet rebuild' from Scylladb[bot]
Currently, when we rebuild a tablet, we stream data from all
replicas. This creates a lot of redundancy, wastes bandwidth
and CPU resources.

In this series, we split the streaming stage of tablet rebuild into
two phases: first we stream tablet's data from only one replica
and then repair the tablet.

Fixes: https://github.com/scylladb/scylladb/issues/17174.

Needs backport to 2025.1 to prevent out of space during streaming

- (cherry picked from commit b80e957a40)

- (cherry picked from commit ed7b8bb787)

- (cherry picked from commit 5d6041617b)

- (cherry picked from commit 4a847df55c)

- (cherry picked from commit eb17af6143)

- (cherry picked from commit acd32b24d3)

- (cherry picked from commit 372b562f5e)

Parent PR: #23187

Closes scylladb/scylladb#23682

* github.com:scylladb/scylladb:
  test: add test for rebuild with repair
  locator: service: move to rebuild_v2 transition if cluster is upgraded
  locator: service: add transition to rebuild_repair stage for rebuild_v2
  locator: service: add rebuild_repair tablet transition stage
  locator: add maybe_get_primary_replica
  locator: service: add rebuild_v2 tablet transition kind
  gms: add REPAIR_BASED_TABLET_REBUILD cluster feature
2025-04-18 14:03:25 +03:00
Michał Chojnowski
e6a2d67be4 managed_bytes_test: add a reproducer for #23781
(cherry picked from commit 6c1889f65c)
2025-04-18 07:55:46 +00:00
Michał Chojnowski
d1d21d97e1 managed_bytes: in the copy constructor, respect the target preferred allocation size
Commit 14bf09f447 added a single-chunk
layout to `managed_bytes`, which makes the overhead of `managed_bytes`
smaller in the common case of a small buffer.

But there was a bug in it. In the copy constructor of `managed_bytes`,
a copy of a single-chunk `managed_bytes` is made single-chunk too.

But this is wrong, because the source of the copy and the target
of the copy might have different preferred max contiguous allocation
sizes.

In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB
is copied from the standard allocator into LSA, the resulting
`managed_bytes` is a single chunk which violates LSA's preferred
allocation size. (And therefore is placed by LSA in the standard
allocator).

In other words, since Scylla 6.0, cache and memtable cells
between 13 kiB and 128 kiB are getting allocated in the standard allocator
rather than inside LSA segments.

Consequences of the bug:

1. Effective memory consumption of an affected cell is rounded up to the nearest
   power of 2.

2. With a pathological-enough allocation pattern
   (for example, one which somehow ends up placing a single 16 kiB
   memtable-owned allocation in every aligned 128 kiB span),
   memtable flushing could theoretically deadlock,
   because the allocator might be too fragmented to let the memtable
   grow by another 128 kiB segment, while keeping the sum of all
   allocations small enough to avoid triggering a flush.
   (Such an allocation pattern probably wouldn't happen in practice though).

3. It triggers a bug in reclaim which results in spurious
   allocation failures despite ample evictable memory.

   There is a path in the reclaimer procedure where we check whether
   reclamation succeeded by checking that the number of free LSA
   segments grew.

   But in the presence of evictable non-LSA allocations, this is wrong
   because the reclaim might have met its target by evicting the non-LSA
   allocations, in which case memory is returned directly to the
   standard allocator, rather than to the pool of free segments.

   If that happens, the reclaimer wrongly returns `reclaimed_nothing`
   to Seastar, which fails the allocation.

Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781

(cherry picked from commit 4e2f62143b)
2025-04-18 07:55:46 +00:00
Botond Dénes
440141a4ff test/cluster/test_read_repair.py: increase read request timeout
This test enables trace-level logging for the mutation_data logger,
which seems to be too much in debug mode and the test read times out.
Increase timeout to 1minute to avoid this.

Fixes: #23513

Closes scylladb/scylladb#23558

(cherry picked from commit 7bbfa5293f)
2025-04-18 06:31:42 +03:00
Nadav Har'El
6b35eea1a9 Merge '[Backport 2025.1] Alternator batch rcu' from Scylladb[bot]
This series adds support for reporting consumed capacity in BatchGetItem operations in Alternator.
It includes changes to the RCU accounting logic, exposing internal functionality to support batch-specific behavior, and adds corresponding tests for both simple and complex use cases involving multiple tables and consistency modes.

Need backporting to 2025.1, as RCU and WCU are not fully supported

Fixes #23690

- (cherry picked from commit 0eabf8b388)

- (cherry picked from commit 88095919d0)

- (cherry picked from commit 3acde5f904)

Parent PR: #23691

Closes scylladb/scylladb#23790

* github.com:scylladb/scylladb:
  test_returnconsumedcapacity.py: test RCU for batch get item
  alternator/executor: Add RCU support for batch get items
  alternator/consumed_capacity: make functionality public
2025-04-17 21:39:58 +03:00
Botond Dénes
b14ae92f4f service/storage_proxy: schedule_repair(): materialize the range into a vector
Said method passes down its `diff` input to `mutate_internal()`, after
some std::ranges massaging. Said massaging is destructive -- it moves
items from the diff. If the output range is iterated-over multiple
times, only the first time will see the actual output, further
iterations will get an empty range.
When trace-level logging is enabled, this is exactly what happens:
`mutate_internal()` iterates over the range multiple times, first to log
its content, then to pass it down the stack. This ends up resulting in
a range with moved-from elements being pased down and consequently write
handlers being created with nullopt mutations.

Make the range re-entrant by materializing it into a vector before
passing it to `mutate_internal()`.

Fixes: scylladb/scylladb#21907
Fixes: scylladb/scylladb#21714

Closes scylladb/scylladb#21910

(cherry picked from commit 7150442f6a)
2025-04-17 11:15:19 +00:00
Benny Halevy
fd6c7c53b8 token_group_based_splitting_mutation_writer: maybe_switch_to_new_writer: prevent double close
Currently, maybe_switch_to_new_writer resets _current_writer
only in a continuation after closing the current writer.
This leaves a window of vulnerability if close() yields,
and token_group_based_splitting_mutation_writer::close()
is called. Seeing the engaged _current_writer, close()
will call _current_writer->close() - which must be called
exactly once.

Solve this when switching to a new writer by resetting
_current_writer before closing it and potentially yielding.

Fixes #22715

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#22922

(cherry picked from commit 29b795709b)

Closes scylladb/scylladb#22965
2025-04-17 12:59:55 +02:00
Amnon Heiman
9434bd81b3 test_returnconsumedcapacity.py: test RCU for batch get item
This patch adds tests for consumed capacity in batch get item.  It tests
both the simple case and the multi-item, multi-table case that combines
consistent and non-consistent reads.

(cherry picked from commit 3acde5f904)
2025-04-17 10:30:18 +00:00
Amnon Heiman
0761eacf68 alternator/executor: Add RCU support for batch get items
This patch adds RCU support for batch get items.  With batch requests,
multiple objects are read from multiple tables. While the criterion for
adding the units is per the batch request, the units are calculated per
table—and so is the read consistency.

(cherry picked from commit 88095919d0)
2025-04-17 10:30:18 +00:00
Amnon Heiman
8bb4ee49da alternator/consumed_capacity: make functionality public
The consumed_capacity_counter is not completely applicable for batch
operations.  This patch makes some of its functionality public so that
batch get item can use the components to decide if it needs to send
consumed capacity in the reply, to get the half units used by the
metrics and returned result, and to allow an empty constructor for the
RCU counter.

(cherry picked from commit 0eabf8b388)
2025-04-17 10:30:18 +00:00
Avi Kivity
a89cdfc253 scylla-gdb: small-objects: fix for very small objects
Because of rounding and alignment, there are multiple pools for small
sizes (e.g. 4 for size 32). Because the pool selection algorithm
ignores alignment, different pools can be chosen for different object
sizes. For example, an object size of 29 will choose the first pool
of size 32, while an object size of 32 will choose the fourth pool of
size 32.

The small-objects command doesn't know about this and always considers
just the first pool for a given size. This causes it to miss out on
sister pools.

While it's possible to adjust pool selection to always choose one of the
pools, it may eat a precious cycle. So instead let's compensate in the
small-objects command. Instead of finding one pool for a given size,
find all of them, and iterate over all those pools.

Fixes #23603

Closes scylladb/scylladb#23604

(cherry picked from commit b4d4e48381)

Closes scylladb/scylladb#23749
2025-04-16 14:37:43 +03:00
Botond Dénes
998bfe908f Merge '[Backport 2025.1] Fix EAR not applied on write to S3 (but on read).' from Scylladb[bot]
Fixes #23225
Fixes #23185

Adds a "wrap_sink" (with default implementation) to sstables::file_io_extension, and moves
extension wrapping of file and sink objects to storage level.
(Wrapping/handling on sstable level would be problematic, because for file storage we typically re-use the sstable file objects for sinks, whereas for S3 we do not).

This ensures we apply encryption on both read and write, whereas we previously only did so on read -> fail.
Adds io wrapper objects for adapting file/sink for default implementation, as well as a proper encrypted sink implementation for EAR.

Unit tests for io objects and a macro test for S3 encrypted storage included.

- (cherry picked from commit 98a6d0f79c)

- (cherry picked from commit e100af5280)

- (cherry picked from commit d46dcbb769)

- (cherry picked from commit e02be77af7)

- (cherry picked from commit 9ac9813c62)

- (cherry picked from commit 5c6337b887)

Parent PR: #23261

Closes scylladb/scylladb#23424

* github.com:scylladb/scylladb:
  encryption: Add "wrap_sink" to encryption sstable extension
  encrypted_file_impl: Add encrypted_data_sink
  sstables::storage: Move wrapping sstable components to storage provider
  sstables::file_io_extension: Add a "wrap_sink" method.
  sstables::file_io_extension: Make sstable argument to "wrap" const
  utils: Add "io-wrappers", useful IO helper types
2025-04-16 09:32:23 +03:00
Calle Wilund
0eed7f8f29 encryption: Add "wrap_sink" to encryption sstable extension
Creates a more efficient data_sink wrapper for encrypted output
stream (S3).

(cherry picked from commit 5c6337b887)
2025-04-15 11:00:22 +00:00
Calle Wilund
f174b419a4 encrypted_file_impl: Add encrypted_data_sink
Adds a sibling type to encrypted file, a data_sink, that
will write a data stream in the same block format as a file
object would. Including end padding.

For making encrypted data sink writing less cumbersome.

(cherry picked from commit 9ac9813c62)
2025-04-15 11:00:22 +00:00
Calle Wilund
ac4c7a7ad2 sstables::storage: Move wrapping sstable components to storage provider
Fixes #23225
Fixes #23185

Moved wrapping component files/sinks to storage provider. Also ensures
to wrap data_sinks as well as actual files. This ensures that we actually
write encryption if active.

(cherry picked from commit e02be77af7)
2025-04-15 11:00:22 +00:00
Calle Wilund
6feb95ffad sstables::file_io_extension: Add a "wrap_sink" method.
Similar to wrap file, should wrap a data_sink (used for
sstable writers), in obvious write-only, simple stream
mode.

Default impl will detect if we wrap files for this component,
and if so, generate a file wrapper for the input sink, wrap
this, and the wrap it in a file_data_sink_impl.

This is obviously not efficient, so extensions used in actual
non-test code should implement the method.

(cherry picked from commit d46dcbb769)
2025-04-15 11:00:22 +00:00
Calle Wilund
b6ec0961ca sstables::file_io_extension: Make sstable argument to "wrap" const
This matches the signature of call sites. Since the only "real"
extension to actually make a marker in the sstable will do so in
the scylla component, which is writable even in a const sstable,
this is ok.

(cherry picked from commit e100af5280)
2025-04-15 10:36:47 +00:00
Calle Wilund
9a10458500 utils: Add "io-wrappers", useful IO helper types
Mainly to add a somewhat functional file-impl wrapping
a data_sink. This can implement a rudimentary, write-only,
file based on any output sink.

For testing, and because they fit there, place memory
sink and source types there as well.

(cherry picked from commit 98a6d0f79c)
2025-04-15 10:36:47 +00:00
Pavel Emelyanov
263416201c Merge '[Backport 2025.1] audit: add semaphore to audit_syslog_storage_helper' from Scylladb[bot]
audit_syslog_storage_helper::syslog_send_helper uses Seastar's
net::datagram_channel to write to syslog device (usually /dev/log).
However, datagram_channel.send() is not fiber-safe (ref seastar#2690),
so unserialized use of send() results in packets overwriting its state.
This, in turn, causes a corruption of audit logs, as well as assertion
failures.

To workaround the problem, a new semaphore is introduced in
audit_syslog_storage_helper. As storage_helper is a member of sharded
audit service, the semaphore allows for one datagram_channel.send() on
each shard. Each audit_syslog_storage_helper stores its own
datagram_channel, therefore concurrent sends to datagram_channel are
eliminated.

This change:
 - Moved syslog_send_helper to audit_syslog_storage_helper
 - Corutinize audit_syslog_storage_helper
 - Introduce semaphore with count=1 in audit_syslog_storage_helper.

See https://github.com/scylladb/scylla-dtest/pull/5749 for releated dtest
Fixes: scylladb/scylladb#22973

Backport to 2025.1 should be considered, as https://github.com/scylladb/scylladb/issues/22973 is known to cause crashes of 2025.1.

- (cherry picked from commit dbd2acd2be)

- (cherry picked from commit 889fd5bc9f)

- (cherry picked from commit c12f976389)

Parent PR: #23464

Closes scylladb/scylladb#23674

* github.com:scylladb/scylladb:
  audit: add semaphore to audit_syslog_storage_helper
  audit: corutinize audit_syslog_storage_helper
  audit: moved syslog_send_helper to audit_syslog_storage_helper
2025-04-15 12:37:48 +03:00
Jenkins Promoter
42db149393 Update ScyllaDB version to: 2025.1.2 2025-04-15 12:13:36 +03:00
Pavel Emelyanov
4c382fbe7e cql: Remove unused "initial_tablets" mention from guardrails
All tablets configuration was moved into its own "with tablets" section,
this option name cannot be met among replication factors.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23555

(cherry picked from commit d4f3a3ee4f)

Closes scylladb/scylladb#23676
2025-04-15 11:01:50 +03:00
David Garcia
7588789b02 fix: openapi not rendering in docs.scylladb.com/manual
Closes scylladb/scylladb#23686

(cherry picked from commit cf11d5eb69)

Closes scylladb/scylladb#23710
2025-04-15 10:58:59 +03:00
Jenkins Promoter
a0faf0bde0 Update pgo profiles - aarch64 2025-04-15 04:33:44 +03:00
Jenkins Promoter
a503e74bf5 Update pgo profiles - x86_64 2025-04-15 04:10:13 +03:00
Aleksandra Martyniuk
6702849f32 test: add test for rebuild with repair
(cherry picked from commit 372b562f5e)
2025-04-14 12:01:58 +02:00
Aleksandra Martyniuk
5c683449b3 locator: service: move to rebuild_v2 transition if cluster is upgraded
If cluster is upgraded to version containing rebuild_v2 transition
kind, move to this transition kind instead of rebuild.

(cherry picked from commit acd32b24d3)
2025-04-14 11:55:28 +02:00
Aleksandra Martyniuk
2436c24db7 locator: service: add transition to rebuild_repair stage for rebuild_v2
Modify write_both_read_old and streaming stages in rebuild_v2 transition
kind: write_both_read_old moves to rebuild_repair stage and streaming stage
streams data only from one replica.

(cherry picked from commit eb17af6143)
2025-04-14 11:55:28 +02:00
Aleksandra Martyniuk
6a251df136 locator: service: add rebuild_repair tablet transition stage
Currently, in the streaming stage of rebuild tablet transition,
we stream tablet data from all replicas.
This patch series splits the streaming stage into two phases:
- repair phase, where we repair the tablet;
- streaming phase, where we stream tablet data from one replica.

rebuild_repair is a stage that will be used to perform the repair
phase. It executes the tablet repair on tablet_info::replicas.
A primary replica out of migration_streraming_info::read_from is
the repair master. If the repair succeeds, we move to streaming
tablet transition stage, and to cleanup_target - if it fails.

The repair bypasses the tablet repair scheduler and it does not update
the repair_time.

A transition to the rebuild_repair stage will be added in the following
patches.

(cherry picked from commit 4a847df55c)
2025-04-14 11:55:28 +02:00
Aleksandra Martyniuk
0e152e9f2e locator: add maybe_get_primary_replica
Add maybe_get_primary_replica to choose a primary replica out of
custom replica set.

(cherry picked from commit 5d6041617b)
2025-04-14 11:55:28 +02:00
Aleksandra Martyniuk
5a42418a19 locator: service: add rebuild_v2 tablet transition kind
Currently, in the streaming stage of rebuild tablet transition,
we stream tablet data from all replicas.
This patch series splits the streaming stage into two phases:
- repair phase, where we repair the tablet;
- streaming phase, where we stream tablet data from one replica.

To differentiate the two streaming methods, a new tablet transition
kind - rebuild_v2 - is added.

The transtions and stages for rebuild_v2 transition kind will be
added in the following patches.

(cherry picked from commit ed7b8bb787)
2025-04-14 11:55:28 +02:00
Aleksandra Martyniuk
4df0ddb11f gms: add REPAIR_BASED_TABLET_REBUILD cluster feature
(cherry picked from commit b80e957a40)
2025-04-14 11:55:28 +02:00
Botond Dénes
c1dce79847 Merge '[Backport 2025.1] Finalize tablet splits earlier' from Scylladb[bot]
Resize finalization is executed in a separate topology transition state,
`tablet_resize_finalization`, to ensure it does not overlap with tablet
transitions. The topology transitions into the
`tablet_resize_finalization` state only when no tablet migrations are
scheduled or being executed. If there is a large load-balancing backlog,
split finalization might be delayed indefinitely, leaving the tables
with large tablets.

This PR fixes the issue by updating the load balancer to no schedule any
migrations and to not make any repair plans when there a resize
finalization is pending in any table.

Also added a testcase to verify the fix.

Fixes #21762

- (cherry picked from commit 8cabc66f07)

- (cherry picked from commit 5b47d84399)

- (cherry picked from commit dccce670c1)

Parent PR: #22148

Closes scylladb/scylladb#23633

* github.com:scylladb/scylladb:
  topology_coordinator: fix indentation in generate_migration_updates
  topology_coordinator: do not schedule migrations when there are pending resize finalizations
  load_balancer: make repair plans only when there is no pending resize finalization
2025-04-14 06:44:57 +03:00
Botond Dénes
251db77fcb mutation/frozen_mutation: frozen_mutation_consumer_adaptor: fix end-of-partition handling
This adaptor adapts a mutation reader pausable consumer to the frozen
mutation visitor interface. The pausable consumer protocol allows the
consumer to skip the remaining parts of the partition and resume the
consumption with the next one. To do this, the consumer just has to
return stop_iteration::yes from one of the consume() overloads for
clustering elements, then return stop_iteration::no from
consume_end_of_partition(). Due to a bug in the adaptor, this sequence
leads to terminating the consumption completely -- so any remaining
partitions are also skipped.

This protocol implementation bug has user-visible effects, when the
only user of the adaptor -- read repair -- happens during a query which
has limitations on the amount of content in each partition.
There are two such queries: select distinct ... and select ... with
partition limit. When converting the repaired mutation to to query
result, these queries will trigger the skip sequence in the consumer and
due to the above described bug, will skip the remaining partitions in
the results, omitting these from the final query result.

This patch fixes the protocol bug, the return value of the underlying
consumer's consume_end_of_partition() is now respected.

A unit test is also added which reproduces the problem both with select
distinct ... and select ... per partition limit.

Follow-up work:
* frozen_mutation_consumer_adaptor::on_end_of_partition() calls the
  underlying consumer's on_end_of_stream(), so when consuming multiple
  frozen mutations, the underlying's on_end_of_stream() is called for
  each partition. This is incorrect but benign.
* Improve documentation of mutation_reader::consume_pausable().

Fixes: #20084

Closes scylladb/scylladb#23657

(cherry picked from commit d67202972a)

Closes scylladb/scylladb#23694
2025-04-11 10:53:31 +03:00
Botond Dénes
f7761729cc Merge '[Backport 2025.1] nodetool: cluster repair: add a command to repair tablet keyspaces' from Scylladb[bot]
Add a new nodetool cluster super-command. Add nodetool
cluster repair command to repair tablet keyspaces.
It uses the new /storage_service/tablets/repair API.

The nodetool cluster repair command allows you to specify
the keyspace and tables to be repaired. A cluster repair of many
tables will request /storage_service/tablets/repair and wait for
the result synchronously for each table.

The nodetool repair command, which was previously used to repair
keyspaces of any type, now repairs only vnode keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/22409.

Needs backport to 2025.1 that introduces the new tablet repair API

- (cherry picked from commit cbde835792)

- (cherry picked from commit b81c81c7f4)

- (cherry picked from commit aa3973c850)

- (cherry picked from commit 8bbc5e8923)

- (cherry picked from commit 02fb71da42)

- (cherry picked from commit 9769d7a564)

Parent PR: #22905

Closes scylladb/scylladb#23672

* github.com:scylladb/scylladb:
  docs: nodetool: update repair and add tablet-repair docs
  test: nodetool: add tests for cluster repair command
  nodetool: add cluster repair command
  nodetool: repair: extract getting hosts and dcs to functions
  nodetool: repair: warn about repairing tablet keyspaces
  nodetool: repair: move keyspace_uses_tablets function
2025-04-11 10:53:03 +03:00
Raphael S. Carvalho
75cd8e9492 replica: Fix truncate and drop table after tablet migration happens
When running those operations after a tablet replica is migrated away from
a shard, an assert can fail resulting in a crash.

Status quo (around the assert in truncate procedure):

1) Highest RP seen by table is saved in low_mark, and the current time in
low_mark_at.
2) Then compaction is disabled in order to not mix data written before truncate,
and data written later.
3) Then memtable is flushed in order for the data written before truncate to be
available in sstables and then removed.
4) Now, current time is saved in truncated_at, which is supposedly the time of
truncate to decide which sstables to remove.

Note: truncated_at is likely above low_mark_at due to steps 2 and 3.

The interesting part of the assert is:
    (truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp)

Note: RP in the assert above is the highest RP among all sstables generated
before truncated_at. RP is retrieved by table::discard_sstables().

If truncated_at > low_mark_at, maybe newer data was written during steps 2 and
3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with
RP > low_mark.
So assert's 2nd condition is there to defend against the scenario above.

truncated_at and low_mark_at uses millisecond granularity, so even if
truncated_at == low_mark_at, data could have been written in steps 2 and 3
(during same MS window), failing the assert. This is fragile.

Reproducer:

To reproduce the problem, truncated_at must be > low_mark_at, which can easily
happen with both drop table and truncate due to steps 2 and 3.

If a shard has 2 or more tablets, the table's highest RP refer to just one
tablet in that shard.
If the tablet with the highest RP is migrated away, then the sstables in that
shard will have lower RP than the recorded highest RP (it's a table wide state,
which makes sense since CL is shared among tablets).

So when either drop table or truncate runs, low_mark will be potentially bigger
than highest RP retrieved from sstables.

Proposed solution:

The current assert is hacked to not fail if writes sneak in, during steps 2 and
3, but it's still fragile and seems not to serve its real purpose, since it's
allowing for RP > low_mark.

We should be able to say that low_mark >= RP, as a way of asserting we're not
leaving data targeted by truncate behind (or that we're not removing the wrong
data).

But the problem is that we're saving low_mark in step 1, before preparation
steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying
all data written so far is targeted for removal. But as of today, low_mark
refers to all data written up to step 1. So low_mark is now only one set
before issuing flush, and also accounts for all potentially flushed data.

Fixes #18059.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#23560

(cherry picked from commit 0f59deffaa)
(cherry picked from commit 7554d4bbe09967f9b7a55575b5dfdde4f6616862)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#23649
2025-04-11 10:52:37 +03:00
Raphael S. Carvalho
7007dabdf9 storage_service: Don't retry split when table is dropped
The split monitor wasn't handling the scenario where the table being
split is dropped. The monitor would be unable to find the tablet map
of such a table, and the error would be treated as a retryable one
causing the monitor to fall into an endless retry loop, with sleeps
in between. And that would block further splits, since the monitor
would be busy with the retries. The fix is about detecting table
was dropped and skipping to the next candidate, if any.

Fixes #21859.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#22933

(cherry picked from commit 4d8a333a7f)

Closes scylladb/scylladb#23480
2025-04-11 10:52:05 +03:00
Aleksandra Martyniuk
636ec802c3 service: tasks: hold token_metadata_ptr in tablet_virtual_task
Hold token_metadata_ptr in tablet_virtual_task methods that iterate
over tablets, to keep the tablet_map alive.

Fixes: https://github.com/scylladb/scylladb/issues/22316.

Closes scylladb/scylladb#22740

(cherry picked from commit f8e4198e72)

Closes scylladb/scylladb#22937
2025-04-11 10:51:07 +03:00
Avi Kivity
3335557075 Merge '[Backport 2025.1] row_cache: don't garbage-collect tombstones which cover data in memtables' from Scylladb[bot]
The row cache can garbage-collect tombstones in two places:
1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it;
2) During reads - reads now compact data including garbage collection;

In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables.
This PR includes fixes for (2), which were not handled at all currently.
(1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included.

Fixes: https://github.com/scylladb/scylladb/issues/23291
Fixes: https://github.com/scylladb/scylladb/issues/23252

The fix will need backport to all live release.

- (cherry picked from commit c2518cdf1a)

- (cherry picked from commit 6b5b563ef7)

- (cherry picked from commit 7e600a0747)

- (cherry picked from commit d126ea09ba)

- (cherry picked from commit cb76cafb60)

- (cherry picked from commit df09b3f970)

- (cherry picked from commit e5afd9b5fb)

- (cherry picked from commit 34b18d7ef4)

- (cherry picked from commit f7938e3f8b)

- (cherry picked from commit 6c1f6427b3)

- (cherry picked from commit 0d39091df2)

Parent PR: #23255

Closes scylladb/scylladb#23673

* github.com:scylladb/scylladb:
  test/boost/row_cache_test: add memtable overlap check tests
  replica/table: add error injection to memtable post-flush phase
  utils/error_injection: add a way to set parameters from error injection points
  test/cluster: add test_data_resurrection_in_memtable.py
  test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
  replica/mutation_dump: don't assume cells are live
  replica/database: do_apply() add error injection point
  replica: improve memtable overlap checks for the cache
  replica/memtable: add is_merging_to_cache()
  db/row_cache: add overlap-check for cache tombstone garbage collection
  mutation/mutation_compactor: copy key passed-in to consume_new_partition()
2025-04-10 21:42:28 +03:00
Avi Kivity
6ff7927d67 sstables: store features early in write path
sstable features indicate that an sstable has some extension, or that
some bug was fixed. They allow us to know if we can rely on certain
properties in a read sstables.

Currently, sstable features are set early in the read path (when we
read the scylla metadata file) and very late in the write path
(when we write the scylla metadata file just before sealing the sstable).

However, we happen to read features before we set them in the write path -
when we resize the bloom filter for a newly written sstable we instantiate
an index reader, and that depends on some features. As a result,
we read a disengaged optional (for the scylla metadata component) as if
it was engaged. This somehow worked so far, but fails with libstdc++
hash table implementation.

Fix it by moving storage of the features to the sstable itself, and
setting it early in the write path.

Fixes #23484

Closes scylladb/scylladb#23485

(cherry picked from commit 73e4a3c581)

Closes scylladb/scylladb#23504
2025-04-10 21:41:09 +03:00
Pavel Emelyanov
1021a3d126 Merge '[Backport 2025.1] Allow abort during join_cluster' from Scylladb[bot]
Bootstrap or replace can take a long time, but
since feef7d3fa1,
the stop_signal is checked only in checkpoints,
and in particular, abort isn't requested during
join_cluster.

Fixes #23222

* requires backport on top of https://github.com/scylladb/scylladb/pull/23184

- (cherry picked from commit 0fc196991a)

- (cherry picked from commit f269480f53)

- (cherry picked from commit 41f02c521d)

Parent PR: #23306

Closes scylladb/scylladb#23461

* github.com:scylladb/scylladb:
  main: allow abort during join_cluster
  main: add checkpoint before joining cluster
  storage_service: add start_sys_dist_ks
2025-04-10 19:03:46 +03:00
Avi Kivity
5d8bb068fa Merge '[Backport 2025.1] streaming: fix the way a reason of streaming failure is determined' from Scylladb[bot]
During streaming receiving node gets and processes mutation fragments.
If this operation fails, receiver responds with -1 status code, unless
it failed due to no_such_column_family in which case streaming of this
table should be skipped.

However, when the table was dropped, an exception handler on receiver
side may get not only data_dictionary::no_such_column_family, but also
seastar::nested_exception of two no_such_column_family.

Encountered example:
```
ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14))
```

In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family>
clause and gets handled the same as any other exception type.

Replace try_catch clause with table_sync_and_check that synchronizes
the schema and check if the table exists.

Fixes: https://github.com/scylladb/scylladb/issues/22834.

Needs backport to all live version, as they all contain the bug

- (cherry picked from commit 876cf32e9d)

- (cherry picked from commit faf3aa13db)

- (cherry picked from commit 44748d624d)

- (cherry picked from commit 35bc1fe276)

Parent PR: #22868

Closes scylladb/scylladb#23290

* github.com:scylladb/scylladb:
  streaming: fix the way a reason of streaming failure is determined
  streaming: save a continuation lambda
  streaming: use streaming namespace in table_check.{cc,hh}
  repair: streaming: move table_check.{cc,hh} to streaming
2025-04-10 18:22:16 +03:00
Lakshmi Narayanan Sreethar
fb069f0fbf topology_coordinator: fix indentation in generate_migration_updates
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit dccce670c1)
2025-04-10 18:39:10 +05:30
Lakshmi Narayanan Sreethar
48077b160d topology_coordinator: do not schedule migrations when there are pending resize finalizations
Resize finalization is executed in a separate topology transition state,
`tablet_resize_finalization`, to ensure it does not overlap with tablet
transitions. The topology transitions into the
`tablet_resize_finalization` state only when no tablet migrations are
scheduled or being executed. If there is a large load-balancing backlog,
split finalization might be delayed indefinitely, leaving the tables
with large tablets.

To fix this, do not schedule tablet migrations on any tables when there
are pending resize finalizations. This ensures that migrations from the
same table and other unrelated tables do not block resize finalization.

Also added a testcase to verify the fix.

Fixes #21762

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 5b47d84399)
2025-04-10 18:39:10 +05:30
Lakshmi Narayanan Sreethar
c286fc231a load_balancer: make repair plans only when there is no pending resize finalization
Do not make repair plans if any table has pending resize finalization.
This is to ensure that the finalization doesn't get delayed by reapir
tasks.

Refs #21762

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 8cabc66f07)
2025-04-10 18:20:00 +05:30
Botond Dénes
df4872b82a test/boost/row_cache_test: add memtable overlap check tests
Similar to test/cluster/test_data_resurrection_in_memtable.py but works
on a single node and uses more low-level mechanism. These tests can also
reproduce more advanced scenarios, like concurrent reads, with some
reading from flushed memtables.

(cherry picked from commit 0d39091df2)
2025-04-10 06:52:18 -04:00
Botond Dénes
7943db9844 replica/table: add error injection to memtable post-flush phase
After the memtable was flushed to disk, but before it is merged to
cache. The injection point will only active for the table specified in
the "table_name" injection parameter.

(cherry picked from commit 6c1f6427b3)
2025-04-10 06:52:18 -04:00
Botond Dénes
bd8c584a01 utils/error_injection: add a way to set parameters from error injection points
With this, now it is possible to have two-way communication between
the error injection point and its enabler. The test can enable the error
injection point, then wait until it is hit, before proceedin.

(cherry picked from commit f7938e3f8b)
2025-04-10 06:52:18 -04:00
Botond Dénes
50c05abd14 test/cluster: add test_data_resurrection_in_memtable.py
Reproducers for #23252 and #23291 -- cache garbage
collecting tombstones resurrecting data in the memtable.

(cherry picked from commit 34b18d7ef4)
2025-04-10 06:52:18 -04:00
Aleksandra Martyniuk
3a49808707 streaming: fix the way a reason of streaming failure is determined
During streaming receiving node gets and processes mutation fragments.
If this operation fails, receiver responds with -1 status code, unless
it failed due to no_such_column_family in which case streaming of this
table should be skipped.

However, when the table was dropped, an exception handler on receiver
side may get not only data_dictionary::no_such_column_family, but also
seastar::nested_exception of two no_such_column_family.

Encountered example:
```
ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14))
```

In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family>
clause and gets handled the same as any other exception type.

Replace try_catch clause with table_sync_and_check that synchronizes
the schema and check if the table exists.

Fixes: https://github.com/scylladb/scylladb/issues/22834.
(cherry picked from commit 35bc1fe276)
2025-04-10 09:35:56 +02:00
Aleksandra Martyniuk
b57774dea6 streaming: save a continuation lambda
In the following patches, an additional preemption point will be
added to the coroutine lambda in register_stream_mutation_fragments.

Assign a lambda to a variable to prolong the captures lifetime.

(cherry picked from commit 44748d624d)
2025-04-10 09:35:55 +02:00
Aleksandra Martyniuk
67b0ea99a0 streaming: use streaming namespace in table_check.{cc,hh}
(cherry picked from commit faf3aa13db)
2025-04-10 09:35:54 +02:00
Aleksandra Martyniuk
7fa0e041eb repair: streaming: move table_check.{cc,hh} to streaming
(cherry picked from commit 876cf32e9d)
2025-04-10 09:34:23 +02:00
Botond Dénes
de1d8372fa test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
Such that a given index in the return hosts refers to the same
underlying Scylla instance, as the same index in the passed-in nodes
list. This is what users of this method intuitively expect, but
currently the returned hosts list is unordered (has random order).

(cherry picked from commit e5afd9b5fb)
2025-04-10 03:17:27 -04:00
Botond Dénes
dcc3604e02 replica/mutation_dump: don't assume cells are live
Currently the dumper unconditionally extracts the value of atomic cells,
assuming they are live. This doesn't always hold of course and
attempting to get the value of a dead cell will lead to marshalling
errors. Fix by checking is_live() before attempting to get the cell
value. Fix for both regular and collection cells.

(cherry picked from commit df09b3f970)
2025-04-10 03:17:27 -04:00
Botond Dénes
39ca3463b3 replica/database: do_apply() add error injection point
So writes (to user tables) can be failed on a replica, via error
injection. Should simplify tests which want to create differences in
what writes different replicas receive.

(cherry picked from commit cb76cafb60)
2025-04-10 03:17:27 -04:00
Botond Dénes
1c7a6ba140 replica: improve memtable overlap checks for the cache
The current memtable overlap check that is used by the cache
-- table::get_max_purgeable_fn_for_cache_underlying_reader() -- only
checks the active memtable, so memtables which are either being flushed
or are already flushed and also have active reads against them do not
participate in the overlap check.
This can result in temporary data resurrection, where a cache read can
garbage-collect a tombstone which still covers data in a flushing or
flushed memtable, which still have active read against it.

To prevent this, extend the overlap check to also consider all of the
memtable list. Furthermore, memtable_list::erase() now places the removed
(flushed) memtable in an intrusive list. These entries are alive only as
long as there are readers still keeping an `lw_shared_ptr<memtable>`
alive. This list is now also consulted on overlap checks.

(cherry picked from commit d126ea09ba)
2025-04-10 03:17:27 -04:00
Botond Dénes
4febf2a938 replica/memtable: add is_merging_to_cache()
And set it when the memtable is merged to cache.

(cherry picked from commit 7e600a0747)
2025-04-10 03:17:27 -04:00
Botond Dénes
b43d024ffb db/row_cache: add overlap-check for cache tombstone garbage collection
The cache should not garbage-collect tombstone which cover data in the
memtable. Add overlap checks (get_max_purgeable) to garbage collection
to detect tombstones which cover data in the memtable and to prevent
their garbage collection.

(cherry picked from commit 6b5b563ef7)
2025-04-10 03:17:27 -04:00
Botond Dénes
4bb1969a7f mutation/mutation_compactor: copy key passed-in to consume_new_partition()
This doesn't introduce additional work for single-partition queries: the
key is copied anyway on consume_end_of_stream().
Multi-partition reads and compaction are not that sensitive to
additional copy added.

This change fixes a bug in the compacting_reader: currently the reader
passes _last_uncompacted_partition_start.key() to the compactor's
consume_new_partition(). When the compactor emits enough content for this
partition, _last_uncompacted_partition_start is moved from to emit the
partition start, this makes the key reference passed to the compaction
corrupt (refer to moved-from value). This in turn means that subsequent
GC checks done by the compactor will be done with a corrupt key and
therefore can result in tombstone being garbage-collected while they
still cover data elsewhere (data resurrection).

The compacting reader is violating the API contract and normally the bug
should be fixed there. We make an exception here because doing the fix
in the mutation compactor better aligns with our future plans:
* The fix simplifies the compactor (gets rid of _last_dk).
* Prepares the way to get rid of the consume API used by the compactor.

(cherry picked from commit c2518cdf1a)
2025-04-10 03:17:27 -04:00
Anna Stuchlik
6bcf513f11 doc: add enabling consistent topology updates to the 2025.1 upgrade guide-from-2024
This commit adds the procedure to enable consistent topology updates for upgrades
from 2024.1 to 2025.1 (or from 2024.2 to 2025.1 if the feature wasn't enabled
after upgrading from 2024.1 to 2024.2).

Fixes https://github.com/scylladb/scylladb/issues/23650

Closes scylladb/scylladb#23651

(cherry picked from commit 93a7b3ac1d)

Closes scylladb/scylladb#23670
2025-04-10 10:09:23 +03:00
Botond Dénes
b1a995b571 Merge '[Backport 2025.1] tablets: Make tablet allocation equalize per-shard load ' from Scylladb[bot]
Before, it was equalizing per-node load (tablet count), which is wrong
in heterogeneous clusters. Nodes with fewer shards will end up with
overloaded shards.

Fixes #23378

- (cherry picked from commit d6232a4f5f)

- (cherry picked from commit 6bff596fce)

Parent PR: #23478

Closes scylladb/scylladb#23635

* github.com:scylladb/scylladb:
  tablets: Make tablet allocation equalize per-shard load
  tablets: load_balancer: Fix reporting of total load per node
2025-04-10 10:08:38 +03:00
Botond Dénes
ec7da3d785 tools/scylla-nodetool: s/GetInt()/GetInt64()/
GetInt() was observed to fail when the integer JSON value overflows the
int32_t type, which `GetInt()` uses for storage. When this happens,
rapidjson will assign a distinct 64 bit integer type to the value, and
attempting to access it as 32 bit integer triggers the wrong-type error,
resulting in assert failure. This was hit on the field where invoking
nodetool netstats resulted in nodetool crashing when the streamed bytes
amounts were higher than maxint.

To avoid such bugs in the future, replace all usage of GetInt() in
nodetool of GetInt64(), just to be sure.

A reproducer is added to the nodetool netstats crash.

Fixes: scylladb/scylladb#23394

Closes scylladb/scylladb#23395

(cherry picked from commit bd8973a025)

Closes scylladb/scylladb#23476
2025-04-10 10:05:18 +03:00
Botond Dénes
02d89435a9 Merge '[Backport 2025.1] Ignore wrapped exceptions gate_closed_exception and rpc::closed_error when node shuts down.' from Scylladb[bot]
Normally, when a node is shutting down, `gate_closed_exception` and `rpc::closed_error`
in `send_to_live_endpoints` should be ignored. However, if these exceptions are wrapped
in a `nested_exception`, an error message is printed, causing tests to fail.

This commit adds handling for nested exceptions in this case to prevent unnecessary
error messages.

Fixes scylladb/scylladb#23325
Fixes scylladb/scylladb#23305
Fixes scylladb/scylladb#21815

Backport: looks like this is quite a frequent issue, therefore backport to 2025.1.

- (cherry picked from commit 6abfed9817)

- (cherry picked from commit b1e89246d4)

- (cherry picked from commit 0d9d0fe60e)

- (cherry picked from commit d448f3de77)

Parent PR: #23336

Closes scylladb/scylladb#23470

* github.com:scylladb/scylladb:
  database: Pass schema_ptr as const ref in `wrap_commitlog_add_error`
  database: Unify exception handling in `do_apply` and `apply_with_commitlog`
  storage_proxy: Ignore wrapped `gate_closed_exception` and `rpc::closed_error` when node shuts down.
  exceptions: Add `try_catch_nested` to universally handle nested exceptions of the same type.
2025-04-10 10:04:50 +03:00
Kefu Chai
4e500bc806 gms: Fix fmt formatter for gossip_digest_sync
In commit 4812a57f, the fmt-based formatter for gossip_digest_syn had
formatting code for cluster_id, partitioner, and group0_id
accidentally commented out, preventing these fields from being included
in the output. This commit restores the formatting by uncommenting the
code, ensuring full visibility of all fields in the gossip_digest_syn
message when logging permits.

This fixes a regression introduced in 4812a57f, which obscured these
fields and reduced debugging insight. Backporting is recommended for
improved observability.

Fixes #23142
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23155

(cherry picked from commit 2a9966a20e)

Closes scylladb/scylladb#23199
2025-04-10 10:00:37 +03:00
Botond Dénes
0a86511359 Merge '[Backport 2025.1] reader_concurrency_semaphore: register_inactive_read(): handle aborted permit' from Scylladb[bot]
It is possible that the permit handed in to register_inactive_read() is already aborted (currently only possible if permit timed out). If the permit also happens to have wait for memory, the current code will attempt to call promise<>::set_exception() on the permit's promise to abort its waiters. But if the permit was already aborted via timeout, this promise will already have an exception and this will trigger an assert. Add a separate case for checking if the permit is aborted already. If so, treat it as immediate eviction: close the reader and clean up.

Fixes: scylladb/scylladb#22919

Bug is present in all live versions, backports are required.

- (cherry picked from commit 4d8eb02b8d)

- (cherry picked from commit 7ba29ec46c)

Parent PR: #23044

Closes scylladb/scylladb#23145

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: register_inactive_read(): handle aborted permit
  test/boost/reader_concurrency_semaphore_test: move away from db::timeout_clock::now()
2025-04-10 09:58:19 +03:00
Asias He
a7ab9149e8 repair: Fix return type for storage_service/tablets/repair API
The API returns the repair task UUID. For example:

{"tablet_task_id":"3597e990-dc4f-11ef-b961-95d5ead302a7"}

Fixes #23032

Closes scylladb/scylladb#23050

(cherry picked from commit 3f59a89e85)

Closes scylladb/scylladb#23090
2025-04-10 09:57:45 +03:00
Piotr Dulikowski
a866dada1d test: test_mv_topology_change: increase timeout for removenode
The test `test_mv_topology_change` is a regression test for
scylladb/scylladb#19529. The problem was that CL=ANY writes issued when
all replicas were down would be kept in memory until the timeout. In
particular, MV updates are CL=ANY writes and have a 5 minute timeout.
When doing topology operations for vnodes or when migrating tablet
replicas, the cluster goes through stages where the replica sets for
writes undergo changes, and the writes started with the old replica set
need to be drained first.

Because of the aforementioned MV updates, the removenode operation could
be delayed by 5 minutes or more. Therefore, the
`test_mv_topology_change` test uses a short timeout for the removenode
operation, i.e. 30s. Apparently, this is too low for the debug mode and
the test has been observed to time out even though the removenode
operation is progressing fine.

Increase the timeout to 60s. This is the lowest timeout for the
removenode operation that we currently use among the in-repo tests, and
is lower than 5 minutes so the test will still serve its purpose.

Fixes: scylladb/scylladb#22953

Closes scylladb/scylladb#22958

(cherry picked from commit 43ae3ab703)

Closes scylladb/scylladb#23053
2025-04-10 09:56:38 +03:00
Lakshmi Narayanan Sreethar
4e51f37c76 topology_coordinator: handle_table_migration: do not continue after executing metadata barrier
Return after executing the global metadata barrier to allow the topology
handler to handle any transitions that might have started by a
concurrect transaction.

Fixes #22792

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#22793

(cherry picked from commit 0f7d08d41d)

Closes scylladb/scylladb#23019
2025-04-10 09:55:53 +03:00
Botond Dénes
c44362451c replica/database: setup_scylla_memory_diagnostics_producer() un-static semaphore dump lambda
The lambda which dumps the diagnostics for each semaphore, is static.
Considering that said lambda captures a local (writeln) by reference, this
is wrong on two levels:
* The writeln captured on the shard which happens to initialize this
  static, will be used on all shards.
* The writeln captured on the first dump, will be used on later dumps,
  possibly triggering a segfault.

Drop the `static` to make the lambda local and resolve this problem.

Fixes: scylladb/scylladb#22756

Closes scylladb/scylladb#22776

(cherry picked from commit 820f196a49)

Closes scylladb/scylladb#22938
2025-04-10 09:54:37 +03:00
Calle Wilund
7b351682ac network_topology_strategy/alter ks: Remove dc:s from options once rf=0
Fixes #22688

If we set a dc rf to zero, the options map will still retain a dc=0 entry.
If this dc is decommissioned, any further alters of keyspace will fail,
because the union of new/old options will now contained an unknown keyword.

Change alter ks options processing to simply remove any dc with rf=0 on
alter, and treat this as an implicit dc=0 in nw-topo strategy.
This means we change the reallocate_tablets routine to not rely on
the strategy objects dc mapping, but the full replica topology info
for dc:s to consider for reallocation. Since we verify the input
on attribute processing, the amount of rf/tablets moved should still
be legal.

v2:
* Update docs as well.
v3:
* Simplify dc processing
* Reintroduce options empty check, but do early in ks_prop_defs
* Clean up unit test some

Closes scylladb/scylladb#22693

(cherry picked from commit 342df0b1a8)

Closes scylladb/scylladb#22877
2025-04-10 09:53:48 +03:00
Benny Halevy
eaf67dd227 main: allow abort during join_cluster
Bootstrap or replace can take a long time, but
since feef7d3fa1,
the stop_signal is checked only in checkpoints,
and in particular, abort isn't requested during
join_cluster.

Fixes #23222

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 41f02c521d)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-10 09:02:09 +03:00
Benny Halevy
fa92b6787c main: add checkpoint before joining cluster
(cherry picked from commit f269480f53)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-10 09:01:21 +03:00
Benny Halevy
a86e7ff286 storage_service: add start_sys_dist_ks
Currently, there's a call to
`supervisor::notify("starting system distributed keyspace")`
which is misleading as it is identical to a similar
message in main() when starting the sharded service.

Change that to a storage_service log messages
and be more specific that the sys_dist_ks shards are started.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0fc196991a)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-10 09:01:20 +03:00
Andrzej Jackowski
6da78533ed audit: add semaphore to audit_syslog_storage_helper
audit_syslog_storage_helper::syslog_send_helper uses Seastar's
net::datagram_channel to write to syslog device (usually /dev/log).
However, datagram_channel.send() is not fiber-safe (ref seastar#2690),
so unserialized use of send() results in packets overwriting its state.
This, in turn, causes a corruption of audit logs, as well as assertion
failures.

To workaround the problem, a new semaphore is introduced in
audit_syslog_storage_helper. As storage_helper is a member of sharded
audit service, the semaphore allows for one datagram_channel.send() on
each shard. Each audit_syslog_storage_helper stores its own
datagram_channel, therefore concurrent sends to datagram_channel are
eliminated.

This change:
 - Introduce semaphore with count=1 in audit_syslog_storage_helper.
 - Added 1 hour timeout to the semaphore, so semaphore stalls are
   failed just as all other syslog auditing failures.

Fixes: scylladb#22973
(cherry picked from commit c12f976389)
2025-04-09 14:13:24 +00:00
Andrzej Jackowski
1bb35952d7 audit: corutinize audit_syslog_storage_helper
This change:
 - Corutinize audit_syslog_storage_helper::syslog_send_helper
 - Corutinize audit_syslog_storage_helper::start
 - Corutinize audit_syslog_storage_helper::write

(cherry picked from commit 889fd5bc9f)
2025-04-09 14:13:24 +00:00
Andrzej Jackowski
efb99f29bc audit: moved syslog_send_helper to audit_syslog_storage_helper
This change:
 - Make syslog_send_helper() a method of audit_syslog_storage_helper, so
   syslog_send_helper() can access private members of
   audit_syslog_storage_helper in the next commits.
 - Remove unneeded syslog_send_helper() arguments that now are class
   members.

(cherry picked from commit dbd2acd2be)
2025-04-09 14:13:24 +00:00
Aleksandra Martyniuk
c7f1e1814c docs: nodetool: update repair and add tablet-repair docs
(cherry picked from commit 9769d7a564)
2025-04-09 14:03:29 +00:00
Aleksandra Martyniuk
7bbffb53dd test: nodetool: add tests for cluster repair command
(cherry picked from commit 02fb71da42)
2025-04-09 14:03:28 +00:00
Aleksandra Martyniuk
c5c631f175 nodetool: add cluster repair command
Add a new nodetool cluster repair command that repairs tablet keyspaces.

Users may specify keyspace and tables that they want to repair.
If the keyspace and tables are not specified, all tablet keyspaces
are repaired.

The command calls the new tablet repair API /storage_service/tablets/repair.

(cherry picked from commit 8bbc5e8923)
2025-04-09 14:03:28 +00:00
Aleksandra Martyniuk
8453d4f987 nodetool: repair: extract getting hosts and dcs to functions
(cherry picked from commit aa3973c850)
2025-04-09 14:03:28 +00:00
Aleksandra Martyniuk
a1b8ae57d9 nodetool: repair: warn about repairing tablet keyspaces
Warn about an attempt to repair tablet keysapce with nodetool repair.

A nodetool cluster repair command to repair tablet keyspaces will
be added in the following patches.

(cherry picked from commit b81c81c7f4)
2025-04-09 14:03:28 +00:00
Aleksandra Martyniuk
b500fa498d nodetool: repair: move keyspace_uses_tablets function
(cherry picked from commit cbde835792)
2025-04-09 14:03:28 +00:00
Nadav Har'El
c94d8e2471 Merge '[Backport 2025.1] transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing' from Scylladb[bot]
A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause) can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver.

However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation. For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to set it back then).

This patch fixes this.

Fixes #23173

The issue fixed by this PR is not critical but the fix is simple and safe enough so we should backport it to all live releases.

- (cherry picked from commit ca6bddef35)

- (cherry picked from commit f7e1695068)

Parent PR: #23174

Closes scylladb/scylladb#23524

* github.com:scylladb/scylladb:
  CQL Tracing: set common query parameters in a single function
  transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing
2025-04-09 14:59:13 +03:00
Kefu Chai
d7265a1bc2 storage_proxy: Prevent integer overflow in abstract_read_executor::execute
Fix UBSan abort caused by integer overflow when calculating time difference
between read and write operations. The issue occurs when:
1. The queried partition on replicas is not purgeable (has no recorded
   modified time)
2. Digests don't match across replicas
3. The system attempts to calculate timespan using missing/negative
   last_modified timestamps

This change skips cross-DC repair optimization when write timestamp is
negative or missing, as this optimization is only relevant for reads
occurring within write_timeout of a write.

Error details:
```
service/storage_proxy.cc:5532:80: runtime error: signed integer overflow: -9223372036854775808 - 1741940132787203 cannot be represented in type 'int64_t' (aka 'long')
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior service/storage_proxy.cc:5532:80
Aborting on shard 1, in scheduling group sl:default
```

Related to previous fix 39325cf which handled negative read_timestamp cases.

Fixes #23314
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23359

(cherry picked from commit ebf9125728)

Closes scylladb/scylladb#23387
2025-04-09 14:56:10 +03:00
Nadav Har'El
7f19a27f4f Merge '[Backport 2025.1] main: safely check stop_signal in-between starting services' from Scylladb[bot]
To simplify aborting scylla while starting the services,
add a _ready state to stop_signal, so that until
main is ready to be stopped by the abort_source,
just register that the signal is caught, and
let a check() method poll that and request abort
and throw respective exception only then, in controlled
points that are in-between starting of services
after the service started successfully and a deferred
stop action was installed.

This patch prevents gate_closed_exception to escape handling
when start-up is aborted early with the stop signal,
causing https://github.com/scylladb/scylladb/issues/23153
The regression is apparently due to a25c3eaa1c

Fixes https://github.com/scylladb/scylladb/issues/23153

* Requires backport to 2025.1 due to a25c3eaa1c

- (cherry picked from commit 23433f593c)

- (cherry picked from commit 282ff344db)

- (cherry picked from commit feef7d3fa1)

- (cherry picked from commit b6705ad48b)

Parent PR: #23103

Closes scylladb/scylladb#23184

* github.com:scylladb/scylladb:
  main: add checkpoints
  main: safely check stop_signal in-between starting services
  main: move prometheus start message
  main: move per-shard database start message
2025-04-09 14:54:19 +03:00
Nadav Har'El
c6825920a6 alternator: in GetRecords, enforce Limit to be <= 1000
Alternator Streams' "GetRecords" operation has a "Limit" parameter on
how many records to return. The DynamoDB documentations says that the
upper limit on this Limit parameter is 1000 - but Alternator didn't
enforce this. In this patch we begin enforcing this highest Limit, and
also add a test for verifying this enforcement. As usual, the new test
passes on DynamoDB, and after this patch - also on Alternator.

The reason why it's useful to have *some* upper limit on Limit is that
the existing executor::get_records() implementation does not really have
preemption points in all the necessary places. In particular, we have a
loop on all returned records without preemption points. We also store
the returned records in a RapidJson vector, which requires a contiguous
allocation.

Even before this patch, GetRecords had a hard limit of 1 MB of results.
But still, in some cases 1 MB of results may be a lot of results, and we
can see stalls in the aforementioned places being O(number of results).

Fixes #23534

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23547

(cherry picked from commit 84fd52315f)

Closes scylladb/scylladb#23643
2025-04-09 12:46:30 +03:00
Botond Dénes
bff75aa812 Merge '[Backport 2025.1] Add tablet enforcing option' from Scylladb[bot]
This series add a new config option: `tablets_mode_for_new_keyspaces` that replaces the existing
`enable_tablets` option. It can be set to the following values:
    disabled: New keyspaces use vnodes by default, unless enabled by the tablets={'enabled':true} option
    enabled:  New keyspaces use tablets by default, unless disabled by the tablets={'disabled':true} option
    enforced: New keyspaces must use tablets. Tablets cannot be disabled using the CREATE KEYSPACE option

`tablets_mode_for_new_keyspaces=disabled` or `tablets_mode_for_new_keyspaces=enabled` control whether
tablets are disabled or enabled by default for new keyspaces, respectively.
In either cases, tablets can be opted-in or out using the `tablets={'enabled':...}`
keyspace option, when the keyspace is created.

`tablets_mode_for_new_keyspaces=enforced` enables tablets by default for new keyspaces,
like `tablets_mode_for_new_keyspaces=enabled`.
However, it does not allow to opt-out when creating
new keyspaces by setting `tablets = {'enabled': false}`

Fixes scylladb/scylla-enterprise#4355

[Edit: changed `Refs` above to `Fixes` to apeace the backport bot gods]

* Requires backport to 2025.1

- (cherry picked from commit c62865df90)

- (cherry picked from commit 62aeba759b)

- (cherry picked from commit 9fac0045d1)

Parent PR: #22273

Closes scylladb/scylladb#23602

* github.com:scylladb/scylladb:
  boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy
  tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option
  db/config: add tablets_mode_for_new_keyspaces option
2025-04-09 08:47:10 +03:00
Michał Chojnowski
2a74426084 table: fix a race in table::take_storage_snapshot()
`safe_foreach_sstable` doesn't do its job correctly.

It iterates over an sstable set under the sstable deletion
lock in an attempt to ensure that SSTables aren't deleted during the iteration.

The thing is, it takes the deletion lock after the SSTable set is
already obtained, so SSTables might get unlinked *before* we take the lock.

Remove this function and fix its usages to obtain the set and iterate
over it under the lock.

Closes scylladb/scylladb#23397

(cherry picked from commit e23fdc0799)

Closes scylladb/scylladb#23628
2025-04-08 19:07:22 +03:00
Lakshmi Narayanan Sreethar
b7e72b3167 replica/table::do_apply : do not check for async gate's closure
The `table::do_apply()` method verifies if the compaction group's async
gate is open to determine if the compaction group is active. Closing
this async gate prevents any new operations but waits for existing
holders to exit, allowing their operations to complete. When holding a
gate, holders will observe the gate as closed when it is being closed,
but this is irrelevant as they are already inside the gate and are
allowed to complete. All the callers of `table::do_apply()` already
enter the gate before calling the method. So, the async gate check
inside `table::do_apply()` will erroneously throw an exception when the
compaction group is closing despite holding the gate. This commit
removes the check to prevent this from happening.

Fixes #23348

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#23579

(cherry picked from commit 750f4baf44)

Closes scylladb/scylladb#23645
2025-04-08 18:59:22 +03:00
Yaron Kaikov
98359dbfb1 .github: Make "make-pr-ready-for-review" workflow run in base repo
in 57683c1a50 we fixed the `token` error,
but removed the checkout part which causing now the following error
```
failed to run git: fatal: not a git repository (or any of the parent directories): .git
```
Adding the repo checkout stage to avoid such error

Fixes: https://github.com/scylladb/scylladb/issues/22765

Closes scylladb/scylladb#23641

(cherry picked from commit 2dc7ea366b)

Closes scylladb/scylladb#23654
2025-04-08 13:49:27 +03:00
Benny Halevy
27ca0d1812 boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 9fac0045d1)
2025-04-08 08:35:26 +03:00
Benny Halevy
736f89b31a tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option
`tablets_mode_for_new_keyspaces=enforced` enables tablets by default for
new keyspaces, like `tablets_mode_for_new_keyspaces=enabled`.
However, it does not allow to opt-out when creating
new keyspaces by setting `tablets = {'enabled': false}`.

Refs scylladb/scylla-enterprise#4355

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 62aeba759b)
2025-04-08 08:35:14 +03:00
Benny Halevy
a49e27ac8f db/config: add tablets_mode_for_new_keyspaces option
The new option deprecates the existing `enable_tablets` option.
It will be extended in the next patch with a 3rd value: "enforced"
while will enable tablets by default for new keyspace but
without the posibility to opt out using the `tablets = {'enabled':
false}` keyspace schema option.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c62865df90)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-08 08:08:47 +03:00
Tomasz Grabiec
4f4c884d5d tablets: Make tablet allocation equalize per-shard load
Before, it was equalizing per-node load (tablet count), which is wrong
in heterogenous clusters. Nodes with fewer shards will end up with
overloaded shards.

Refs #23378

(cherry picked from commit 6bff596fce)
2025-04-07 18:14:11 +02:00
Tomasz Grabiec
55bfbe8ea3 tablets: load_balancer: Fix reporting of total load per node
Load is now utilization, not count, so we should report average
per-shard load, which is equivalent to node's utilization.

(cherry picked from commit d6232a4f5f)
2025-04-07 15:51:08 +00:00
Botond Dénes
1a896169dc Merge '[Backport 2025.1] repair: release erm in repair_writer_impl::create_writer when possible' from Scylladb[bot]
Currently, repair_writer_impl::create_writer keeps erm to ensure that a sharder is valid. If we repair a tablet, erm blocks the state machine and no operation on any tablet of this table might be performed.

Use auto_refreshing_sharder and topology_guard to ensure that the operation is safe and that tablet operations on the whole table aren't blocked.

Fixes: #23453.

Needs backport to 2025.1 that introduces the tablet repair scheduler.

- (cherry picked from commit 1dc29ddc86)

- (cherry picked from commit bae6711809)

Parent PR: #23455

Closes scylladb/scylladb#23580

* github.com:scylladb/scylladb:
  \test: add test to check concurrent migration and repair of two different tablets
  repair: release erm in repair_writer_impl::create_writer when possible
2025-04-07 10:10:20 +03:00
Kefu Chai
9ccad33e59 .github: Make "make-pr-ready-for-review" workflow run in base repo
The "make-pr-ready-for-review" workflow was failing with an "Input
required and not supplied: token" error.  This was due to GitHub Actions
security restrictions preventing access to the token when the workflow
is triggered in a fork:
```
    Error: Input required and not supplied: token
```

This commit addresses the issue by:

- Running the workflow in the base repository instead of the fork. This
  grants the workflow access to the required token with write permissions.
- Simplifying the workflow by using a job-level `if` condition to
  controlexecution, as recommended in the GitHub Actions documentation
  (https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/using-conditions-to-control-job-execution).
  This is cleaner than conditional steps.
- Removing the repository checkout step, as the source code is not required for this workflow.

This change resolves the token error and ensures the
"make-pr-ready-for-review" workflow functions correctly.

Fixes scylladb/scylladb#22765

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22766

(cherry picked from commit ca832dc4fb)

Closes scylladb/scylladb#23561
2025-04-07 08:10:10 +03:00
Piotr Smaron
a17dd4d4c9 [Backport 2025.1] auth: forbid modifying system ks by non-superusers
Before this patch, granting a user MODIFY permissions on ALL KEYSPACES allowed the user to write to system tables, where the user could also set himself to "superuser" granting him all other permissions. After this patch, MODIFY permissions on ALL KEYSPACES is limited only to non-system keyspaces.

Fixes: scylladb/scylladb#23218
(cherry picked from commit fee50f287c)

Parent PR: #23219

Closes scylladb/scylladb#23594
2025-04-06 15:10:06 +03:00
Nadav Har'El
a2a4c6e4b2 test/alternator: increase timeout in Alternator RBAC test
On our testing infrastructure, tests often run a hundred times (!)
slower than usual, for various reasons that we can't always avoid.
This is why all our test frameworks drastically increase the default
timeouts.

We forgot to increase the timeout in one place - where Alternator tests
use CQL. This is needed for the Alternator role-based access control
(RBAC) tests, which is configured via CQL and therefore the Alternator
test unusually uses CQL.

So in this patch we increase the timeout of CQL driver used by
Alternator tests to the same high timeouts (60-120 seconds) used by
the regular CQL tests. As the famous saying goes, these timeouts should
be enough for anyone.

Fixes #23569.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23578

(cherry picked from commit a9a6f9eecc)

Closes scylladb/scylladb#23601
2025-04-06 11:49:46 +03:00
Avi Kivity
64182d9df6 Update seastar submodule (prefaulter leaving zombie threads)
* seastar a350b5d70e...6d8fccf14c (1):
  > smp: prefaulter: don't leave zombie worker threads

Fixes #23316
2025-04-05 22:28:53 +03:00
Pavel Emelyanov
8e85ef90d2 sstables_loader: Do not stop sharded<progress_monitor> unconditionally
The member in question is unconditionally .stop()-ed in task's
release_resources() method, however, it may happen that the thing wasn't
.start()-ed in the first place. Start happens in the middle of the
task's .run() method and there can be several reasons why it can be
skipped -- e.g. the task is aborted early, or collecting sstables from
S3 throws.

fixes: #23231

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23483

(cherry picked from commit 832d83ae4b)

Closes scylladb/scylladb#23557
2025-04-04 17:46:20 +03:00
Aleksandra Martyniuk
b5b2ffa5df \test: add test to check concurrent migration and repair of two different tablets
(cherry picked from commit bae6711809)
2025-04-04 10:14:51 +02:00
Andrzej Jackowski
b7f067ce33 audit: fix empty query string in BATCH query
Function modification_statement::add_raw() is never called, which
makes query string in audit_info of batch queries empty. In enterprise
branch, add_raw is called in Cql.g and those changes were never merged
to master.

This changes:
 - Add missing call of add_raw() to Cql.g
 - Include other related changes (from PR#3228 in scylla-enterprise)

Fixes scylladb#23311

Closes scylladb/scylladb#23315

(cherry picked from commit b8adbcbc84)

Closes scylladb/scylladb#23495
2025-04-03 16:46:33 +03:00
Aleksandra Martyniuk
307f00a398 repair: release erm in repair_writer_impl::create_writer when possible
Currently, repair_writer_impl::create_writer keeps erm to ensure
that a sharder is valid. If we repair a tablet, erm blocks the state
machine and no operation on any tablet of this table might be performed.

Use auto_refreshing_sharder and topology_guard to ensure that the
operation is safe and that tablet operations on the whole table
aren't blocked.

Fixes: #23453.
(cherry picked from commit 1dc29ddc86)
2025-04-03 13:19:40 +00:00
Dawid Mędrek
c56e47f72f db/hints: Cancel draining when stopping node
Draining hints may occur in one of the two scenarios:

* a node leaves the cluster and the local node drains all of the hints
  saved for that node,
* the local node is being decommissioned.

Draining may take some time and the hint manager won't stop until it
finishes. It's not a problem when decommissioning a node, especially
because we want the cluster to retain the data stored in the hints.
However, it may become a problem when the local node started draining
hints saved for another node and now it's being shut down.

There are two reasons for that:

* Generally, in situations like that, we'd like to be able to shut down
  nodes as fast as possible. The data stored in the hints won't
  disappear from the cluster yet since we can restart the local node.
* Draining hints may introduce flakiness in tests. Replaying hints doesn't
  have the highest priority and it's reflected in the scheduling groups we
  use as well as the explicitly enforced throughput. If there are a large
  number of hints to be replayed, it might affect our tests.
  It's already happened, see: scylladb/scylladb#21949.

To solve those problems, we change the semantics of draining. It will behave
as before when the local node is being decommissioned. However, when the
local node is only being stopped, we will immediately cancel all ongoing
draining processes and stop the hint manager. To amend for that, when we
start a node and it initializes a hint endpoint manager corresponding to
a node that's already left the cluster, we will begin the draining process
of that endpoint manager right away.

That should ensure all data is retained, while possibly speeding up
the shutdown process.

There's a small trade-off to it, though. If we stop a node, we can then
remove it. It won't have a chance to replay hints it might've before
these changes, but that's an edge case. We expect this commit to bring
more benefit than harm.

We also provide tests verifying that the implementation works as intended.

Fixes scylladb/scylladb#21949

Closes scylladb/scylladb#22811

(cherry picked from commit 0a6137218a)

Closes scylladb/scylladb#23370
2025-04-03 09:09:05 +02:00
Tomasz Grabiec
51ee15f02d Merge '[Backport 2025.1] tablets: Make load balancing capacity-aware' from Tomasz Grabiec
Before this patch, the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
 1) tablets have the same size
 2) shards have the same capacity

That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogeneous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.

After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assumes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.

One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.

Also, per-shard goal for tablet count is still the same for all nodes in the cluster,
so nodes with less capacity will be below limit and nodes with more capacity will
be slightly above limit. This shouldn't be a significant problem in practice, we could
compensate for this by increasing the limit.

Fixes #23042

* github.com:scylladb/scylladb:
  tablets: Make load balancing capacity-aware
  topology_coordinator: Fix confusing log message
  topology_coordinator: Refresh load stats after adding a new node
  topology_coordinator: Allow capacity stats to be refreshed with some nodes down
  topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
  test: boost: tablets_test: Always provide capacity in load_stats
  test: perf_load_balancing: Set node capacity
  test: perf_load_balancing: Convert to topology_builder
  config, disk_space_monitor: Allow overriding capacity via config
  storage_service, tablets: Collect per-node capacity in load_stats
  test: tablets_test: Add support for auto-split mode
  test: cql_test_env: Expose db config

Closes scylladb/scylladb#23443

* github.com:scylladb/scylladb:
  Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec
  test: tablets_test: Add support for auto-split mode
  test: cql_test_env: Expose db config
2025-04-01 20:31:05 +02:00
Vlad Zolotarov
feadb781f2 CQL Tracing: set common query parameters in a single function
Each query-type (QUERY, EXECUTE, BATCH) CQL opcode has a number of parameters
in their payload which we always want to record in the Tracing object.
Today it's a Consistency Level, Serial Consistency Level and a Default Timestamp.

Setting each of them individually can lead to a human error when one (or more) of
them would not be set. Let's eliminate such a possibility by defining
a single function that sets them all.

This also allows an easy addition of such parameters to this function in
the future.

(cherry picked from commit f7e1695068)
2025-04-01 11:45:54 +00:00
Vlad Zolotarov
6b71d6b9ba transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing
A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause)
can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of
QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver.

However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation.
For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to
set it back then).

This patch fixes this.

Fixes #23173

(cherry picked from commit ca6bddef35)
2025-04-01 11:45:54 +00:00
Jenkins Promoter
36bb089663 Update ScyllaDB version to: 2025.1.1 2025-04-01 14:18:18 +03:00
yangpeiyu2_yewu
1661b35050 mutation_writer/multishard_writer.cc: wrap writer into futurize_invoke
wrapped writer in seastar::futurize_invoke to make sure that the close() for the mutation_reader can be executed before destruction.

Fixes scylladb/scylladb#22790

Closes scylladb/scylladb#22812

(cherry picked from commit 0de232934a)

Closes scylladb/scylladb#22943
2025-04-01 13:46:27 +03:00
Asias He
8c93a331f7 repair: Enable small table optimization for system_replicated_keys
This enterprise-only system table is replicated and small. It should be
included for small table optimization.

Fixes scylladb/scylla-enterprise#5256

Closes scylladb/scylladb#23135

Closes scylladb/scylladb#23147
2025-04-01 13:36:51 +03:00
Calle Wilund
85c161b9f1 generic_server: Update conditions for is_broken_pipe_or_connection_reset
Refs scylla-enterprise#5185
Fixes #22901

If a tls socket gets EPIPE the error is not translated to a specific
gnutls error code, but only a generic ERROR_PULL/PUSH. Since we treat
EPIPE as ignorable for plain sockets, we need to unwind nested exception
here to detect that the error was in fact due to this, so we can suppress
log output for this.

Closes scylladb/scylladb#22888

(cherry picked from commit e49f2046e5)

Closes scylladb/scylladb#23045
2025-04-01 13:06:29 +03:00
Patryk Jędrzejczak
d088cc8a2d Merge '[Backport 2025.1] Fix a regression that sometimes causes an internal error and demote barrier_and_drain rpc error log to a warning ' from Scylladb[bot]
The series fixes a regression and demotes a barrier_and_drain logging error to a warning since this particular condition may happen during normal operation.

We want to backport both since one is a bug fix and another is trivial and reduces CI flakiness.

- (cherry picked from commit 1da7d6bf02)

- (cherry picked from commit fe45ea505b)

Parent PR: #22650

Closes scylladb/scylladb#22923

* https://github.com/scylladb/scylladb:
  topology_coordinator: demote barrier_and_drain rpc failure to warning
  topology_coordinator: read peers table only once during topology state application
2025-04-01 11:54:56 +02:00
Patryk Jędrzejczak
39c20144e5 Merge '[Backport 2025.1] raft topology: Add support for raft topology init to happen before group0 initialization' from Scylladb[bot]
In the current scenario, the problem discovered is that there is a time
gap between group0 creation and raft_initialize_discovery_leader call.
Because of that, the group0 snapshot/apply entry enters wrong values
from the disk(null) and updates the in-memory variables to wrong values.
During the above time gap, the in-memory variables have wrong values and
perform absurd actions.

This PR removes the variable `_manage_topology_change_kind_from_group0`
which was used earlier as a work around for correctly handling
`topology_change_kind` variable, it was brittle and had some bugs
(causing issues like scylladb/scylladb#21114). The reason for this bug
that _manage_topology_change_kind used to block reading from disk and
was enabled after group0 initialization and starting raft server for the
restart case. Similarly, it was hard to manage `topology_change_kind`
using `_manage_topology_change_kind_from_group0` correctly in bug free
anner.

Post `_manage_topology_change_kind_from_group0` removal, careful
management of `topology_change_kind` variable was needed for maintaining
correct `topology_change_kind` in all scenarios. So this PR also performs
a refactoring to populate all init data to system tables even before
group0 creation(via `raft_initialize_discovery_leader` function). Now
because `raft_initialize_discovery_leader` happens before the group 0
creation, we write mutations directly to system tables instead of a
group 0 command. Hence, post group0 creation, the node can read the
correct values from system tables and correct values are maintained
throughout.

Added a new function `initialize_done_topology_upgrade_state` which
takes care of updating the correct upgrade state to system tables before
starting group0 server. This ensures that the node can read the correct
values from system tables and correct values are maintained throughout.

By moving `raft_initialize_discovery_leader` logic to happen before
starting group0 server, and not as group0 command post server start, we
also get rid of the potential problem of init group0 command not being
the 1st command on the server. Hence ensuring full integrity as expected
by programmer.

This PR fixes a bug. Hence we need to backport it.

Fixes: scylladb/scylladb#21114

- (cherry picked from commit 4748125a48)

- (cherry picked from commit e491950c47)

- (cherry picked from commit 623e01344b)

- (cherry picked from commit d7884cf651)

Parent PR: #22484

Closes scylladb/scylladb#22966

* https://github.com/scylladb/scylladb:
  storage_service: Remove the variable _manage_topology_change_kind_from_group0
  storage_service: fix indentation after the previous commit
  raft topology: Add support for raft topology system tables initialization to happen before group0 initialization
  service/raft: Refactor mutation writing helper functions.
2025-04-01 11:46:15 +02:00
Jenkins Promoter
f1e7cee7a5 Update pgo profiles - aarch64 2025-04-01 04:20:56 +03:00
Jenkins Promoter
023b27312d Update pgo profiles - x86_64 2025-04-01 04:08:00 +03:00
Anna Stuchlik
2ffbc81e19 doc: remove the outdated info on seeds-info
This commit removes the outdated information about seed nodes.
We no longer need it in the docs, as a) the documentation is versioned,
and b) the ScyllaDB Open Source 4.3 and ScyllaDB Enterprise 2021.1 versions
mentioned in the docs are no longer supported.

In addition, some clarification has been added to the existing sections.

Fixes https://github.com/scylladb/scylladb/issues/22400

Closes scylladb/scylladb#23282

(cherry picked from commit dbbf9e19e4)

Closes scylladb/scylladb#23327
2025-03-31 12:33:59 +02:00
Yaron Kaikov
88e548ed72 .github: add action to make PR ready for review when conflicts label was removed
Moving a PR out of draft is only allowed to users with write access,
adding a github action to switch PR to `ready for review` once the
`conflicts` label was removed

Closes scylladb/scylladb#22446

(cherry picked from commit ed4bfad5c3)

Closes scylladb/scylladb#23023
2025-03-31 13:22:04 +03:00
Tomasz Grabiec
975882a489 test: tablets: Fix flakiness due to ungraceful shutdown
The test fails sporadically with:

cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test3.test2 - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}

That's becase a server is stopped in the middle of the workload.

The server is stopped ungracefully which will cause some requests to
time out. We should stop it gracefully to allow in-flight requests to
finish.

Fixes #20492

Closes scylladb/scylladb#23451

(cherry picked from commit 8e506c5a8f)

Closes scylladb/scylladb#23469
2025-03-28 14:56:02 +01:00
Sergey Zolotukhin
bfb242b735 database: Pass schema_ptr as const ref in wrap_commitlog_add_error
(cherry picked from commit d448f3de77)
2025-03-27 21:28:13 +00:00
Sergey Zolotukhin
fe94b5a475 database: Unify exception handling in do_apply and apply_with_commitlog
Move exception wrapping logic from `do_apply` and `apply_with_commitlog`
to `wrap_commitlog_add_error` to ensure consistent error handling.

(cherry picked from commit 0d9d0fe60e)
2025-03-27 21:28:13 +00:00
Sergey Zolotukhin
f7b7a47404 storage_proxy: Ignore wrapped gate_closed_exception and rpc::closed_error when node shuts down.
Normally, when a node is shutting down, `gate_closed_exception` and `rpc::closed_error`
in `send_to_live_endpoints` should be ignored. However, if these exceptions are wrapped
in a `nested_exception`, an error message is printed, causing tests to fail.

This commit adds handling for nested exceptions in this case to prevent unnecessary
error messages.

Fixes scylladb/scylladb#23325

(cherry picked from commit b1e89246d4)
2025-03-27 21:28:13 +00:00
Sergey Zolotukhin
0e0d5241db exceptions: Add try_catch_nested to universally handle nested exceptions of the same type.
(cherry picked from commit 6abfed9817)
2025-03-27 21:28:12 +00:00
Evgeniy Naydanov
3653662099 test.py: random_failures: deselect topology ops for some injections
After recent changes #18640 and #19151 started to reproduce for
stop_after_sending_join_node_request and
stop_after_bootstrapping_initial_raft_configuration error injections too.

The solution is the same: deselect the tests.

Fixes #23302

Closes scylladb/scylladb#23405

(cherry picked from commit 574c81eac6)

Closes scylladb/scylladb#23460
2025-03-27 13:19:59 +02:00
Anna Stuchlik
7336bb38fa doc: fix product names in the 2025.1 upgrage guides
This commit fixes the product names in the upgrade 2025.1 guides so that:

- 6.2 is preceded with "ScyllaDB Open Source"
- 2024.x is preceded with "ScyllaDB Enterprise"
- 2025.1 is preceded with "ScyllaDB"

Fixes https://github.com/scylladb/scylladb/issues/23154

Closes scylladb/scylladb#23223

(cherry picked from commit cd61f60549)

Closes scylladb/scylladb#23328
2025-03-27 11:58:01 +02:00
Avi Kivity
cff90755d8 Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec
Before this patch, the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
 1) tablets have the same size
 2) shards have the same capacity

That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogeneous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.

After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assumes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.

One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.

Also, per-shard goal for tablet count is still the same for all nodes in the cluster,
so nodes with less capacity will be below limit and nodes with more capacity will
be slightly above limit. This shouldn't be a significant problem in practice, we could
compensate for this by increasing the limit.

Refs #23042

Closes scylladb/scylladb#23079

* github.com:scylladb/scylladb:
  tablets: Make load balancing capacity-aware
  topology_coordinator: Fix confusing log message
  topology_coordinator: Refresh load stats after adding a new node
  topology_coordinator: Allow capacity stats to be refreshed with some nodes down
  topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
  test: boost: tablets_test: Always provide capacity in load_stats
  test: perf_load_balancing: Set node capacity
  test: perf_load_balancing: Convert to topology_builder
  config, disk_space_monitor: Allow overriding capacity via config
  storage_service, tablets: Collect per-node capacity in load_stats

(cherry picked from commit b1d9f80d85)
2025-03-25 23:16:35 +01:00
Tomasz Grabiec
3be469da29 test: tablets_test: Add support for auto-split mode
rebalance_tablets() was performing migrations and merges automatically
but not splits, because splits need to be acked by replicas via
load_stats. It's inconvenient in tests which want to rebalance to the
equilibrium point. This patch changes rebalance_tablets() to split
automatically by default, can be disabled for tests which expect
differently.

shared_load_stats was introduced to provide a stable holder of
load_stats which can be reused across rebalance_tablets() calls.

(cherry picked from commit 5e471c6f1b)
2025-03-25 18:23:22 +01:00
Tomasz Grabiec
1895724465 test: cql_test_env: Expose db config
(cherry picked from commit f3b63bfeff)
2025-03-25 18:22:32 +01:00
Jenkins Promoter
9dca28d2b8 Update ScyllaDB version to: 2025.1.0 2025-03-25 09:19:12 +02:00
Avi Kivity
bc98301783 Merge '[Backport 2025.1] repair: allow concurrent repair and migration of two different tablets' from Aleksandra Martyniuk
Do not hold erm during repair of a tablet that is started with tablet
repair scheduler. This way two different tablets can be repaired
and migrated concurrently. The same tablet won't be migrated while
being repaired as it is provided by topology coordinator.

Use topology_guard to maintain safety.

Fixes: https://github.com/scylladb/scylladb/issues/22408.

Needs backport to 2025.1 that introduces the tablet repair scheduler.

Closes scylladb/scylladb#23362

* github.com:scylladb/scylladb:
  test: add test to check concurrent tablets migration and repair
  repair: do not hold erm for repair scheduled by scheduler
  repair: get total rf based on current erm
  repair: make shard_repair_task_impl::erm private
  repair: do not pass erm to put_row_diff_with_rpc_stream when unnecessary
  repair: do not pass erm to flush_rows_in_working_row_buf when unnecessary
  repair: pass session_id to repair_writer_impl::create_writer
  repair: keep materialized topology guard in shard_repair_task_impl
  repair: pass session_id to repair_meta
2025-03-23 20:14:53 +02:00
Avi Kivity
220bbcf329 Merge '[Backport 2025.1] cql3: Introduce RF-rack-valid keyspaces' from Scylladb[bot]
This PR is an introductory step towards enforcing
RF-rack-valid keyspaces in Scylla.

The scope of changes:
* defining RF-rack-valid keyspaces,
* introducing a configuration option enforcing RF-rack-valid
  keyspaces,
* restricting the CREATE and ALTER KEYSPACE statements
  so that they never lead to RF-rack invalid keyspaces,
* during the initialization of a node, it verifies that all existing
  keyspaces are RF-rack-valid. If not, the initialization fails.

We provide tests verifying that the changes behave as intended.

---

Note that there are a number of things that still need to be implemented.
That includes, for instance, restricting topology operations too.

---

Implementation strategy (going beyond the scope of this PR):

1. Introduce the new configuration option `rf_rack_valid_keyspaces`.
2. Start enforcing RF-rack-validity in keyspaces if the option is enabled.
3. Adjust the tests: in the tree and out of it. Explicitly enable the option in all tests.
4. Once the tests have been adjusted, change the default value of the option to enabled.
5. Stop explicitly enabling the option in tests.
6. Get rid of the option.

---

Fixes scylladb/scylladb#20356
Fixes scylladb/scylladb#23276
Fixes scylladb/scylladb#23300

---

Backport: this is part of the requirements for releasing 2025.1.

- (cherry picked from commit 32879ec0d5)

- (cherry picked from commit 41f862d7ba)

- (cherry picked from commit 0e04a6f3eb)

Parent PR: #23138

Closes scylladb/scylladb#23398

* github.com:scylladb/scylladb:
  main: Refuse to start node when RF-rack-invalid keyspace exists
  cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces
  db/config: Introduce RF-rack-valid keyspaces
2025-03-23 16:16:29 +02:00
Dawid Mędrek
ecdefe801c main: Refuse to start node when RF-rack-invalid keyspace exists
When a node is started with the option `rf_rack_valid_keyspaces`
enabled, the initialization will fail if there is an RF-rack-invalid
keyspace. We want to force the user to adjust their existing
keyspaces when upgrading to 2025.* so that the invariant that
every keyspace is RF-rack-valid is always satisfied.

Fixes scylladb/scylladb#23300

(cherry picked from commit 0e04a6f3eb)
2025-03-21 12:27:04 +00:00
Dawid Mędrek
af2215c2d2 cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces
In this commit, we refuse to create or alter a keyspace when that operation
would make it RF-rack-invalid if the option `rf_rack_valid_keyspaces` is
enabled.

We provide two tests verifying that the changes work as intended.

Fixes scylladb/scylladb#23276

(cherry picked from commit 41f862d7ba)
2025-03-21 12:27:04 +00:00
Dawid Mędrek
864528eb9b db/config: Introduce RF-rack-valid keyspaces
We introduce a new term in the glossary: RF-rack-valid keyspace.

We also highlight in our user documentation that all keyspaces
must remain RF-rack-valid throughout their lifetime, and failing
to guarantee that may result in data inconsistencies or other
issues. We base that information on our experience with materialized
views in keyspaces using tablets, even though they remain
an experimental feature.

Along with the new term, we introduce a new configuration option
called `rf_rack_valid_keyspaces`, which, when enabled, will enforce
preserving all keyspaces RF-rack-valid. That functionality will be
implemented in upcoming commits. For now, we materialize the
restriction in form of a named requirement: a function verifying
that the passed keyspace is RF-rack-valid.

The option is disabled by default. That will change once we adjust
the existing tests to the new semantics. Once that is done, the option
will first be enabled by default, and then it will be removed.

Fixes scylladb/scylladb#20356

(cherry picked from commit 32879ec0d5)
2025-03-21 12:27:04 +00:00
Aleksandra Martyniuk
5153b91514 test: add test to check concurrent tablets migration and repair
Add a test to check whether a tablet can be migrated while another
tablet is repaired.

(cherry picked from commit 20f9d7b6eb)
2025-03-19 10:15:19 +01:00
Aleksandra Martyniuk
0a0347cb4e repair: do not hold erm for repair scheduled by scheduler
Do not hold erm	for tablet repair scheduled by scheduler. Thanks to
that one tablet repair won't exclude migration of other tablets.

Concurrent repair and migration of the same tablet isn't possible,
since a tablet can be in one type of transition only at the time.
Hence the change is safe.

Refs: https://github.com/scylladb/scylladb/issues/22408.
(cherry picked from commit 5b792bdc98)
2025-03-19 10:09:51 +01:00
Aleksandra Martyniuk
da64c02b92 repair: get total rf based on current erm
Get total rf based on erm. Currently, it does not change anything
because erm stays the same during the whole repair.

(cherry picked from commit a1375896df)
2025-03-19 10:09:30 +01:00
Aleksandra Martyniuk
39aabe5191 repair: make shard_repair_task_impl::erm private
Make shard_repair_task_impl::erm private. Access it with getter.

(cherry picked from commit 34cd485553)
2025-03-19 10:09:11 +01:00
Aleksandra Martyniuk
9eeff8573b repair: do not pass erm to put_row_diff_with_rpc_stream when unnecessary
When small_table_optimization isn't enabled, put_row_diff_with_rpc_stream
does not access erm. Pass small_table_optimization_params containing erm
only when small_table_optimization is enabled.

This is safe as erm is kept by shard_repair_task_impl.

(cherry picked from commit 444c7eab90)
2025-03-19 10:08:22 +01:00
Aleksandra Martyniuk
4115f6f367 repair: do not pass erm to flush_rows_in_working_row_buf when unnecessary
When small_table_optimization isn't enabled, flush_rows_in_working_row_buf
does not access erm. Add small_table_optimization_params containing erm and
pass it only when small_table_optimization is enabled.

This is safe as erm is kept by shard_repair_task_impl.

(cherry picked from commit e56bb5b6e2)
2025-03-19 10:07:45 +01:00
Aleksandra Martyniuk
fb2c46dfbe repair: pass session_id to repair_writer_impl::create_writer
(cherry picked from commit 09c74aa294)
2025-03-19 10:07:00 +01:00
Aleksandra Martyniuk
b4e37600d6 repair: keep materialized topology guard in shard_repair_task_impl
Keep materialized topology guard in shard_repair_task_impl and check
it in check_in_abort_or_shutdown and before each range repair.

(cherry picked from commit 47bb9dcf78)
2025-03-19 10:04:17 +01:00
Aleksandra Martyniuk
6bbf20a440 repair: pass session_id to repair_meta
Pass session_id of tablet repair down the stack from the repair request
to repair_meta.

The session_id will be utiziled in the following patches.

(cherry picked from commit 928f92c780)
2025-03-19 10:02:24 +01:00
Botond Dénes
b8797551eb Merge '[Backport 2025.1] Rack aware tablet merge colocation migration ' from Tomasz Grabiec
service: Introduce rack-aware co-location migrations for tablet merge

Merge co-location can emit migrations across racks even when RF=#racks,
reducing availability and affecting consistency of base-view pairing.

Given replica set of sibling tablets T0 and T1 below:
[T0: (rack1,rack3,rack2)]
[T1: (rack2,rack1,rack3)]

Merge will co-locate T1:rack2 into T0:rack1, T1 will be temporarily only at
only a subset of racks, reducing availability.

This is the main problem fixed by this patch.

It also lays the ground for consistent base-view replica pairing,
which is rack-based. For tables on which views can be created we plan
to enforce the constraint that replicas don't move across racks and
that all tablets use the same set of racks (RF=#racks). This patch
avoids moving replicas across racks unless it's necessary, so if the
constraint is satisfied before merge, there will be no co-locating
migrations across racks. This constraint of RF=#racks is not enforced
yet, it requires more extensive changes.

Fixes #22994.
Refs #17265.

This patch is based on Raphael's work done in PR #23081. The main differences are:

1) Instead of sorting replicas by rack, we try to find
    replicas in sibling tablets which belong to the same rack.
    This is similar to how we match replicas within the same host.
    It reduces number of across-rack migrations even if RF!=#racks,
    which the original patch didn't handle.
    Unlike the original patch, it also avoids rack-overloaded in case
    RF!=#racks

2) We emit across-rack co-locating migrations if we have no other choice
   in order to finalize the merge

   This is ok, since views are not supported with tablets yet. Later,
   we will disallow this for tables which have views, and we will
   allow creating views in the first place only when no such migrations
   can happen (RF=#racks).

3) Added boost unit test which checks that rack overload is avoided during merge
   in case RF<#racks

4) Moved logging of across-rack migration to debug level

5) Exposed metric for across-rack co-locating migrations

(cherry picked from commit af949f3b6a)

Also backports dependent patches:
  - locator: network_topology_strategy: Fix SIGSEGV when creating a table when there is a rack with no normal nodes
  - locator: network_topology_startegy: Ignore leaving nodes when computing capacity for new tables
  - Merge 'test: tablets_test: Create proper schema in load balancer tests' from Tomasz Grabiec

Closes scylladb/scylladb#22657
Closes scylladb/scylladb#22652

Closes scylladb/scylladb#23297

* github.com:scylladb/scylladb:
  service: Introduce rack-aware co-location migrations for tablet merge
  Merge 'test: tablets_test: Create proper schema in load balancer tests' from Tomasz Grabiec
  locator: network_topology_startegy: Ignore leaving nodes when computing capacity for new tables
  locator: network_topology_strategy: Fix SIGSEGV when creating a table when there is a rack with no normal nodes
2025-03-18 16:22:29 +02:00
Nadav Har'El
b1cf1890a9 alternator: document the state of tablet support in Alternator
In commit c24bc3b we decided that creating a new table in Alternator
will by default use vnodes - not tablets - because of all the missing
features in our tablets implementation that are important for
Alternator, namely - LWT, CDC and Alternator TTL.

We never documented this, or the fact that we support a tag
`experimental:initial_tablets` which allows to override this decision
and create an Alternator table using tablets. We also never documented
what exactly doesn't work when Alternator uses tablet.

This patch adds the missing documentation in docs/alternator/new-apis.md
(which is a good place for describing the `experimental:initial_tablets`
tag). The patch also adds a new test file, test_tablets.py, which
includes tests for all the statements made in the document regarding
how `experimental:initial_tablets` works and what works or doesn't
work when tablets are enabled.

Two existing tests - for TTL and Streams non-support with tablets -
are moved to the new test file.

When the tablets feature will finally be completed, both the document
and the tests will need to be modified (some of the tests should be
outright deleted). But it seems this will not happen for at least
several months, and that is too long to wait without accurate
documentation.

Fixes #21629

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#22462

(cherry picked from commit c0821842de)

Closes scylladb/scylladb#23298
2025-03-16 18:25:21 +02:00
Jenkins Promoter
2f0ebe9f49 Update pgo profiles - aarch64 2025-03-15 04:21:14 +02:00
Jenkins Promoter
3633fb9ff8 Update pgo profiles - x86_64 2025-03-15 04:13:25 +02:00
Raphael S. Carvalho
33b5f27057 service: Introduce rack-aware co-location migrations for tablet merge
Merge co-location can emit migrations across racks even when RF=#racks,
reducing availability and affecting consistency of base-view pairing.

Given replica set of sibling tablets T0 and T1 below:
[T0: (rack1,rack3,rack2)]
[T1: (rack2,rack1,rack3)]

Merge will co-locate T1:rack2 into T0:rack1, T1 will be temporarily only at
only a subset of racks, reducing availability.

This is the main problem fixed by this patch.

It also lays the ground for consistent base-view replica pairing,
which is rack-based. For tables on which views can be created we plan
to enforce the constraint that replicas don't move across racks and
that all tablets use the same set of racks (RF=#racks). This patch
avoids moving replicas across racks unless it's necessary, so if the
constraint is satisfied before merge, there will be no co-locating
migrations across racks. This constraint of RF=#racks is not enforced
yet, it requires more extensive changes.

Fixes #22994.
Refs #17265.

This patch is based on Raphael's work done in PR #23081. The main differences are:

1) Instead of sorting replicas by rack, we try to find
    replicas in sibling tablets which belong to the same rack.
    This is similar to how we match replicas within the same host.
    It reduces number of across-rack migrations even if RF!=#racks,
    which the original patch didn't handle.
    Unlike the original patch, it also avoids rack-overloaded in case
    RF!=#racks

2) We emit across-rack co-locating migrations if we have no other choice
   in order to finalize the merge

   This is ok, since views are not supported with tablets yet. Later,
   we will disallow this for tables which have views, and we will
   allow creating views in the first place only when no such migrations
   can happen (RF=#racks).

3) Added boost unit test which checks that rack overload is avoided during merge
   in case RF<#racks

4) Moved logging of across-rack migration to debug level

5) Exposed metric for across-rack co-locating migrations

(cherry picked from commit af949f3b6a)

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>
2025-03-14 20:02:33 +01:00
Anna Stuchlik
11ecc886c3 doc: Remove "experimental" from ALTER KEYSPACE with Tablets
Altering a keyspace with tablets is no longer experimental.
This commit removes the "Experimental" label from the feature.

Fixes https://github.com/scylladb/scylladb/issues/23166

Closes scylladb/scylladb#23183

(cherry picked from commit 562b5db5b8)

Closes scylladb/scylladb#23274
2025-03-14 13:57:55 +01:00
Botond Dénes
eb147ec564 Merge 'test: tablets_test: Create proper schema in load balancer tests' from Tomasz Grabiec
This PR converts boost load balancer tests in preparation for load balancer changes
which add per-table tablet hints. After those changes, load balancer consults with the replication
strategy in the database, so we need to create proper schema in the
database. To do that, we need proper topology for replication
strategies which use RF > 1, otherwise keyspace creation will fail.

Topology is created in tests via group0 commands, which is abstracted by
the new `topology_builder` class.

Tests cannot modify token_metadata only in memory now as it needs to be
consistent with the schema and on-disk metadata. That's why modifications to
tablet metadata are now made under group0 guard and save back metadata to disk.

Closes scylladb/scylladb#22648

* github.com:scylladb/scylladb:
  test: tablets: Drop keyspace after do_test_load_balancing_merge_colocation() scenario
  tests: tablets: Set initial tablets to 1 to exit growing mode
  test: tablets_test: Create proper schema in load balancer tests
  test: lib: Introduce topology_builder
  test: cql_test_env: Expose topology_state_machine
  topology_state_machine: Introduce lock transition

(cherry picked from commit 51a273401c)
2025-03-13 14:08:30 +01:00
Tomasz Grabiec
637e5fc9b5 locator: network_topology_startegy: Ignore leaving nodes when computing capacity for new tables
For example, nodes which are being decommissioned should not be
consider as available capacity for new tables. We don't allocate
tablets on such nodes.

Would result in higher per-shard load then planned.

Closes scylladb/scylladb#22657

(cherry picked from commit 3bb19e9ac9)
2025-03-13 14:08:27 +01:00
Tomasz Grabiec
0d77754c63 locator: network_topology_strategy: Fix SIGSEGV when creating a table when there is a rack with no normal nodes
In that case, new_racks will be used, but when we discover no
candidates, we try to pop from existing_racks.

Fixes #22625

Closes scylladb/scylladb#22652

(cherry picked from commit e22e3b21b1)
2025-03-13 14:00:48 +01:00
Benny Halevy
5481c9aedd docs: document the views-with-tablets experimental feature
Refs scylladb/scylladb#22217

Fixes scylladb/scylladb#22893

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#22896

(cherry picked from commit 55dbf5493c)

Closes scylladb/scylladb#23024
2025-03-10 13:26:36 +01:00
Botond Dénes
59db708cba Merge '[Backport 2025.1] tablets: repair: fix hosts and dcs filters behavior for tablet repair' from Scylladb[bot]
If hosts and/or dcs filters are specified for tablet repair and
some replicas match these filters, choose the replica that will
be the repair master according to round-robin principle
(currently it's always the first replica).

If hosts and/or dcs filters are specified for tablet repair and
no replica matches these filters, the repair succeeds and
the repair request is removed (currently an exception is thrown
and tablet repair scheduler reschedules the repair forever).

Fixes: https://github.com/scylladb/scylladb/issues/23100.

Needs backport to 2025.1 that introduces hosts and dcs filters for tablet repair

- (cherry picked from commit 9bce40d917)

- (cherry picked from commit fe4e99d7b3)

- (cherry picked from commit 2b538d228c)

- (cherry picked from commit c40eaa0577)

- (cherry picked from commit c7c6d820d7)

Parent PR: #23101

Closes scylladb/scylladb#23109

* github.com:scylladb/scylladb:
  test: add new cases to tablet_repair tests
  test: extract repiar check to function
  locator: add round-robin selection of filtered replicas
  locator: add tablet_task_info::selected_by_filters
  service: finish repair successfully if no matching replica found
2025-03-10 12:49:01 +02:00
Botond Dénes
28690f8203 Merge '[Backport 2025.1] repair: Introduce Host and DC filter support' from Scylladb[bot]
Currently, the tablet repair scheduler repairs all replicas of a tablet. It does not support hosts or DCs selection. It should be enough for most cases. However, users might still want to limit the repair to certain hosts or DCs in production. https://github.com/scylladb/scylladb/pull/21985 added the preparation work to add the config options for the selection. This patch adds the hosts or DCs selection support.

Fixes https://github.com/scylladb/scylladb/issues/22417

New feature. No backport is needed.

- (cherry picked from commit 4c75701756)

- (cherry picked from commit 5545289bfa)

- (cherry picked from commit 1c8a41e2dd)

- (cherry picked from commit e499f7c971)

Parent PR: #22621

Closes scylladb/scylladb#23080

* github.com:scylladb/scylladb:
  test: add test to check dcs and hosts repair filter
  test: add repair dc selection to test_tablet_metadata_persistence
  repair: Introduce Host and DC filter support
  docs: locator: update the docs and formatter of tablet_task_info
2025-03-10 12:48:49 +02:00
Anna Stuchlik
235c859b98 doc: zero-token nodes and Arbiter DC
This commit adds documentation for zero-token nodes and an explanation
of how to use them to set up an arbiter DC to prevent a quorum loss
in multi-DC deployments.

The commit adds two documents:
- The one in Architecture describes zero-token nodes.
- The other in Cluster Management explains how to use them.

We need separate documents because zero-token nodes may be used
for other purposes in the future.

In addition, the documents are cross-linked, and the link is added
to the Create a ScyllaDB Cluster - Multi Data Centers (DC) document.

Refs https://github.com/scylladb/scylladb/pull/19684

Fixes https://github.com/scylladb/scylladb/issues/20294

Closes scylladb/scylladb#21348

(cherry picked from commit 9ac0aa7bba)

Closes scylladb/scylladb#23201
2025-03-10 10:59:07 +01:00
Benny Halevy
64248f2635 main: add checkpoints
Before starting significant services that didn't
have a corresponding call to supervisor::notify
before them.

Fixes #23153

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b6705ad48b)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-10 08:52:16 +02:00
Benny Halevy
b139b09129 main: safely check stop_signal in-between starting services
To simplify aborting scylla while starting the services,
Add a _ready state to stop_signal, so that until
main is ready to be stopped by the abort_source,
just register that the signal is caught, and
let a check() method poll that and request abort
and throw respective exception only then, in controlled
points that are in-between starting of services
after the service started successfully and a deferred
stop action was installed.

Refs #23153

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit feef7d3fa1)
2025-03-10 08:42:30 +02:00
Benny Halevy
ca161900cd main: move prometheus start message
The `prometheus_server` is started only conditionally
but the notification message is sent and logged
unconditionally.
Move it inside the condtional code block.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 282ff344db)
2025-03-10 08:16:37 +02:00
Benny Halevy
5417d4d517 main: move per-shard database start message
It is now logged out of place, so move it to right before
calling `start` on every database shard.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 23433f593c)
2025-03-10 08:16:37 +02:00
Anna Stuchlik
5453e85f39 doc: remove the reference to the 6.2 version
This commit removes the OSS version name, which is irrelevant
and confusing for 2025.1 and later users.
Also, it updates the warning to avoid specifying the release
when the deprecated feature will be removed.

Fixes https://github.com/scylladb/scylladb/issues/22839

Closes scylladb/scylladb#22936

(cherry picked from commit d0a48c5661)

Closes scylladb/scylladb#23022
2025-03-07 12:53:42 +02:00
Anna Stuchlik
7a6bcb3a3f doc: remove references to Enterprise
This commit removes the redundant references to Enterprise,
which are no longer valid.

Fixes https://github.com/scylladb/scylladb/issues/22927

Closes scylladb/scylladb#22930

(cherry picked from commit a28bbc22bd)

Closes scylladb/scylladb#22963
2025-03-07 12:53:22 +02:00
Anna Stuchlik
8b2a382eb6 doc: add support for Ubuntu 24.04 in 2024.1
Fixes https://github.com/scylladb/scylladb/issues/22841

Refs https://github.com/scylladb/scylla-enterprise/issues/4550

Closes scylladb/scylladb#22843

(cherry picked from commit 439463dbbf)

Closes scylladb/scylladb#23092
2025-03-07 12:51:13 +02:00
Dusan Malusev
cdd51d8b7a docs: add instruction for installing cassandra-stress
Signed-off-by: Dusan Malusev <dusan.malusev@scylladb.com>

Closes scylladb/scylladb#21723

(cherry picked from commit 4e6ea232d2)

Closes scylladb/scylladb#22947
2025-03-07 11:48:46 +02:00
Anna Stuchlik
88a8d140b3 doc: add information about tablets limitation to the CQL page
This commit adds a link to the Limitations section on the Tablets page
to the CQL pag, the tablets option.
This is actually the place where the user will need the information:
when creating a keyspace.

In addition, I've reorganized the section for better readability
(otherwise, the section about limitations was easy to miss)
and moved the section up on the page.

Note that I've removed the updated content from the  `_common` folder
(which I deleted) to the .rst page - we no longer split OSS and Enterprise,
so there's no need to keep using the `scylladb_include_flag` directive
to include OSS- and Ent-specific content.

Fixes https://github.com/scylladb/scylladb/issues/22892

Fixes https://github.com/scylladb/scylladb/issues/22940

Closes scylladb/scylladb#22939

(cherry picked from commit 0999fad279)

Closes scylladb/scylladb#23091
2025-03-07 11:48:07 +02:00
Aleksandra Martyniuk
1957dac2b4 test: add new cases to tablet_repair tests
Add tests for tablet repair with host and dc filters that select
one or no replica.

(cherry picked from commit c7c6d820d7)
2025-03-05 10:59:00 +01:00
Aleksandra Martyniuk
1091ef89e1 test: extract repiar check to function
(cherry picked from commit c40eaa0577)
2025-03-05 10:59:00 +01:00
Aleksandra Martyniuk
b081e07ffa locator: add round-robin selection of filtered replicas
(cherry picked from commit 2b538d228c)
2025-03-05 10:58:59 +01:00
Aleksandra Martyniuk
1f102ca2f7 locator: add tablet_task_info::selected_by_filters
Extract dcs and hosts filters check to a method.

(cherry picked from commit fe4e99d7b3)
2025-03-05 10:36:51 +01:00
Aleksandra Martyniuk
8a98f0d5b6 service: finish repair successfully if no matching replica found
If hosts and/or dcs filters are specified for tablet repair and
no replica matches these filters, an exception is thrown. The repair
fails and tablet repair scheduler reschedules it forever.

Such a repair should actually succeed (as all specified relpicas were
repaired) and the repair request should be removed.

Treat the repair as successful if the filters were specified and
selected no replica.

(cherry picked from commit 9bce40d917)
2025-03-05 10:36:50 +01:00
Botond Dénes
01485b2158 reader_concurrency_semaphore: register_inactive_read(): handle aborted permit
It is possible that the permit handed in to register_inactive_read() is
already aborted (currently only possible if permit timed out).
If the permit also happens to have wait for memory, the current code
will attempt to call promise<>::set_exception() on the permit's promise
to abort its waiters. But if the permit was already aborted via timeout,
this promise will already have an exception and this will trigger an
assert. Add a separate case for checking if the permit is aborted
already. If so, treat it as immediate eviction: close the reader and
clean up.

Fixes: scylladb/scylladb#22919
(cherry picked from commit 7ba29ec46c)
2025-03-04 18:46:55 +00:00
Botond Dénes
953c7cd71a test/boost/reader_concurrency_semaphore_test: move away from db::timeout_clock::now()
Unless the test in question actually wants to test timeouts. Timeouts
will have more pronounced consequences soon and thus using
db::timeout_clock::now() becomes a sure way to make tests flaky.
To avoid this, use db::no_timeout in the tests that don't care about
timeouts.

(cherry picked from commit 4d8eb02b8d)
2025-03-04 18:46:55 +00:00
Anna Stuchlik
cdae92065b doc: add the 2025.1 upgrade guides and reorganize the upgrade section
This commit adds the upgrade guides relevant in version 2025.1:
- From 6.2 to 2025.1
- From 2024.x to 2025.1

It also removes the upgrade guides that are not relevant in 2025.1 source available:
- Open Source upgrade guides
- From Open Source to Enterprise upgrade guides
- Links to the Enterprise upgrade guides

Also, as part of this PR, the remaining relevant content has been moved to
the new About Upgrade page.

WHAT NEEDS TO BE REVIEWED
- Review the instructions in the 6.2-to-2025.1 guide
- Review the instructions in the 2024.x-to-2025.1 guide
- Verify that there are no references to Open Source and Enterprise.

The scope of this PR does not have to include metrics - the info can be added
in a follow-up PR.

Fixes https://github.com/scylladb/scylladb/issues/22208
Fixes https://github.com/scylladb/scylladb/issues/22209
Fixes https://github.com/scylladb/scylladb/issues/23072
Fixes https://github.com/scylladb/scylladb/issues/22346

Closes scylladb/scylladb#22352

(cherry picked from commit 850aec58e0)

Closes scylladb/scylladb#23106
2025-03-04 08:15:08 +02:00
Jenkins Promoter
4813c48d64 Update pgo profiles - aarch64 2025-03-01 04:23:19 +02:00
Jenkins Promoter
b623b108c3 Update pgo profiles - x86_64 2025-03-01 04:05:24 +02:00
Aleksandra Martyniuk
7fdc7bdc4b test: add test to check dcs and hosts repair filter
(cherry picked from commit e499f7c971)
2025-02-27 12:14:47 +01:00
Aleksandra Martyniuk
c2e926850d test: add repair dc selection to test_tablet_metadata_persistence
(cherry picked from commit 1c8a41e2dd)
2025-02-27 12:14:47 +01:00
Asias He
6d5b029812 repair: Introduce Host and DC filter support
Currently, the tablet repair scheduler repairs all replicas of a tablet.
It does not support hosts or DCs selection. It should be enough for most
cases. However, users might still want to limit the repair to certain
hosts or DCs in production. #21985 added the preparation work to add the
config options for the selection. This patch adds the hosts or DCs
selection support.

Fixes #22417

(cherry picked from commit 5545289bfa)
2025-02-27 12:14:44 +01:00
Aleksandra Martyniuk
ffeb55cf77 docs: locator: update the docs and formatter of tablet_task_info
(cherry picked from commit 4c75701756)
2025-02-26 23:49:50 +00:00
Jenkins Promoter
37aa7c216c Update ScyllaDB version to: 2025.1.0-rc4 2025-02-25 21:33:18 +02:00
Gleb Natapov
0b0e9f0c32 treewide: include build_mode.hh for SCYLLA_BUILD_MODE_RELEASE where it is missing
Fixes: #22914

Closes scylladb/scylladb#22915

(cherry picked from commit 914c9f1711)

Closes scylladb/scylladb#22962
2025-02-25 18:12:54 +03:00
Evgeniy Naydanov
871fabd60a test.py: test_random_failures: improve handling of hung node
In some cases the paused/unpaused node can hang not after 30s timeout.
This make the test flaky.  Change the condition to always check the
coordinator's log if there is a hung node.

Add `stop_after_streaming` to the list of error injections which can
cause a node's hang.

Also add a wait for a new coordinator election in cluster events
which cause such elections.

Closes scylladb/scylladb#22825

(cherry picked from commit 99be9ac8d8)

Closes scylladb/scylladb#23007
2025-02-25 14:31:51 +03:00
Abhi
67b7ea12a2 storage_service: Remove the variable _manage_topology_change_kind_from_group0
This commit removes the variable _manage_topology_change_kind_from_group0
which was used earlier as a work around for correctly handling
topology_change_kind variable, it was brittle and had some bugs. Earlier commits
made some modifications to deal with handling topology_change_kind variable
post _manage_topology_change_kind_from_group0 removal

(cherry picked from commit d7884cf651)
2025-02-20 21:21:31 +00:00
Abhi
d74bb95f54 storage_service: fix indentation after the previous commit
(cherry picked from commit 623e01344b)
2025-02-20 21:21:31 +00:00
Abhinav Jha
98977e9465 raft topology: Add support for raft topology system tables initialization to happen before group0 initialization
In the current scenario, topology_change_kind variable, was been handled using
 _manage_topology_change_kind_from_group0 variable. This method was brittle
and had some bugs(e.g. for restart case, it led to a time gap between group0
server start and topology_change_kind being managed via group0)

Post _manage_topology_change_kind_from_group0 removal, careful management of
topology_change_kind variable was needed for maintaining correct
topology_change_kind in all scenarios. So this PR also performs a refactoring
to populate all init data to system tables even before group0 creation(via
raft_initialize_discovery_leader function). Now because raft_initialize_discovery_leader
happens before the group 0 creation, we write mutations directly to system
tables instead of a group 0 command. Hence, post group0 creation, the node
can read the correct values from system tables and correct values are
maintained throughout.

Added a new function initialize_done_topology_upgrade_state which takes
care of updating the correct upgrade state to system tables before starting
group0 server. This ensures that the node can read the correct values from
system tables and correct values are maintained throughout.

By moving raft_initialize_discovery_leader logic to happen before starting
group0 server, and not as group0 command post server start, we also get rid
of the potential problem of init group0 command not being the 1st command on
the server. Hence ensuring full integrity as expected by programmer.

Fixes: scylladb/scylladb#21114
(cherry picked from commit e491950c47)
2025-02-20 21:21:31 +00:00
Abhi
e84376c9dc service/raft: Refactor mutation writing helper functions.
We use these changes in following commit.

(cherry picked from commit 4748125a48)
2025-02-20 21:21:31 +00:00
Gleb Natapov
79556be7a7 topology_coordinator: demote barrier_and_drain rpc failure to warning
The failure may happen during normal operation as well (for instance if
leader changes).

Fixes: scylladb/scylladb#22364
(cherry picked from commit fe45ea505b)
2025-02-19 08:59:53 +00:00
Gleb Natapov
fe0740ff56 topology_coordinator: read peers table only once during topology state application
During topology state application peers table may be updated with the
new ip->id mapping. The update is not atomic: it adds new mapping and
then removes the old one. If we call get_host_id_to_ip_map while this is
happening it may trigger an internal error there. This is a regression
since ef929c5def. Before that commit the
code read the peers table only once before starting the update loop.
This patch restores the behaviour.

Fixes: scylladb/scylladb#22578
(cherry picked from commit 1da7d6bf02)
2025-02-19 08:59:53 +00:00
Pavel Emelyanov
aa5cb15166 Merge 'Alternator: implement UpdateTable operation to add or delete GSI' from Nadav Har'El
In this series we implement the UpdateTable operation to add a GSI to an existing table, or remove a GSI from a table. As the individual commit messages will explained, this required changing how Alternator stores materialized view keys - instead of insisting that these key must be real columns (that is **not** the case when adding a GSI to an existing table), the materialized view can now take as its key any Alternator attribute serialized inside the ":attrs" map holding all non-key attributes. Fixes #11567.

We also fix the IndexStatus and Backfilling attributes returned by DescribeTable - as DynamoDB API users use this API to discover when a newly added GSI completed its "backfilling" (what we call "view building") stage. Fixes #11471.

This series should not be backported lightly - it's a new feature and required fairly large and intrusive changes that can introduce bugs to use cases that don't even use Alternator or its UpdateTable operations - every user of CQL materialized views or secondary indexes, as well as Alternator GSI or LSI, will use modified code. **It should be backported to 2025.1**, though - this version was actually branched long after this PR was sent, and it provides a feature that was promised for 2025.1.

Closes scylladb/scylladb#21989

* github.com:scylladb/scylladb:
  alternator: fix view build on oversized GSI key attribute
  mv: clean up do_delete_old_entry
  test/alternator: unflake test for IndexStatus
  test/alternator: work around unrelated bug causing test flakiness
  docs/alternator: adding a GSI is no longer an unimplemented feature
  test/alternator: remove xfail from all tests for issue 11567
  alternator: overhaul implementation of GSIs and support UpdateTable
  mv: support regular_column_transformation key columns in view
  alternator: add new materialized-view computed column for item in map
  build: in cmake build, schema needs alternator
  build: build tests with Alternator
  alternator: add function serialized_value_if_type()
  mv: introduce regular_column_transformation, a new type of computed column
  alternator: add IndexStatus/Backfilling in DescribeTable
  alternator: add "LimitExceededException" error type
  docs/alternator: document two more unimplemented Alternator features

(cherry picked from commit 529ff3efa5)

Closes scylladb/scylladb#22826
2025-02-18 19:05:21 +02:00
Jenkins Promoter
13d79ba990 Update ScyllaDB version to: 2025.1.0-rc3 2025-02-18 15:06:57 +02:00
Nadav Har'El
35b410326b test/topology_custom: fix very slow test test_localnodes_broadcast_rpc_address
The test
topology_custom/test_alternator::test_localnodes_broadcast_rpc_address
sets up nodes with a silly "broadcast rpc address" and checks that
Alternator's "/localnodes" requests returns it correctly.

The problem is that although we don't use CQL in this test, the test
framework does open a CQL connection when the test starts, and closes
it when it ends. It turns out that when we set a silly "broadcast RPC
address", the driver tends to try to connect to it when shutting down,
I'm not even sure why. But the choice of the silly address was 1.2.3.4
is unfortunate, because this IP address is actually routable - and
the driver hangs until it times out (in practice, in a bit over two
minutes). This trivial patch changes 1.2.3.4 to 127.0.0.0 - and equally
silly address but one to which connections fail immediately.

Before this patch, the test often takes more than 2 minutes to finish
on my laptop, after this patch, it always finishes in 4-5 seconds.

Fixes #22744

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#22746

(cherry picked from commit f89235517d)

Closes scylladb/scylladb#22875
2025-02-18 10:33:21 +02:00
Botond Dénes
12a3fcceae Merge '[Backport 2025.1] sstable_loader: fix cross-shard resource cleanup in download_task_impl ' from Scylladb[bot]
This PR addresses two related issues in our task system:

1. Prepares for asynchronous resource cleanup by converting release_resources() to a coroutine. This refactoring enables future improvements in how we handle resource cleanup.

2. Fixes a cross-shard resource cleanup issue in the SSTable loader where destruction of per-shard progress elements could trigger "shared_ptr accessed on non-owner cpu" errors in multi-shard environments. The fix uses coroutines to ensure resources are released on their owner shards.

Fixes #22759

---

this change addresses a regression introduced by d815d7013c, which is contained by 2025.1 and master branches. so it should be backported to 2025.1 branch.

- (cherry picked from commit 4c1f1baab4)

- (cherry picked from commit b448fea260)

Parent PR: #22791

Closes scylladb/scylladb#22871

* github.com:scylladb/scylladb:
  sstable_loader: fix cross-shard resource cleanup in download_task_impl
  tasks: make release_resources() a coroutine
2025-02-18 10:32:48 +02:00
Gleb Natapov
040c59674a api: initialize token metadata API after starting the gossiper
Token metadata API now depend on gossiper to do ip to host id mappings,
so initialized it after the gossiper is initialized and de-initialized
it before gossiper is stopped.

Fixes: scylladb/scylladb#22743

Closes scylladb/scylladb#22760

(cherry picked from commit d288d79d78)

Closes scylladb/scylladb#22854
2025-02-18 10:32:24 +02:00
Asias He
b50a6657e8 repair: Add await_completion option for tablet_repair api
Set true to wait for the repair to complete. Set false to skip waiting
for the repair to complete. When the option is not provided, it defaults
to false.

It is useful for management tool that wants the api to be async.

Fixes #22418

Closes scylladb/scylladb#22436

(cherry picked from commit fb318d0c81)

Closes scylladb/scylladb#22851
2025-02-18 10:31:53 +02:00
Botond Dénes
93479ffcf9 Merge '[Backport 2025.1] raft/group0_state_machine: load current RPC compression dict on startup' from Michał Chojnowski
We are supposed to be loading the most recent RPC compression dictionary on startup, but we forgot to port the relevant piece of logic during the source-available port. This causes a restarted node not to use the dictionary for RPC compression until the next dictionary update.

Fix that.

Fixes #22738

This is more of a bugfix than an improvement, so it should be backported to 2025.1.

* (cherry picked from commit [dd82b40](dd82b40186))

* (cherry picked from commit [8fb2ea6](8fb2ea61ba))

Additionally cherry picked https://github.com/scylladb/scylladb/pull/22836 to fix the timeout.

Parent PR: #22739

Closes scylladb/scylladb#22837

* github.com:scylladb/scylladb:
  test_rpc_compression.py: fix an overly-short timeout
  test_rpc_compression.py: test the dictionaries are loaded on startup
  raft/group0_state_machine: load current RPC compression dict on startup
2025-02-18 10:31:23 +02:00
Botond Dénes
38bd74b2d4 tools/scylla-nodetool: netstats: don't assume both senders and receivers
The code currently assumes that a session has both sender and receiver
streams, but it is possible to have just one or the other.
Change the test to include this scenario and remove this assumption from
the code.

Fixes: #22770

Closes scylladb/scylladb#22771

(cherry picked from commit 87e8e00de6)

Closes scylladb/scylladb#22874
2025-02-17 14:34:36 +02:00
Takuya ASADA
6ee1779578 dist: fix upgrade error from 2024.1
We need to allow replacing nodetool from scylla-enterprise-tools < 2024.2,
just like we did for scylla-tools < 5.5.
This is required to make packages able to upgrade from 2024.1.

Fixes #22820

Closes scylladb/scylladb#22821

(cherry picked from commit b5e306047f)

Closes scylladb/scylladb#22867
2025-02-16 14:47:48 +02:00
Kefu Chai
9fe2301647 sstable_loader: fix cross-shard resource cleanup in download_task_impl
Previously, download_task_impl's destructor would destroy per-shard progress
elements on whatever shard the task was destroyed on. In multi-shard
environments, this caused "shared_ptr accessed on non-owner cpu" errors when
attempting to free memory allocated on a different shard.

Fix by:
- Convert progress_per_shard into a sharded service
- Stop the service on owner shards during cleanup using coroutines
- Add operator+= to stream_progress to leverage seastar's built-in adder
  instead of a custom adder struct

Alternative approaches considered:

1. Using foreign_ptr: Rejected as it would require interface changes
   that complicate stream delegation. foreign_ptr manages the underlying
   pointee with another smart pointer but does not expose the smart
   pointer instance in its APIs, making it impossible to use
   shared_ptr<stream_progress> in the interface.
2. Using vector<stream_progress>: Rejected for similar interface
   compatibility reasons.

This solution maintains the existing interfaces while ensuring proper
cross-shard cleanup.

Fixes scylladb/scylladb#22759
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit b448fea260)
2025-02-15 22:46:43 +00:00
Kefu Chai
6b27459de3 tasks: make release_resources() a coroutine
Convert tasks::task_manager::task::impl::release_resources() to a coroutine
to prepare for upcoming changes that will implement asynchronous resource
release.

This is a preparatory refactoring that enables future coroutine-based
implementation of resource cleanup logic.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 4c1f1baab4)
2025-02-15 22:46:43 +00:00
Jenkins Promoter
48130ca2e9 Update pgo profiles - aarch64 2025-02-15 04:20:15 +02:00
Jenkins Promoter
5054087f0b Update pgo profiles - x86_64 2025-02-15 04:05:06 +02:00
Botond Dénes
889fb9c18b Update tools/java submodule
* tools/java 807e991d...6dfe728a (1):
  > dist: support smooth upgrade from enterprise to source availalbe

Fixes: scylladb/scylladb#22820
2025-02-14 11:14:07 +02:00
Botond Dénes
c627aff5f7 Merge '[Backport 2025.1] reader_concurrency_semaphore: set_notify_handler(): disable timeout ' from Scylladb[bot]
`set_notify_handler()` is called after a querier was inserted into the querier cache. It has two purposes: set a callback for eviction and set a TTL for the cache entry. This latter was not disabling the pre-existing timeout of the permit (if any) and this would lead to premature eviction of the cache entry if the timeout was shorter than TTL (which his typical).
Disable the timeout before setting the TTL to prevent premature eviction.

Fixes: https://github.com/scylladb/scylladb/issues/22629

Backport required to all active releases, they are all affected.

- (cherry picked from commit a3ae0c7cee)

- (cherry picked from commit 9174f27cc8)

Parent PR: #22701

Closes scylladb/scylladb#22752

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: set_notify_handler(): disable timeout
  reader_permit: mark check_abort() as const
2025-02-13 15:24:54 +02:00
Michał Chojnowski
ffca4a9f85 test_rpc_compression.py: fix an overly-short timeout
The timeout of 10 seconds is too small for CI.
I didn't mean to make it so short, it was an accident.

Fix that by changing the timeout to 10 minutes.
2025-02-13 10:03:13 +01:00
Michał Chojnowski
2c0ffdce31 pgo: disable tablets for training with secondary index, lwt and counters
As of right now, materialized views (and consequently secondary
indexes), lwt and counters are unsupported or experimental with tablets.
Since by defaults tablets are enabled, training cases using those
features are currently broken.

The right thing to do here is to disable tablets in those cases.

Fixes https://github.com/scylladb/scylladb/issues/22638

Closes scylladb/scylladb#22661

(cherry picked from commit bea434f417)

Closes scylladb/scylladb#22808
2025-02-13 09:42:09 +02:00
Botond Dénes
ff7e93ddd5 db/config: reader_concurrency_semaphore_cpu_concurrency: bump default to 2
This config item controls how many CPU-bound reads are allowed to run in
parallel. The effective concurrency of a single CPU core is 1, so
allowing more than one CPU-bound reads to run concurrently will just
result in time-sharing and both reads having higher latency.
However, restricting concurrency to 1 means that a CPU bound read that
takes a lot of time to complete can block other quick reads while it is
running. Increase this default setting to 2 as a compromise between not
over-using time-sharing, while not allowing such slow reads to block the
queue behind them.

Fixes: #22450

Closes scylladb/scylladb#22679

(cherry picked from commit 3d12451d1f)

Closes scylladb/scylladb#22722
2025-02-13 09:40:25 +02:00
Botond Dénes
1998733228 service: query_pager: fix last-position for filtering queries
On short-pages, cut short because of a tombstone prefix.
When page-results are filtered and the filter drops some rows, the
last-position is taken from the page visitor, which does the filtering.
This means that last partition and row position will be that of the last
row the filter saw. This will not match the last position of the
replica, when the replica cut the page due to tombstones.
When fetching the next page, this means that all the tombstone suffix of
the last page, will be re-fetched. Worse still: the last position of the
next page will not match that of the saved reader left on the replica, so
the saved reader will be dropped and a new one created from scratch.
This wasted work will show up as elevated tail latencies.
Fix by always taking the last position from raw query results.

Fixes: #22620

Closes scylladb/scylladb#22622

(cherry picked from commit 7ce932ce01)

Closes scylladb/scylladb#22719
2025-02-13 09:40:05 +02:00
Botond Dénes
e79ee2ddb0 reader_concurrency_semaphore: foreach_permit(): include _inactive_reads
So inactive reads show up in semaphore diagnostics dumps (currently the
only non-test user of this method).

Fixes: #22574

Closes scylladb/scylladb#22575

(cherry picked from commit e1b1a2068a)

Closes scylladb/scylladb#22611
2025-02-13 09:39:39 +02:00
Aleksandra Martyniuk
4c39943b3f replica: mark registry entry as synch after the table is added
When a replica get a write request it performs get_schema_for_write,
which waits until the schema is synced. However, database::add_column_family
marks a schema as synced before the table is added. Hence, the write may
see the schema as synced, but hit no_such_column_family as the table
hasn't been added yet.

Mark schema as synced after the table is added to database::_tables_metadata.

Fixes: #22347.

Closes scylladb/scylladb#22348

(cherry picked from commit 328818a50f)

Closes scylladb/scylladb#22604
2025-02-13 09:39:13 +02:00
Calle Wilund
17c86f8b57 encryption: Fix encrypted components mask check in describe
Fixes #22401

In the fix for scylladb/scylla-enterprise#892, the extraction and check for sstable component encryption mask was copied
to a subroutine for description purposes, but a very important 1 << <value> shift was somehow
left on the floor.

Without this, the check for whether we actually contain a component encrypted can be wholly
broken for some components.

Closes scylladb/scylladb#22398

(cherry picked from commit 7db14420b7)

Closes scylladb/scylladb#22599
2025-02-13 09:38:41 +02:00
Botond Dénes
d05b3897a2 Merge '[Backport 2025.1] api: task_manager: do not unregister finish task when its status is queried' from Scylladb[bot]
Currently, when the status of a task is queried and the task is already finished,
it gets unregistered. Getting the status shouldn't be a one-time operation.

Stop removing the task after its status is queried. Adjust tests not to rely
on this behavior. Add task_manager/drain API and nodetool tasks drain
command to remove finished tasks in the module.

Fixes: https://github.com/scylladb/scylladb/issues/21388.

It's a fix to task_manager API, should be backported to all branches

- (cherry picked from commit e37d1bcb98)

- (cherry picked from commit 18cc79176a)

Parent PR: #22310

Closes scylladb/scylladb#22598

* github.com:scylladb/scylladb:
  api: task_manager: do not unregister tasks on get_status
  api: task_manager: add /task_manager/drain
2025-02-13 09:38:12 +02:00
Botond Dénes
9116fc635e Merge '[Backport 2025.1] split: run set_split_mode() on all storage groups during all_storage_groups_split()' from Scylladb[bot]
`tablet_storage_group_manager::all_storage_groups_split()` calls `set_split_mode()` for each of its storage groups to create split ready compaction groups. It does this by iterating through storage groups using `std::ranges::all_of()` which is not guaranteed to iterate through the entire range, and will stop iterating on the first occurrence of the predicate (`set_split_mode()`) returning false. `set_split_mode()` creates the split compaction groups and returns false if the storage group's main compaction group or merging groups are not empty. This means that in cases where the tablet storage group manager has non-empty storage groups, we could have a situation where split compaction groups are not created for all storage groups.

The missing split compaction groups are later created in `tablet_storage_group_manager::split_all_storage_groups()` which also calls `set_split_mode()`, and that is the reason why split completes successfully. The problem is that
`tablet_storage_group_manager::all_storage_groups_split()` runs under a group0 guard, but
`tablet_storage_group_manager::split_all_storage_groups()` does not. This can cause problems with operations which should exclude with compaction group creation. i.e. DROP TABLE/DROP KEYSPACE

Fixes #22431

This is a bugfix and should be back ported to versions with tablets: 6.1 6.2 and 2025.1

- (cherry picked from commit 24e8d2a55c)

- (cherry picked from commit 8bff7786a8)

Parent PR: #22330

Closes scylladb/scylladb#22560

* github.com:scylladb/scylladb:
  test: add reproducer and test for fix to split ready CG creation
  table: run set_split_mode() on all storage groups during all_storage_groups_split()
2025-02-13 09:36:23 +02:00
Raphael S. Carvalho
5f74b5fdff test: Use linux-aio backend again on seastar-based tests
Since mid December, tests started failing with ENOMEM while
submitting I/O requests.

Logs of failed tests show IO uring was used as backend, but we
never deliberately switched to IO uring. Investigation pointed
to it happening accidentaly in commit 1bac6b75dc,
which turned on IO uring for allowing native tool in production,
and picked linux-aio backend explicitly when initializing Scylla.
But it missed that seastar-based tests would pick the default
backend, which is io_uring once enabled.

There's a reason we never made io_uring the default, which is
that it's not stable enough, and turns out we made the right
choice back then and it apparently continue to be unstable
causing flakiness in the tests.

Let's undo that accidental change in tests by explicitly
picking the linux-aio backend for seastar-based tests.
This should hopefully bring back stability.

Refs #21968.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#22695

(cherry picked from commit ce65164315)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#22800
2025-02-12 20:50:51 +02:00
Michał Chojnowski
a746fd2bb8 test_rpc_compression.py: test the dictionaries are loaded on startup
Reproduces scylladb/scylladb#22738

(cherry picked from commit 8fb2ea61ba)
2025-02-11 15:52:34 +00:00
Michał Chojnowski
89a5889bed raft/group0_state_machine: load current RPC compression dict on startup
We are supposed to be loading the most recent RPC compression dictionary
on startup, but we forgot to port the relevant piece of logic during
the source-available port.

(cherry picked from commit dd82b40186)
2025-02-11 15:52:33 +00:00
Michael Litvak
8d1f6df818 test/test_view_build_status: fix flaky asserts
In few test cases of test_view_build_status we create a view, wait for
it and then query the view_build_status table and expect it to have all
rows for each node and view.

But it may fail because it could happen that the wait_for_view query and
the following queries are done on different nodes, and some of the nodes
didn't apply all the table updates yet, so they have missing rows.

To fix it, we change the assert to work in the eventual consistency
sense, retrying until the number of rows is as expectd.

Fixes scylladb/scylladb#22644

Closes scylladb/scylladb#22654

(cherry picked from commit c098e9a327)

Closes scylladb/scylladb#22780
2025-02-11 10:21:54 +01:00
Avi Kivity
75320c9a13 Update tools/cqlsh submodule (driver update, upgradability)
* tools/cqlsh 52c6130...02ec7c5 (18):
  > chore(deps): update dependency scylla-driver to v3.28.2
  > dist: support smooth upgrade from enterprise to source availalbe
  > github action: fix downloading of artifacts
  > chore(deps): update docker/setup-buildx-action action to v3
  > chore(deps): update docker/login-action action to v3
  > chore(deps): update docker/build-push-action action to v6
  > chore(deps): update docker/setup-qemu-action action to v3
  > chore(deps): update peter-evans/dockerhub-description action to v4
  > upload actions: update the usage for multiple artifacts
  > chore(deps): update actions/download-artifact action to v4.1.8
  > chore(deps): update dependency scylla-driver to v3.28.0
  > chore(deps): update pypa/cibuildwheel action to v2.22.0
  > chore(deps): update actions/checkout action to v4
  > chore(deps): update python docker tag to v3.13
  > chore(deps): update actions/upload-artifact action to v4
  > github actions: update it to work
  > add option to output driver debug
  > Add renovate.json (#107)

Fixes: https://github.com/scylladb/scylladb/issues/22420
2025-02-09 18:07:55 +02:00
Yaron Kaikov
359af0ae9c dist: support smooth upgrade from enterprise to source availalbe
When upgrading for example from `2024.1` to `2025.1` the package name is
not identical casuing the upgrade command to fail:
```
Command: 'sudo DEBIAN_FRONTEND=noninteractive apt-get dist-upgrade scylla -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold"'
Exit code: 100
Stdout:
Selecting previously unselected package scylla.
Preparing to unpack .../6-scylla_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb ...
Unpacking scylla (2025.1.0~dev-0.20250118.1ef2d9d07692-1) ...
Errors were encountered while processing:
/tmp/apt-dpkg-install-JbOMav/0-scylla-conf_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb
/tmp/apt-dpkg-install-JbOMav/1-scylla-python3_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb
/tmp/apt-dpkg-install-JbOMav/2-scylla-server_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb
/tmp/apt-dpkg-install-JbOMav/3-scylla-kernel-conf_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb
/tmp/apt-dpkg-install-JbOMav/4-scylla-node-exporter_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb
/tmp/apt-dpkg-install-JbOMav/5-scylla-cqlsh_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb
Stderr:
E: Sub-process /usr/bin/dpkg returned an error code (1)
```

Adding `Obsoletes` (for rpm) and `Replaces` (for deb)

Fixes: https://github.com/scylladb/scylladb/issues/22420

Closes scylladb/scylladb#22457

(cherry picked from commit 93f53f4eb8)

Closes scylladb/scylladb#22753
2025-02-09 18:06:52 +02:00
Avi Kivity
7f350558c2 Update tools/python3 (smooth upgrade from enterprise)
* tools/python3 8415caf...91c9531 (1):
  > dist: support smooth upgrade from enterprise to source availalbe

Ref #22420
2025-02-09 14:22:33 +02:00
Botond Dénes
fa9b1800b6 reader_concurrency_semaphore: set_notify_handler(): disable timeout
set_notify_handler() is called after a querier was inserted into the
querier cache. It has two purposes: set a callback for eviction and set
a TTL for the cache entry. This latter was not disabling the
pre-existing timeout of the permit (if any) and this would lead to
premature eviction of the cache entry if the timeout was shorter than
TTL (which his typical).
Disable the timeout before setting the TTL to prevent premature
eviction.

Fixes: #scylladb/scylladb#22629
(cherry picked from commit 9174f27cc8)
2025-02-09 00:32:38 +00:00
Botond Dénes
c25d447b9c reader_permit: mark check_abort() as const
All it does is read one field, making it const makes using it easier.

(cherry picked from commit a3ae0c7cee)
2025-02-09 00:32:38 +00:00
Ferenc Szili
cf147d8f85 truncate: create session during request handling
Currently, the session ID under which the truncate for tablets request is
running is created during the request creation and queuing. This is a problem
because this could overwrite the session ID of any ongoing operation on
system.topology#session

This change moves the creation of the session ID for truncate from the request
creation to the request handling.

Fixes #22613

Closes scylladb/scylladb#22615

(cherry picked from commit a59618e83d)

Closes scylladb/scylladb#22705
2025-02-06 10:09:00 +02:00
Botond Dénes
319626e941 reader_concurrency_semaphore: with_permit(): proper clean-up after queue overload
with_permit() creates a permit, with a self-reference, to avoid
attaching a continuation to the permit's run function. This
self-reference is used to keep the permit alive, until the execution
loop processes it. This self reference has to be carefully cleared on
error-paths, otherwise the permit will become a zombie, effectively
leaking memory.
Instead of trying to handle all loose ends, get rid of this
self-reference altogether: ask caller to provide a place to save the
permit, where it will survive until the end of the call. This makes the
call-site a little bit less nice, but it gets rid of a whole class of
possible bugs.

Fixes: #22588

Closes scylladb/scylladb#22624

(cherry picked from commit f2d5819645)

Closes scylladb/scylladb#22704
2025-02-06 10:08:19 +02:00
Aleksandra Martyniuk
cca2d974b6 service: use read barrier in tablet_virtual_task::contains
Currently, when the tablet repair is started, info regarding
the operation is kept in the system.tablets. The new tablet states
are reflected in memory after load_topology_state is called.
Before that, the data in the table and the memory aren't consistent.

To check the supported operations, tablet_virtual_task uses in-memory
tablet_metadata. Hence, it may not see the operation, even though
its info is already kept in system.tablets table.

Run read barrier in tablet_virtual_task::contains to ensure it will
see the latest data. Add a test to check it.

Fixes: #21975.

Closes scylladb/scylladb#21995

(cherry picked from commit 610a761ca2)

Closes scylladb/scylladb#22694
2025-02-06 10:07:51 +02:00
Aleksandra Martyniuk
43f2e5f86b nodetool: tasks: print empty string for start_time/end_time if unspecified
If start_time/end_time is unspecified for a task, task_manager API
returns epoch. Nodetool prints the value in task status.

Fix nodetool tasks commands to print empty string for start_time/end_time
if it isn't specified.

Modify nodetool tasks status docs to show empty end_time.

Fixes: #22373.

Closes scylladb/scylladb#22370

(cherry picked from commit 477ad98b72)

Closes scylladb/scylladb#22601
2025-02-06 10:05:07 +02:00
Takuya ASADA
ad81d49923 dist: Support FIPS mode
- To make Scylla able to run in FIPS-compliant system, add .hmac files for
  crypto libraries on relocatable/rpm/deb packages.
- Currently we just write hmac value on *.hmac files, but there is new
  .hmac file format something like this:

  ```
  [global]
  format-version = 1
  [lib.xxx.so.yy]
  path = /lib64/libxxx.so.yy
  hmac = <hmac>
  ```
  Seems like GnuTLS rejects fips selftest on .libgnutls.so.30.hmac when
  file format is older one.
  Since we need to absolute path on "path" directive, we need to generate
  .libgnutls.so.30.hmac in older format on create-relocatable-script.py,

Fixes scylladb/scylladb#22573

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes scylladb/scylladb#22384

(cherry picked from commit fb4c7dc3d8)

Closes scylladb/scylladb#22587
2025-02-06 10:01:12 +02:00
Wojciech Mitros
138c68d80e mv: forbid views with tablets by default
Materialized views with tablets are not stable yet, but we want
them available as an experimental feature, mainly for teseting.

The feature was added in scylladb/scylladb#21833,
but currently it has no effect. All tests have been updated to use the
feature, so we should finally make it work.
This patch prevents users from creating materialized views in keyspaces
using tablets when the VIEWS_WITH_TABLETS feature is not enabled - such
requests will now get rejected.

Fixes scylladb/scylladb#21832

Closes scylladb/scylladb#22217

(cherry picked from commit 677f9962cf)

Closes scylladb/scylladb#22659
2025-02-04 08:06:23 +01:00
Avi Kivity
e0fb727f18 Update seastar submodule (hwloc failure on some AWS instances)
* seastar 1822136684...a350b5d70e (1):
  > resource: fallback to sysconf when failed to detect memory size from hwloc

Fixes #22382.
2025-02-03 22:47:39 +02:00
Jenkins Promoter
440833ae59 Update ScyllaDB version to: 2025.1.0-rc2 2025-02-03 13:23:18 +02:00
Michael Litvak
246635c426 test/test_view_build_status: fix wrong assert in test
The test expects and asserts that after wait_for_view is completed we
read the view_build_status table and get a row for each node and view.
But this is wrong because wait_for_view may have read the table on one
node, and then we query the table on a different node that didn't insert
all the rows yet, so the assert could fail.

To fix it we change the test to retry and check that eventually all
expected rows are found and then eventually removed on the same host.

Fixes scylladb/scylladb#22547

Closes scylladb/scylladb#22585

(cherry picked from commit 44c06ddfbb)

Closes scylladb/scylladb#22608
2025-02-03 09:24:17 +01:00
Michael Litvak
58eda6670f view_builder: fix loop in view builder when tokens are moved
The view builder builds a view by going over the entire token ring,
consuming the base table partitions, and generating view updates for
each partition.

A view is considered as built when we complete a full cycle of the
token ring. Suppose we start to build a view at a token F. We will
consume all partitions with tokens starting at F until the maximum
token, then go back to the minimum token and consume all partitions
until F, and then we detect that we pass F and complete building the
view. This happens in the view builder consumer in
`check_for_built_views`.

The problem is that we check if we pass the first token F with the
condition `_step.current_token() >= it->first_token` whenever we consume
a new partition or the current_token goes back to the minimum token.
But suppose that we don't have any partitions with a token greater than
or equal to the first token (this could happen if the partition with
token F was moved to another node for example), then this condition will never be
satisfied, and we don't detect correctly when we pass F. Instead, we
go back to the minimum token, building the same token ranges again,
in a possibly infinite loop.

To fix this we add another step when reaching the end of the reader's
stream. When this happens it means we don't have any more fragments to
consume until the end of the range, so we advance the current_token to
the end of the range, simulating a partition, and check for built views
in that range.

Fixes scylladb/scylladb#21829

Closes scylladb/scylladb#22493

(cherry picked from commit 6d34125eb7)

Closes scylladb/scylladb#22607
2025-02-02 22:29:52 +02:00
Jenkins Promoter
28b8896680 Update pgo profiles - aarch64 2025-02-01 04:30:11 +02:00
Jenkins Promoter
e9cae4be17 Update pgo profiles - x86_64 2025-02-01 04:05:22 +02:00
Avi Kivity
daf1c96ad3 seatar: point submodule at scylla-seastar.git
This allows backporting commits to seastar.
2025-01-31 19:47:30 +02:00
Botond Dénes
1a1893078a Merge '[Backport 2025.1] encrypted_file_impl: Check for reads on or past actual file length in transform' from Scylladb[bot]
Fixes #22236

If reading a file and not stopping on block bounds returned by `size()`, we could allow reading from (_file_size+&lt;1-15&gt;) (if crossing block boundary) and try to decrypt this buffer (last one).

Simplest example:
Actual data size: 4095
Physical file size: 4095 + key block size (typically 16)
Read from 4096: -> 15 bytes (padding) -> transform return `_file_size` - `read offset` -> wraparound -> rather larger number than we expected (not to mention the data in question is junk/zero).

Check on last block in `transform` would wrap around size due to us being >= file size (l).
Just do an early bounds check and return zero if we're past the actual data limit.

- (cherry picked from commit e96cc52668)

- (cherry picked from commit 2fb95e4e2f)

Parent PR: #22395

Closes scylladb/scylladb#22583

* github.com:scylladb/scylladb:
  encrypted_file_test: Test reads beyond decrypted file length
  encrypted_file_impl: Check for reads on or past actual file length in transform
2025-01-31 11:38:50 +02:00
Aleksandra Martyniuk
8cc5566a3c api: task_manager: do not unregister tasks on get_status
Currently, /task_manager/task_status_recursive/{task_id} and
/task_manager/task_status/{task_id} unregister queries task if it
has already finished.

The status should not disappear after being queried. Do not unregister
finished task when its status or recursive status is queried.

(cherry picked from commit 18cc79176a)
2025-01-31 08:21:03 +00:00
Aleksandra Martyniuk
1f52ced2ff api: task_manager: add /task_manager/drain
In the following patches, get_status won't be unregistering finished
tasks. However, tests need a functionality to drop a task, so that
they could manipulate only with the tasks for operations that were
invoked by these tests.

Add /task_manager/drain/{module} to unregister all finished tasks
from the module. Add respective nodetool command.

(cherry picked from commit e37d1bcb98)
2025-01-31 08:21:03 +00:00
Avi Kivity
d7e3ab2226 Merge '[Backport 2025.1] truncate: trigger truncate logic from a transition state instead of global topology request' from Ferenc Szili
This is a manual backport of #22452

Truncate table for tablets is implemented as a global topology operation. However, it does not have a transition state associated with it, and performs the truncate logic in topology_coordinator::handle_global_request() while topology::tstate remains empty. This creates problems because topology::is_busy() uses transition_state to determine if the topology state machine is busy, and will return false even though a truncate operation is ongoing.

This change introduces a new topology transition topology::transition_state::truncate_table and moves the truncate logic to a new method topology_coordinator::handle_truncate_table(). This method is now called as a handler of the truncate_table transition state instead of a handler of the truncate_table global topology request.

Fixes #22552

Closes scylladb/scylladb#22557

* github.com:scylladb/scylladb:
  truncate: trigger truncate logic from transition state instead of global request handler
  truncate: add truncate_table transition state
2025-01-30 22:49:17 +02:00
Anna Stuchlik
cf589222a0 doc: update the Web Installer docs to remove OSS
Fixes https://github.com/scylladb/scylladb/issues/22292

Closes scylladb/scylladb#22433

(cherry picked from commit 2a6445343c)

Closes scylladb/scylladb#22581
2025-01-30 13:04:16 +02:00
Anna Stuchlik
156800a3dd doc: add SStable support in 2025.1
This commit adds the information about SStable version support in 2025.1
by replacing "2022.2" with "2022.2 and above".

In addition, this commit removes information about versions that are
no longer supported.

Fixes https://github.com/scylladb/scylladb/issues/22485

Closes scylladb/scylladb#22486

(cherry picked from commit caf598b118)

Closes scylladb/scylladb#22580
2025-01-30 13:03:47 +02:00
Nikos Dragazis
d1e8b02260 encrypted_file_test: Test reads beyond decrypted file length
Add a test to reproduce a bug in the read DMA API of
`encrypted_file_impl` (the file implementation for Encryption-at-Rest).

The test creates an encrypted file that contains padding, and then
attempts to read from an offset within the padding area. Although this
offset is invalid on the decrypted file, the `encrypted_file_impl` makes
no checks and proceeds with the decryption of padding data, which
eventually leads to bogus results.

Refs #22236.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 8f936b2cbc)
(cherry picked from commit 2fb95e4e2f)
2025-01-30 09:17:31 +00:00
Calle Wilund
a51888694e encrypted_file_impl: Check for reads on or past actual file length in transform
Fixes #22236

If reading a file and not stopping on block bounds returned by `size()`, we could
allow reading from (_file_size+1-15) (block boundary) and try to decrypt this
buffer (last one).
Check on last block in `transform` would wrap around size due to us being >=
file size (l).

Simplest example:
Actual data size: 4095
Physical file size: 4095 + key block size (typically 16)
Read from 4096: -> 15 bytes (padding) -> transform return _file_size - read offset
-> wraparound -> rather larger number than we expected
(not to mention the data in question is junk/zero).

Just do an early bounds check and return zero if we're past the actual data limit.

v2:
* Moved check to a min expression instead
* Added lengthy comment
* Added unit test

v3:
* Fixed read_dma_bulk handling of short, unaligned read
* Added test for unaligned read

v4:
* Added another unaligned test case

(cherry picked from commit e96cc52668)
2025-01-30 09:17:31 +00:00
Botond Dénes
68f134ee23 Merge '[Backport 2025.1] Do not update topology on address change' from Scylladb[bot]
Since now topology does not contain ip addresses there is no need to
create topology on an ip address change. Only peers table has to be
updated. The series factors out peers table update code from
sync_raft_topology_nodes() and calls it on topology and ip address
updates. As a side effect it fixes #22293 since now topology loading
does not require IP do be present, so the assert that is triggered in
this bug is removed.

Fixes: scylladb/scylladb#22293

- (cherry picked from commit ef929c5def)

- (cherry picked from commit fbfef6b28a)

Parent PR: #22519

Closes scylladb/scylladb#22543

* github.com:scylladb/scylladb:
  topology coordinator: do not update topology on address change
  topology coordinator: split out the peer table update functionality from raft state application
2025-01-30 11:14:19 +02:00
Jenkins Promoter
b623c237bc Update ScyllaDB version to: 2025.1.0-rc1 2025-01-30 01:25:18 +02:00
Calle Wilund
8379d545c5 docs: Remove configuration_encryptor
Fixes #21993

Removes configuration_encryptor mention from docs.
The tool itself (java) is not included in the main branch
java tools, thus need not remove from there. Only the words.

Closes scylladb/scylladb#22427

(cherry picked from commit bae5b44b97)

Closes scylladb/scylladb#22556
2025-01-29 20:17:36 +02:00
Michael Litvak
58d13d0daf cdc: fix handling of new generation during raft upgrade
During raft upgrade, a node may gossip about a new CDC generation that
was propagated through raft. The node that receives the generation by
gossip may have not applied the raft update yet, and it will not find
the generation in the system tables. We should consider this error
non-fatal and retry to read until it succeeds or becomes obsolete.

Another issue is when we fail with a "fatal" exception and not retrying
to read, the cdc metadata is left in an inconsistent state that causes
further attempts to insert this CDC generation to fail.

What happens is we complete preparing the new generation by calling `prepare`,
we insert an empty entry for the generation's timestamp, and then we fail. The
next time we try to insert the generation, we skip inserting it because we see
that it already has an entry in the metadata and we determine that
there's nothing to do. But this is wrong, because the entry is empty,
and we should continue to insert the generation.

To fix it, we change `prepare` to return `true` when the entry already
exists but it's empty, indicating we should continue to insert the
generation.

Fixes scylladb/scylladb#21227

Closes scylladb/scylladb#22093

(cherry picked from commit 4f5550d7f2)

Closes scylladb/scylladb#22546
2025-01-29 20:06:18 +02:00
Anna Stuchlik
4def507b1b doc: add OS support for 2025.1 and reorganize the page
This commit adds the OS support information for version 2025.1.
In addition, the OS support page is reorganized so that:
- The content is moved from the include page _common/os-support-info.rst
  to the regular os-support.rst page. The include page was necessary
  to document different support for OSS and Enterprise versions, so
  we don't need it anymore.
- I skipped the entries for versions that won't be supported when 2025.1
  is released: 6.1 and 2023.1.
- I moved the definition of "supported" to the end of the page for better
  readability.
- I've renamed the index entry to "OS Support" to be shorter on the left menu.

Fixes https://github.com/scylladb/scylladb/issues/22474

Closes scylladb/scylladb#22476

(cherry picked from commit 61c822715c)

Closes scylladb/scylladb#22538
2025-01-29 19:48:32 +02:00
Anna Stuchlik
69ad9350cc doc: remove Enterprise labels and directives
This PR removes the now redundant Enterprise labels and directives
from the ScyllDB documentation.

Fixes https://github.com/scylladb/scylladb/issues/22432

Closes scylladb/scylladb#22434

(cherry picked from commit b2a718547f)

Closes scylladb/scylladb#22539
2025-01-29 19:48:11 +02:00
Anna Stuchlik
29e5f5f54d doc: enable the FIPS note in the ScyllaDB docs
This commit removes the information about FIPS out of the '.. only:: enterprise' directive.
As a result, the information will now show in the doc in the ScyllaDB repo
(previously, the directive included the note in the Entrprise docs only).

Refs https://github.com/scylladb/scylla-enterprise/issues/5020

Closes scylladb/scylladb#22374

(cherry picked from commit 1d5ef3dddb)

Closes scylladb/scylladb#22550
2025-01-29 19:47:37 +02:00
Avi Kivity
379b3fa46c Merge '[Backport 2025.1] repair: handle no_such_keyspace in repair preparation phase' from null
Currently, data sync repair handles most no_such_keyspace exceptions,
but it omits the preparation phase, where the exception could be thrown
during make_global_effective_replication_map.

Skip the keyspace repair if no_such_keyspace is thrown during preparations.

Fixes: #22073.

Requires backport to 6.1 and 6.2 as they contain the bug

- (cherry picked from commit bfb1704afa)

- (cherry picked from commit 54e7f2819c)

Parent PR: #22473

Closes scylladb/scylladb#22542

* github.com:scylladb/scylladb:
  test: add test to check if repair handles no_such_keyspace
  repair: handle keyspace dropped
2025-01-29 14:09:23 +02:00
Ferenc Szili
fe869fd902 test: add reproducer and test for fix to split ready CG creation
This adds a reproducer for #22431

In cases where a tablet storage group manager had more than one storage
group, it was possible to create compaction groups outside the group0
guard, which could create problems with operations which should exclude
with compaction group creation.

(cherry picked from commit 8bff7786a8)
2025-01-29 10:10:28 +00:00
Ferenc Szili
dc55a566fa table: run set_split_mode() on all storage groups during all_storage_groups_split()
tablet_storage_group_manager::all_storage_groups_split() calls set_split_mode()
for each of its storage groups to create split ready compaction groups. It does
this by iterating through storage groups using std::ranges::all_of() which is
not guaranteed to iterate through the entire range, and will stop iterating on
the first occurance of the predicate (set_split_mode()) returning false.
set_split_mode() creates the split compaction groups and returns false if the
storage group's main compaction group or merging groups are not empty. This
means that in cases where the tablet storage group manager has non-empty
storage groups, we could have a situation where split compaction groups are not
created for all storage groups.

The missing split compaction groups are later created in
tablet_storage_group_manager::split_all_storage_groups() which also calls
set_split_mode(), and that is the reason why split completes successfully. The
problem is that tablet_storage_group_manager::all_storage_groups_split() runs
under a group0 guard, and tablet_storage_group_manager::split_all_storage_groups()
does not. This can cause problems with operations which should exclude with
compaction group creation. i.e. DROP TABLE/DROP KEYSPACE

(cherry picked from commit 24e8d2a55c)
2025-01-29 10:10:28 +00:00
Ferenc Szili
3bb8039359 truncate: trigger truncate logic from transition state instead of global
request handler

Before this change, the logic of truncate for tablets was triggered from
topology_coordinator::handle_global_request(). This was done without
using a topology transition state which remained empty throughout the
truncate handler's execution.

This change moves the truncate logic to a new method
topology_coordinator::handle_truncate_table(). This method is now called
as a handler of the truncate_table topology transition state instead of
a handler of the trunacate_table global topology request.
2025-01-29 10:48:34 +01:00
Ferenc Szili
9f3838e614 truncate: add truncate_table transition state
Truncate table for tablets is implemented as a global topology operation.
However, it does not have a transition state associated with it, and
performs the truncate logic in handle_global_request() while
topology::tstate remains empty. This creates problems because
topology::is_busy() uses transition_state to determine if the topology
state machine is busy, and will return false even though a truncate
operation is ongoing.

This change adds a new transition state: truncate_table
2025-01-29 10:47:15 +01:00
Gleb Natapov
366212f997 topology coordinator: do not update topology on address change
Since now topology does not contain ip addresses there is no need to
create topology on an ip address change. Only peers table has to be
updated, so call a function that does peers table update only.

(cherry picked from commit fbfef6b28a)
2025-01-28 21:51:11 +00:00
Gleb Natapov
c0637aff81 topology coordinator: split out the peer table update functionality from raft state application
Raft topology state application does two things: re-creates token metadata
and updates peers table if needed. The code for both task is intermixed
now. The patch separates it into separate functions. Will be needed in
the next patch.

(cherry picked from commit ef929c5def)
2025-01-28 21:51:11 +00:00
Aleksandra Martyniuk
dcf436eb84 test: add test to check if repair handles no_such_keyspace
(cherry picked from commit 54e7f2819c)
2025-01-28 21:50:35 +00:00
Aleksandra Martyniuk
8e754e9d41 repair: handle keyspace dropped
Currently, data sync repair handles most no_such_keyspace exceptions,
but it omits the preparation phase, where the exception could be thrown
during make_global_effective_replication_map.

Skip the keyspace repair if no_such_keyspace is thrown during preparations.

(cherry picked from commit bfb1704afa)
2025-01-28 21:50:35 +00:00
Yaron Kaikov
f407799f25 Update ScyllaDB version to: 2025.1.0-rc0 2025-01-27 11:29:45 +02:00
Kefu Chai
1151062b2a Update seastar submodule
* seastar a9bef537...18221366 (33):
  > io_queue: fix static member access to comply with CWG2813
  > build: add missing include in program_options.cc
  > coroutine: move operator co_await(exception) into seastar::coroutine namespace
  > fair_queue: Mark entry constructor explicit
  > test: Add perf test to measure the "cost" of chain wakeup
  > websocket: Support clients that do not specify subprotocol
  > websocket: Accept plain const& to string as subprotocol
  > perf_tests: Inline print_text_header() into stdout_printer
  > perf_tests: Right-align numeric metrics in markdown tables
  > scripts/addr2line.py: fix hanging with the new llvm-addr2line version
  > Revert "rpc stream: do not abort stream queue if stream connection was closed without error"
  > websocket: Convert connection::read_http_upgrade_request() to use coros
  > rpc stream: do not abort stream queue if stream connection was closed without error
  > linux-aio: remove cpu reduction suggestions
  > gitignore: ignore directories that match "build*"
  > perf_tests: make column generic
  > net: replace deprecated ip::address_v4::from_string()
  > file: remove deprecated file lifetime hint APIs
  > semaphore: expiry_handler: tunnel exception_ptr to entry
  > tests: unit: refactor expected_exception
  > semaphore: return early exception before appending wait_list
  > semaphore: expiry_handler: refactor exception getters
  > abortable_fifo: support OnAbort callbacks accepting exception_ptr
  > abort_on_expiry: fix typos in comments
  > abort_on_expiry: request_abort with timed_out_error
  > Add missing include in dpdk_rte.hh
  > build: use path to libraries in .pc
  > httpd: drop unnecessary dependencies from httpd.hh
  > build: allow CMake to find Boost using package config
  > print: remove deprecated print() functions
  > github: s/ubuntu-latest/ubuntu-24.04/
  > perf_tests: coroutinize main loop
  > add perf_tests_perf

Closes scylladb/scylladb#22466
2025-01-26 12:54:14 +02:00
Asias He
4018dc7f0d Introduce file stream for tablet
File based stream is a new feature that optimizes tablet movement
significantly. It streams the entire SSTable files without deserializing
SSTable files into mutation fragments and re-serializing them back into
SSTables on receiving nodes. As a result, less data is streamed over the
network, and less CPU is consumed, especially for data models that
contain small cells.

The following patches are imported from the scylla enterprise:

*) Merge 'Introduce file stream for tablet' from Asias He

    This patch uses Seastar RPC stream interface to stream sstable files on
    network for tablet migration.

    It streams sstables instead of mutation fragments. The file based
    stream has multiple advantages over the mutation streaming.

    - No serialization or deserialization for mutation fragments
    - No need to read and process each mutation fragments
    - On wire data is more compact and smaller

    In the test below, a significant speed up is observed.

    Two nodes, 1 shard per node, 1 initial_tablets:

    - Start node 1
    - Insert 10M rows of data with c-s
    - Bootstrap node 2

    Node 1 will migration data to node2 with the file stream.

    Test results:

    1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s

    [shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0]
	Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s
    [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds

    2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s

    [shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058]
	Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s
    [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds

    Test Summary:

    File stream v.s. Mutation stream improvements

    - Stream bandwidth = 836 / 128  (MB/s)  = 6.53X

    - Stream time = 23.60 / 1.08  (Seconds) = 21.85X

    - Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X

    Closes scylladb/scylla-enterprise#3438

    * github.com:scylladb/scylla-enterprise:
      tests: Add file_stream_test
      streaming: Implement file stream for tablet

*) streaming: Use new take_storage_snapshot interface

    The new take_storage_snapshot returns a file object instead of a file
    name. This allows the file stream sender to read from the file even if
    the file is deleted by compaction.

    Closes scylladb/scylla-enterprise#3728

*) streaming: Protect unsupported file types for file stream

    Currently, we assume the file streamed over the stream_blob rpc verb is
    a sstable file. This patch rejects the unsupported file types on the
    receiver side. This allows us to stream more file types later using the
    current file stream infrastructure without worrying about old nodes
    processing the new file types in the wrong way.

    - The file_ops::noop is renamed to file_ops::stream_sstables to be
      explicit about the file types

    - A missing test_file_stream_error_injection is added to the idl

    Fixes: #3846
    Tests: test_unsupported_file_ops

    Closes scylladb/scylla-enterprise#3847

*) idl: Add service::session_id id to idl

    It will be used in the next patch.

    Refs #3907

*) streaming: Protect file stream with topology_guard

    Similar to "storage_service, tablets: Use session to guard tablet
    streaming", this patch protects file stream with topology_guard.

    Fixes #3907

*) streaming: Take service topology_guard under the try block

    Taking the service::topology_guard could throw. Currently, it throws
    outside the try block, so the rpc sink will not be closed, causing the
    following assertion:

    ```
    scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual
    seastar::rpc::sink_impl<netw::serializer,
    streaming::stream_blob_cmd_data>::~sink_impl() [Serializer =
    netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion
    `this->_con->get()->sink_closed()' failed.
    ```

    To fix, move more code including the topology_guard taking code to the
    try block.

    Fixes https://github.com/scylladb/scylla-enterprise/issues/4106

    Closes scylladb/scylla-enterprise#4110

*) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho

    We're not preserving the SSTable state across file based migration, so
    staging SSTables for example are being placed into main directory, and
    consequently, we're mixing staging and non-staging data, losing the
    ability to continue from where the old replica left off.
    It's expected that the view update backlog is transferred from old
    into new replica, as migration doesn't wait for leaving replica to
    complete view update work (which can take long). Elasticity is preferred.

    So this fix guarantees that the state of the SSTable will be preserved
    by propagating it in form of subdirectory (each subdirectory is
    statically mapped with a particular state).

    The staging sstables aren't being registered into view update generator
    yet, as that's supposed to be fixed in OSS (more details can be found
    at https://github.com/scylladb/scylladb/issues/19149).

    Fixes #4265.

    Closes scylladb/scylla-enterprise#4267

    * github.com:scylladb/scylla-enterprise:
      tablet: Preserve original SSTable state with file based tablet migration
      sstables: Add get method for sstable state

*) sstable: (Re-)add shareabled_components getter

*) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund

    Fixes #4246

    Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier.

    Ensures we transfer and pre-process scylla metadata for streamed
    file blobs first, then properly apply receiving nodes local config
    by using a source and sink layer exported from sstables, which
    handles things like ordering, metadata filtering (on source) as well
    as handling metadata and proper IO paths when writing data on
    receiver node (sink).

    This implementation maintains the statelessness of the current
    design, and the delegated sink side will re-read and re-write the
    metadata for each component processed. This is a little wasteful,
    but the meta is small, and it is less error prone than trying to do
    caching cross-shards etc. The transport is isolated from the
    knowledge.

    This is an alternative/complement to #4436 and #4472, fixing the
    underlying issue. Note that while the layers/API:s here allows easy
    fixing of other fundamental problems in the feature (such as
    destination location etc), these are not included in the PR, to keep
    it as close to the current behaviour as possible.

    Closes scylladb/scylla-enterprise#4646

    * github.com:scylladb/scylla-enterprise:
      raft_tests: Copy/add a topology test with encryption
      file streaming: Use sstable source/sink to transfer snapshots
      sstables: Add source and sink objects + producers for transfering a snapshot
      sstable::types: Add remove accessor for extension info in metadata

*) The change for error injection in merge commit 966ea5955dd8760:

    File streaming now has "stream_mutation_fragments" error injection points
    so test_table_dropped_during_streaming works with file streaming.

*) doc: document file-based streaming

    This commit adds a description of the file-based streaming feature to the documentation.

    It will be displayed in the docs using the scylladb_include_flag directive after
    https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0,
    and, in turn, branch-2024.2.

    Refs https://github.com/scylladb/scylla-enterprise/issues/4585
    Refs https://github.com/scylladb/scylla-enterprise/issues/4254

    Closes scylladb/scylla-enterprise#4587

*) doc: move File-based streaming to the Tablets source file-based-streaming

    This commit moves the description of file-based streaming from a common include file
    to the regular doc source file where tablets are described.

    Closes scylladb/scylla-enterprise#4652

*) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference

Closes scylladb/scylladb#22467
2025-01-26 12:51:59 +02:00
Avi Kivity
f4b1ad43d4 gdb: protect debug::the_database from lto
Clang 18.1 with lto gained the ability to eliminate dead stores.
Since debug::the_database is write-only as far as the compiler understands
(it is read only by gdb), all writes to it are eliminated.

Protect writes to the variable by marking it volatile.

Closes scylladb/scylladb#22454
2025-01-23 22:26:04 +02:00
Pavel Emelyanov
eee3681d86 Merge 'tree: restore header compilation (${mode}-headers)' from Botond Dénes
Our CI accidentally switched to using CMake to compile scylla and it looks like CMake doesn't run the `${mode}-headers` command correctly and some missing-include in headers managed to slip in.

Compile fix, no backport needed.

Closes scylladb/scylladb#22471

* github.com:scylladb/scylladb:
  test/raft/replication.hh: add missing include <fmt/std.h>
  test/boost/bptree_validation.hh: add missing include <fmt/format.h>
2025-01-23 15:35:55 +03:00
Botond Dénes
e038473887 test/raft/replication.hh: add missing include <fmt/std.h> 2025-01-23 07:29:01 -05:00
Botond Dénes
e60e575cb0 test/boost/bptree_validation.hh: add missing include <fmt/format.h> 2025-01-23 06:05:57 -05:00
Avi Kivity
0092bb5831 Merge 'main: rename cql_sg_stats metrics on scheduling group rename' from Piotr Dulikowski
This PR contains the missing part of a fix for scylladb/scylla-enterprise#4912 which was omitted during migration of workload prioritization to the source available repository. Even though the regression test for it was ported, it was silently made ineffective by a different fix (scylladb/scylla-enterprise#4764), so this PR also improves the test.

Fixes: scylladb/scylladb#22404

No need to backport - service levels are not yet a part of any source-available release.

Closes scylladb/scylladb#22416

* github.com:scylladb/scylladb:
  test/auth_cluster: make test_service_level_metric_name_change useful
  main: rename `cql_sg_stats` metrics on scheduling group rename
2025-01-22 14:22:09 +02:00
Avi Kivity
59d3a66d18 Revert "Introduce file stream for tablet"
This reverts commit 8208688178. It was
contributed from enterprise, but is too different from the original
for me to merge back.
2025-01-22 09:42:20 +02:00
Benny Halevy
23284f038f table: flush: synchronize with stop()
When the table is stopped, all compaction groups
are stopped, and as part of that, they are flushing
their memtables.

To synchronize with stop-induced flush operation,
move _pending_flushes_phaser.stop() later in table::stop(),
after all compaction groups are flushed and stopped.

This way, in table::flush, if we see that the phaser
is already closed, we know that there is nothing to flush,
otherwise we start a flush operation that would be waited
on by a parallel table::stop().

Fixes #22243

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#22339
2025-01-22 09:23:09 +02:00
Nadav Har'El
a8805c4fc1 Merge 'cql3, test, utils: switch from boost::adaptors::uniqued to utils::views:unique ' from Kefu Chai
In order to reduce the dependency on external libraries, and for better integration with ranges in C++ standard library. let's use the homebrew `utils::views::unique()` before unique is accepted by the C++ standard.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#22393

* github.com:scylladb/scylladb:
  cql3, test: switch from boost::adaptors::uniqued to utils::views:unique
  utils: implement drop-in replacement for replacing boost::adaptors::uniqued
2025-01-21 19:06:21 +02:00
Sergey Zolotukhin
38caabe3ef test: Fix inconsistent naming of the log files.
The log file names created in `scylla_cluster.py` by
`ScyllaClusterManager`
and files to be collected in conftest.py by `manager` should be in
sync. This patch fixes the issue, originally introduced in
scylladb/scylladb#22192

Fixes scylladb/scylladb#22387

Backports: 6.1 and 6.2.

Closes scylladb/scylladb#22415
2025-01-21 10:45:17 +02:00
Kefu Chai
ccb7b4e606 cql3, test: switch from boost::adaptors::uniqued to utils::views:unique
In order to reduce the dependency on external libraries, and for better
integration with ranges in C++ standard library. let's use the homebrew
`utils::views::unique()` before unique is accepted by the C++ standard.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-01-21 16:24:45 +08:00
Kefu Chai
d5d251da9a utils: implement drop-in replacement for replacing boost::adaptors::uniqued
Add a custom implementation of boost::adaptors::uniqued that is compatible
with C++20 ranges library. This bridges the gap between Boost.Range and
the C++ standard library ranges until std::views::unique becomes available
in C++26. Currently, the unique view is included in
[P2214](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2760r0.html)
"A Plan for C++ Ranges Evolution", which targets C++26.

The implementation provides:
- A lazy view adaptor that presents unique consecutive elements
- No modification of source range
- Compatibility with C++20 range views and concepts
- Lighter header dependencies compared to Boost

This resolves compilation errors when piping C++20 range views to
boost::adaptors::uniqued, which fails due to concept requirements
mismatch. For example:
```c++
    auto range = std::views::take(n) | boost::adaptors::uniqued; // fails
```
This change also offers us a lightweight solution in terms of smaller
header dependency.

While std::ranges::unique exists in C++23, it's an eager algorithm that
modifies the source range in-place, unlike boost::adaptors::uniqued which
is a lazy view. The proposed std::views::unique (P2214) targeting C++26
would provide this functionality, but is not yet available.

This implementation serves as an interim solution for filtering consecutive
duplicate elements using range views until std::views::unique is
standardized.

For more details on the differences between `std::ranges::unique` and
`boost::adaptors::uniqued`:

- boost::adaptors::uniqued is a view adaptor that creates a lazy view over the original range. It:
  * Doesn't modify the source range
  * Returns a view that presents unique consecutive elements
  * Is non-destructive and lazy-evaluated
  * Can be composed with other views
- std::ranges::unique is an algorithm that:
  * Modifies the source range in-place
  * Removes consecutive duplicates by shifting elements
  * Returns an iterator to the new logical end
  * Cannot be used as a view or composed with other range adaptors

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-01-21 16:24:45 +08:00
Tomasz Grabiec
8059090a29 Merge 'Cache base info for view schemas in the schema registry' from Wojciech Mitros
Currently, when we load a frozen schema into the registry, we lose
the base info if the schema was of a view. Because of that, in various
places we need to set the base info again, and in some codepaths we
may miss it completely, which may make us unable to process some
requests (for example, when executing reverse queries on views).
Even after setting the base info, we may still lose it if the schema
entry gets deactivated due to all `schema_ptr`s temporarily dying.

To fix this, this patch adds the base schema to the registry, alongside
the view schema. We store just the frozen base schema, so that we can
transfer it across shards. With the base schema, we can now set the base
info when returning the schema from the registry. As a result, we can now
assume that all view schemas returned by the registry have base_info set.

In this series we also make sure that the view schemas in the registry are
kept up-to-date in regards to base schema changes.

Fixes https://github.com/scylladb/scylladb/issues/21354

This issue is a bug, so adding backport labels 6.1 and 6.2

Closes scylladb/scylladb#21862

* github.com:scylladb/scylladb:
  test: add test for schema registry maintaining base info for views
  schema_registry: avoid setting base info when getting the schema from registry
  schema_registry: update cached base schemas when updating a view
  schema_registry: cache base schemas for views
  db: set base info before adding schema to registry
2025-01-21 00:17:54 +01:00
Nadav Har'El
3e16b80014 Merge 'Reject create table with compact storage' from Benny Halevy
As discussed in
https://github.com/scylladb/scylladb/issues/12263#issuecomment-1853576813,
compact storage tables are deprecated.

Yet, there's is nothing in the code that prevents users
from creating such tables.

This patch adds a live-updateable config option:
`enable_create_table_with_compact_storage`, set to
`false` by default, that require users to opt-in
in order to create new tables WITH COMPACT STORAGE.

Refs scylladb/scylladb#12263, scylladb/scylladb#16375

* Since this guardrail is an enhancement, no backport is needed

Closes scylladb/scylladb#16403

* github.com:scylladb/scylladb:
  docs: ddl: document the deprecation of compact tables
  test: enable_create_table_with_compact_storage for tests that need it
  config: add enable_create_table_with_compact_storage
2025-01-20 22:02:02 +02:00
Piotr Dulikowski
780ff17ff5 test/auth_cluster: make test_service_level_metric_name_change useful
The test test_service_level_metric_name_change was originally introduced
to serve as a regression test for scylladb/scylla-enterprise#4912.
Before the fix, some per-scheduling-group metrics would not get adjusted
when the scheduling group gets renamed (which does happen for SL-managed
scheduling groups) and it would be possible to attempt to register
metrics with the same set of labels, resulting in an error.

However, in scylladb/scylla-enterprise#4764, another bug was fixed which
affected the test. Before a service level is created, a "test"
scheduling group can be created by service level controller if it is
unsure whether it is allowed to create more scheduling groups or not. If
creation of the scheduling group succeeds, it is put into the pool of
scheduling groups to be reused when a new service level is created.
Therefore, the node handling CREATE SERVICE LEVEL would always use the
scheduling group that was originally created for the sake of the test as
a SG for the new service level.

All of the above is intentional and was actually fixed by the
aforementioned issue. However, the test scheduling groups would always
get unique names and, therefore, the error would no longer reproduce.
However, the faulty logic that ran previously and caused the bug still
runs - when a node updates its service levels cache on group0 reload.

The test previously used only one node. Fix it by starting two nodes
instead of one at the beginning of the test and by serving all service
level commands to the first node - were the issue not fixed, the error
would get triggered on the second node.
2025-01-20 18:17:15 +01:00
Piotr Dulikowski
de153a2ba7 main: rename cql_sg_stats metrics on scheduling group rename
This commit contains the part of a fix
for scylladb/scylla-enterprise#4912 that was accidentally omitted when
workload prioritization were ported from enterprise to scylladb.git
repo. Without it, the metrics created by `cql_sg_stats` would not be
updated, leading to wrong scheduling group names being used in metrics'
names, and could lead to "double metric registration errors" in some
unlucky circumstances where a scheduling group would be created,
destroyed and then created again.

Fixes: scylladb/scylladb#22404
2025-01-20 18:16:46 +01:00
Tomasz Grabiec
c7f78edc78 Merge 'repair: Wire repair_time in system.tablets for tombstone gc' from Asias He
The repair_time in system.tablets will be updated when repair runs
successfully. We can now use it to update the repair time for tombstone
gc, i.e, when the system.tablets.repair_time is propagated, call
gc_state.update_repair_time() on the node that is the owner of the
tablet.

Since b3b3e880d3 ("repair: Reduce hints and batchlog flush"), the
repair time that could be used for tombstone gc might be smaller than
when the repair is started, so the actual repair time for tombstone gc
is returned by the repair rpc call from the repair master node.

Fixes #17507

New feature. No backport is needed.

Closes scylladb/scylladb#21896

* github.com:scylladb/scylladb:
  repair: Stop using rpc to update repair time for repairs scheduled by scheduler
  repair: Wire repair_time in system.tablets for tombstone gc
  test: Disable flush_cache_time for two tablet repair tests
  test: Introduce guarantee_repair_time_next_second helper
  repair: Return repair time for repair_service::repair_tablet
  service: Add tablet_operation.hh
2025-01-20 18:08:49 +01:00
Benny Halevy
88ae067ddb everywhere: add skeletal support for the in_memory_tables feature
Forward-ported from scylla-enterprise.
Note that the feature has been deprecated and the implementation
is provided only for backward compatibility with pre-existing
features and schema.

Tested manually after adding the following to feature_service:
```
    gms::feature workload_prioritization { *this, "WORKLOAD_PRIORITIZATION"sv };
```

Launched a single-node cluster running 2023.1.10
```
cqlsh> create KEYSPACE ks WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
cqlsh> create TABLE ks.test ( pk int PRIMARY KEY, val int ) WITH compaction = {'class': 'InMemoryCompactionStrategy'};
```

log:
```
Scylla version 2023.1.10-0.20241227.21cffccc1ccd with build-id bd65b8399cb13b713a87e57fe333cfcabfd50be7 starting ...
...
INFO  2024-12-27 19:45:16,563 [shard 0] migration_manager - Create new ColumnFamily: org.apache.cassandra.config.CFMetaData@0x600000f1b400[cfId=5529c630-c47a-11ef-bd1d-4295734ce5a8,ksName=ks,cfName=test,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,readRepairChance=0,dcLocalReadRepairChance=0,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,keyValidator=org.apache.cassandra.db.marshal.Int32Type,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=val, type=org.apache.cassandra.db.marshal.Int32Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.InMemoryCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,in_memory=false,version=5529c631-c47a-11ef-bd1d-4295734ce5a8,droppedColumns={},collections={},indices={}]
INFO  2024-12-27 19:45:16,564 [shard 0] schema_tables - Creating ks.test id=5529c630-c47a-11ef-bd1d-4295734ce5a8 version=ec88d510-6aff-344a-914d-541d37081440
```

Upgraded to this branch and started scylla.
Verified that ks.test was successfuly loaded:

log:
```
INFO  2024-12-27 19:48:58,115 [shard 0:main] init - Scylla version 6.3.0~dev-0.20241227.a64c6dfc153e with build-id f9496134a09cf2e55d3865b9e9ff499f672aa7da starting ...
...
WARN  2024-12-27 19:53:02,948 [shard 1:main] CompactionStrategy - InMemoryCompactionStrategy is no longer supported. Defaulting to NullCompactionStrategy.
...
INFO  2024-12-27 19:53:02,948 [shard 0:main] database - Keyspace ks: Reading CF test id=5529c630-c47a-11ef-bd1d-4295734ce5a8 version=ec88d510-6aff-344a-914d-541d37081440 storage=/home/bhalevy/scylladb/data/ks/test-5529c630c47a11efbd1d4295734ce5a8
```

Then, tested:
```
cqlsh> describe KEYSPACE ks;

CREATE KEYSPACE ks WITH replication = {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true AND tablets = {'enabled': false};

CREATE TABLE ks.test (
    pk int,
    val int,
    PRIMARY KEY (pk)
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'InMemoryCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND speculative_retry = '99.0PERCENTILE';

cqlsh> alter TABLE ks.test with compaction = {'class': 'SizeTieredCompactionStrategy'};
cqlsh> describe KEYSPACE ks;

CREATE KEYSPACE ks WITH replication = {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true AND tablets = {'enabled': false};

CREATE TABLE ks.test (
    pk int,
    val int,
    PRIMARY KEY (pk)
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND speculative_retry = '99.0PERCENTILE'
    AND tombstone_gc = {'mode': 'timeout', 'propagation_delay_in_seconds': '3600'};
```

log:
```
INFO  2024-12-27 19:56:40,465 [shard 0:stmt] migration_manager - Update table 'ks.test' From org.apache.cassandra.config.CFMetaData@0x60000362d800[cfId=5529c630-c47a-11ef-bd1d-4295734ce5a8,ksName==ks,cfName=test,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=val, type=org.apache.cassandra.db.marshal.Int32Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.InMemoryCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,version=ec88d510-6aff-344a-914d-541d37081440,droppedColumns={},collections={},indices={}] To org.apache.cassandra.config.CFMetaData@0x60000336e000[cfId=5529c630-c47a-11ef-bd1d-4295734ce5a8,ksName==ks,cfName=test,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=val, type=org.apache.cassandra.db.marshal.Int32Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,version=ecccf010-c47b-11ef-b52c-622f2f0e87c4,droppedColumns={},collections={},indices={}]
INFO  2024-12-27 19:56:40,466 [shard 0: gms] schema_tables - Altering ks.test id=5529c630-c47a-11ef-bd1d-4295734ce5a8 version=ecccf010-c47b-11ef-b52c-622f2f0e87c4
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#22068
2025-01-20 16:55:17 +02:00
Kefu Chai
9d6ec45730 build: support wasm32-wasip1 target in configure.py
Update configure.py to use wasm32-wasip1 as an alternative to wasm32-wasi,
matching the behavior previously implemented for CMake builds in 8d7786cb0e.
This ensures consistent WASI target handling across both build systems.

Refs #20878

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22386
2025-01-20 16:43:22 +02:00
Asias He
8208688178 Introduce file stream for tablet
File based stream is a new feature that optimizes tablet movement
significantly. It streams the entire SSTable files without deserializing
SSTable files into mutation fragments and re-serializing them back into
SSTables on receiving nodes. As a result, less data is streamed over the
network, and less CPU is consumed, especially for data models that
contain small cells.

The following patches are imported from the scylla enterprise:

*) Merge 'Introduce file stream for tablet' from Asias He

    This patch uses Seastar RPC stream interface to stream sstable files on
    network for tablet migration.

    It streams sstables instead of mutation fragments. The file based
    stream has multiple advantages over the mutation streaming.

    - No serialization or deserialization for mutation fragments
    - No need to read and process each mutation fragments
    - On wire data is more compact and smaller

    In the test below, a significant speed up is observed.

    Two nodes, 1 shard per node, 1 initial_tablets:

    - Start node 1
    - Insert 10M rows of data with c-s
    - Bootstrap node 2

    Node 1 will migration data to node2 with the file stream.

    Test results:

    1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s

    [shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0]
	Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s
    [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds

    2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s

    [shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058]
	Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s
    [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds

    Test Summary:

    File stream v.s. Mutation stream improvements

    - Stream bandwidth = 836 / 128  (MB/s)  = 6.53X

    - Stream time = 23.60 / 1.08  (Seconds) = 21.85X

    - Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X

    Closes scylladb/scylla-enterprise#3438

    * github.com:scylladb/scylla-enterprise:
      tests: Add file_stream_test
      streaming: Implement file stream for tablet

*) streaming: Use new take_storage_snapshot interface

    The new take_storage_snapshot returns a file object instead of a file
    name. This allows the file stream sender to read from the file even if
    the file is deleted by compaction.

    Closes scylladb/scylla-enterprise#3728

*) streaming: Protect unsupported file types for file stream

    Currently, we assume the file streamed over the stream_blob rpc verb is
    a sstable file. This patch rejects the unsupported file types on the
    receiver side. This allows us to stream more file types later using the
    current file stream infrastructure without worrying about old nodes
    processing the new file types in the wrong way.

    - The file_ops::noop is renamed to file_ops::stream_sstables to be
      explicit about the file types

    - A missing test_file_stream_error_injection is added to the idl

    Fixes: #3846
    Tests: test_unsupported_file_ops

    Closes scylladb/scylla-enterprise#3847

*) idl: Add service::session_id id to idl

    It will be used in the next patch.

    Refs #3907

*) streaming: Protect file stream with topology_guard

    Similar to "storage_service, tablets: Use session to guard tablet
    streaming", this patch protects file stream with topology_guard.

    Fixes #3907

*) streaming: Take service topology_guard under the try block

    Taking the service::topology_guard could throw. Currently, it throws
    outside the try block, so the rpc sink will not be closed, causing the
    following assertion:

    ```
    scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual
    seastar::rpc::sink_impl<netw::serializer,
    streaming::stream_blob_cmd_data>::~sink_impl() [Serializer =
    netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion
    `this->_con->get()->sink_closed()' failed.
    ```

    To fix, move more code including the topology_guard taking code to the
    try block.

    Fixes https://github.com/scylladb/scylla-enterprise/issues/4106

    Closes scylladb/scylla-enterprise#4110

*) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho

    We're not preserving the SSTable state across file based migration, so
    staging SSTables for example are being placed into main directory, and
    consequently, we're mixing staging and non-staging data, losing the
    ability to continue from where the old replica left off.
    It's expected that the view update backlog is transferred from old
    into new replica, as migration doesn't wait for leaving replica to
    complete view update work (which can take long). Elasticity is preferred.

    So this fix guarantees that the state of the SSTable will be preserved
    by propagating it in form of subdirectory (each subdirectory is
    statically mapped with a particular state).

    The staging sstables aren't being registered into view update generator
    yet, as that's supposed to be fixed in OSS (more details can be found
    at https://github.com/scylladb/scylladb/issues/19149).

    Fixes #4265.

    Closes scylladb/scylla-enterprise#4267

    * github.com:scylladb/scylla-enterprise:
      tablet: Preserve original SSTable state with file based tablet migration
      sstables: Add get method for sstable state

*) sstable: (Re-)add shareabled_components getter

*) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund

    Fixes #4246

    Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier.

    Ensures we transfer and pre-process scylla metadata for streamed
    file blobs first, then properly apply receiving nodes local config
    by using a source and sink layer exported from sstables, which
    handles things like ordering, metadata filtering (on source) as well
    as handling metadata and proper IO paths when writing data on
    receiver node (sink).

    This implementation maintains the statelessness of the current
    design, and the delegated sink side will re-read and re-write the
    metadata for each component processed. This is a little wasteful,
    but the meta is small, and it is less error prone than trying to do
    caching cross-shards etc. The transport is isolated from the
    knowledge.

    This is an alternative/complement to #4436 and #4472, fixing the
    underlying issue. Note that while the layers/API:s here allows easy
    fixing of other fundamental problems in the feature (such as
    destination location etc), these are not included in the PR, to keep
    it as close to the current behaviour as possible.

    Closes scylladb/scylla-enterprise#4646

    * github.com:scylladb/scylla-enterprise:
      raft_tests: Copy/add a topology test with encryption
      file streaming: Use sstable source/sink to transfer snapshots
      sstables: Add source and sink objects + producers for transfering a snapshot
      sstable::types: Add remove accessor for extension info in metadata

*) The change for error injection in merge commit 966ea5955dd8760:

    File streaming now has "stream_mutation_fragments" error injection points
    so test_table_dropped_during_streaming works with file streaming.

*) doc: document file-based streaming

    This commit adds a description of the file-based streaming feature to the documentation.

    It will be displayed in the docs using the scylladb_include_flag directive after
    https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0,
    and, in turn, branch-2024.2.

    Refs https://github.com/scylladb/scylla-enterprise/issues/4585
    Refs https://github.com/scylladb/scylla-enterprise/issues/4254

    Closes scylladb/scylla-enterprise#4587

*) doc: move File-based streaming to the Tablets source file-based-streaming

    This commit moves the description of file-based streaming from a common include file
    to the regular doc source file where tablets are described.

    Closes scylladb/scylla-enterprise#4652

*) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference

Closes scylladb/scylladb#22034
2025-01-20 16:43:21 +02:00
Yaniv Michael Kaul
7495237a33 Remove noexcept_traits.hh header file
The content of the header file noexcept_traits.hh is unused throughout ScyllaDB's code base.
As part of a greater effort to cleanup Scylla's code and reduce content in the root directory, this header file is simply removed.

This is code cleanup - no need to backport.

Fixes: https://github.com/scylladb/scylladb/issues/22117
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#22139
2025-01-20 16:43:21 +02:00
Nadav Har'El
8caea23d2a test/cqlpy/run: fix regression in "--release" option
The way that the "test/cqlpy/run --release" feature runs older Scylla
releases is that it takes *today*'s command line parameters and "fixes"
it to conform to what old releases took. This approach was easy to
implement (and the resulting "--release" feature is super useful), but
the downside is that we need to update this fixup code whenever we add
new options to the Scylla command line used by test/cqlpy/run.py.

Commit d04f376 made test/cqlpy/run.py use a new option
"--experimental-features=views-with-tablets", so now we need to remove
it when running older versions of Scylla. So this is what we do in this
patch.

Fixes #22349

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#22350
2025-01-20 16:43:21 +02:00
Nadav Har'El
12cbdfa095 test/cqlpy: add regression test for tombstone_gc in "desc table"
The small cqlpy test in this patch is a regression test for issue #14390,
which claimed that the Scylla-only "tombstone_gc" option is missing from
the output of "describe table".

This test shows that this report is *not* true, at least not when the
"server-side describe" is used. "test/cqlpy/run --release ..." shows
that this test passes on master and also for Scylla versions all the
way back to Scylla 5.2 (Scylla 5.1 did not support server-side
describe, so the test fails for that reason).

This suggests that the report in issue #14390 was for old-style
client-side (cqlsh) describe, which we no longer support, so this
issue can be closed.

Fixes #14390.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#22354
2025-01-20 16:43:21 +02:00
Avi Kivity
d2869ecb2b partition_range_compat: drop dependency on boost ranges
Unused anyway.

Closes scylladb/scylladb#22359
2025-01-20 16:43:21 +02:00
Anna Stuchlik
e340d6a452 doc: remove Open Source references in the docs
Fixes https://github.com/scylladb/scylladb/issues/22325

Closes scylladb/scylladb#22377
2025-01-20 16:43:21 +02:00
Botond Dénes
1f20f7810e Merge 'main, encryption: correct misspellings' from Kefu Chai
in this changeset, some misspellings identified by codespell were corrected.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#22301

* github.com:scylladb/scylladb:
  ent/encryption: rename "sie" to "get_opt"
  ent,main: fix misspellings
2025-01-20 16:43:21 +02:00
Benny Halevy
5c77956205 docs: ddl: document the deprecation of compact tables
Add a paragraph documenting the decision to deprecate
the COMPACT STORAGE feature, and instruct the user
how to enable the feature despite that.

Note that we don't have an official migration strategy
for users like `DROP COMPACT STORAGE`, which is not
implemented at this time (See #3882).

Fixes #16375

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-01-20 08:14:39 +02:00
Benny Halevy
f3ab00e61c test: enable_create_table_with_compact_storage for tests that need it
Now enable_create_table_with_compact_storage can be set
to `false` by default in db/config.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-01-20 08:14:37 +02:00
Benny Halevy
0110eb0506 config: add enable_create_table_with_compact_storage
As discussed in
https://github.com/scylladb/scylladb/issues/12263#issuecomment-1853576813,
compact storage tables are deprecated.

Yet, there's is nothing in the code that prevents users
from creating such tables.

This patch adds a live-updateable config option:
`enable_create_table_with_compact_storage` that require users
to opt-in in order to create new tables WITH COMPACT STORAGE.

The option is currently set to `true` by default in db/config
to reduce the churn to tests and to `false` in scylla.yaml,
for new clusters.

TODO: once regressions tests that use compact storage
are converted to enable the option, change the default in
db/config to false.

A unit test was added to test/cql-pytest that
checks that the respective cql query fails as expected
with the default option or when it is explicitly set to `false`,
and that the query succeeds when the option is set to `true`.

Note that `check_restricted_table_properties` already
returns an optional warning, but it is only logged
but not returned in the `prepared_statement`.
Fixing that is out of the scope of this patch.
See https://github.com/scylladb/scylladb/issues/20945

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-01-20 08:03:25 +02:00
Kefu Chai
1ef2d9d076 tree: migrate from boost::adaptors::transformed to std::views::transform
Replace remaining uses of boost::adaptors::transformed with std::views::transform
to reduce Boost dependencies, following the migration pattern established in
bab12e3a. This change addresses recently merged code that reintroduced Boost
header dependencies through boost::adaptors::transformed usage.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22365
2025-01-17 16:56:40 +02:00
Botond Dénes
47989b1503 Merge 'tasks: add tablet resize virtual task' from Aleksandra Martyniuk
In this change, tablet_virtual_task starts supporting tablet
resize (i.e. split and merge).

Users can see running resize tasks - finished tasks are not
presented with the task manager API.

A new task state "suspended" is added. If a resize was revoked,
it will appear to users as suspended. We assume that the resize was revoked
when the tablet number didn't change.

Fixes: #21366.
Fixes: #21367.

No backport, new feature

Closes scylladb/scylladb#21891

* github.com:scylladb/scylladb:
  test: boost: check resize_task_info in tablet_test.cc
  test: add tests to check revoked resize virtual tasks
  test: add tests to check the list of resize virtual tasks
  test: add tests to check spilt and merge virtual tasks status
  test: test_tablet_tasks: generalize functions
  replica: service: add split virtual task's children
  replica: service: pass parent info down to storage_group::split
  tasks: children of virtual tasks aren't internal by default
  tasks: initialize shard in task_info ctor
  service: extend tablet_virtual_task::abort
  service: retrun status_helper struct from tablet_virtual_task::get_status_helper
  service: extend tablet_virtual_task::wait
  tasks: add suspended task state
  service: extend tablet_virtual_task::get_status
  service: extend tablet_virtual_task::contains
  service: extend tablet_virtual_task::get_stats
  service: add service::task_manager_module::get_nodes
  tasks: add task_manager::get_nodes
  tasks: drop noexcept from module::get_nodes
  replica: service: add resize_task_info static column to system.tablets
  locator: extend tablet_task_info to cover resize tasks
2025-01-17 14:24:07 +02:00
Piotr Dulikowski
6aa962f5f4 Merge 'Add audit subsystem for database operations' from Paweł Zakrzewski
Introduces a comprehensive audit system to track database operations for security
and compliance purposes. This change includes:

Core Components:
- New audit subsystem for logging database operations
- Service level integration for proper resource management
- CQL statement tracking with operation categories
- Login process integration for tenant management

Key Features:
- Configurable audit logging (syslog/table)
- Operation categorization (QUERY/DML/DDL/DCL/AUTH/ADMIN)
- Selective auditing by keyspace/table
- Password sanitization in audit logs
- Service level shares support (1-1000) for workload prioritization
- Proper lifecycle management and cleanup

I ran the dtests for audit (manually enabled) and they pass.
The in-repo tests pass.

Notably, there should be no non-whitespace changes between this and scylla-enterprise

Fixes scylladb/scylla-enterprise#4999

Closes scylladb/scylladb#22147

* github.com:scylladb/scylladb:
  audit: Add shares support to service level management
  audit: Add service level support to CQL login process
  audit: Add support to CQL statements
  audit: Integrate audit subsystem into Scylla main process
  audit: Add documentation for the audit subsystem
  audit: Add the audit subsystem
2025-01-17 13:14:55 +01:00
Kamil Braun
89ee2a6834 Merge 'drop ip addresses from token metadata' from Gleb
Now that all topology related code uses host ids there is not point to
maintain ip to id (and back) mappings in the token metadata. After the
patch the mapping will be maintained in the gossiper only. The rest of
the system will use host ids and in rare cases where translation is
needed (mostly for UX compatibility reasons) the translation will be
done using gossiper.

Fixes: scylladb/scylla#21777

* 'gleb/drop-ip-from-tm-v3' of github.com:scylladb/scylla-dev: (57 commits)
  hint manager: do not translate ip to id in case hint manager is stopped already
  locator: token_metadata: drop update_host_id() function that does nothing now
  locator: topology: drop indexing by ips
  repair: drop unneeded code
  storage_service: use host_id to look for a node in on_alive handler
  storage_proxy: translate ips to ids in forward array using gossiper
  locator: topology: remove unused functions
  storage_service: check for outdated ip in on_change notification in the peers table
  storage_proxy: translate id to ip using address map in tablets's describe_ring code instead of taking one from the topology
  topology coordinator: change connection dropping code to work on host ids
  cql3: report host id instead of ip in error during SELECT FROM MUTATION_FRAGMENTS query
  locator: drop unused function from tablet_effective_replication_map
  api: view_build_statuses: do not use IP from the topology, but translate id to ip using address map instead
  locator: token_metadata: remove unused ip based functions
  locator: network_topology_strategy: use host_id based function to check number of endpoints in dcs
  gossiper: drop get_unreachable_token_owners functions
  storage_service: use gossiper to map ip to id in node_ops operations
  storage_service: fix indentation after the last patch
  storage_service: drop loops from node ops replace_prepare handling since there can be only one replacing node
  token_metadata: drop no longer used functions
  ...
2025-01-17 11:00:52 +01:00
Kefu Chai
4a5a00347f utils: do not include unused headers
these unused includes were identifier by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22201
2025-01-17 11:24:54 +03:00
Botond Dénes
55963f8f79 replica: remove noexcept from token -> tablet resolution path
The methods to resolve a key/token/range to a table are all noexcept.
Yet the method below all of these, `storage_group_for_id()` can throw.
This means that if due to any mistake a tablet without local replica is
attempted to be looked up, it will result in a crash, as the exception
bubbles up into the noexcept methods.
There is no value in pretending that looking up the tablet replica is
noexcept, remove the noexcept specifiers so that any bad lookup only
fails the operation at hand and doesn't crash the node. This is
especially relevant to replace, which still has a window where writes
can arrive for tablets that don't (yet) have a local replica. Currently,
this results in a crash. After this patch, this will only fail the
writes and the replace can move on.

Fixes: #21480

Closes scylladb/scylladb#22251
2025-01-17 11:24:09 +03:00
Łukasz Paszkowski
adef719c43 api/storage_service: Remove unimplemented truncate API
The API /storage_service/truncate/{ks} returns an unimplemented
error when invoked. As we already have a CQL command,
`TRUNCATE TABLE ks.cf` that causes the table to be truncated on all
nodes, the API can be dropped. Due to the error, it is unused.

Fixes https://github.com/scylladb/scylladb/issues/10520

No backport is required. A small cleanup of not working API.

Closes scylladb/scylladb#22258
2025-01-17 11:21:05 +03:00
Pavel Emelyanov
14c3fbbf8c Merge 'sstable_directory: do not load remote unshared sstables in process_descriptor()' from Lakshmi Narayanan Sreethar
The sstable loader relied on the generation id to provide an efficient
hint about the shard that owns an sstable. But, this hint was rendered
ineffective with the introduction of UUID generation, as the shard id
was no longer embedded in the generation id. This also became suboptimal
with the introduction of tablets. Commit 0c77f77 addressed this issue by
reading the minimum from disk to determine sstable ownership but this
improvement was lost with commit 63f1969, which optimistically assumed
that hints would work most of the time, which isn't true.

This commit restores that change - shard id of a table is deduced by
reading minially from disk and then the sstable is fully loaded only if
it belongs to the local shard. This patch also adds a testcase to verify
that the sstable are loaded only in their respective shards.

Fixes #21015

This fixes a regression and should be backported.

Closes scylladb/scylladb#22263

* github.com:scylladb/scylladb:
  sstable_directory: do not load remote sstables in process_descriptor
  sstable_directory: update `load_sstable()` definition
  sstable_directory: reintroduce `get_shards_for_this_sstable()`
2025-01-17 11:17:54 +03:00
Asias He
387b2050df repair: Stop using rpc to update repair time for repairs scheduled by scheduler
If a tablet repair is scheduled by tablet repair scheduler, the repair
time for tombstone gc will be updated when the system.tablet.repair_time
is updated. Skip updating using rpc calls in this case.
2025-01-17 16:12:05 +08:00
Asias He
53e6025aa6 repair: Wire repair_time in system.tablets for tombstone gc
The repair_time in system.tablets will be updated when repair runs
successfully. We can now use it to update the repair time for tombstone
gc, i.e, when the system.tablets.repair_time is propagated, call
gc_state.update_repair_time() on the node that is the owner of the
tablet.

Since b3b3e880d3 ("repair: Reduce hints and batchlog flush"), the
repair time that could be used for tombstone gc might be smaller than
when the repair is started, so the actual repair time for tombstone gc
is returned by the repair rpc call from the repair master node.

Fixes #17507
2025-01-17 16:12:05 +08:00
Asias He
0b2fef74bc test: Disable flush_cache_time for two tablet repair tests
The cache of the hints and batchlog flush makes the exact repair time
check difficult in the test. Disabling it for two repair tests
that check the exact repair time.
2025-01-17 16:12:05 +08:00
Asias He
23afbd938c test: Introduce guarantee_repair_time_next_second helper
The repair time granularity is seconds. This helper makes sure the
repair time is different than the previous one.
2025-01-17 16:12:05 +08:00
Asias He
41a1eca072 repair: Return repair time for repair_service::repair_tablet
The repair time returned by repair_service::repair_tablet considers the
hints and batchlog flush time, so it could be used for the tombstone gc
purpose.
2025-01-17 16:12:05 +08:00
Asias He
614c3380c6 service: Add tablet_operation.hh
A tablet_operation_result struct is added to track the result of a
tablet operation.
2025-01-17 16:12:05 +08:00
Avi Kivity
d6f7f873d0 utils: config_file: don't use extern fully specialized variable templates
Declaring-but-not-defining a fully specialized template is a great way to
cut dependencies between users and providers, but unfortunately not
supported for variable templates. Clang 18 does support it, but
apparently it is a misinterpretation of the standard, and was removed
in clang 19.

We started using this non-feature in 7ed89266b3.

The fix is to use function templates. This is more verbose as
each specialization needs to define a static variable to return,
but is fully supported.

Closes scylladb/scylladb#22299
2025-01-17 11:06:50 +03:00
Botond Dénes
2428f22d3e Update tools/python3 submodule
* tools/python3 fbf12d02...8415caf4 (1):
  > dist: Support FIPS mode
2025-01-17 09:17:29 +02:00
Tzach Livyatan
a00ab65491 remove BETA from metric and API reference
Closes scylladb/scylladb#22092
2025-01-16 19:25:51 -05:00
Łukasz Paszkowski
aad46bd6f3 reader_concurrency_semaphore: do_wait_admission(): remove dumping diagnostics
The commit b39ca29b3c introduced detection of admission-waiter
anomaly and dumps permit diagnostics as soon as the semaphore did
not admit readers even though it could.

Later on, the commit bf3d0b3543 introduces the optimization where
the admission check is moved to the fiber processing the _read_list.
Since the semaphore no longer admits readers as soon as it can,
dumping diagnostic errors is not necessary as the situation is not
abnormal.

Closes scylladb/scylladb#22344
2025-01-16 19:23:43 -05:00
Nadav Har'El
955ac1b7b7 test/alternator: close boto3 client before shutting down
For several years now, we have seen a strange, and very rare, flakiness
in Alternator tests described in issue #17564: We see all the test pass,
pytest declares them to have passed, and while Python is existing, it
crashes with a signal 11 (SIGSEGV). Because this happens exclusively in
test/alternator and never in the test/cqlpy, we suspect that something
that the test/alternator leaves behind but test/cqlpy does not, causes
some race and crashes during shutdown.

The immediate suspect is the boto3 library, or rather, the urllib3 library
which it uses. This is more-or-less the only thing that test/alternator
does which test/cqlpy doesn't. The urllib3 library keeps around pools of
reusable connections, and it's possible (although I don't actually have any
proof for it) that these open connections may cause a crash during shutdown.

So in this patch I add to the "dynamodb" and "dynamodbstreams" fixtures
(which all Alternator tests use to connect to the server), a teardown which
calls close() for the boto3 client object. This close() call percolates
down to calling clear() on urllib3's PoolManager. Hopefully, this will
make some difference in the chance to crash during shutdown - and if it
doesn't, it won't hurt.

Refs #17564

Closes scylladb/scylladb#22341
2025-01-16 19:21:00 -05:00
Gleb Natapov
a40e810442 hint manager: do not translate ip to id in case hint manager is stopped already
Since we do not stop storage proxy on shutdown this code can be called
during shutdown when address map is no longer usable.
2025-01-16 16:37:08 +02:00
Gleb Natapov
1e4b2f25dc locator: token_metadata: drop update_host_id() function that does nothing now 2025-01-16 16:37:08 +02:00
Gleb Natapov
50fb22c8f9 locator: topology: drop indexing by ips
Do not track id to ip mapping in the topology class any longer. There
are no remaining users.
2025-01-16 16:37:08 +02:00
Gleb Natapov
f9df092fd1 repair: drop unneeded code
There is a code that creates a map from id to ip and then creates a
vector from the keys of the map. Create a vector directly instead.
2025-01-16 16:37:08 +02:00
Gleb Natapov
12da203cae storage_service: use host_id to look for a node in on_alive handler 2025-01-16 16:37:08 +02:00
Gleb Natapov
d45ce6fa12 storage_proxy: translate ips to ids in forward array using gossiper
We already use it to translate reply_to, so do it for consistency and to
drop ip based API usage.
2025-01-16 16:37:08 +02:00
Gleb Natapov
db73758655 locator: topology: remove unused functions 2025-01-16 16:37:07 +02:00
Gleb Natapov
fb28ff5176 storage_service: check for outdated ip in on_change notification in the peers table
The code checks that it does not run for an ip address that is no longer
in use (after ip address change). To check that we can use peers table
and see if the host id is mapped to the address. If yes, this is the
latest address for this host id otherwise this is an outdated entry.
2025-01-16 16:37:07 +02:00
Gleb Natapov
163099678e storage_proxy: translate id to ip using address map in tablets's describe_ring code instead of taking one from the topology
We want to drop ip from the locator::node.
2025-01-16 16:37:07 +02:00
Gleb Natapov
49fa1130ef topology coordinator: change connection dropping code to work on host ids
Do not use ip from topology::node, but look it up in address map
instead. We want to drop ip from the topology::node.
2025-01-16 16:37:07 +02:00
Gleb Natapov
83d15b8e32 cql3: report host id instead of ip in error during SELECT FROM MUTATION_FRAGMENTS query
We want to drop ip from the topology::node.
2025-01-16 16:37:07 +02:00
Gleb Natapov
5cd3627baa locator: drop unused function from tablet_effective_replication_map 2025-01-16 16:37:07 +02:00
Gleb Natapov
122d58b4ad api: view_build_statuses: do not use IP from the topology, but translate id to ip using address map instead 2025-01-16 16:37:07 +02:00
Gleb Natapov
97f95f1dbd locator: token_metadata: remove unused ip based functions 2025-01-16 16:37:07 +02:00
Gleb Natapov
3068e38baa locator: network_topology_strategy: use host_id based function to check number of endpoints in dcs 2025-01-16 16:37:07 +02:00
Gleb Natapov
0ec9f7de64 gossiper: drop get_unreachable_token_owners functions
It is used by truncate code only and even there it only check if the
returned set is not empty. Check for dead token owners in the truncation
code directly.
2025-01-16 16:37:07 +02:00
Gleb Natapov
a7a7cdcf42 storage_service: use gossiper to map ip to id in node_ops operations
Replace operation is special though. In case of replacing with the same
IP the gossiper will not have the mapping, and node_ops RPC
unfortunately does not send host id of a replaced node. For replace we
consult peers table instead to find the old owner of the IP. A node that
is replacing (the coordinator of the replace) will not have it though,
but luckily it is not needed since it updates metadata during
join_topology() anyway. The only thing that is missing there is
add_replacing_endpoint() call which the patch adds.
2025-01-16 16:37:07 +02:00
Gleb Natapov
0db6136fa5 storage_service: fix indentation after the last patch 2025-01-16 16:37:07 +02:00
Gleb Natapov
9197b88e48 storage_service: drop loops from node ops replace_prepare handling since there can be only one replacing node
The call already throw an error if there are more than one. Throw is
there are zero as well and drop the loops.
2025-01-16 16:37:07 +02:00
Gleb Natapov
fcfd005023 token_metadata: drop no longer used functions 2025-01-16 16:37:07 +02:00
Gleb Natapov
7c4c485651 host_id_or_endpoint: use gossiper to resolve ip to id and back mappings
host_id_or_endpoint is a helper class that hold either id or ip and
translate one into another on demand. Use gossiper to do a translation
there instead of token_metadata since we want to drop ip based APIs from
the later.
2025-01-16 16:37:07 +02:00
Gleb Natapov
70cc014307 storage_service: ip_address_updater: check peers table instead of token_metadata whether ip was changed
As part of changing IP address peers table is updated. If it has a new
address the update can be skipped.
2025-01-16 16:37:07 +02:00
Gleb Natapov
8e55cc6c78 storage_service: fix logging
When logger outputs a range it already does join, so no other join is
needed.
2025-01-16 16:37:07 +02:00
Gleb Natapov
7556e3d045 topology coordinator: remove gossiper entry only if host id matches provided one
Currently the entry is removed only if ip is not used by any normal or
transitioning node. This is done to not remove a wrong entry that just
happen to use the same ip, but the same can be achieved by checking host
id in the entry.
2025-01-16 16:37:07 +02:00
Gleb Natapov
593308a051 node_ops, cdc: drop remaining token_metadata::get_endpoint_for_host_id() usage
Use address map to translate id to ip instead. We want to drop ips from token_metadata.
2025-01-16 16:37:07 +02:00
Gleb Natapov
ae8dc595e1 hints: move id to ip translation into store_hint() function
Also use gossiper to translate instead of token_metadata since we want
to get rid of ip base APIs there.
2025-01-16 16:37:06 +02:00
Gleb Natapov
c7d08fe1fe storage_service: change get_dc_rack_for() to work on host ids 2025-01-16 16:37:06 +02:00
Gleb Natapov
415e8de36e locator: topology: change get_datacenter_endpoints and get_datacenter_racks to return host ids and amend users 2025-01-16 16:37:06 +02:00
Gleb Natapov
8a0fea5fef locator: topology: drop is_me ip overload along with remaning users 2025-01-16 16:37:06 +02:00
Gleb Natapov
2ea8df2cf5 storage_proxy: drop is_alive that works on ip since it is not used any more 2025-01-16 16:37:06 +02:00
Gleb Natapov
8433947932 locator: topology: remove get_location overload that works on ip and its last users 2025-01-16 16:37:06 +02:00
Gleb Natapov
25eb98ecbc locator: topology: drop no longer used ip based overloads 2025-01-16 16:37:06 +02:00
Gleb Natapov
315db647dd consistency_level: drop templates since the same types of ranges are used by all the callers 2025-01-16 16:37:06 +02:00
Gleb Natapov
1b6e1456e5 messaging_service: drop the usage of ip based token_metadata APIs
We want to drop ips from token_metadata so move to use host id based
counterparts. Messaging service gets a function that maps from ips to id
when is starts listening.
2025-01-16 16:37:06 +02:00
Gleb Natapov
da9b7b2626 storage_service: drop ip based topology::get_datacenter() usage
We want to drop ips from the topology eventually.
2025-01-16 16:37:06 +02:00
Gleb Natapov
36ccc897e8 gossiper: change get_live_members and all its users to work on host ids 2025-01-16 16:37:06 +02:00
Gleb Natapov
7a3237c687 messaging_service: drop get_raw_version and knows_version
The are unused. The version is always fixed.
2025-01-16 16:37:06 +02:00
Gleb Natapov
8cc09f4358 storage_service: do not use ip addresses from token_metadata in handling of a normal state
Instead use gossiper and peers table to retrieve same information.
Token_metadata is created from the mix of those two anyway. The goal is
to drop ips from token_metadata entirely.
2025-01-16 16:37:06 +02:00
Gleb Natapov
5262bbafff locator: drop no longer used ip based functions 2025-01-16 16:37:06 +02:00
Gleb Natapov
542360e825 test: drop inet_address usage from network_topology_strategy_test
Move the test to work on host ids. IPs will be dropped eventually.
2025-01-16 16:37:06 +02:00
Gleb Natapov
9ea53a8656 storage_service: move describe ring and get_range_to_endpoint_map to use host ids inside and translate to ips at the last moment
The functions are called from RESful API so has to return ips for backwards
compatibility, but internally we can use host ids as long as possible
and convert to ips just before returning. This also drops usage of ip
based erm function which we want to get rid of.
2025-01-16 16:37:06 +02:00
Gleb Natapov
f03a575f3d storage_service: move storage_service::get_natural_endpoints to use host ids internally and translate to ips before returning
The function is called by RESful API so has to return ips for backwards
compatibility, but internally we can use host ids as long as possible
and convert to ips just before returning. This also drops usage of ip
based erm function which we want to get rid of.
2025-01-16 16:37:06 +02:00
Gleb Natapov
6e6b2cfa63 storage_service: use existing util function instead of re-iplementing it
locator/util.hh already has get_range_to_address_map which is exactly
like the one in the storage_service. So remove the later one and use the
former instead.
2025-01-16 16:37:06 +02:00
Gleb Natapov
58f8395bc2 storage_service: use gossiper instead of token_metadata to map ip to id in gossiper notifications
We want to drop ips from token_metadata so move to different API to map
ip to id.
2025-01-16 16:35:13 +02:00
Michał Chojnowski
16b3352ae7 build: fix -ffile-prefix-map
cmake doesn't set a `-ffile-prefix-map` for source files. Among other things,
this results in absolute paths in Scylla logs:

```
Jan 11 09:59:11.462214 longevity-tls-50gb-3d-master-db-node-2dcd4a4a-5 scylla[16339]: scylla: /jenkins/workspace/scylla-master/next/scylla/utils/refcounted.hh:23: utils::refcounted::~refcounted(): Assertion `_count == 0' failed.
```

And it results in absolute paths in gdb, which makes it a hassle to get gdb to display
source code during debugging. (A build-specific `substitute-path` has to be
configured for that).

There is a `-file-prefix-map` rule for `CMAKE_BINARY_DIR`,
but it's wrong.
Patch dbb056f4f7, which added it,
was misguided.

What we want is to strip the leading components of paths up to
the repository directory, both in __FILE__ macros and in debug info.
For example, we want to convert /home/michal/scylla/replica/table.cc to
replica/table.cc or ./replica/table.cc, both in Scylla logs and in gdb.

What the current rule does is it maps `/home/michal/scylla/build` to `.`,
which is wrong: it doesn't do anything about the paths outside of `build`,
which are the ones we actually care about.

This patch fixes the problem.

Closes scylladb/scylladb#22311
2025-01-16 16:35:18 +03:00
Kefu Chai
8d7786cb0e build: cmake: use wasm32-wasip1 as an alternative of wasm32-wasi
wasm32-wasi has been removed in Rust 1.84 (Jan 5th, 2025). if one
compiles the tree with Rust 1.84 or up, following build failure is
expected:

```
[2/305] Building WASM /home/kefu/dev/scylladb/build/wasm/return_input.wasm
FAILED: wasm/return_input.wasm /home/kefu/dev/scylladb/build/wasm/return_input.wasm
cd /home/kefu/dev/scylladb/test/resource/wasm/rust && /usr/bin/cargo build --target=wasm32-wasi --example=return_input --locked --manifest-path=Cargo.toml --target-dir=/home/kefu/dev/scylladb/build/test/resource/wasm/rust && wasm-opt /home/kefu/dev/scylladb/build/test/resource/wasm/rust/wasm32-wasi//debug/examples/return_input.wasm -Oz -o /home/kefu/dev/scylladb/build/wasm/return_input.wasm && wasm-strip /home/kefu/dev/scylladb/build/wasm/return_input.wasm
error: failed to run `rustc` to learn about target-specific information

Caused by:
  process didn't exit successfully: `rustc - --crate-name ___ --print=file-names --target wasm32-wasi --crate-type bin --crate-type rlib --crate-type dylib --crate-type cdylib --crate-type staticlib --crate-type proc-macro --print=sysroot --print=split-debuginfo --print=crate-name --print=cfg` (exit status: 1)
  --- stderr
  error: Error loading target specification: Could not find specification for target "wasm32-wasi". Run `rustc --print target-list` for a list of built-in targets
```

in order to workaround this issue, let's check for supported target,
and use wasm32-wasip1 if wasm32-wasi is not listed as the supported
target.

Refs #20878

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22320
2025-01-16 16:28:29 +03:00
Michał Chojnowski
38d94475f2 messaging_service: fix the piece of code which clears clients on shutdown()
While this isn't strictly needed for anything, messaging_service
is supposed to clear its RPC connection objects on stop,
for debuggability reasons.

But a recent change in this area broke that.
std::bind creates copies of its arguments, so the `m.clear()`
statement in stop_client() only clears a copy of the vector of shared pointers,
instead of clearing the original vector. This patch fixes that.

Fixes #22245

Closes scylladb/scylladb#22333
2025-01-16 16:26:18 +03:00
Andrei Chekun
29a69f495e test.py: Mark the cluster dirty after each test for topology
Currently, tests are reusing the cluster. This leads to the situation
when test passes and leaves the cluster broken, that the next tests will
try to clean up the Scylla working directory during starting the node.
Timeout for starting is set to two minutes by default and sometimes
cleaning the mess after several tests can take more time, so tests fails
during adding the node to the cluster. Current PR marks the cluster
dirty after the test, so no need to clean the Scylla working directory.
The disadvantage of this way is increasing the time for tests execution.
Observable increase is approximately one minutes for one repeat in dev
mode: 22 min 35s vs. 23 min 41s.

Closes scylladb/scylladb#22274
2025-01-16 13:51:18 +01:00
Botond Dénes
b2a03e03f7 Merge 'raft: Handle non-critical config update errors in when changing voter status.' from Sergey Zolotukhin
When a node is bootstrapped and joined a cluster as a non-voter and changes it's role to a voter, errors can occur while committing a new Raft record, for instance, if the Raft leader changes during this time. These errors are not critical and should not cause a node crash, as the action can be retried.

Fixes scylladb/scylladb#20814

Backport: This issue occurs frequently and disrupts the CI workflow to some extent. Backports are needed for versions 6.1 and 6.2.

Closes scylladb/scylladb#22253

* github.com:scylladb/scylladb:
  raft: refactor `remove_from_raft_config` to use a timed `modify_config` call.
  raft: Refactor functions using `modify_config` to use a common wrapper for retrying.
  raft: Handle non-critical config update errors in when changing status to voter.
  test: Add test to check that a node does not fail on unknown commit status error when starting up.
  raft: Add run_op_with_retry in raft_group0.
2025-01-16 11:00:47 +02:00
Yaron Kaikov
f15bf8a245 Update ScyllaDB version to: 2025.1.0-dev
Following the license changes in f3eade2f62

Closes scylladb/scylladb#21978
2025-01-16 07:01:37 +02:00
Gleb Natapov
0c930199f8 storage_service: use gossiper to map id to ip instead of token_metadata in node_ops_cmd_handler
We want to drop ips from token_metadata so move to different API to map
ip to id.
2025-01-15 16:30:29 +02:00
Gleb Natapov
5d4d9fd31d storage_service: force_remove_completion use address map to resolve id to ip instead of token metadata
We want to drop ips from token_metadata so move to different API to map
ip to id.
2025-01-15 16:30:29 +02:00
Gleb Natapov
f5fa4d9742 topology coordinator: drop get_endpoint_for_host_id_if_known usage
Now that we have gossiper::get_endpoint_state_ptr that works on host ids
there is no need to translate id to ip at all.
2025-01-15 16:30:29 +02:00
Gleb Natapov
b3f8b579c0 gossiper: add get_endpoint_state_ptr() function that works on host id
Will be used later to simplify code.
2025-01-15 16:30:29 +02:00
Gleb Natapov
448282dc93 storage_proxy: used gossiper for map ip to host id in connection_dropped callback
We want to drop ips from token_metadata so move to different API to map
ip to id.
2025-01-15 16:30:29 +02:00
Gleb Natapov
ae821ba07a repair: use gossiper to map ip to host id instead of token_metadata
We want to drop ips from token_metadata so move to different API to map
ip to id.
2025-01-15 16:30:29 +02:00
Gleb Natapov
8c85350d4b db/virtual_tables: use host id from the gossiper endpoint state in cluster_status table
The state always has host id now, so there is no point to looks it up in
the token metadata.
2025-01-15 16:30:28 +02:00
Gleb Natapov
844cb090bf view: do not use get_endpoint_for_host_id_if_known to check if a node is part of the topology
Check directly in the topology instead.
2025-01-15 16:30:28 +02:00
Gleb Natapov
f685c7d0af hints: use gossiper to map ip to id in wait_for_sync_point
We want to drop ips from token_metadata so move to different API to map
ip to id.
2025-01-15 16:30:28 +02:00
Gleb Natapov
4d7c05ad82 hints: move create_hint_sync_point function to host ids
One of its caller is in the RESTful API which gets ips from the user, so
we convert ips to ids inside the API handler using gossiper before
calling the function. We need to deprecate ip based API and move to host
id based.
2025-01-15 16:30:28 +02:00
Gleb Natapov
755ee9a2c5 api: do not use token_metadata to retrieve ip to id mapping in token_metadata RESTful endpoints
We want to drop ip knowledge from the token_metadata, so use gossiper to
retrieve the mapping instead.
2025-01-15 16:30:28 +02:00
Gleb Natapov
0d4d066fe3 hints: simplify can_send() function
Since there is gossiper::is_alive version that works on host_id now
there is no need to convert _ep_key to ip which simplifies the code a
lot.
2025-01-15 16:30:28 +02:00
Gleb Natapov
50ee962033 service: address_map: add lookup function that expects address to exist
We will add code that expects id to ip mapping to exist. If it does not
it is better to fail earlier during testing, so add a function that
calls internal error in case there is no mapping.
2025-01-15 16:30:28 +02:00
Paweł Zakrzewski
5b1da31595 audit: Add shares support to service level management
Introduces shares-based workload prioritization for service levels, allowing
fine-grained control over resource allocation between tenants. Key changes:

- Add shares option to service level configuration:
  - Valid range: 1-1000 shares
  - Default value: 1000 shares
  - Enterprise-only feature gated by WORKLOAD_PRIORITIZATION feature flag

- Extend CQL interface:
  - Add shares parameter to CREATE/ALTER SERVICE_LEVEL
  - Add shares column to system_distributed.service_levels
  - Add percentage calculation to LIST SERVICE_LEVELS
  - Add shares to DESCRIBE EFFECTIVE SERVICE_LEVEL output

- Add validation:
  - Enforce shares range (1-1000)
  - Validate enterprise feature flag
  - Handle unset/delete markers properly

- Update service level statements:
  - Add shares validation to CREATE/ALTER operations
  - Preserve shares through default value replacement
  - Add proper decomposition for shares values in result sets

This change enables operators to control relative resource allocation between
tenants using proportional share scheduling, while maintaining backward
compatibility with existing service level configurations.
2025-01-15 15:01:05 +01:00
Botond Dénes
25fbe488ef Merge 'view_builder: write status to tables before starting to build' from Michael Litvak
When adding a new view for building, first write the status to the
system tables and then add the view building step that will start
building it.

Otherwise, if we start building it before the status is written to the
table, it may happen that we complete building the view, write the
SUCCESS status, and then overwrite it with the STARTED status. The
view_build_status table will remain in incorrect state indicating the
view building is not complete.

Fixes #20638

The PR contains few additional small fixes in separate commits related to the view build status table.

It addresses flakiness issues in tests that use the view build status table to determine when view building is complete. The table may be in incorrect state due to these issues, having a row with status STARTED when it actually finished building the view, which will cause us to wait in `wait_for_view` until it timeouts.

For testing I used a test similar to `test_view_build_status_with_replace_node`, but it only creates the views and calls `wait_for_view`. Without these commits it failed in 4/1024 runs, and with the commits it passed 2048/2048.

backport to fix the bugs that affects previous versions and improve CI stability

Closes scylladb/scylladb#22307

* github.com:scylladb/scylladb:
  view_builder: hold semaphore during entire startup
  view_builder: pass view name by value to write_view_build_status
  view_builder: write status to tables before starting to build
2025-01-15 15:01:17 +02:00
Paweł Zakrzewski
28bd699c51 audit: Add service level support to CQL login process
This change integrates service level functionality into the CQL authentication and connection handling:

- Add scheduling_group_name to client_data to track service level assignments
- Extend SASL challenge interface to expose authenticated username
- Modify connection processing to support tenant switching:
  - Add switch_tenant() method to handle scheduling group changes
  - Add process_until_tenant_switch() to handle request processing boundaries
  - Implement no_tenant() default executor
  - Add execute_under_tenant_type for scheduling group management

- Update connection lifecycle to properly handle service level changes:
  - Initialize connections with default scheduling group
  - Support dynamic scheduling group updates when service levels change
  - Ensure proper cleanup of scheduling group assignments

The changes enable proper scheduling group assignment and management based on
authenticated users' service levels, while maintaining backward compatibility
for connections without service level assignments.
2025-01-15 11:10:36 +01:00
Paweł Zakrzewski
98f5e49ea8 audit: Add support to CQL statements
Integrates audit functionality into CQL statement processing to enable tracking of database operations. Key changes:

- Add audit_info and statement_category to all CQL statements
- Implement audit categories for different statement types:
  - DDL: Schema altering statements (CREATE/ALTER/DROP)
  - DML: Data manipulation (INSERT/UPDATE/DELETE/TRUNCATE/USE)
  - DCL: Access control (GRANT/REVOKE/CREATE ROLE)
  - QUERY: SELECT statements
  - ADMIN: Service level operations

- Add audit inspection points in query processing:
  - Before statement execution
  - After access checks
  - After statement completion
  - On execution failures

- Add password sanitization for role management statements
  - Mask plaintext passwords in audit logs
  - Handle both direct password parameters and options maps
  - Preserve query structure while hiding sensitive data

- Modify prepared statement lifecycle to carry audit context
  - Pass audit info during statement preparation
  - Track audit info through statement execution
  - Support batch statement auditing

This change enables comprehensive auditing of CQL operations while ensuring sensitive data is properly masked in audit logs.
2025-01-15 11:10:36 +01:00
Paweł Zakrzewski
1810e2e424 audit: Integrate audit subsystem into Scylla main process
Adds core integration of the audit subsystem into Scylla's main process flow. Changes include:

- Import audit subsystem header
- Initialize audit system during server startup using configuration and token metadata
- Start audit system after API server initialization with query processor and memory manager
- Add proper shutdown sequence for audit system using RAII pattern
- Add error handling for audit system initialization failures

The audit system is now properly integrated into Scylla's lifecycle, ensuring:
- Correct initialization order relative to other subsystems
- Proper resource cleanup during shutdown
- Graceful error handling for initialization failures
2025-01-15 11:10:36 +01:00
Paweł Zakrzewski
702e727e33 audit: Add documentation for the audit subsystem
Adds detailed documentation covering the new audit subsystem:

- Add new audit.md design document explaining:
  - Core concepts and design decisions
  - CQL extensions for audit management
  - Implementation details and trigger evaluation
  - Prior art references from other databases

- Add user-facing documentation:
  - New auditing.rst guide with configuration and usage details
  - Integration with security documentation index
  - Updates to cluster management procedures
  - Updates to security checklist

The documentation covers all aspects of the audit system including:
- Configuration options and storage backends (syslog/table)
- Audit categories (DCL/DDL/AUTH/DML/QUERY/ADMIN)
- Permission model and security considerations
- Failure handling and logging
- Example configurations and output formats

This ensures users have complete guidance for setting up and using
the new audit capabilities.
2025-01-15 11:10:35 +01:00
Paweł Zakrzewski
384641194a audit: Add the audit subsystem
This change introduces a new audit subsystem that allows tracking and logging of database operations for security and compliance purposes. Key features include:

- Configurable audit logging to either syslog or a dedicated system table (audit.audit_log)
- Selective auditing based on:
  - Operation categories (QUERY, DML, DDL, DCL, AUTH, ADMIN)
  - Specific keyspaces
  - Specific tables
- New configuration options:
  - audit: Controls audit destination (none/syslog/table)
  - audit_categories: Comma-separated list of operation categories to audit
  - audit_tables: Specific tables to audit
  - audit_keyspaces: Specific keyspaces to audit
  - audit_unix_socket_path: Path for syslog socket
  - audit_syslog_write_buffer_size: Buffer size for syslog writes

The audit logs capture details including:
- Operation timestamp
- Node and client IP addresses
- Operation category and query
- Username
- Success/failure status
- Affected keyspace and table names
2025-01-15 11:10:35 +01:00
Piotr Dulikowski
72f28ce81e Merge 'main, view: Pair view builder drain with its start' from Dawid Mędrek
In this PR, we pair draining the view builder with its start.
To better understand what was done and why, let's first look at the
situation before this commit and the context of it:

(a) The following things happened in order:

    1. The view builder would be constructed.
    2. Right after that, a deferred lambda would be created to stop the
       view builder during shutdown.
    3. group0_service would be started.
    4. A deferred lambda stopping group0_service would be created right
       after that.
    5. The view builder would be started.

(b) Because the view builder depends on group0_client, it couldn't be
    started before starting group0_service. On the other hand, other
    services depend on the view builder, e.g. the stream manager. That
    makes changing the order of initialization a difficult problem,
    so we want to avoid doing that unless we're sure it's the right
    choice.

(c) Since the view builder uses group0_client, there was a possibility
    of running into a segmentation fault issue in the following
    scenario:

    1. A call to `view_builder::mark_view_build_success()` is issued.
    2. We stop group0_service.
    3. `view_builder::mark_view_build_success()` calls
       `announce_with_raft()`, which leads to a use-after-free because
       group0_service has already been destroyed.

      This very scenario took place in scylladb/scylladb#20772.

Initially, we decided to solve the issue by initializing
group0_service a bit earlier (scylladb/scylladb@7bad8378c7).
Unfortunately, it led to other issues described in scylladb/scylladb#21534,
so we revert that patch. These changes are the second attempt
to the problem where we want to solve it in a safer manner.

The solution we came up with is to pair the start of the view builder
with a deferred lambda that deinitializes it by calling
`view_builder::drain()`. No other component of the system should be
able to use the view builder anymore, so it's safe to do that.
Furthermore, that pairing makes the analysis of
initialization/deinitialization order much easier. We also solve the
aformentioned use-after-free issue because the view builder itself
will no longer attempt to use group0_client.

Note that we still pair a deferred lambda calling `view_builder::stop()`
with the construction of the view builder; that function will also call
`view_builder::drain()`. Another notable thing is `view_builder::drain()`
may be called earlier by `storage_service::do_drain()`. In other words,
these changes cover the situation when Scylla runs into a problem when
starting up.

Backport: The patch I'm reverting made it to 6.2, so we want to backport this one there too.

Fixes scylladb/scylladb#20772
Fixes scylladb/scylladb#21534

Closes scylladb/scylladb#21909

* github.com:scylladb/scylladb:
  test/topology_custom: Add test for Scylla with disabled view building
  main, view: Pair view builder drain with its start
  Revert "main,cql_test_env: start group0_service before view_builder"
2025-01-15 09:50:26 +01:00
Sergey Zolotukhin
228a66d030 raft: refactor remove_from_raft_config to use a timed modify_config call.
To avoid potential hangs during the `remove_from_raft_config` operation, use a timed `modify_config` call.
This ensures the operation doesn't get stuck indefinitely.
2025-01-15 09:49:17 +01:00
Sergey Zolotukhin
3da4848810 raft: Refactor functions using modify_config to use a common wrapper
for retrying.

There are several places in `raft_group0` where almost identical code is
used for retrying `modify_config` in case of `commit_status_unknown`
error. To avoid code duplication all these places were changed to
use a new wrapper `run_op_with_retry`.
2025-01-15 09:49:17 +01:00
Sergey Zolotukhin
8c48f7ad62 raft: Handle non-critical config update errors in when changing status
to voter.

When a node is bootstrapped and joins a cluster as a non-voter, errors can occur while committing
a new Raft record, for instance, if the Raft leader changes during this time. These errors are not
critical and should not cause a node crash, as the action can be retried.

Fixes scylladb/scylladb#20814
2025-01-15 09:49:15 +01:00
Takuya ASADA
f2a53d6a2c dist: make p11-kit-trust.so able to work in relocatable package
Currently, our relocatable package doesn't contains p11-kit-trust.so
since it dynamically loaded, not showing on "ldd" results
(Relocatable packaging script finds dependent libraries by "ldd").
So we need to add it on create-relocatable-pacakge.py.

Also, we have two more problems:
1. p11 module load path is defined as "/usr/lib64/pkcs11", not
   referencing to /opt/scylladb/libreloc
   (and also RedHat variants uses different path than Debian variants)

2. ca-trust-source path is configured on build time (on Fedora),
   it compatible with RedHat variants but not compatible with Debian
   variants

To solve these problems, we need to override default p11-kit
configuration.
To do so, we need to add an configuration file to
/opt/scylladb/share/pkcs11/modules/p11-kit-trust.module.
Also, ofcause p11-kit doesn't reference /opt/scylladb by default, we
need to override load path by p11_kit_override_system_files().

On the configuration file, we can specify module load path by "modules: <path>",
and also we can specify ca-trust-source path by "x-init-reservied: paths=<path>".

Fixes scylladb/scylladb#13904

Closes scylladb/scylladb#22302
2025-01-15 10:09:17 +02:00
Kefu Chai
0d399702c7 api: include used header
when building the tree on fedora 41, we could have following build
failure:

```
FAILED: api/CMakeFiles/api.dir/Debug/system.cc.o
/usr/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -isystem /home/kefu/dev/scylladb/seastar/include -isystem /home/kefu/dev/scylladb/build/Debug/seastar/gen/include -isystem /home/kefu/dev/scylladb/build/rust -isystem /home/kefu/dev/scylladb/abseil -I/usr/include/p11-kit-1 -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-explicit-specialization-storage-class -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -std=gnu++23 -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=19 -DSEASTAR_DEBUG -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEBUG_PROMISE -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DBOOST_PROGRAM_OPTIONS_NO_LIB -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_THREAD_NO_LIB -DBOOST_THREAD_DYN_LINK -DFMT_SHARED -DWITH_GZFILEOP -MD -MT api/CMakeFiles/api.dir/Debug/system.cc.o -MF api/CMakeFiles/api.dir/Debug/system.cc.o.d -o api/CMakeFiles/api.dir/Debug/system.cc.o -c /home/kefu/dev/scylladb/api/system.cc
/home/kefu/dev/scylladb/api/system.cc:116:47: error: no member named 'lexical_cast' in namespace 'boost'
  116 |             logging::log_level level = boost::lexical_cast<logging::log_level>(std::string(req.get_query_param("level")));
      |                                        ~~~~~~~^
/home/kefu/dev/scylladb/api/system.cc:116:78: error: expected '(' for function-style cast or type construction
  116 |             logging::log_level level = boost::lexical_cast<logging::log_level>(std::string(req.get_query_param("level")));
      |                                                            ~~~~~~~~~~~~~~~~~~^
/home/kefu/dev/scylladb/api/system.cc:118:25: error: no type named 'bad_lexical_cast' in namespace 'boost'
  118 |         } catch (boost::bad_lexical_cast& e) {
      |                  ~~~~~~~^
/home/kefu/dev/scylladb/api/system.cc:136:47: error: no member named 'lexical_cast' in namespace 'boost'
  136 |             logging::log_level level = boost::lexical_cast<logging::log_level>(std::string(req.get_query_param("level")));
      |                                        ~~~~~~~^
/home/kefu/dev/scylladb/api/system.cc:136:78: error: expected '(' for function-style cast or type construction
  136 |             logging::log_level level = boost::lexical_cast<logging::log_level>(std::string(req.get_query_param("level")));
      |                                                            ~~~~~~~~~~~~~~~~~~^
/home/kefu/dev/scylladb/api/system.cc:140:25: error: no type named 'bad_lexical_cast' in namespace 'boost'
  140 |         } catch (boost::bad_lexical_cast& e) {
      |                  ~~~~~~~^
/home/kefu/dev/scylladb/api/system.cc:148:47: error: no member named 'lexical_cast' in namespace 'boost'
  148 |             logging::log_level level = boost::lexical_cast<logging::log_level>(std::string(req.get_query_param("level")));
      |                                        ~~~~~~~^
/home/kefu/dev/scylladb/api/system.cc:148:78: error: expected '(' for function-style cast or type construction
  148 |             logging::log_level level = boost::lexical_cast<logging::log_level>(std::string(req.get_query_param("level")));
      |                                                            ~~~~~~~~~~~~~~~~~~^
/home/kefu/dev/scylladb/api/system.cc:150:25: error: no type named 'bad_lexical_cast' in namespace 'boost'
  150 |         } catch (boost::bad_lexical_cast& e) {
      |                  ~~~~~~~^
```

in this change, we include the used header to address the build failure.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22303
2025-01-15 10:11:40 +03:00
Jenkins Promoter
f5f15c6d07 Update pgo profiles - aarch64 2025-01-15 04:49:45 +02:00
Jenkins Promoter
f021e16d0c Update pgo profiles - x86_64 2025-01-15 04:26:42 +02:00
Sergey Zolotukhin
16053a86f0 test: Add test to check that a node does not fail on unknown commit status
error when starting up.

Test that a node is starting successfully if while joining a cluster and becoming a voter, it
receives an unknown commit status error.

Test for scylladb/scylladb#20814
2025-01-14 17:12:06 +01:00
Sergey Zolotukhin
775411ac56 raft: Add run_op_with_retry in raft_group0.
Since when calling `modify_config` it's quite often we need to do
retries, to avoid code duplication, a function wrapper that allows
a function to be called with automatic retries in case of failures
was added.
2025-01-14 17:12:04 +01:00
Kamil Braun
2eac7a2d61 Merge 'test/pylib: two trivial cleanups' from Kefu Chai
- use "foo not in bar" instead of "not foo in bar"
- test/pylib: use foo instead of `'{}'.format(foo)`

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#22066

* github.com:scylladb/scylladb:
  test/pylib: use `foo` instead of `'{}'.format(foo)`
  test/pylib: use "foo not in bar" instead of "not foo in bar"
2025-01-14 16:27:44 +01:00
Nadav Har'El
15c252fd8f Merge 'docs: Update documentation on CREATE ROLE WITH HASHED PASSWORD' from Dawid Mędrek
As part of #18750, we added a CQL statement CREATE ROLE WITH SALTED HASH that prevented hashing a password when creating a role, effectively leading to inserting a hash given by the user directly into the database. In #21350, we noticed that Cassandra had implemented a CQL statement of similar semantics but different syntax. We decided to rename Scylla's statement to be compatible with Cassandra. Unfortunately, we didn't notice one more difference between what we had in Scylla and what was part of Cassandra.

Scylla's statement was originally supposed to only be used when restoring the schema and the user needn't have to be aware of its existence at all: the database produced a sequence of CQL statements that the user saved to a file and when a need to restore the schema arose, they would execute the contents of the file. That's why that although we documented the feature, it was only done in the necessary places. Those that weren't related to the backup & restore procedure were deliberately skipped.

Cassandra, on the other hand, added the statement for a different purpose (for details, see the relevant issue) and it was supposed to be used by the user by design. The statement is also documented as such.

Since we want to preserve compatibility with Cassandra, we document the statement and its semantics in the user documentation, explicitly implying that it can be used by the user.

We also add a test verifying that logging in works correctly.

Fixes scylladb/scylladb#21691

Backport: not needed. The relevant code didn't make it to 6.2 or any previous version of OSS.

Closes scylladb/scylladb#21752

* github.com:scylladb/scylladb:
  docs: Update documentation on CREATE ROLE WITH HASHED PASSWORD
  test/boost: Add test for creating roles with hashed passwords
2025-01-14 15:33:30 +02:00
Kefu Chai
3b7a991f74 ent/encryption: rename "sie" to "get_opt"
"sie" is the short for "system info encryption". it is a wrapper around
a `opts` map so we can get the individual option by providing a default
value via an `optional<>` return value. but "sie" could be difficult to
understand without more context. and it is used like a function -- we
get the individual option using its operator().

so, in order to improve the readability, in this change, we rename it
to "get_opt".

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-01-14 21:08:17 +08:00
Kefu Chai
92c6c8a32f ent,main: fix misspellings
these misspellings are identified by codespell. they are either in
comment or logging messages. let's fix them.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-01-14 21:08:17 +08:00
Kefu Chai
7215d4bfe9 utils: do not include unused headers
these unused includes were identifier by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.

please note, because quite a few source files relied on
`utils/to_string.hh` to pull in the specialization of
`fmt::formatter<std::optional<T>>`, after removing
`#include <fmt/std.h>` from `utils/to_string.hh`, we have to
include `fmt/std.h` directly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-01-14 07:56:39 -05:00
Kefu Chai
e6b05cb9ea .github: use the toolchain specified by tools/toolchain/image
Previously, we hardwire the container to a previous frozen toolchain
image. but at the time of writing, the tree does not compile in the
specified toolchain image anymore, after the required building
environment is updated, and toolchain was updated accordingly.

in order to improve the maintability, let's reuse `read-toolchain.yaml`
job which reads `tools/toolchain/image`, so we don't have to hardwire
the container used for building the tree with the latest seastar. this
should address the build failure surfaced recently.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22287
2025-01-14 07:56:38 -05:00
Kefu Chai
f8885a4afd dist/docker,docs: replace "--experimental" with "--experimental-features"
The "--experimental" option was removed in commit f6cca741ea. Using this
deprecated option now causes Scylla to fail with the error:

```
error: the argument ('on') for option '--experimental-features' is invalid
```

So, in this change, let's update the docker entry point script to use
`--experimental-features` command line option instead. The related
document is updated accordingly.

Fixes scylladb/scylladb#22207
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22283
2025-01-14 07:56:38 -05:00
Aleksandra Martyniuk
592512fd0f test: fix memtable_flush_period test
memtable_flush_period test sets the flush period to 200ms and checks
whether the data is flushed after 500ms.

When flush period is set, the timer is armed with the given value.
On expiration, memtables are flushed and then the timer is rearmed.

There is no certainty that during 500ms the flush finishes, though.

Check if after 500ms flush has started. Wait until there is an sstable.

Fixes: #21965.

Closes scylladb/scylladb#22162
2025-01-14 07:56:38 -05:00
Aleksandra Martyniuk
32ab58cdea repair: add repair_service gate
In main.cc storage_service is started before and stopped after
repair_service. storage_service keeps a reference to sharded
repair_service and calls its methods, but nothing ensures that
repair_service's local instance would be alive for the whole
execution of the method.

Add a gate to repair_service and enter it in storage_service
before executing methods on local instances of repair_service.

Fixes: #21964.

Closes scylladb/scylladb#22145
2025-01-14 07:56:38 -05:00
Geoff Montee
25e8478051 docs: rest.rst: use latest docker tag to view Swagger UI for REST API
Closes scylladb/scylladb#21681
2025-01-14 07:56:38 -05:00
Botond Dénes
686a997c04 Merge 'Complete implementation of configuring IO bandwidth limits' from Pavel Emelyanov
In Scylla there are two options that control IO bandwidth limit -- the /storage_service/(compaction|stream)_throughput REST API endpoints. The endpoints are partially implemented and have no counterparts in the nodetool.

This set implements the missing bits and adds tests for new functionality.

Closes scylladb/scylladb#21877

* github.com:scylladb/scylladb:
  nodetool: Implement [gs]etstreamthroughput commands
  nodetool: Implement [gs]etcompationthroughput commands
  test: Add validation of how IO-updating endpoints work
  api: Implement /storage_service/(stream|compaction)_throughput endpoints
  api: Disqualify const config reference
  api: Implement /storage_service/stream_throughput endpoint
  api: Move stream throughput set/get endpoints from storage service block
  api: Move set_compaction_throughput_mb_per_sec to config block
  util: Include fmt/ranges.h in config_file.hh
2025-01-14 07:56:38 -05:00
Aleksandra Martyniuk
94f4871352 test: start waiting for task before it gets aborted
Ensure that the repair task was aborted after wait API acknowledged
its existence.

Fixes: #22011.

Closes scylladb/scylladb#22012
2025-01-14 07:56:37 -05:00
Michael Litvak
7a6aec1a6c view_builder: hold semaphore during entire startup
Guard the whole view builder startup routine by holding the semaphore
until it's done instead of releasing it early, so that it's not
intercepted by migration notifications.
2025-01-14 12:31:29 +02:00
Michael Litvak
1104411f83 view_builder: pass view name by value to write_view_build_status
The function write_view_build_status takes two lambda functions and
chooses which of them to run depending on the upgrade state. It might
run both of them.

The parameters ks_name and view_name should be passed by value instead
of by reference because they are moved inside each lambda function.
Otherwise, if both lambdas are run, the second call operates on invalid
values that were moved.
2025-01-14 12:31:29 +02:00
Michael Litvak
b1be2d3c41 view_builder: write status to tables before starting to build
When adding a new view for building, first write the status to the
system tables and then add the view building step that will start
building it.

Otherwise, if we start building it before the status is written to the
table, it may happen that we complete building the view, write the
SUCCESS status, and then overwrite it with the STARTED status. The
view_build_status table will remain in incorrect state indicating the
view building is not complete.

Fixes scylladb/scylladb#20638
2025-01-14 12:31:20 +02:00
Asias He
cd96fb5a78 repair: Add repair_hosts_filter and repair_dcs_filter
They will be useful for hosts and DCs selection for the repair
scheduler. It is not implemented yet. Adding it earlier, so we do not
need to change the system tabler later.

Closes scylladb/scylladb#21985
2025-01-14 08:46:26 +02:00
Geoff Montee
c8ca2bd212 docs: operating-scylla/admin-tools/virtual-tables.rst: fix link to virtual tables
Closes scylladb/scylladb#22198
2025-01-14 08:45:49 +02:00
Lakshmi Narayanan Sreethar
63100b34da sstable_directory: do not load remote sstables in process_descriptor
The sstable loader relied on the generation id to provide an efficient
hint about the shard that owns an sstable. But, this hint was rendered
ineffective with the introduction of UUID generation, as the shard id
was no longer embedded in the generation id. This also became suboptimal
with the introduction of tablets. Commit 0c77f77 addressed this issue by
reading the minimum from disk to determine sstable ownership but this
improvement was lost with commit 63f1969, which optimistically assumed
that hints would work most of the time, which isn't true.

This commit restores that change - shard id of a table is deduced by
reading minially from disk and then the sstable is fully loaded only if
it belongs to the local shard. This patch also adds a testcase to verify
that the sstable are loaded only in their respective shards.

Fixes #21015

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-01-13 20:01:30 +05:30
Lakshmi Narayanan Sreethar
6e3ecc70a6 sstable_directory: update load_sstable() definition
Updated `sstable_directory::load_sstable()` to directly accept
`data_dictionary::storage_options` instead of a function that returns
the same. This is required to ensure `process_descriptor()` loads the
sstable only once in the right shard.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-01-13 20:00:29 +05:30
Nadav Har'El
321d0fd3b1 Merge 'Alternator: Add WCU suppport for update item' from Amnon Heiman
This series adds WCU support for the Alternator update item.
This motivation behind it, is to have a rough estimation of what a similar operation would have taken from WCU perspective if used with DynamoDB.

The calculation is done while minimal overhead is the prime objective, the results are values that is less or equal to what it would have been in DynamoDB

** New feature, no need to backport. **

Closes scylladb/scylladb#21999

* github.com:scylladb/scylladb:
  alternator/test_returnconsumedcapacity.py: update item
  alternator/executor.cc: Add WCU for update_item
2025-01-13 14:35:46 +02:00
Kamil Braun
48a4efba2f Merge 'Fix possible data corruption due to token keys clashing in read repair.' from Sergey Zolotukhin
This update addresses an issue in the mutation diff calculation
algorithm used during read repair. Previously, the algorithm used
`token` as the hashmap key. Since `token` is calculated basing on the
Murmur3 hash function, it could generate duplicate values for different
partition keys, causing corruption in the affected rows' values.

Fixes scylladb/scylladb#19101

Since the issue affects all the relevant scylla versions, backport to: 6.1, 6.2

Closes scylladb/scylladb#21996

* github.com:scylladb/scylladb:
  storage_proxy/read_repair: Remove redundant 'schema' parameter from `data_read_resolver::resolve` function.
  storage_proxy/read_repair: Use `partition_key` instead of `token` key for mutation diff calculation hashmap.
  test: Add test case for checking read repair diff calculation when having conflicting keys.
2025-01-13 10:54:34 +01:00
Kamil Braun
88a48f2355 Merge 'Load peers table into the gossiper on boot' from Gleb
Since we manage ip to id mapping directly in gossiper now we need to
load the mapping on boot. We already do it anyway, but only due to a bug
which checks raft topology mode config before it is set, so the code
thinks that it is in the gossiper mode and loads peers table into the
gossiper and token metadata. Fix the bug and load peers into the gossiper
only since token metadata is managed by raft.

The series also removes address map related test that no longer checks
anything and replace it with unit test.

It also adds the dc/rack check to "join node" rpc. The check is done
during shadow round now, but for it to work it requires dc/rack to be
propagated through the gossiper and we want to eventually drop it.

Ref: scylladb/scylladb#21777

* 'load-peers' of https://github.com/gleb-cloudius/scylla:
  topology coordinator: reject replace request if topology does not match
  gossiper: fix the logic of shadow_round parameter
  storage_service: do not add endpoint to the gossiper during topology loading.
  storage_service: load peers into gossiper on boot in raft topology mode
  storage_service: set raft topology change mode before using it in join_cluster
  locator: drop inet_address usage to figure out per dc/rack replication
  test: drop test_old_ip_notification_repro.py
  test: address_map: check generation handling during entry addition
2025-01-13 09:40:36 +01:00
Andrei Chekun
2aea2610e0 test.py: Wait for tasks finish before going further
Developers using asyncio.gather() often assume that it waits for all futures (awaitables) givens.
But this isn't true when the return_exceptions parameter is False, which is the default.
In that case, as soon as one future completes with an exception, the gather() call will return this exception immediately, and some of the finished tasks may continue to run in the background.
This is bad for applications that use gather() to ensure that a list of background tasks has all completed.
So such applications must use asyncio.gather() with return_exceptions=True, to wait for all given futures to complete either successfully or unsuccessfully.

Closes scylladb/scylladb#22252
2025-01-13 09:43:28 +02:00
Botond Dénes
f899f0e411 tools/scylla-sstable: dump-statistics: fix handling of {min,max}_column_names
Said fields in statistics are of type
`disk_array<uint32_t, disk_string<uint16_t>>` and currently are handled
as array of regular strings. However these fields store exploded
clustering keys, so the elements store binary data and converting to
string can yield invalid UTF-8 characters that certain JSON parsers (jq,
or python's json) can choke on. Fix this by treating them as binary and
using `to_hex()` to convert them to string. This requires some massaging
of the json_dumper: passing field offset to all visit() methods and
using a caller-provided disk-string to sstring converter to convert disk
strings to sstring, so in the case of statistics, these fields can be
intercepted and properly handled.

While at it, the type of these fields is also fixed in the
documentation.

Before:

    "min_column_names": [
      "��Z���\u0011�\u0012ŷ4^��<",
      "�2y\u0000�}\u007f"
    ],
    "max_column_names": [
      "��Z���\u0011�\u0012ŷ4^��<",
      "}��B\u0019l%^"
    ],

After:

    "min_column_names": [
      "9dd55a92bc8811ef12c5b7345eadf73c",
      "80327900e2827d7f"
    ],
    "max_column_names": [
      "9dd55a92bc8811ef12c5b7345eadf73c",
      "7df79242196c255e"
    ],

Fixes: #22078

Closes scylladb/scylladb#22225
2025-01-13 09:19:04 +03:00
Botond Dénes
a21ecc3253 tools/scylla-sstable: also try reading scylla.yaml from /etc/scylla
scylla-sstable tries to read scylla.yaml via the following sequence:
1) Use user-provided location is provided (--scylla-yaml-file parameter)
2) Use the environment variables SCYLLA_HOME and/or SCYLLA_CONF if set
3) Use the default location ./conf/scylla.yaml

Step 3 is fine on dev machines, where the binaries are usually invoked
from scylla.git, which does have conf/scylla.yaml, but it doesn't work
on production machines, where the default location for scylla.yaml is
/etc/scylla/scylla.yaml. To reduce friction when used on production
machines, add another fallback in case (3) fails, which tries to read
scylla.yaml from /etc/scylla/scylla.yaml location.

Fixes: scylladb/scylladb#22202

Closes scylladb/scylladb#22241
2025-01-13 09:11:29 +03:00
Kefu Chai
752e6561fb test/pylib: log if scylla exits with non-zero status code
When destroying a test cluster, ScyllaCluster.stop() calls ScyllaServer.stop()
for each running server. Previously, non-zero exit status codes from scylla
servers were silently ignored during test teardown.

This change modifies the logging behavior to print the exit status code when
a scylla server exits with a non-zero status. This helps developers quickly
identify potential issues or unexpected terminations during test runs.

Differences in handling:
- Before: Non-zero exit codes were not logged
- After: Non-zero exit codes are printed, providing visibility into
  server termination errors

This improvement aids in diagnosing intermittent test failures or
unexpected server shutdowns during test execution.

Refs #21742

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21934
2025-01-13 09:09:43 +03:00
Kefu Chai
41de3a17e1 api: move histogram data into future to avoid deep copying
Previously, we created a vector<utils_json::histogram> and returned it
by copying into a future. Since histogram is a JSON representation of
ihistogram, it can be heavyweight, making the vector copy overhead
significant.

Now we move the vector into the returned future instead of copying it,
eliminating the deep copy overhead. The APIs backed by this function
are marked deprecated, so this performance improvement is not that
important.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22004
2025-01-13 09:08:15 +03:00
Kefu Chai
fbca0a08f7 build: cmake: do not add absl::headers as a link directory
In 0b0e661a85, we brought abseil back as a submodule, and we
added absl::headers as an interface library for importing
abseil headers' include directory. And:

```console
$ patchelf --print-rpath build/RelWithDebInfo/scylla
/home/kefu/dev/scylla/idl/absl::headers
```

In this change, we remove `absl::headers` from
`target_link_directories()` as it's an interface library that only
provides header files, not linkable libraries. This fixes the incorrect
inclusion of absl::headers in the rpath of the scylla executable.

Additionally, remove abseil library dependencies from the idl target
since none of the idl source files directly include abseil headers.

After this change,

```console
$ patchelf --print-rpath build/RelWithDebInfo/scylla

```

the output of `pathelf` is now empty.

Fixes #22265
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22266
2025-01-13 09:05:43 +03:00
Kefu Chai
d815d7013c sstables_loader: report progress with the unit of batch
We restore a snapshot of table by streaming the sstables of
the given snapshot of the table using
`sstable_streamer::stream_sstable_mutations()` in batches. This function
reads mutations from a set of sstables, and streams them to the target
nodes. Due to the limit of this function, we are not able to track the
progress in bytes.

Previously, progress tracking used individual sstables as units, which caused
inaccuracies with tablet-distributed tables, where:
- An sstable spanning multiple tablets could be counted multiple times
- Progress reporting could become misleading (e.g., showing "40" progress
  for a table with 10 sstables)

This change introduces a more robust progress tracking method:
- Use "batch" as the unit of progress instead of individual sstables.
  Each batch represents a tablet when restoring a table snapshot if
  the tablet being restored is distributed with tablets. When it comes
  to tables distributed with vnode, each batch represents an sstable.
- Stream sstables for each tablet separately, handling both partially and
  fully contained sstables
- Calculate progress based on the total number of sstables being streamed
- Skip tablet IDs with no owned tokens

For vnode-distributed tables, the number of "batches" directly corresponds
to the number of sstables, ensuring:
- Consistent progress reporting across different table distribution models
- Simplified implementation
- Accurate representation of restore progress

The new approach provides a more reliable and uniform method of tracking
restoration progress across different table distribution strategies.

Also, Corrected the use of `_sstables.size()` in
`sstable_streamer::stream_sstables()`. It addressed a review comment
from Pavel that was inadvertently overlooked during previous rebasing
the commit of 5ab4932f34.

Fixes scylladb/scylladb#21816

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21841
2025-01-13 09:04:35 +03:00
Dawid Mędrek
d1f960eee2 test/topology_custom: Add test for Scylla with disabled view building
Before this commit, there doesn't seem to have been a test verifying that
starting and shutting down Scylla behave correctly when the configuration
option `view_building` is set to false. In these changes, we add one.
2025-01-13 00:41:27 +01:00
Dawid Mędrek
06ce976370 main, view: Pair view builder drain with its start
In these changes, we pair draining the view builder with its start.
To better understand what was done and why, let's first look at the
situation before this commit and the context of it:

(a) The following things happened in order:

    1. The view builder would be constructed.
    2. Right after that, a deferred lambda would be created to stop the
       view builder during shutdown.
    3. group0_service would be started.
    4. A deferred lambda stopping group0_service would be created right
       after that.
    5. The view builder would be started.

(b) Because the view builder depends on group0_client, it couldn't be
    started before starting group0_service. On the other hand, other
    services depend on the view builder, e.g. the stream manager. That
    makes changing the order of initialization a difficult problem,
    so we want to avoid doing that unless we're sure it's the right
    choice.

(c) Since the view builder uses group0_client, there was a possibility
    of running into a segmentation fault issue in the following
    scenario:

    1. A call to `view_builder::mark_view_build_success()` is issued.
    2. We stop group0_service.
    3. `view_builder::mark_view_build_success()` calls
       `announce_with_raft()`, which leads to a use-after-free because
       group0_service has already been destroyed.

      This very scenario took place in scylladb/scylladb#20772.

Initially, we decided to solve the issue by initializing
group0_service a bit earlier (scylladb/scylladb@7bad8378c7).
Unfortunately, it led to other issues described in scylladb/scylladb#21534.
We reverted that change in the previous commit. These changes are the
second attempt to the problem where we want to solve it in a safer manner.

The solution we came up with is to pair the start of the view builder
with a deferred lambda that deinitializes it by calling
`view_builder::drain()`. No other component of the system should be
able to use the view builder anymore, so it's safe to do that.
Furthermore, that pairing makes the analysis of
initialization/deinitialization order much easier. We also solve the
aformentioned use-after-free issue because the view builder itself
will no longer attempt to use group0_client.

Note that we still pair a deferred lambda calling `view_builder::stop()`
with the construction of the view builder; that function will also call
`view_builder::drain()`. Another notable thing is `view_builder::drain()`
may be called earlier by `storage_service::do_drain()`. In other words,
these changes cover the situation when Scylla runs into a problem when
starting up.

Fixes scylladb/scylladb#20772
2025-01-13 00:41:22 +01:00
Dawid Mędrek
a5715086a4 Revert "main,cql_test_env: start group0_service before view_builder"
The patch solved a problem related to an initialization order
(scylladb/scylladb#20772), but we ran into another one: scylladb/scylladb#21534.
After moving the initialization of group0_service, it ended up being destroyed
AFTER the CDC generation service would. Since CDC generations are accessed
in `storage_service::topology_state_load()`:

```
for (const auto& gen_id : _topology_state_machine._topology.committed_cdc_generations) {
    rtlogger.trace("topology_state_load: process committed cdc generation {}", gen_id);
    co_await _cdc_gens.local().handle_cdc_generation(gen_id);
```

we started getting the following failure:

```
Service &seastar::sharded<cdc::generation_service>::local() [Service = cdc::generation_service]: Assertion `local_is_initialized()' failed.
```

We're reverting the patch to go back to a more stable version of Scylla
and in the following commit, we'll solve the original issue in a more
systematic way.

This reverts commit 7bad8378c7.
2025-01-12 18:13:56 +01:00
Avi Kivity
814942505f Merge 'Introduce Encryption-at-Rest (EAR) for sstables and commitlog' from Calle Wilund
Fixes https://github.com/scylladb/scylla-enterprise/issues/5016#issuecomment-2558464631

EAR - encryption at rest. Allows on-disk file encryption of sstables and commitlog data.
Introduces OpenSSL based file level encrypted storage, managed via a set of providers
ranging from local files to cloud KMS providers.

For a more comprehensive explanation, see the included docs (or if possible, original
source tree).

Manual bulk merge of EAR feature from enterprise repo to main scylla repo.

Breaks some features apart, but main EAR is still a humongous commit, because to separate this
I would have to mess with code incrementally, adding time and risk.

This PR includes the local file gen tool, tests and also p11 validation.

Note: CI will not execute the full tests unless master CI is set to provide the same environment
as the enterprise one. Not sure about the status of this ATM.

Note: Includes code to compile against cryptsoft kmipc SDK, but not the SDK. If you happen to
check out this tree in the scylla folder and configure, it will be linked against and KMIP functionality
will be enabled, otherwise not.

Closes scylladb/scylladb#22233

* github.com:scylladb/scylladb:
  docs: Add EAR docs
  main/build: Add p11-kit and initialize
  tools: Add local-file-key-generator tool
  tests: Add EAR tests
  tmpdir: shorten test tempdir path
  EAR: port the ear feature from enterprise
  cql_test_env: Add optional query timeout
  schema/migration_manager: Add schema validate
  sstables: add get_shared_components accessor
  config/config_file: Add exports and definitions of config_type_for<>
2025-01-12 16:10:46 +02:00
Kefu Chai
e71ac35426 mutation_writer,redis: do not include unused headers
the changes porting enterprise features to oss brought some
used include to the tree. so let's remove them. these unused
includes were identified by clang-include-cleaner.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22246
2025-01-12 16:07:17 +02:00
Yaron Kaikov
6f30d26f2a Update tools/cqlsh submodule
* tools/cqlsh b09bc793...52c61306 (3):
  > cleanup: remove un-used Dockerfiles
  > .github/workflows/build-push.yml: update to newer macos images
  > cython: fix the usage of cython

Closes scylladb/scylladb#22250
2025-01-12 16:06:30 +02:00
Jenkins Promoter
a7d8d21e86 Update pgo profiles - aarch64 2025-01-12 15:27:50 +02:00
Jenkins Promoter
b4ca9489c4 Update pgo profiles - x86_64 2025-01-12 15:05:40 +02:00
Benny Halevy
8d2ff8a915 utils: add disk_space_monitor
Instantiated only on shard 0.
Currently, only subscribe from unit test

Manual unit test using loop mount was added.
Note that the test requires sudo access
and root access to /dev/loop, so it cannot
run in rootless podman instance, and it'd
fail with Permission denied.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#21523
2025-01-12 14:51:15 +02:00
Piotr Smaron
288f9b2b15 Introduce LDAP role manager & saslauthd authenticator
This PR extends authentication with 2 mechanisms:
- a new role_manager subclass, which allows managing users via
LDAP server,
- a new authenticator, which delegates plaintext authentication
to a running saslauthd daemon.

The features have been ported from the enterprise repository
with their test.py tests and the documentation as part of
changing license to source available.

Fixes: scylladb/scylla-enterprise#5000
Fixes: scylladb/scylla-enterprise#5001

Closes scylladb/scylladb#22030
2025-01-12 14:50:29 +02:00
Nadav Har'El
31c6a33666 Merge 'error_injection: replace boost::lexical_cast with std::from_chars' from Avi Kivity
Replace boost with a standard facility; this reduces dependencies as lexical_cast depends on boost ranges.

Since std::from_chars() is chatty, we introduce utils::from_chars_exactly() to trade some flexibility for conciseness.

Small build time improvement, no backport needed.

Closes scylladb/scylladb#22164

* github.com:scylladb/scylladb:
  error_injection: replace boost::lexical_cast with std::from_chars
  utils: introduce from_chars_exactly()
2025-01-12 14:38:54 +02:00
Lakshmi Narayanan Sreethar
d2ba45a01f sstable_directory: reintroduce get_shards_for_this_sstable()
Reintroduce `get_shards_for_this_sstable()` that was removed in commit
ad375fbb. This will be used in the following patch to ensure that an
sstable is loaded only once.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-01-10 23:32:58 +05:30
Aleksandra Martyniuk
1d46bdb1ad test: boost: check resize_task_info in tablet_test.cc 2025-01-10 16:04:19 +01:00
Aleksandra Martyniuk
b11c21e901 test: add tests to check revoked resize virtual tasks
The test is skipped in debug mode, because the preparation of revoke
takes too long and wait request, which needs to be started before
the preparation, hits timeout.
2025-01-10 16:04:11 +01:00
Aleksandra Martyniuk
50c9c0d898 test: add tests to check the list of resize virtual tasks 2025-01-10 10:03:09 +01:00
Aleksandra Martyniuk
2ed4bad752 test: add tests to check spilt and merge virtual tasks status 2025-01-10 10:03:09 +01:00
Aleksandra Martyniuk
48e0843767 test: test_tablet_tasks: generalize functions 2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk
062f155fd6 replica: service: add split virtual task's children
offstrategy_compaction_task_executor and split_compaction_task_executor
running as a part of the split become children of a split virtual task.
2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk
7ef6900837 replica: service: pass parent info down to storage_group::split
Pass task_info down to storage_group::split.

In the following patches, it will be used to set the parent
of offstrategy_compaction_task_executor and split_compaction_task_executor
running as a part of the split. The task_info param will contain task
info of a split virtual task.
2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk
14dcaecc29 tasks: children of virtual tasks aren't internal by default
Currently, streaming_task_impl is the only existing child of any
virtual task.  It overrides the is_internal definition so that it
is non-internal even though it has a parent.

This should apply to all children of all virtual tasks. Modify
task_manager::task::impl::is_internal so that children of virtual
tasks aren't internal by default.
2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk
5a948d3fac tasks: initialize shard in task_info ctor
Initialize shard in task_info constructor. All current usages do
not care about the shard of an empty task_info.

In the following patches we may need that for setting info about
virtual task parent.
2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk
840bcdc158 service: extend tablet_virtual_task::abort
Set resize tasks as non abortable.
2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk
639470d256 service: retrun status_helper struct from tablet_virtual_task::get_status_helper 2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk
0c7bef6875 service: extend tablet_virtual_task::wait
Extend tablet_virtual_task::wait to support resize tasks.

To decide what is a state of a finished resize virtual task (done
or failed), the tablet count is checked. The task state is set to done,
if the tablet count before resize is different than after.
2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk
24bbd161fd tasks: add suspended task state
Add suspended task state. It will be used for revoke resize requests.
2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk
adf6b3f3ff service: extend tablet_virtual_task::get_status
Extend tablet_virtual_task::get_status to cover resize tasks.
2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk
78215d64d1 service: extend tablet_virtual_task::contains
Extend tablet_virtual_task::contains to check resize operations.

Methods that do not support resize tasks return immediately if they
are handling split or merge task.
2025-01-10 10:03:07 +01:00
Aleksandra Martyniuk
0df64e18fb service: extend tablet_virtual_task::get_stats
Extend tablet_virtual_task::get_stats to list resize tasks.
2025-01-10 10:03:07 +01:00
Aleksandra Martyniuk
a8d7f4d89a service: add service::task_manager_module::get_nodes 2025-01-10 10:03:07 +01:00
Aleksandra Martyniuk
3f6b932362 tasks: add task_manager::get_nodes
Move an implementation of node_ops::task_manager_module::get_nodes
to task_manager::get_nodes, so that it can be reused by other modules.
2025-01-10 10:03:07 +01:00
Aleksandra Martyniuk
5dfac9290c tasks: drop noexcept from module::get_nodes 2025-01-10 10:03:07 +01:00
Aleksandra Martyniuk
18b829add8 replica: service: add resize_task_info static column to system.tablets
Add resize_task_info static column to system.tablets. Set or delete
resize_task_info value when the resize_decision is changed.
Reflect the column content in tablet_map.
2025-01-10 10:03:07 +01:00
Aleksandra Martyniuk
b6b4b767de locator: extend tablet_task_info to cover resize tasks 2025-01-10 10:03:07 +01:00
Michael Litvak
2a8ff478f0 view_builder: register listener for new views before reading views
When starting the view builder, we find all existing views in
`calculate_shard_build_step` and then register a listener for new views.
Between these steps we may yield and create a new view, then we miss
initializing the view build step for the new view, and we won't start
building it.

To fix this we first register the listener and then read existing views,
so a view can't be missed.

Fixes scylladb/scylladb#20338

Closes scylladb/scylladb#22184
2025-01-09 13:18:28 +02:00
Calle Wilund
8e828f608d docs: Add EAR docs
Merge docs relating to EAR.
2025-01-09 10:40:47 +00:00
Calle Wilund
083f735366 main/build: Add p11-kit and initialize
For p11 certification/validation
2025-01-09 10:40:47 +00:00
Calle Wilund
f901beec87 tools: Add local-file-key-generator tool
For generating key files for local provider
2025-01-09 10:40:47 +00:00
Calle Wilund
c596ae6eb1 tests: Add EAR tests
Adds the migrated EAR/encryption tests.
Note: Until scylla CI is updated to provide all the proper
ENV vars, some tests will not execute.
2025-01-09 10:40:39 +00:00
Calle Wilund
ee62b61c84 tmpdir: shorten test tempdir path
To make certain python tests work in CI
2025-01-09 10:37:35 +00:00
Calle Wilund
723518c390 EAR: port the ear feature from enterprise
Bulk transfer of EAR functionality. Includes all providers etc.
Could maybe break up into smaller blocks, but once it gets down to
the core of it, would require messing with code instead of just moving.
So this is it.

Note: KMIP support is disabled unless you happen to have the kmipc
SDK in your scylla dir.

Adds optional encryption of sstables and commitlog, using block
level file encryption. Provides key sourcing from various sources,
such as local files or popular KMS systems.
2025-01-09 10:37:26 +00:00
Avi Kivity
9ff6473691 error_injection: replace boost::lexical_cast with std::from_chars
Replace boost with a standard facility; this reduces dependencies
as lexical_cast depends on boost ranges.

As a side effect the exception error message is improved.
2025-01-09 11:14:51 +02:00
Avi Kivity
224dc34089 utils: introduce from_chars_exactly()
This is a replacement for boost::lexical_cast (but without its
long dependency chain). It wraps std::from_chars(), providing a less
flexible but also more concise interface.
2025-01-09 11:14:49 +02:00
Michał Chojnowski
1728f9c983 utils/dict_trainer: silence an ERROR log when raft is aborted during dict publication
The dict publication routine might throw raft::request_aborted when the node is
aborted. This doesn't deserve an ERROR log. Let's demote the log printed in this
case from ERROR to DEBUG.

Fixes scylladb/scylladb#22081

Closes scylladb/scylladb#22211
2025-01-08 17:55:05 +01:00
Kefu Chai
462a10c4f6 test.py: do not repeat "combined_tests"
instead of repeating "combined_tests", let's define a variable for
it. less repeating this way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22185
2025-01-08 15:43:34 +02:00
Sergey Zolotukhin
2f1731c551 test: Include parent test name in ScyllaClusterManager log file names.
Add the test file name to `ScyllaClusterManager` log file names alongside the test function name.
This avoids race conditions when tests with the same function names are executed simultaneously.

Fixes scylladb/scylladb#21807

Backport: not needed since this is a fix in the testing scripts.

Closes scylladb/scylladb#22192
2025-01-08 15:42:31 +02:00
Calle Wilund
e734fc11ec cql_test_env: Add optional query timeout
Some tests need queries to actually fail.
2025-01-08 12:50:03 +00:00
Calle Wilund
511326882a schema/migration_manager: Add schema validate
Validates schema before announce. To ensure all extensions
are happy.
2025-01-08 12:50:03 +00:00
Calle Wilund
9f06a0e3a3 sstables: add get_shared_components accessor
To access the shared components.
2025-01-08 12:50:03 +00:00
Calle Wilund
7ed89266b3 config/config_file: Add exports and definitions of config_type_for<>
Required for implementors. Other than config.cc.
2025-01-08 12:50:03 +00:00
Kefu Chai
d0a3311ced locator: do not include unused headers
these unused includes were identifier by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22199
2025-01-08 14:26:48 +02:00
Kefu Chai
866520ff89 test.py: Defer Scylla executable check until test execution
Move the Scylla executable existence check from PythonTestSuite's constructor
to test execution time. This allows running unit tests that don't depend on
the scylla executable without building it first.

Previously, PythonTestSuite's constructor would fail if the Scylla executable
was missing, preventing even unrelated unit tests from running. Now, only
tests that actually require Scylla will fail if the executable is
missing.

Fixes scylladb/scylladb#22168
Refs scylladb/scylladb#19486

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22224
2025-01-08 14:25:50 +02:00
Michael Litvak
35316a40c8 service/storage_proxy: consider all replicas participating in write for MV backpressure
replica writes are delayed according to the view update backlog in order
to apply backpressure and reduce the rate of incoming base writes when
the backlog is large, allowing slow replicas to catch up.

previously the backlog calculation considered only the pending targets,
excluding targets that replied successfuly, probably due to confusion in
the code. instead, we want to consider the backlog of all the targets
participating in the write.

Fixes scylladb/scylladb#21672

Closes scylladb/scylladb#21935
2025-01-08 12:03:26 +01:00
Botond Dénes
a2436f139f docs/dev: review-checklist.md: expand the guide for good commit log
Closes scylladb/scylladb#22214
2025-01-08 13:01:35 +02:00
Kefu Chai
f41b030fdd repair: do not include unused header
this unused include was identifier by clang-include-cleaner. after
auditing task_manager_module.hh, the report has been confirmed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22200
2025-01-08 12:58:35 +02:00
Avi Kivity
de8253b98a types: explicitly instantiate map_type_impl::deserialize()
The definition of the template is in a source translation unit, but there
are also uses outside the translation unit. Without lto/pgo it worked due
to the definition in the translation unit, but with lto/pgo we can presume
the definition was inlined, so callers outside the translation unit did not
have anything to link with.

Fix by explicitly instantiating the template function.

Closes scylladb/scylladb#22136
2025-01-08 11:52:11 +02:00
Benny Halevy
e6efaa3b73 Update seastar submodule
* seastar 3133ecdd...a9bef537 (24):
  > file: add file_system_space
  > future: avoid inheriting from future payload type
  > treewide: include fmt/ostream.h for using fmt::print()
  > build: remove messages used for debugging
  > demos: Rename websocket demo to websocket_server demo
  > demos: Add a way to set port from cmd line in websocket demo
  > tls: Add optional builder + future-wait to cert reload callback + expose rebuild
  > rwlock: add try_hold_{read,write}_lock methods
  > json: add moving push to json_list
  > github: add a step to build "check-include-style"
  > build: add a target for checking include style
  > scheduling_group: use map for key configs instead of vector
  > scheduling_group: fix indentation
  > scheduling_group: fix race between scheduling group and key creation
  > http: Make request writing functions public
  > http: Expose connection_factory implementations
  > metrics: Use separate type for shared metadata
  > file: unexpected throw from inside noexcept
  > metrics: Internalize metric label sets
  > thread: optimize maybe_yield
  > reactor: fix crash in pending registration task after poller dtor
  > net: Fix ipv6 socket_address comparision
  > reactor, linux-aio: factor out get_smp_count() lambda
  > reactor, linux-aio: restore "available_aio" meaning after "reserve_iocbs"

Fixed usage of seastar metric label sets due to:
scylladb/seastar@733420d57 Merge 'metrics: Internalize metric label sets' from Stephan Dollberg

Closes scylladb/scylladb#22076
2025-01-08 09:37:16 +02:00
Kefu Chai
23729beeb5 docs: remove "ScyllaDB Enterprise" labels
remove the "ScyllaDB Enterprise" labels in document. because
there is no need to differentiate ScyllaDB Enterprise from its OSS
variant, let's stop adding the "ScyllaDB Enterprise" labels to
enterprise-only features. this helps to reduce the confusion.

as we are still in the process of porting the enterprise features
to this repo, this change does not fix scylladb/scylladb#22175.
we will review the document again when completing the migration.

we also take this opportunity to stop referencing "Enterprise" in
the changed paragraph.

Refs scylladb/scylladb#22175
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22177
2025-01-08 09:02:52 +02:00
Kefu Chai
e51b2075da docs/kb: correct referenced git sha1 and version number
in 047ce136, we cherry-picked the change adding
garbage-collection-ics.rst to the document. but it was still
referencing the git sha1 and version number in enterprise.

this change updates kb/garbage-collection-ics.rst, so that it

* references the git commit sha1 in this repo
* do not reference the version introducing this feature, as
  per Anna Stuchlik

  > As a rule, we should avoid documenting when something was
  > introduced or set as a default because our documentation
  > was versioned. Per-version information should be listed in
  > the release notes.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22195
2025-01-08 07:08:15 +02:00
Michał Chojnowski
9f639b176f db/config: increase the default value of internode_compression_zstd_min_message_size from 0 to 1024
Usually, the smaller the messsage, the higher the CPU cost per each network byte
saved by compression, so it often makes sense to reserve heavier compression
for bigger messages (where it can make the biggest impact for a given CPU budget)
and use ligher compression for smaller messages.

There is a knob -- internode_compression_zstd_min_message_size -- which
excludes RPC messages below certain size from being compressed with zstd.
We arbitrarily set its default to 0 bytes before.
Now we want to arbitrarily set it to 1024 bytes.

This is based purely on intuition and isn't backed by any solid data.

Fixes scylladb/scylla-enterprise#4731

Closes scylladb/scylla-enterprise#4990

Closes scylladb/scylladb#22204
2025-01-07 18:14:01 +02:00
Wojciech Mitros
d04f376227 mv: add an experimental feature for creating views using tablets
We still have a number of issues to be solved for views with tablets.
Until they are fixed, we should prevent users from creating them,
and use the vnode-based views instead.

This patch prepares the feature for enabling views with tablets. The
feature is disabled by default, but currently it has no effect.
After all tests are adjusted to use the feature, we should depend
on the feature for deciding whether we can create materialized views
in tablet-enabled keyspaces.

The unit tests are adjusted to enable this feature explicitly, and it's
also added to the scylla sstable tool config - this tool treats all
tables as if they were tablet-based (surprisingly, with SimpleStrategy),
so for it to work on views, the new feature must be enabled.

Refs scylladb/scylladb#21832

Closes scylladb/scylladb#21833
2025-01-07 15:52:36 +01:00
Emil Maskovsky
115005d863 raft: refactor the voters api to allow enabling voters
The raft voters api implementation only allowed to make a node to be
a non-voter, but for the "limited voters" feature we need to
also have the option to make the node a voter (from within the topology
coordinator).

Modifying the api to allow both adding and removing voters.

This in particular tries to simplify the API by not having to add
another set of new functions to make a voter, but having a single setter
that allows to modify the node configuration to either become a voter or
a non-voter.

Fixes: scylladb/scylladb#21914

Refs: scylladb/scylladb#18793

Closes scylladb/scylladb#21899
2025-01-07 15:25:50 +01:00
Asias He
d719f423e5 config: Enable enable_small_table_optimization_for_rbno by default
Since the problematic dtests are with the
enable_small_table_optimization_for_rbno turn off now, we can enable the
flag by default.

https://github.com/scylladb/scylla-dtest/pull/5383

Refs: #19131

Closes scylladb/scylladb#21861
2025-01-07 16:20:36 +02:00
Asias He
935dcd69fa repair: Remove repair_task_info only when repair is finished
In case of error, repair will be moved into the end_repair stage. We
should not remove repair_task_info in this case because the repair task
requested by the user is not finished yet.

To fix, we should remove repair_task_info at the end of repair stage.

Tests are added to ensure failed repair is not reported as finished.

Closes scylladb/scylladb#21973
2025-01-07 16:19:40 +02:00
Avi Kivity
748d30a34d tools: toolchain: simplify non-emulated build procedure
Avoid using temporary names and instead treat the final image tag
as a temporary.

The new procedure is more or less

   remote-final := local-x86_64
   local-aarch64 += remote-final
   remote-final := local-aarch64 (which now contains the x86_64 image too)

Closes scylladb/scylladb#21981
2025-01-07 16:17:29 +02:00
Asias He
baaee28c07 storage_service: Add tablet migration log
So that both mutation and file streaming will have the same log for
tablet streaming which simplifies the dtest checking.

Closes scylladb/scylladb#22176
2025-01-07 15:16:37 +01:00
Emil Maskovsky
2ac9ed2073 raft: test the limited voters feature
Test the limited voters feature by creating a cluster with 3 DCs, one of
them disproportionately larger than the others. The raft majority should
not be lost in case the large DC goes down.

Fixes: scylladb/scylla#21915

Refs: scylladb/scylla#18793

Closes scylladb/scylladb#21901
2025-01-07 15:09:49 +01:00
Michael Litvak
0617564123 db/commitlog: make the commit log hard limit mandatory
mark the config parameter --commitlog-use-hard-size-limit as deprecated so the
default 'true' is always used, making the hard limit mandatory.

Fixes scylladb/scylladb#16471

Closes scylladb/scylladb#21804
2025-01-07 15:03:56 +02:00
Anna Stuchlik
8d824a564f doc: add troubleshooting removal with --autoremove-ubuntu
This commit adds a troubleshooting article on removing ScyllaDB
with the --autoremove option.

Fixes https://github.com/scylladb/scylladb/issues/21408

Closes scylladb/scylladb#21697
2025-01-07 13:35:13 +01:00
Botond Dénes
b3f8c4faa7 Merge 'node_ops: filter topology_requests entries shown by node_ops_virtual_task' from Aleksandra Martyniuk
node_ops_virtual_task does not filter the entries of system.topology_request
and so it creates statuses of operations that aren't node ops.

Filter the entries used by node_ops_virtual_task. With this change, the status
of a bootstrap of the first node will not be visible.

Fixes: https://github.com/scylladb/scylladb/issues/22008.

Needs backport to 6.2 that introduced node_ops_virtual_task

Closes scylladb/scylladb#22009

* github.com:scylladb/scylladb:
  test: truncate the table before node ops task checks
  node_ops: rename a method that get node ops entries
  node_ops: filter topology_requests entries
2025-01-07 14:17:01 +02:00
Dani Tweig
d984f27b23 Create urgent_issue_reminder.yml
Closes scylladb/scylladb#22042
2025-01-07 14:16:17 +02:00
Kefu Chai
353b522ca0 treewide: migrate from boost::adaptors::reversed to std::views::reverse
now that we are allowed to use C++23. we now have the luxury of using
`std::views::reverse`.

- replace `boost::adaptors::transformed` with `std::views::transform`
- remove unused `#include <boost/range/adaptor/reversed.hpp>`

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-01-07 13:22:00 +02:00
Kefu Chai
f7fd55146d compaction: do not include unused headers
these unused includes are identified by clang-include-cleaner.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22188
2025-01-07 13:18:31 +02:00
Yaron Kaikov
b74565e83f dist/common/scripts/scylla_raid_setup: reduce XFS metadata overhead
The block size of 1k is significantly increasing metadata overhead with xfs since it reserves space upfront for btree expansion. With CRC disabled, this reservation doesn't happen. Smaller btree blocks reduce the fanout factor, increasing btree height and the reservation size. So block size implies a trade-off between write amplification and metadata size. Bigger blocks, smaller metadata, more write ampl. Smaller blocks, more metadata, and less write ampl.

Let's disable both `rmapbt` and `relink` since we replicate data, and we can afford to rebuild a replica on local corruption.

Fixes: https://github.com/scylladb/scylladb/issues/22028

Closes scylladb/scylladb#22072
2025-01-07 13:18:21 +02:00
Botond Dénes
69150f0680 Merge 'Fix edge case issues related to tablet draining ' from Tomasz Grabiec
Main problem:

If we're draining the last node in a DC, we won't have a chance to
evaluate candidates and notice that constraints cannot be satisfied (N
< RF). Draining will succeed and node will be removed with replicas
still present on that node. This will cause later draining in the same
DC to fail when we will have 2 replicas which need relocaiton for a
given tablet.

The expected behvior is for draining to fail, because we cannot keep
the RF in the DC. This is consistent, for example, with what happens
when removing a node in a 2-node cluster with RF=2.

Fixes #21826

Secondary problem:

We allowed tablet_draining transition to be exited with undrained nodes, leaving replicas on nodes in the "left" state.

Third problem:

We removed DOWN nodes from the candidate node set, even when draining. This is not safe because it may lead to overload. This also makes the "main problem" more likely by extending it to the scenario when the DC is DOWN.

The overload part in not a problem in practice currently, since migrations will block on global topology barrier if there are DOWN nodes.

Closes scylladb/scylladb#21928

* github.com:scylladb/scylladb:
  tablets: load_balancer: Fail when draining with no candidate nodes
  tablets: load_balancer: Ignore skip_list when draining
  tablets: topology_coordinator: Keep tablet_draining transition if nodes are not drained
2025-01-07 13:04:00 +02:00
Botond Dénes
173fad296a tools/schema_loader.cc: remove duplicate include of short_streams.hh
Closes scylladb/scylladb#21982
2025-01-07 13:03:17 +02:00
David Garcia
66a5e7f672 docs: update Sphinx configuration for unified repository publishing
This change is related to the unification of enterprise and open-source repositories.

The Sphinx configuration is updated to build documentation either for `docs.scylladb.com/manual` or `opensource.docs.scylladb.com`, depending on the flag passed to Sphinx.

By default, it will build docs for `docs.scylladb.com/manual`. If the `opensource` flag is passed, it will build docs for `opensource.docs.scylladb.com`, with a different set of versions.

This change will prepare the configuration to publish to `docs.scylladb.com/manual` while allowing the option to keep publishing and editing docs with a different multiversion configuration.

Note that this change will continue publishing docs to `opensource.docs.scylladb.com` for now since the `opensource` flag is being passed in the `gh-pages.yml` branch.

chore: remove comment

chore: update project name

Closes scylladb/scylladb#22089
2025-01-07 12:54:51 +02:00
Kefu Chai
e4463b11af treewide: replace boost::algorithm::join() with fmt::join()
Replace usages of `boost::algorithm::join()` with `fmt::join()` to improve
performance and reduce dependency on Boost. `fmt::join()` allows direct
formatting of ranges and tuples with custom separators without creating
intermediate strings.

When formatting comma-separated values into another string, fmt::join()
avoids the overhead of temporary string creation that
`boost::algorithm::join()` requires. This change also helps streamline
our dependencies by leveraging the existing fmt library instead of
Boost.Algorithm.

To avoid the ambiguity, some caller sites were updated to call
`seastar::format()` explicitly.

See also

- boost::algorithm::join():
  https://www.boost.org/doc/libs/1_87_0/doc/html/string_algo/reference.html#doxygen.join_8hpp
- fmt::join():
  https://fmt.dev/11.0/api/#ranges-api

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22082
2025-01-07 12:45:05 +02:00
Aleksandra Martyniuk
a91e03710a repair: check tasks local to given shard
Currently task_manager_module::is_aborted checks the tasks local
to caller's shard on a given shard.

Fix the method to check the task map local to the given shard.

Fixes: #22156.

Closes scylladb/scylladb#22161
2025-01-06 21:53:54 +02:00
Kefu Chai
d3f3e2a6c8 .github: add more subdirectories to CLEANER_DIR
in order to prevent future inclusion of unused headers, let's include

- mutation_writer
- node_ops
- redis
- replica

subdirectories to CLEANER_DIR, so that this workflow can identify the
regressions in future.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22050
2025-01-06 21:28:39 +02:00
Avi Kivity
5653d13d48 Merge 'Clean up test/alternator mistakes that service levels introduced' from Nadav Har'El
The recent pull request https://github.com/scylladb/scylladb/pull/22031 introduced some regressions into the test/alternator framework. For a long time now, tests can create their own CQL roles for testing role-based features. But the new service levels test changed the "run" script and test.py's "suite.yaml" to create a new role and service level just for one test. This is not only ugly (the test code is now split to two places) and unnecessary, this setup also means that you can't run this test against an already-running copy of Scylla which wasn't prepared with the "right" role and service level. Even worse - the code that was added test/alternator/run was plain wrong - it used an outdated keyspace name (the code in suite.yaml was fine).

So in this patch I remove that extra run and suite.yaml code, and replace it by code inside the service level test to create the role and service level that it wants to test rather than assume it already exists.
While at it, I also removed a lot of duplicate and unnecessary code from this test.

After this patch, test/alternator/run returns to work correctly, after #22031 broke it.

This patch fixes a recent testing-framework regression, so doesn't need to be backported (unless that regression is backported).

Fixes #22047.

Closes scylladb/scylladb#22172

* github.com:scylladb/scylladb:
  test/alternator: fix mistakes introduced with test_service_levels.py
  test/alternator: move "cql" fixture to test/alternator/conftest.py
2025-01-06 17:44:25 +02:00
Anna Stuchlik
047ce13641 doc: add a new KB article about tombstone garbage collection in ICS
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22174
2025-01-06 16:48:50 +02:00
Kefu Chai
8873a4e1aa test.py: pass "count" to re.sub() with kwarg
since Python 3.13, passing count to `re.sub()` as positional argument
has been deprecated. and when runnint `test.py` with Python 3.13, we
have following warning:

```
/home/kefu/dev/scylladb/./test.py:1540: DeprecationWarning: 'count' is passed as positional argument
  args.modes = re.sub(r'.* List configured modes\n(.*)\n', r'\1',
```

see also https://github.com/python/cpython/issues/56166

in order to silence this distracting warning, let's pass
`count` using kwarg.

this change was created in the same spirit of c3be4a36af.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22085
2025-01-06 16:35:38 +02:00
Avi Kivity
4632e217e3 cql3: grammar: simplify unaliasedSelector production
The return variable s only gets a value by assignment from the
temporary tmp. Make tmp the return value instead.

Closes scylladb/scylladb#22151
2025-01-06 13:06:12 +02:00
Kefu Chai
9396c2ee6c api: include "smaller" header
Previously, `api/service_levels.hh` includes `api/api.hh` for
accessing symbols like `api/http_context`. but these symbols are
already available in a "smaller" header -- `api/api_init.hh`. so,
in order to improve the build efficiency, let's include smaller
headers in favor of "larger" ones.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22178
2025-01-06 13:04:33 +02:00
Amnon Heiman
7390116620 alternator/test_returnconsumedcapacity.py: update item
This patch adds tests for return consumed capacity for update_item.

The tests cover: a simple update for a small object, a missing item, an
update with a very large attribute (where the attribute itself is more
than 1KB), and an update of a big item that uses read-before-write.
2025-01-06 09:55:17 +02:00
Nadav Har'El
fc22d5214f Merge 'test.py: check for existence of combined test with correct path' from Kefu Chai
test.py: Only check existence of Scylla executable

Previously, we had inconsistent behavior around missing executables:
- 561e88f0 added early failure if any executable was missing
- 8b7a5ca8 added a partial skip for combined_test, but didn't properly
  handle build paths and artifacts

This change:
1. Moves executable existence check to PythonTestSuite class
2. Only adds combined_test suite when the executable exists
3. Eliminates redundant os.access() checks
4. Corrects the path to combined_test when checking for its existence

This allows running tests with a partial build while properly handling
missing executables, particularly for the combined_test suite.

Fixes scylladb/scylladb#22086

---

no need to backport, because the offending commit (8b7a5ca88d) is not included by any LTS branches yet.

Closes scylladb/scylladb#22163

* github.com:scylladb/scylladb:
  test.py: Fix path checking for combined_test executable
  test.py: Throw only if scylla executable is not found
2025-01-06 09:21:01 +02:00
Nadav Har'El
e919794db8 test/alternator: fix mistakes introduced with test_service_levels.py
This patch undoes multiple mistakes done when introducing the test
for service levels in pull request #22031:

1. The PR introduced in test/alternator/run and test/alternator/suite.yaml
   a permanent role and service level that the service-level test is
   supposed to use. This was a mistake - the test can create the service
   level for its own use, using CQL, it does not need to assume such a
   service level already exists.
   It's important to fix this to allow the service level test to run
   against an installation of Scylla not set up by our own scripts.
   Moreover, while the code in suite.yaml was correct, the code in
   "run" was incorrect (used an outdated keyspace name). This patch
   removes that incorrect code.

2. The PR introduced a duplicate "cql" fixture, copied verbatim
   from test_cql_rbac.py (including a comment that was correct only
   in the latter file :-)). Let's de-duplicate it, using the fixture
   that I moved to conftest.py in the previous patch.

3. The PR used temporary_grant(). This needelessly complicated the test
   and added even more duplicate code, and this patch removes all that
   stuff. This test is about service levels, not RBAC and "grant".
   This test should just use a superuser role that has the permissions
   to do everything, and don't need to be granted specific permissions.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-01-05 19:40:14 +02:00
Nadav Har'El
879c0a3bd6 test/alternator: move "cql" fixture to test/alternator/conftest.py
Most Alternator test use only the DynamoDB API, not CQL. Tests in
test_cql_rbac.py did need CQL to set up roles and RBAC, so this file
introduced a "cql" fixture to make CQL requests.

A recently-introduced test/alternator/test_service_levels.py also
needs access to CQL - it currently uses it for misguided reasons but
the next patch will need it for creating a role and a service level.
So instead of duplicating this fixture, let's move this fixture into
test/alternator/conftest.py that all Alternator tests can share.

The next patch will clean up this duplication in test_service_levels.py
and the other mistakes it introduced.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-01-05 19:33:55 +02:00
Kefu Chai
569f8e9246 treewide: fix misspellings
these misspellings were identified by codespell. let's fix them.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22154
2025-01-05 16:13:09 +02:00
Raphael S. Carvalho
c973254362 Introduce incremental compaction strategy (ICS)
ICS is a compaction strategy that inherits size tiered properties --
therefore it's write optimized too -- but fixes its space overhead of
100% due to input files being only released on completion. That's
achieved with the concept of sstable run (similar in concept to LCS
levels) which breaks a large sstable into fixed-size chunks (1G by
default), known as run fragments. ICS picks similar-sized runs
for compaction, and fragments of those runs can be released
incrementally as they're compacted, reducing the space overhead
to about (number_of_input_runs * 1G). This allows user to increase
storage density of nodes (from 50% to ~80%), reducing the cost of
ownership.

NOTE: test_system_schema_version_is_stable adjusted to account for batchlog
using IncrementalCompactionStrategy

contains:

compaction/: added incremental_compaction_strategy.cc (.hh), incremental_backlog_tracker.cc (.hh)
compaction/CMakeLists.txt: include ICS cc files
configure.py: changes for ICS files, includes test
db/legacy_schema_migrator.cc / db/schema_tables.cc: fallback to ICS when strategy is not supported
db/system_keyspace: pick ICS for some system tables
schema/schema.hh: ICS becomes default
test/boost: Add incremental_compaction_test.cc
test/boost/sstable_compaction_test.cc: ICS related changes
test/cqlpy/test_compaction_strategy_validation.py: ICS related changes

docs/architecture/compaction/compaction-strategies.rst: changes to ICS section
docs/cql/compaction.rst: changes to ICS section
docs/cql/ddl.rst: adds reference to ICS options
docs/getting-started/system-requirements.rst: updates sentence mentioning ICS
docs/kb/compaction.rst: changes to ICS section
docs/kb/garbage-collection-ics.rst: add file
docs/kb/index.rst: add reference to <garbage-collection-ics>
docs/operating-scylla/procedures/tips/production-readiness.rst: add ICS section

some relevant commits throughout the ICS history:

commit 434b97699b39c570d0d849d372bf64f418e5c692
Merge: 105586f747 30250749b8
Author: Paweł Dziepak <pdziepak@scylladb.com>
Date:   Tue Mar 12 12:14:23 2019 +0000

    Merge "Introduce Incremental Compaction Strategy (ICS)" from Raphael

    "
    Introduce new compaction strategy which is essentially like size tiered
    but will work with the existing incremental compaction. Thus incremental
    compaction strategy.

    It works like size tiered, but each element composing a tier is a sstable
    run, meaning that the compaction strategy will look for N similar-sized
    sstable runs to compact, not just individual sstables.

    Parameters:
    * "sstable_size_in_mb": defines the maximum sstable (fragment) size
    composing
    a sstable run, which impacts directly the disk space requirement which is
    improved with incremental compaction.
    The lower the value the lower the space requirement for compaction because
    fragments involved will be released more frequently.
    * all others available in size tiered compaction strategy

    HOWTO
    =====

    To change an existing table to use it, do:
         ALTER TABLE mykeyspace.mytable  WITH compaction =
    {'class' : 'IncrementalCompactionStrategy'};

    Set fragment size:
         ALTER TABLE mykeyspace.mytable  WITH compaction =
    {'class' : 'IncrementalCompactionStrategy', 'sstable_size_in_mb' : 1000 }

    "

commit 94ef3cd29a196bedbbeb8707e20fe78a197f30a1
Merge: dca89ce7a5 e08ef3e1a3
Author: Avi Kivity <avi@scylladb.com>
Date:   Tue Sep 8 11:31:52 2020 +0300

    Merge "Add feature to limit space amplification in Incremental Compaction" from Raphael

    "
    A new option, space_amplification_goal (SAG), is being added to ICS. This option
    will allow ICS user to set a goal on the space amplification (SA). It's not
    supposed to be an upper bound on the space amplification, but rather, a goal.
    This new option will be disabled by default as it doesn't benefit write-only
    (no overwrites) workloads and could hurt severely the write performance.
    The strategy is free to delay triggering this new behavior, in order to
    increase overall compaction efficiency.

    The graph below shows how this feature works in practice for different values
    of space_amplification_goal:
    https://user-images.githubusercontent.com/1409139/89347544-60b7b980-d681-11ea-87ab-e2fdc3ecb9f0.png

    When strategy finds space amplification crossed space_amplification_goal, it
    will work on reducing the SA by doing a cross-tier compaction on the two
    largest tiers. This feature works only on the two largest tiers, because taking
    into account others, could hurt the compaction efficiency which is based on
    the fact that the more similar-sized sstables are compacted together the higher
    the compaction efficiency will be.

    With SAG enabled, min_threshold only plays an important role on the smallest
    tiers, given that the second-largest tier could be compacted into the largest
    tier for a space_amplification_goal value < 2.
    By making the options space_amplification_goal and min_threshold independent,
    user will be able to tune write amplification and space amplification, based on
    the needs. The lower the space_amplification_goal the higher the write
    amplification, but by increasing the min threshold, the write amplification
    can be decreased to a desired amount.
    "

commit 7d90911c5fb3fa891ad64a62147c3a6ca26d61b1
Author: Raphael S. Carvalho <raphaelsc@scylladb.com>
Date:   Sat Oct 16 13:41:46 2021 -0300

    compaction: ICS: Add garbage collection

    Today, ICS lacks an approach to persist expired tombstones in a timely manner,
    which is a problem because accumulation of tombstones are known to affecting
    latency considerably.

    For an expired tombstone to be purged, it has to reach the top of the LSM tree
    and hope that older overlapping data wasn't introduced at the bottom.
    The condition are there and must be satisfied to avoid data resurrection.

    STCS, today, has an inefficient garbage collection approach because it only
    picks a single sstable, which satisfies the tombstone density threshold and
    file staleness. That's a problem because overlapping data either on same tier
    or smaller tiers will prevent tombstones from being purged. Also, nothing is
    done to push the tombstones to the top of the tree, for the conditions to be
    eventually satisfied.

    Due to incremental compaction, ICS can more easily have an effecient GC by
    doing cross-tier compaction of relevant tiers.

    The trigger will be file staleness and tombstone density, which threshold
    values can be configured by tombstone_compaction_interval and
    tombstone_threshold, respectively.

    If ICS finds a tier which meets both conditions, then that tier and the
    larger[1] *and* closest-in-size[2] tier will be compacted together.
    [1]: A larger tier is picked because we want tombstones to eventually reach the
    top of the tree.
    [2]: It also has to be the closest-in-size tier as the smaller the size
    difference the higher the efficiency of the compaction. We want to minimize
    write amplification as much as possible.
    The staleness condition is there to prevent the same file from being picked
    over and over again in a short interval.

    With this approach, ICS will be continuously working to purge garbage while
    not hurting overall efficiency on a steady state, as same-tier compactions are
    prioritized.

    Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
    Message-Id: <20211016164146.38010-1-raphaelsc@scylladb.com>

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#22063
2025-01-04 15:43:52 +02:00
Kefu Chai
220cafe7c4 test.py: Fix path checking for combined_test executable
Previously in 8b7a5ca88d, we checked for combined_test existence
without the "build" component in the path. This caused the test
suite to never find the executable, preventing the test cases'
cache from being populated.

Changes:

1. Use path_to() to check executable existence, which:
   - Includes the "build" component in path
   - Handles both CMake and configure.py build paths
2. Move existence check out of _generate_cache() for clarity

This ensures combined_test and its included tests are properly
discovered and run.

Fixes scylladb/scylladb#22086

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-01-04 06:11:21 +08:00
Kefu Chai
9d0f27e7c1 test.py: Throw only if scylla executable is not found
Previously, we had inconsistent behavior around missing executables:
- 561e88f0 added early failure if any executable was missing
- 8b7a5ca8 added a partial skip for combined_test, but didn't properly
  handle build paths and artifacts

This change:
1. Moves executable existence check to PythonTestSuite class
3. Eliminates redundant os.access() checks

This allows running tests with a partial build while properly handling
missing executables, particularly for the combined_test suite.

In a succeeding change, we will correct the check for combined_tests.

Refs scylladb/scylladb#19489
Refs scylladb/scylladb#22086

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-01-04 06:11:21 +08:00
Tomasz Grabiec
4c89e62470 Merge 'Phased barrier improvements' from Benny Halevy
- utils: phased_barrier: advance_and_await: allocate new gate only when needed
- utils: phased_barrier: add close() method
  - and use in existing services

* Improvement. No backport needed

Closes scylladb/scylladb#22018

* github.com:scylladb/scylladb:
  utils: phased_barrier: add close() method
  utils: phased_barrier: advance_and_await: allocate new gate only when needed
2025-01-03 18:51:23 +01:00
Sergey Zolotukhin
155480595f storage_proxy/read_repair: Remove redundant 'schema' parameter from data_read_resolver::resolve
function.

The `data_read_resolver` class inherits from `abstract_read_resolver`, which already includes the
`schema_ptr _schema` member. Therefore, using a separate function parameter in `data_read_resolver::resolve`
initialized with the same variable in `abstract_read_executor` is redundant.
2025-01-03 10:04:13 +01:00
Sergey Zolotukhin
39785c6f4e storage_proxy/read_repair: Use partition_key instead of token key for mutation
diff calculation hashmap.

This update addresses an issue in the mutation diff calculation algorithm used during read repair.
Previously, the algorithm used `token` as the hashmap key. Since `token` is calculated basing on
the Murmur3 hash function, it could generate duplicate values for different partition keys, causing
corruption in the affected rows' values.

Fixes scylladb/scylladb#19101
2025-01-03 09:53:02 +01:00
Sergey Zolotukhin
e577f1d141 test: Add test case for checking read repair diff calculation when having
conflicting keys.

The test updates two rows with keys that result in a Murmur3 hash collision, which
is used to generate Scylla tokens. These tokens are involved in read repair diff
calculations. Due to the identical token values, a hash map key collision occurs.
Consequently, an incorrect value from the second row (with a different primary key)
is then sent for writing as 'repaired', causing data corruption.
2025-01-03 09:53:02 +01:00
Avi Kivity
202f16e799 Merge 'Introduce workload prioritization for service levels' from Piotr Dulikowski
This series introduces workload prioritization: an extension of the service levels feature which allows specifying "shares" per service level. The number of shares determines the priority of the user which has this service level attached (if multiple are attached then the one with the lowest shares wins).

Different service levels will be isolated in the following way:

- Each service level gets its own scheduling group with the number of shares (corresponding to the service level's number of shares), which controls the priority of the CPU and I/O used for user operations running on that service level.
- Each service level gets two reader concurrency semaphores, one for user reads and the other for read-before-write done for view updates.
- Each service level gets its own TCP connections for RPC to prevent priority inversion issues.

Because of the mandatory use of scheduling groups, which are a globally limited resource, the number of service levels is now limited to 7 user created service levels + 1 created by default that cannot be removed.

This feature has been previously only available in ScyllaDB Enterprise but has been made available for the source available ScyllaDB. The series was created by comparing the master branch with source-available-workbranch / enterprise branch and taking the workload prioritization related parts from the diff, then molding the resulting diff into a proper series. Some very minor changes were made such as fixing whitespace, removing unused or unnecessary code, adding some boilerplate (in api/) which was missing, but otherwise no major changes have been made.

No backport is required.

Closes scylladb/scylladb#22031

* github.com:scylladb/scylladb:
  tracing: record scheduling group in trace event record
  qos: un-shared-from-this standard_service_level_distributed_data_accessor
  alternator: execute under scheduling group for service level
  test.py: support multiple commands in prepare_cql in suite.yml
  docs: add documentation for workload prioritization
  docs/dev: describe workload prioritization features in service_levels
  test/auth_cluster: test workload prioritization in service level tests
  cqlpy/test_service_levels: add workload prioritization tests
  api: introduce service levels specific API
  api/cql_server_test: add information about scheduling group
  db/virtual_tables: add scheduling group column to system.clients
  test/boost: update service_level_controller_test for workload prio
  qos: include number of shares in DESCRIBE
  cql3/statements: update SL statements for workload prioritization
  transport/server: use scheduling group assigned to current user
  messaging_service: use separate set of connections per service levels
  replica/database: add reader concurrency semaphore groups
  qos: manage and assign scheduling groups to service levels
  qos: use the shares field in service level reads/writes
  qos: add shares to service_level_options
  qos: explicitly specify columns when querying service level tables
  db/system_distributed_keyspace: add shares column and upgrade code
  db/system_keyspace: adjust SL schema for workload prioritization
  gms: introduce WORKLOAD_PRIORITIZATION cluster feature
  build: increase the max number of scheduling groups
  qos: return correct error code when SL does not exist
2025-01-02 20:05:36 +02:00
Kefu Chai
0ea8cd2bb8 test/pylib/minio_server: use error level for fatal errors
Previously fatal errors like missing Minio executable were logged at INFO level,
which could be filtered out by log settings. Switch to ERROR level to ensure
these critical issues are always visible to developers.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22084
2025-01-02 20:03:55 +02:00
Gleb Natapov
245483f1bc topology coordinator: reject replace request if topology does not match
Currently it should not happen because gossiper shadow round does
similar check, but we want to drop states that propagate through raft
from the gossiper eventually.
2025-01-02 18:44:19 +02:00
Gleb Natapov
7e3a196734 gossiper: fix the logic of shadow_round parameter
Currently the logic is mirrored shadow_round is true in on shadow round.
Fix it but flipping all the logic.
2025-01-02 18:44:19 +02:00
Gleb Natapov
2736a3e152 storage_service: do not add endpoint to the gossiper during topology loading.
As removed comment says it was done because storage_service::join_cluster
did not load gossiper endpoint but now it does.
2025-01-02 18:44:19 +02:00
Gleb Natapov
4fee8e0e09 storage_service: load peers into gossiper on boot in raft topology mode
Gossiper manages address map now, so load peers table into the gossiper
on reboot to be able to map ids to ips as early as possible.
2025-01-02 18:44:19 +02:00
Gleb Natapov
acbc667d3e storage_service: set raft topology change mode before using it in join_cluster
ss::join_cluster calls raft_topology_change_enabled() before the mode is
initialized below in the same function. Fix it by changing the order.
2025-01-02 18:44:19 +02:00
Gleb Natapov
491b7232de locator: drop inet_address usage to figure out per dc/rack replication
It allows to correctly calculate replication map even without knowing
IPs of the nodes.
2025-01-02 18:44:19 +02:00
Botond Dénes
7d42b80228 service/storage_proxy: data_read_resolver::resolve(): remove unneded maybe_yield()
We already have a yield in the loop via apply_gently(), the maybe_yield
is superfluous so remove it.

Follow-up to https://github.com/scylladb/scylladb/pull/21884

Closes scylladb/scylladb#21984
2025-01-02 16:13:29 +01:00
Kefu Chai
de42dce4c4 pgo: use java-11 when running cassandra-stress
we updated tools/java/build.xml recently to only build for java-11. so
if

- the `java` executable in `$PATH` points to a java which is neither
  java-8 nor java-11.
- java-8 is installed

java-8 is used to execute the cassandra-stress tool. and we would have
following failure:

```
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/cassandra/stress/Stress has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recogniz
es class file versions up to 52.0
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:473)
        at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
        at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:621)
```

in order to be compatible with the bytecode targeting java-11, let's run
cassandra-stress with java-11. we do not need to support java-8, because
the new tools/java is now building cassandra-stress targeting java-11 jre.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22142
2025-01-02 16:56:29 +02:00
Artsiom Mishuta
174199610b test.py: add more log info if the server is broken
attribute server_broken_reason into the server was introduced, to store the raw information
regarding why the server was broken

additional information was added in the error messages in case of "server
broken"

fixes: #21630

Closes scylladb/scylladb#22074
2025-01-02 16:54:55 +02:00
Kefu Chai
233e3969c4 utils: correct misspellings
these misspellings were identified by codespell. let's fix them.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22143
2025-01-02 16:47:57 +02:00
Avi Kivity
1ce373d80b schema: deinline some speculative_retry methods
This string conversion functions are not in any fast path. Deinlining
them moves a <boost/lexical_cast.hpp> include out of a common header file.

Some files accessed on boost::iterator_range via lexical_cast.hpp,
so they gain a new dependency.

Closes scylladb/scylladb#21950
2025-01-02 12:28:33 +01:00
Avi Kivity
051c310f02 tracing: record scheduling group in trace event record
We have a "thread" field (unfortunately not yet displayed
in cqlsh, but visible in the table) that records the shard
on which a particular event was recorded. Record the scheduling
group as well, as this can be useful to understand where the
query came from.

(cherry picked from commit 3c03b5f66376dca230868e54148ad1c6a1ad0ee2)
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
07fdf9d21f qos: un-shared-from-this standard_service_level_distributed_data_accessor
Apparently, it is not needed for
standard_service_level_distributed_data_accessor to derive from
enable_shared_from_this.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
b23bc3a5d5 alternator: execute under scheduling group for service level
Now, the Alternator API requests are executed under the correct
scheduling group of the service level assigned to the currently logged
in user.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
67b11e846a test.py: support multiple commands in prepare_cql in suite.yml
This will be needed for alternator tests introduced in the next commit,
which will have to execute multiple CQL operations during preparation.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
07b162fb5b docs: add documentation for workload prioritization
The doc pages were slightly adjusted during migration not to mention
Scylla Enterprise and to fix some whitespace issues.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
241e710c19 docs/dev: describe workload prioritization features in service_levels
The concept of shares, and some helper HTTP APIs, are now described in
the developer documentation for service levels.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
473bb44722 test/auth_cluster: test workload prioritization in service level tests
Update `test_connections_parameters_auto_update` to also check that the
scheduling group of given connections is appropriately changed when a
different service level is assigned to the user that the connection uses
for authentication.

Apart from that, more tests are added:

- Check for the logic that forbids setting shares for a service level
  until all nodes in the cluster are upgraded
- Test for handling the case when there are more scheduling groups than
  it is allowed (it might happen after upgrade from a non-workload-prio
  version)
- Regression test for a bug where less scheduling groups could have been
  created than allowed due to some metrics not being renamed on
  scheduling group name change.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
29b153c9e7 cqlpy/test_service_levels: add workload prioritization tests
Adjust existing cqlpy tests and add more in order to test the workload
prioritization feature:

- The DESCRIBE test is updated to check that generated statements
  contain information about shares
- Two tests for shares in the LIST EFFECTIVE SERVICE LEVEL statement
- Regression test which checks that we can create as many service levels
  as promised in the documentation (currently 7), but no more
- Test which checks that NULL shares in the service levels table are
  treated as the default 1000 shares
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
49f5fc0e70 api: introduce service levels specific API
Introduces two endpoints with operations specific to service levels:

- switch_tenants: updates the scheduling group of all connections to be
  aligned with the service level specific to the logged in user. This is
  mostly legacy API, as with service levels on raft this is done
  automatically.
- count_connections: for each user and for each scheduling group, counts
  how many connections are assigned to that user and scheduling group.
  This API is used in tests.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
a65c0c3735 api/cql_server_test: add information about scheduling group
Now, information about connections' scheduling group is included in the
HTTP API for querying information about connections' parameters.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
9319d65971 db/virtual_tables: add scheduling group column to system.clients
Add the "scheduling_group" column to the system.clients table which
names the scheduling group that currently serves the connection/client.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
bbc655ff32 test/boost: update service_level_controller_test for workload prio
Adjust some of the existing tests in service_level_controller_test.cc
and add some more in order to test the workload prioritization features,
i.e. the service level shares.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
ce4032dfc0 qos: include number of shares in DESCRIBE
Now, the CREATE statements generated for each service level by the
DESCRIBE SCHEMA WITH INTERNALS statement will account for the service
level's shares.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
0f62eb45d1 cql3/statements: update SL statements for workload prioritization
Introduce the "SHARES" keyword which can be used in conjunction with
existing CQL statements related to the service levels.

Adjust the CQL statements for service levels:

- CREATE/ALTER now allow to set shares (only if the cluster is fully
  upgraded)
- LIST EFFECTIVE SERVICE LEVEL now return the number of shares in a new
  column
- LIST SERVICE LEVEL(S) also return the number of shares, and has the
  additional column "percentage of all service level shares"
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
6d90a933cd transport/server: use scheduling group assigned to current user
Now, when the user logs in and the connection becomes authenticated, the
processing loop of the connection is switched to the scheduling group
that corresponds to the service level assigned to the logged in user.
The scheduling group is also updated when the service level assigned to
this user changes.

Starting from this commit, the scheduling groups managed by the service
level controller are actually being used by user workload.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
f1b9737e07 messaging_service: use separate set of connections per service levels
In order to make sure that the scheduling group carries over RPC, and
also to prevent priority inversion issues between different service
levels, modify the messaging service to use separate RPC connections for
each service level in order to serve user traffic.

The above is achieved by reusing the existing concept of "tenants" in
messaging service: when a new service level (or, more accurately,
service-level specific scheduling group) is first used in an RPC, a
new tenant is created.

In addition, extend the service level controller to be able to quickly
look up the service level name of the currently active scheduling group
in order to speed up the logic for choosing the tenant.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
7383013f43 replica/database: add reader concurrency semaphore groups
Replace the reader concurrency semaphores for user reads and view
updates with the newly introduced reader concurrency semaphore group,
which assigns a semaphore for each service level.

Each group is statically assigned to some pool of memory on startup and
dynamically distribute this memory between the semaphores, relative to
the number of shares of the corresponding scheduling group.

The intent of having a separate reader concurrency semaphore for each
scheduling group is to prevent priority inversion issues due to reads
with different priorities waiting on the same semaphore, as well as make
memory allocation more fair between service levels due to the adjusted
number of shares.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
4cfd26efaf qos: manage and assign scheduling groups to service levels
Introduce the core logic of workload prioritization, responsible for
assigning scheduling groups to service levels.

The service level controller maintains a pool of scheduling groups for
the currently present service levels, as well as a pool of unused
scheduling groups which were previously used by some service level that
was deleted during node's lifetime.

When a new service level is created, the SL controller either assigns a
scheduling group from the unused SG pool, or creates a new one if the
pool is empty. The scheduling group is renamed to "sl:<scheduling group
name>".

When updating shares of a service level (and also when creating a new
service level), the shares of the corresponding scheduling group are
synchronized with those of the service level.

When a service level is deleted, its group is released to the
aforementioned pool of unused scheduling groups and the prefix of its
name is changed from "sl:" to "sl_deleted:".

For now, these scheduling groups are not used by any user operations.
This will be changed in subsequent commits.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
ff51551a94 qos: use the shares field in service level reads/writes
Now, the newly introduced `shares` field is used when service levels are
either read from or written into system tables.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
a6f681029f qos: add shares to service_level_options
Add service level shares related fields to service_level_options and
slo_effective_names structs, and adjust the existing methods of the
former (merge_with, init_effective_names) to account for them.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
2eb35f37d0 qos: explicitly specify columns when querying service level tables
The service levels table is queried with a `SELECT * ...` query, by
using the `execute_internal` method which prepares and caches the query
in an special cache for internal queries, separate from the user query
cache.

During rolling upgrade from a version which does not support service
level shares to the one that does, the `shares` column is added. The
aforementioned internal query cache is _not_ invalidated on schema
change, so the cache might still contain the prepared query from the
time before the column was added, and that prepared query will fetch the
old set of column without the new `shares` column.

In order to solve this, explicitly specify the columns in the query
string, using the full set of column names from the time when the query
is executed.

Note that this is a problem only for the legacy, non-raft service
levels. Raft-based service levels use a local table for which the schema
is determined on startup.

Also note that this code only fetches values from the `shares` column
but does not make any use of it otherwise. It will be handled by later
commits in this series.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
ea25b29684 db/system_distributed_keyspace: add shares column and upgrade code
Add the "shares" column to the
system_distributed_keyspace.service_levels table, which is used by
legacy code.

Because this table is in a distributed and not local keyspace, adding
the column to an existing cluster during rolling upgrade requires a bit
of care. A callback is added to the workload prioritization cluster
feature which runs when the feature becomes enabled and adds the column
for all nodes in the cluster.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
346fc84c3e db/system_keyspace: adjust SL schema for workload prioritization
Add a "shares" column which hold the number of shares allocated to
given service level.

It is not used by the code at all right now, subsequent commits will
make good use of it.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
ecbf8721de gms: introduce WORKLOAD_PRIORITIZATION cluster feature
Information about the number of shares per service level will be stored
in an additional column in the service levels table, which is managed
through group0. We will need the feature to make sure that all nodes in
the cluster know about the new column before any node starts applying
group0 commands the would touch the new column.

This feature also serves a role for the legacy service levels
implementation that uses system_distributed for storage: after all nodes
are upgraded to support workload prioritization, one of the nodes will
perform a schema change operation and will add the new column.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
75d2d0d949 build: increase the max number of scheduling groups
Workload prioritization assigns scheduling groups to service levels, and
the number of scheduling groups that can exist at the same time is
limited with a compile-time parameter in seastar. The documentation for
workload prioritization says that we currently support 7 user-managed
service levels and 1 created by default. Increase the current
compile-time limit in order to align with the documentation.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
48e7ffc300 qos: return correct error code when SL does not exist
The `nonexistant_service_level_exception` can be thrown by service
levels code and propagated up to the CQL server layer, where it is
converted into a CQL protocol error. The aforementioned exception
inherits from `service_level_argument_exception`, which in turn inherits
from `std::invalid_argument` - which doesn't mean much to the CQL layer
and is converted to a generic SERVER_ERROR.

We can do better and return a more meaningful error code for this
exception. Change the base class of service_level_argument_exception to
exceptions::invalid_request_exception which gets converted to an INVALID
error.

The INVALID error code was already being used by the enterprise version,
so this commit just synchronizes error handling with enterprise.
2025-01-02 07:13:34 +01:00
Avi Kivity
727f68e0f5 Merge 'cql3: allow SELECT of specific collection element' from Michael Litvak
This adds to the grammar the option to SELECT a specific element in a collection (map/set/list).

For example:
`SELECT map['key'] FROM table`
`SELECT map['key1']['key2'] FROM table`

This feature was implemented in Cassandra 4.0 and was requested by scylla users.

The behavior is mostly compatible with Cassandra, except:
1. in SELECT, we allow list subscript in a selector, while cassandra allows only map and set.
2. in UPDATE, we allow set subscript in a column condition, while cassandra allows only map and list.
3. the slice syntax `SELECT m[a..b]` is not implemented yet
4. null subscript - `SELECT m[null]` returns null in scylla, while cassandra returns error

Fixes #7751

backport was requested for a user to be able to use it

Closes scylladb/scylladb#22051

* github.com:scylladb/scylladb:
  cql3: allow SELECT of specific collection key
  cql3: allow set subscript
2025-01-01 14:48:40 +02:00
Gleb Natapov
c4b26ba8dc test: drop test_old_ip_notification_repro.py
The test no longer test anything since the address map is updated much
earlier now by the gossiper itself, not by the notifiers. The
functionality is tested by a unit test now.
2025-01-01 12:43:11 +02:00
Gleb Natapov
c4db90799a test: address_map: check generation handling during entry addition
Check that adding an entry with smaller generation does not overwrite
existing entry.
2025-01-01 12:43:11 +02:00
Benny Halevy
85bd799308 storage_service: replicate_to_all_cores: prevent stalls when preparing per-table erms
Although the `network_topology_stratergy::make_replication_map` ->
`tablet_aware_replication_strategy::do_make_replication_map`
is not cpu intensive it still allocates and constructs a shared
`tablet_effective_replication_map`, and that might stall with
thousands of tablet-based tables.

Therefore coroutinize the preparation loop to allow yielding.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-31 14:52:39 +01:00
Gleb Natapov
745b6d7d0d gossiper: ignore gossiper entries with local host id in gossiper mode as well
We already ignore a gossiper entries with host id equal to local host id
in raft mode since those entries are just outdated entries since before
ip change. The same logic applies to gossiper mode as well though, so do
the same in both modes.

Fixes: scylladb/scylladb#21930

Message-ID: <Z20kBZvpJ1fP9WyJ@scylladb.com>
2024-12-31 15:50:12 +02:00
Avi Kivity
76cf5148e1 Merge 'message: introduce advanced rpc compression' from Michał Chojnowski
This is a forward port (from scylla-enterprise) of additional compression options (zstd, dictionaries shared across messages) for inter-node network traffic. It works as follows:

After the patch, messaging_service (Scylla's interface for all inter-node communication)
compresses its network traffic with compressors managed by
the new advanced_rpc_compression::tracker. Those compressors compress with lz4,
but can also be configured to use zstd as long as a CPU usage limit isn't crossed.

A precomputed compression dictionary can be fed to the tracker. Each connection
handled by the tracker will then start a negotiation with the other end to switch
to this dictionary, and when it succeeds, the connection will start being compressed using that dictionary.

All traffic going through the tracker is passed as a single merged "stream" through dict_sampler.
dictionary_service has access to the dict_sampler.
On chosen nodes (in the "usual" configuration: the Raft leader), it uses the sampler to maintain
a random multi-megabyte sample of the sampler's stream. Every several minutes,
it copies the sample, trains a compression dictionary on it (by calling zstd's
training library via the alien_worker thread) and publishes the new dictionary
to system.dicts via Raft's write_mutation command.

This update triggers (eventually) a callback on all nodes, which feeds the new dictionary
to advanced_rpc_compression::tracker, and this switches (eventually) all inter-node connections
to this dictionary.

Closes scylladb/scylladb#22032

* github.com:scylladb/scylladb:
  messaging_service: use advanced_rpc_compression::tracker for compression
  message/dictionary_service: introduce dictionary_service
  service: make Raft group 0 aware of system.dicts
  db/system_keyspace: add system.dicts
  utils: add advanced_rpc_compressor
  utils: add dict_trainer
  utils: introduce reservoir_sampling
  utils: introduce alien_worker
  utils: add stream_compressor
2024-12-31 15:02:57 +02:00
Evgeniy Naydanov
4260f3f55a test.py: topology_random_failures: log randomization parameters in test
Logging randomization parameters in the pytest_generate_tests hook doesn't
play well for us.  To make these parameters more visible move the logging
to the test level.

Closes scylladb/scylladb#22055
2024-12-31 14:23:47 +02:00
Avi Kivity
2b48c2e72a Merge 'build: add support for LTO and PGO to the building system' from Kefu Chai
This changeset ports LTO and PGO support from scylla-enterprise.git to scylladb.git.

Add support for Link-Time Optimization (LTO) and Profile-Guided Optimization (PGO)
to improve performance. LTO provides ~7% performance gain and enables crucial
binary layout optimizations for PGO.

LTO Changes:
- Add `-flto` flag to compile and link steps
- Use `-ffat-lto-objects` to generate both LLVM IR and machine code
- Enable cross-object optimization while maintaining fast test linking

PGO Implementation:
- Implement three-stage build process:
  1. Context-free profiling (`-fprofile-generate`)
  2. Context-sensitive profiling (`-fprofile-use` + `-fcs-profile-generate`)
  3. Final optimization using merged profiles
- Add release-pgo and release-cs-pgo build stages
- Integrate with ninja build system
- Stages can be enabled independently

Profile Management:
- Add `pgo/pgo.py` for workload profile collection
- Store default profile in `pgo/profiles/profile.profdata.xz` using Git LFS
- Add configure.py integration for profile detection and validation
- Support custom profiles via `--use-profile` flag
- Add profile regeneration script

Both optimizations are recommended for maximum performance, though each PGO
stage adds a full build cycle. Future optimization may allow dropping one
PGO stage if performance impact is minimal.

---

this is a forward port, hence no need to backport.

Closes scylladb/scylladb#22039

* github.com:scylladb/scylladb:
  build: cmake: add CMake options for PGO support
  build: cmake: add "Scylla_ENABLE_LTO" option
  build: set LTO and PGO flags for Seastar in cmake build
  build: collect scylla libraries with `scylla_libs` variable
  build: Unify Abseil CXX flags configuration
  configure.py: prepare the build for a default PGO profile in version control
  configure.py: introduce profile-guided optimization
  pgo: add alternator workloads training
  pgo: add a repair workload
  pgo: add a counters workload
  pgo: add a secondary index workload
  pgo: add a LWT workload
  pgo: add a decommission workload
  pgo: add a clustering workload
  pgo: add a basic workload
  pgo: introduce a PGO training script
  configure.py: don't include non-default modes in dist-server-* rules
  configure.py: enable LTO in release builds by default
  configure.py: introduce link-time optimization
  configure.py: add a `default` to `add_tristate`.
  configure.py: unify build rules for cxxbridge .cc files and regular .cc files
2024-12-31 14:14:40 +02:00
Avi Kivity
4905b1bf76 Merge 'table: make update_effective_replication_map sync again' from Benny Halevy
Commit f2ff701489 introduced
a yield in update_effective_replication_map that might
cause the storage_group manager to be inconsistent with the
new effective_replication_map (e.g. if yielding right
before calling `handle_tablet_split_completion`.

Also, yielding inside storage_service::replicate_to_all_cores
update loop means that base tables and their views
aren't updated atomically, that caused scylladb/scylladb#17786

This change essentially reverts f2ff701489
and makes handle_tablet_split_completion synchronous too.
The stopped compaction groups future is kept as a member and
storage_group_manager::stop() consumes this future during table::stop().

- storage_service: replicate_to_all_cores: update base and view tables atomically

Currently, the loop updating all tables (including views) with the
new effective_replication_map may yield, and therefore expose
a state where the base and view tables effective_replication_map
and topology are out of sync (as seen in scylladb/scylladb#17786)

To prevent that, loop over all base tables and for each table
update the base table and all views atomically, without yielding,
and so allow yielding only between base tables.

* Regression was introduced in f2ff701489, so backport is required to 6.x, 2024.2

Closes scylladb/scylladb#21781

* github.com:scylladb/scylladb:
  storage_service: replicate_to_all_cores: clear_gently pending erms
  test_mv_topology_change: drop delay_after_erm_update injection case
  storage_service: replicate_to_all_cores: update base and view tables atomically
  table: make update_effective_replication_map sync again
2024-12-30 23:42:06 +02:00
Tomasz Grabiec
bf3d0b3543 reader_concurrency_semaphore: Optimize resource_units destruction by postponing wait list processing
Observed 3% throughput improvement in sstable-heavy workload bounded by CPU.

SStable parsing involves lots of buffer operations which obtain and
destroy resource_units. Before the patch, reosurce_unit destruction
invoked maybe_admit_waiters(), which performs some computations on
waiting permits. We don't really need to admit on each change of
resources, since the CPU is used by other things anyway. We can batch
the computation. There is already a fiber which does this for
processing the _ready_list. We can reuse it for processing _wait_list
as well.

The changes violate an assumption made by tests that releasing
resources immediately triggers an admission check. Therefore, some of
the BOOST_REQUIRE_EQUAL needs to be replaced with REQUIRE_EVENTUALLY_EQUAL
as the admision check is now done in the fiber processing the _ready_list.

`perf-simple-query` --tablets --smp 1 -m 1G results obtained for
fixed 400MHz frequency:

Before:
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...

112590.60 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41353 insns/op,   17992 cycles/op,        0 errors)
122620.68 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41310 insns/op,   17713 cycles/op,        0 errors)
118169.48 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41353 insns/op,   17857 cycles/op,        0 errors)
120634.65 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41328 insns/op,   17733 cycles/op,        0 errors)
117317.18 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41347 insns/op,   17822 cycles/op,        0 errors)

         throughput: mean=118266.52 standard-deviation=3797.81 median=118169.48 median-absolute-deviation=2368.13 maximum=122620.68 minimum=112590.60
instructions_per_op: mean=41337.86 standard-deviation=18.73 median=41346.89 median-absolute-deviation=14.64 maximum=41352.53 minimum=41309.83
  cpu_cycles_per_op: mean=17823.50 standard-deviation=111.75 median=17821.97 median-absolute-deviation=90.45 maximum=17992.04 minimum=17713.00
```

After
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...

123689.63 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40997 insns/op,   17384 cycles/op,        0 errors)
129643.24 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40997 insns/op,   17325 cycles/op,        0 errors)
128907.27 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41009 insns/op,   17325 cycles/op,        0 errors)
130342.56 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40993 insns/op,   17286 cycles/op,        0 errors)
130294.09 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40972 insns/op,   17336 cycles/op,        0 errors)

         throughput: mean=128575.36 standard-deviation=2792.75 median=129643.24 median-absolute-deviation=1718.73 maximum=130342.56 minimum=123689.63
instructions_per_op: mean=40993.51 standard-deviation=13.23 median=40996.73 median-absolute-deviation=3.30 maximum=41008.86 minimum=40972.48
  cpu_cycles_per_op: mean=17331.16 standard-deviation=35.02 median=17324.84 median-absolute-deviation=6.49 maximum=17383.97 minimum=17286.33
```

Closes scylladb/scylladb#21918

[avi: patch was co-authored by Łukasz Paszkowski <lukasz.paszkowski@scylladb.com>]
2024-12-30 23:37:46 +02:00
Michael Litvak
5ef7afb968 cql3: allow SELECT of specific collection key
This adds to the grammar the option to SELECT a specific key in a
collection column using subscript syntax.

For example:
SELECT map['key'] FROM table
SELECT map['key1']['key2'] FROM table

The key can also be parameterized in a prepared query. For this we need
to pass the query options to result_set_builder where we process the
selectors.

Fixes scylladb/scylladb#7751
2024-12-30 17:05:20 +02:00
Wojciech Mitros
74cbc77f50 test: add test for schema registry maintaining base info for views
In this patch we test the behavior of schema registry in a few
scenarios where it was identified it could misbehave.

The first one is reverse schemas for views. Previously, SELECT
queries with reverse order on views could fail because we didn't
have base info in the registry for such schemas.

The second one is schemas that temporarily died in the registry.
This can happen when, while processing a query for a given schema
version, all related schema_ptrs were destroyed, but this schema
was requested before schema_registry::grace_period() has passed.
In this scenario, the base info would not be recovered, causing
errors.
2024-12-30 14:59:06 +01:00
Wojciech Mitros
3094ff7cbe schema_registry: avoid setting base info when getting the schema from registry
After the previous patches, the view schemas returned by schema registry
always have their base info set. As such, we no longer need to set it after
getting the view schema from the registry. This patch removes these
unnecessary updates.
2024-12-30 14:56:18 +01:00
Wojciech Mitros
82f2e1b44c schema_registry: update cached base schemas when updating a view
The schema registry now holds base schemas for view schemas.
The base schema may change without changing the view schema, so to
preserve the change in the schema registry, we also update the
base schema in the registry when updating the base info in the
view schema.
2024-12-30 14:56:18 +01:00
Wojciech Mitros
dfe3810f64 schema_registry: cache base schemas for views
Currently, when we load a frozen schema into the registry, we lose
the base info if the schema was of a view. Because of that, in various
places we need to set the base info again, and in some codepaths we
may miss it completely, which may make us unable to process some
requests (for example, when executing reverse queries on views).
Even after setting the base info, we may still lose it if the schema
entry gets deactivated.

To fix this, this patch adds the base schema to the registry, alongside
the view schema. With the base schema, we can now set the base
info when returning the schema from the registry. As a result, we can now
assume that all view schemas returned by the registry have base_info set.

To store the base schema, the loader methods now have to return the base
schema alongside the view schema. At the same time, when loading into
the registry, we need to check whether we're loading a view schema, and if
so, we need to also provide the base schema. When inserting a regular table
schema, the base schema should be a disengaged optional.
2024-12-30 14:56:17 +01:00
Wojciech Mitros
6f11edbf3f db: set base info before adding schema to registry
In the following patches, we'll assure that view schemas returned by the
schema registry always have base info set. To prepare for that, make sure
that the base info is always set before inserting it into schema registry,
2024-12-30 14:56:17 +01:00
Avi Kivity
b32b7ab806 Merge 'test.py: only access combined_tests executable if it is built' from Konstantin Osipov
test.py: only access combined_tests executable if it is built

Fixes #22038

Closes scylladb/scylladb#22069

* github.com:scylladb/scylladb:
  test.py: only access combined_tests if it exists
  test.py: rethrow CancelledError when executing a test
2024-12-30 15:15:39 +02:00
Piotr Smaron
2352063f20 server: set connection_stage to READY when authenticated
If authentication is enabled, but STARTUP isn't followed by REGISTER (which is optional, and in practice only happens on only one of a driver's connections — because there's no point listening for the same events on multiple connections), connections are wrongly displayed in the system.clients as AUTHENTICATING instead of READY, even when they are ready.
This commit fixes this problem.

Fixes: scylladb/scylladb#12640

Closes scylladb/scylladb#21774
2024-12-30 14:04:26 +02:00
Kefu Chai
6281fb825f test/pytest.ini: ignore warning on deprecated record_property fixture
`record_property` generates XML which is not compatible with xunit2,
so pytest decided to deprecated when the generating xunit reports.
and pytest generates following warning when a test failure is
reported using this fixture:

```
  object_store/test_backup.py:337: PytestWarning: record_property is incompatible with junit_family 'xunit2' (use 'legacy' or 'xunit1')
```

this warning is not related to the test, but more about how we
report a failure using pytrest. it is distracting, so let's silence it.

See also https://github.com/pytest-dev/pytest/issues/5202

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22067
2024-12-30 10:58:31 +02:00
Nadav Har'El
27180620af Merge 'topology_random_failures: deselect more cases which can cause #21534' from Evgeniy Naydanov
There are many CI failures (repros of https://github.com/scylladb/scylladb/issues/21534) which caused by `stop_after_setting_mode_to_normal_raft_topology` and `stop_before_becoming_raft_voter` error injections in combination with some cluster events.

Need to deselect them for now to make CI more stable.  First batch deselected in https://github.com/scylladb/scylladb/pull/21658

Also, add the handling of topology state rollback caused by `stop_before_streaming` or `stop_after_updating_cdc_generation` error injections as a separate commit.

See also https://github.com/scylladb/scylladb/issues/21872 and https://github.com/scylladb/scylladb/issues/21957

Closes scylladb/scylladb#22044

* github.com:scylladb/scylladb:
  test.py: topology_random_failures: more deselects for #21534
  test.py: topology_random_failures: handle more node's hangs during 30s sleep
2024-12-30 10:52:22 +02:00
Michael Litvak
2701b5d50d cql3: allow set subscript
This allows to use subscript on a set column, in addition to map/list
which was possible until now.
The behavior is compatible with Cassandra - a subscript with a specific value
returns the value if it's found in the set, and null otherwise.
2024-12-30 09:50:31 +02:00
Konstantin Osipov
8b7a5ca88d test.py: only access combined_tests if it exists
When the scylla source tree is only partially built,
we still may want to run the tests.

test.py builds a case cache at boot, and executes
--list-cases for that, for all built tests.

After amalgamating boost unit tests into a single
file, it started running it unconditionally, which broke
partial builds.

Hence, only use combined_tests executable if it exists.

Fixes #22038
2024-12-27 14:54:13 -05:00
Konstantin Osipov
2b1ba9c3fd test.py: rethrow CancelledError when executing a test
Commit 870f3b00fc,
"Add option to fail after number of failures" adds
tracking on the number of cancelled tests.

For the purpose, it intercepts CancelledError
and sets test's is_cancelled flag.

This introduced a regression reported in gh-21636:
Ctrl-C no longer works, since CancelledError is muted.

There was no intent to mute the exception,
re-throw it after accounting the test as cancelled.
2024-12-27 14:40:47 -05:00
Michał Chojnowski
fdb2d2209c messaging_service: use advanced_rpc_compression::tracker for compression
This patch sets up an `alien_worker`, `advanced_rpc_compression::tracker`,
`dict_sampler` and `dictionary_service` in `main()`, and wires them to each other
and to `messaging_service`.

`messaging_service` compresses its network traffic with compressors managed by
the `advanced_rpc_compression::tracker`. All this traffic is passed as a single
merged "stream" through `dict_sampler`.

`dictionary_service` has access to `dict_sampler`.
On chosen nodes (by default: the Raft leader), it uses the sampler to maintain
a random multi-megabyte sample of the sampler's stream. Every several minutes,
it copies the sample, trains a compression dictionary on it (by calling zstd's
training library via the `alien_worker` thread) and publishes the new dictionary
to `system.dicts` via Raft.

This update triggers a callback into `advanced_rpc_compression::tracker` on all nodes,
which updates the dictionary used by the compressors it manages.
2024-12-27 10:17:58 +01:00
Kefu Chai
cf35562e89 test/pylib: use foo instead of '{}'.format(foo)
for better readability

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-27 17:09:56 +08:00
Kefu Chai
71eccf01c7 test/pylib: use "foo not in bar" instead of "not foo in bar"
for better readability

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-27 17:09:56 +08:00
Kefu Chai
6adf70ec03 build: cmake: add CMake options for PGO support
- "Scylla_BUILD_INSTRUMENTED" option

  Scylla_BUILD_INSTRUMENTED allows us to instrument the code at
  different level, namely, IR, and CSIR. this option mirrors
  "--pgo" and "--cspgo" options in `configure.py` . please note,
  the instrumentation at the frontend is not supported, as the IR
  based instrumentation is better when it comes to the use case of
  optimization for performance.
  see https://lists.llvm.org/pipermail/llvm-dev/2015-August/089044.html
  for the rationales.

- "Scylla_PROFDATA_FILE" option

  this option allows us to specify the profile data previous generated
  with the "Scylla_BUILD_INSTRUMENTED" option. this option mirrors
  the `--use-profile` option in `configure.py`, but it does not
  take the empty option as a special case and consider it as a file
  fetched from Git LFS. that will be handled by another option in a
  follow-up change. please note, one cannot use
  -DScylla_BUILD_INSTRUMENTED=PGO and -DScylla_PROFDATA_FILE=...
  at the same time. clang just does not allow this. but CSPGO is fine.

- "Scylla_PROFDATA_COMPRESSED_FILE" option

  this option allows us to specify the compressed profile data previouly
  generated with the "Scylla_BUILD_INSTRUMENTED" option. along with
  "Scylla_PROFDATA_FILE", this option mirros the functionality of
  `--use-profile` in `configure.py`. the goal is to ensure user always
  gets the result with the specified options. if anything goes wrong,
  we just error out.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-27 16:16:04 +08:00
Kefu Chai
4154789670 build: cmake: add "Scylla_ENABLE_LTO" option
add an option named "Scylla_ENABLE_LTO", which is off by default.
if it is on, build the whole tree with ThinLTO enabled.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-27 16:16:04 +08:00
Kefu Chai
2647369d46 build: set LTO and PGO flags for Seastar in cmake build
This change extends scylla commit 7cb74df to scylla-enterprise-commit
4ece7e1.

we recently started building Seastar as an external project, so
we need to prepare its compilation flags separately. in enterprise
scylla, we prepare the LTO and PGO related cflags in
`prepare_advanced_optimizations()`. this function is called when
preparing the build rules directly from `configure.py`, and despite
we have equivalant settings in CMake, they cannot be applied to Seastar
due to the reason above.

in this change, we set up the the LTO and PGO compilation flags when
generating the buiding system for Seastar when building using CMake.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-27 16:16:04 +08:00
Kefu Chai
ffe8c5dcdb build: collect scylla libraries with scylla_libs variable
with which, we can set the properties of these targets in a single place.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-27 16:16:04 +08:00
Kefu Chai
610f1b7a0a build: Unify Abseil CXX flags configuration
- Set ABSL_GCC_FLAGS and ABSL_LLVM_FLAGS with a more generic absl_cxx_flags
- Enables more flexible configuration of compiler flags for Abseil libraries
- Provides a centralized approach to setting compilation flags

Previously, sanitizer-specific flags were directly applied to Abseil library builds.
This change allows for more extensible compiling flag management across
different build configurations.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-27 16:16:04 +08:00
Michał Chojnowski
131b1d6f81 configure.py: prepare the build for a default PGO profile in version control
This patch adds the following logic to the release build:

pgo/profiles/profile.profdata.xz is the default profile file, compressed.
This file is stored in version control using git LFS.
A ninja rule is added which creates build/profile.profdata by decompressing it.

If no profile file is explicitly specified, ./configure.py checks whether
the compressed default profile file exists and is compressed.
(If it exists, but isn't compressed, the user most likely has
git lfs disabled or not installed. In this case, the file visible in the working
tree will be the LFS placeholder text file describing the LFS metadata.)

If the compressed file exists, build/profile.profdata is chosen as the used
profile file.
If it doesn't exist, a warning is printed and configure.py falls back
to a profileless build.

The default profile file can be explicitly disabled by passing the empty
--use-profile="" to configure.py

A script is added which re-generates the profile.
After the script is run, the re-generated compressed profile can be staged,
committed, pushed and merged to update the default profile.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
a868b44ad8 configure.py: introduce profile-guided optimization
This commit enables profile-guided optimizations (PGO) in the Scylla build.

A full LLVM PGO requires 3 builds:
1. With -fprofile-generate to generate context-free (pre-inlining) profile. This
profile influences inlining, indirect-call promotion and call graph
simplifications.
2. With -fprofile-use=results_of_build_1 -fcs-profile-generate to generate
context-sensitive (post-inlining) profile. This profile influences post-inline
and codegen optimizations.
3. With -fprofile-use=merged_results_of_builds_1_2 to build the final binary
with both profiles.

We do all three in one ninja call by adding release-pgo and release-cs-pgo
"stages" to release. They are a copy of regular release mode, just with the
flags described above added. With the full course, release objects depend on the
profile file produced by build/release-cs-pgo/scylla, while release-cs-pgo
depends on the profile file generated by build/release-pgo/scylla.

The stages are orthogonal and enabled with separate options. It's recommended
to run them both for full performance, but unfortunately each one adds a full
build of scylla to the compile time, so maybe we can drop one of them in the
future if it turns out e.g. that regular PGO doesn't have a big effect.

It's strongly recommended to combine PGO with LTO. The latter enables the entire
class of binary layout optimizations, which for us is probably the most
important part of the entire thing.
2024-12-27 16:16:04 +08:00
Marcin Maliszkiewicz
80989556ac pgo: add alternator workloads training
This patch adds a set of alternator workloads to pgo training
script.

To confirm that added workloads are indeed affecting profile we can compare:

⤖ llvm-profdata show ./build/release-pgo/profiles/workdirs/clustering/prof.profdata

Instrumentation level: IR  entry_first = 0
Total functions: 105075
Maximum function count: 1079870885
Maximum internal block count: 2197851358

and

⤖ llvm-profdata show ./build/release-pgo/profiles/workdirs/alternator/prof.profdata

Instrumentation level: IR  entry_first = 0
Total functions: 105075
Maximum function count: 5240506052
Maximum internal block count: 9112894084

to see that function counters are on similar levels, they are around 5x higher for alternator
but that's because it combines 5 specific sub-workloads.

To confirm that final profile contains alterantor functions we can inspect:

⤖ llvm-profdata show --counts --function=alternator --value-cutoff 100000 ./build/release-pgo/profiles/merged.profdata
(...)
Instrumentation level: IR  entry_first = 0
Functions shown: 356
Total functions: 105075
Number of functions with maximum count (< 100000): 97275
Number of functions with maximum count (>= 100000): 7800
Maximum function count: 7248370728
Maximum internal block count: 13722347326

we can see that 356 functions which symbol name contains word alternator were identified as 'hot' (with max count grater than 100'000). Running:

⤖ llvm-profdata show --counts --function=alternator --value-cutoff 1 ./build/release-pgo/profiles/merged.profdata
(...)
Instrumentation level: IR  entry_first = 0
Functions shown: 806
Total functions: 105075
Number of functions with maximum count (< 1): 67036
Number of functions with maximum count (>= 1): 38039
Maximum function count: 7248370728
Maximum internal block count: 13722347326

we can see that 806 alternator functions were executed at least once during training.

And finally to confirm that alternator specific PGO brings any speedups we run:

for workload in read scan write write_gsi write_rmw
do
./build/release/scylla perf-alternator-workloads --smp 4 --cpuset "10,12,14,16" --workload $workload --duration 1 --remote-host 127.0.0.1 2> /dev/null | grep median
done

results BEFORE:

median 258137.51910849303
median absolute deviation: 786.06
median 547.2578202937141
median absolute deviation: 6.33
median 145718.19856685458
median absolute deviation: 5689.79
median 89024.67095807113
median absolute deviation: 1302.56
median 43708.101729598646
median absolute deviation: 294.47

results AFTER:

median 303968.55333940056
median absolute deviation: 1152.19
median 622.4757636209254
median absolute deviation: 8.42
median 198566.0403745328
median absolute deviation: 1689.96
median 91696.44912842038
median absolute deviation: 1891.84
median 51445.356525664996
median absolute deviation: 1780.15

We can see that single node cluster tps increase is typically 13% - 17% with notable exceptions,
improvement for write_gsi is 3% and for write workload whopping 36%.
The increase is on top of CQL PGO.

Write workload is executed more often because it's involved also as data preparation for read and scan.
Some further improvement could be to separate preparation from training as it's done for CQL but it would
be a bit odd if ~3x higher counters for one flow have so big impact.

Additional disclaimers:
 - tests are performing exactly the same workloads as in training so there might be some bias
 - tests are running single node cluster, more realistic setup will likely show lower improvement

Fixes https://github.com/scylladb/scylla-enterprise/issues/4066
2024-12-27 16:16:04 +08:00
Michał Chojnowski
95c8d88b96 pgo: add a repair workload
This workload is added to teach PGO about repair.
Tests are inconclusive about its alignment with existing workloads,
because repair doesn't seem utilize 100% of the reactor.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
1c9ce0a9ee pgo: add a counters workload
This workload is added to teach PGO about counters.
Tests seem to show it's mostly aligned with existing CQL workloads.

The config YAML is based on the default cassandra-stress schema.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
47dc0399cb pgo: add a secondary index workload
This workload is added to teach PGO about secondary indexes.
Tests seem to show that it's mostly aligned with existing CQL workloads.

The config YAML was copied from one of scylla-cluster-test test cases.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
e67f4a5c51 pgo: add a LWT workload
This workload is added to teach PGO about LWT codepaths.
Tests seem to show that it's mostly aligned with existing CQL workloads.

The config YAML was copied from one of scylla-cluster-tests test cases.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
e217c124a6 pgo: add a decommission workload
This workload is added to teach PGO about streaming.
Tests show that this workload is mostly orthogonal to CQL workloads
(where "orthogonal" means that training on workload A doesn't improve workload
B much, while training on workload A doesn't improve workload B much),
so adding it to the training is quite important.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
65abecaede pgo: add a clustering workload
In contrast to the basic workload, this workload uses clustering
keys, CK range queries, RF=1, logged batches, and more CQL types.
Tests seem to show that this workload is mostly aligned with the existing basic
workload (where "aligned" means that training on workload A improves workload B
about as much as training on workload B).

The config YAML is based on the example YAML attached to cassandra-stress
sources.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
c1297dbcd2 pgo: add a basic workload
This commit adds the default cassandra-stress workload to the PGO training
suite.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
f73b122de3 pgo: introduce a PGO training script
Profile-guided optimization consists of the following steps:
1. Build the program as usual, but with with special options (instrumentation
or just some supplementary info tables, depending on the exact flavor of PGO
in use).
2. Collect an execution profile from the special binary by running a
training workload on it.
3. Rebuild the program again, using the collected profile.

This commit introduces a script automating step 2: running PGO training workloads
on Scylla. The contents of training workloads will be added in future commits.
The changes in configure.py responsible for steps 1. and 3. will also appear
in future commits.

As input, the script takes a path to the instrumented binary, a path to a
the output file, and a directory with (optionally) prepopulated datasets for use
in training. The output profile file can be then passed to the compiler to
perform a PGO build.

The script current supports two kinds of PGO instrumentation: LLVM instrumentation
(binary instrumented with -fprofile-generate and -fcs-profile-generate passed to
clang during compilation) and BOLT instrumentation (binary instrumented with
`llvm-bolt -instrument`, with logs from this operation saved to
$binary_path.boltlog)

The actual training workloads for generating the profile will be added in later
commits.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
6f01ceae3d configure.py: don't include non-default modes in dist-server-* rules
dist-server-tar only includes default modes. Let dist-server-deb
and dist-server-rpm behave consistently with it.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
dd1a847d61 configure.py: enable LTO in release builds by default 2024-12-27 16:16:04 +08:00
Michał Chojnowski
4b03b91fbd configure.py: introduce link-time optimization
This patch introduces link-time optimization (LTO) to the build.

The performance gains from LTO alone are modest (~7%), but it's vital ingredient
of effective profile-guided optimization, which will be introduced later.

In general, use of LTO is quite simple and transparent to build systems.
It is sufficient to add the -flto flag to compile and link steps, and use a
LTO-aware linker.
At compile time, -ffat-lto-objects will cause the compiler to emit .o
files both LTO-ready LLVM IR for main executable optimization and machine
code for fast test linking. At link time, those pieces of IR will be
compiled together, allowing cross-object optimization of the main
executable and the fast linking of test executables.

Due to it's high compile time cost, the optimization can be toggled with a
configure.py option. As of this patch, it's disabled by default.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
192cb6de4b configure.py: add a default to add_tristate.
It will be used in the next patch.
2024-12-27 16:16:04 +08:00
Michał Chojnowski
1224200d7a configure.py: unify build rules for cxxbridge .cc files and regular .cc files
This is going to prevent some code duplication in following patches.
2024-12-27 16:16:04 +08:00
Benny Halevy
3e22998dc1 sstables: parse(summary): reserve positions vector
We know the number of positions in advance
so reserve the chunked_vector capacity for that.

Note: reservation replaces the existing reset of the
positions member.  This is safe since we parse the summary
only once as sstable::read_summary() returns early
if the summary component is already populated.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#21767
2024-12-26 13:33:29 +02:00
Yaron Kaikov
bc487c9456 .github: cherry-pick each commit instead of merge commit when available
Until today, when we had a PR with multiple commits we cherry-pick the merge commit only, which created a PR with only one commit (the merge commit) with all relevant changes

This was causing an issue when there was a need to backport part of the commits like in https://github.com/scylladb/scylladb/pull/21990 (reported by @gleb-cloudius)

Changing the logic to cherry-pick each commit

Closes scylladb/scylladb#22027
2024-12-26 13:10:18 +02:00
Kefu Chai
6acc5294a4 treewide: migrate from boost::copy_range to std::ranges::to
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::to`.

in this change, we:

- replace `boost::copy_range` to `std::ranges::to`
- remove unused `#include` of boost headers

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21880
2024-12-26 11:46:26 +02:00
Kefu Chai
6c031ad92f test/topology: Percent-encode URL in pytest artifact links
When embedding HTML documents in pytest reports with links to test artifacts,
parameterized test names containing special characters like "[" and "]" can
cause URL encoding issues. These characters, when used verbatim in URLs, can
trigger HTTP 400 errors on web servers.

This commit resolves the issue by percent-encoding the URLs for artifact links,
ensuring compatibility with servers like Jenkins and preventing "HTTP ERROR 400
Illegal Path Character" errors.

Changes:
- Percent-encode test artifact URLs to handle special characters
- Improve link robustness for parameterized test names

Fixes scylladb/scylla-pkg#4599
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21963
2024-12-26 10:23:52 +02:00
Benny Halevy
a25c3eaa1c utils: phased_barrier: add close() method
When services are stopped we generally want to call
advance_and_await(), but we should also prevent starting
new operations, so close() would do that be closing the
phased_barrier active gate (which implicitly also awaits
past operations similar to advance_and_await()).

Add unit tests for that and use in existing services.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-26 06:54:07 +02:00
Benny Halevy
311c52fbb1 utils: phased_barrier: advance_and_await: allocate new gate only when needed
If there are no opearions in progress, there is no
need to close the current gate and allocate a new one.
The current gate can be reused for the new phase just as well.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-26 06:53:43 +02:00
Konstantin Osipov
d87e1eb7ef test: merge topology_experimental_raft into topology_custom
This enables tablets in topology_custom, so explicitly
disable them where tests don't support tablets.
In scope of this rename patch a few imports.
Importing dependencies from another test is a bad idea -
please use shared libraries instead.

Fixed #20193

Closes scylladb/scylladb#22014
2024-12-26 00:33:08 +02:00
Yaron Kaikov
0fc7e786dd .github/scripts/auto-backport.py: fix wrong username param
In 2e6755ecca I have added a comment when PR has conflicts so the assignee can get a notification about it. There was a problem with the user mention param (a missing `.login`)

Fixing it

Closes scylladb/scylladb#22036
2024-12-25 20:41:34 +02:00
Avi Kivity
465449e4a1 test: combined_test: relicense
Was inadvertantly released under the AGPL.
2024-12-25 13:53:54 +02:00
Avi Kivity
3ffe93b6ae Merge 'Enhance load-and-stream with "scope"' from Pavel Emelyanov
The main purpose of this change is to enhance the restore from object storage usage.

Currently, restore uses the load-and-stream facility. When triggered, the restoring task opens the provided list of sstables directory from the remote bucket and then feeds the list of sstables to load_and_stream() method. The method, in turn, iterates over this list, reads mutations and for each mutation decides where to send one by checking the replication map (it's pretty much the same for both vnodes and tablets, but for tablets that are "fully contained" by a range there's the plan to stream faster).

As described above, restore is governed by a single node and this single node reads all sstables from the object store, which can be very slow. This PR allows speeding things up. For that, the load-and-stream code is equipped with the "scope" filter which limits where mutations can be streamed to. There are four options for that -- all, dc, rack and node. The "all" is how things work currently, "dc" and "rack" filter out target nodes that don't belong to this node's dc/rack respectively. The "node" scope only streams mutations to local node.

With the "node" scope it's possible to make all nodes in the cluster load mutations that belong to them in parallel, without re-sending them to peers. The last patch in this PR is the test that shows how it can be possible.

Closes scylladb/scylladb#21169

* github.com:scylladb/scylladb:
  test: Add scope-streaming test (for restore from backup)
  api: New "scope" API param to load-and-stream calls
  sstables_loader: Propagate scope from API down
  sstables_loader: Filter tablets based on scope
  streamer: Disable scoped streaming of primary replica only
  sstables_loader: Introduce streaming scope
  sstables_loader: Wrap get_endpoints()
2024-12-25 13:52:51 +02:00
Nadav Har'El
23213e8696 Merge 'Make get_built_indexes REST API endpoint be consistent with system."IndexInfo" table' from Pavel Emelyanov
It turned out that aforementioned APIs use slightly different sources of information about view build progress/status which sometimes results in different reporting of whether an index is built. It's good to make those two APIs consistent. Also add a test for the REST API endpoint (system table test was addressed by #21677).

Closes scylladb/scylladb#21814

* github.com:scylladb/scylladb:
  test: Add tests for MVs and indexes reporting by API endpoint(s)
  api: Use built_views table in get_built_indexes API
2024-12-25 11:47:03 +02:00
Evgeniy Naydanov
5992e8b031 test.py: topology_random_failures: more deselects for #21534
More cases found which can cause the same 'local_is_initialized()' assertion
during the node's bootstrap.
2024-12-25 06:38:13 +00:00
Evgeniy Naydanov
f337ecbafa test.py: topology_random_failures: handle more node's hangs during 30s sleep
The node is hanging and the coordinator just rollback a topology state.  It's different from
`stop_after_sending_join_node_request` and `stop_after_bootstrapping_initial_raft_configuration`
because in these cases the coordinator just not able to start the topology change at all and
a message in the coordinator's log is different.

Error injections handled:
  - `stop_after_updating_cdc_generation`
  - `stop_before_streaming`

And, actually, it can be any cluster event which lasts more than 30s.
2024-12-25 06:38:13 +00:00
Avi Kivity
f9c3ab03a3 Merge 'Sort by proximity: shuffle equal-distance replicas' from Benny Halevy
This series re-implements locator::topology::sort_by_proximity
and adds some randomization to shuffle equal-distance replicas for improving load-balancing
when reading with 1 < consistency level < replication factor.

This change also adds a manual test for benchmarking sort_by_proximity,
as it's not exercised by the single-node perf-simple-query.

The benchmark shows performance improvement of over 20% (from about 71 ns to 56 ns
per call for 3 nodes vectors), mainly due to "calculate distance only once" which
pre-calculates the distance from the reference node for each replica once, rather than
each time to comparator is called by std::sort

* Improvement.  No backport needed

Closes scylladb/scylladb#21958

* github.com:scylladb/scylladb:
  locator/topology: do_sort_by_proximity: shuffle equal-distance replicas
  locator/topology: sort_by_proximity: calculate distance only once
  utils: small_vector: expose internal_capacity()
  storage_proxy: sort_endpoints_by_proximity: lookup my_id only if cannot sort by proximity
  test/perf: add perf_sort_by_proximity benchmark
  locator: refactor sort_by_proximity
2024-12-24 17:37:48 +02:00
Pavel Emelyanov
644d36996d test: Add tests for MVs and indexes reporting by API endpoint(s)
So far there's the /column_family/built_indexes one that reports the
index names similar to how system.IndexInfo does, but it's not tested.
This patch adds tests next to existing system. table ones.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-12-24 16:18:32 +03:00
Pavel Emelyanov
5eb3278d9e api: Use built_views table in get_built_indexes API
Somehow system."IndexInfo" table and column_family/built_indexes REST
API endpoint declare an index "built" at slightly different times:

The former a virtual table which declares an index completely built
when it appears on the system.built_views table.

The latter uses different data -- it takes the list of indexes in
the schema and eliminates indexes which are still listed in the
system.scylla_views_builds_in_progress table.

The mentioned system. tables are updated at different times, so API
notices the change a bit later. It's worth improving the consistency
of these two APIs by making the REST API endpoint piggy-back the
load_built_views() instead of load_view_build_progress(). With that
change the filtering of indexes should be negated.

Fixes #21587

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-12-24 16:18:00 +03:00
Benny Halevy
d1490bb7bf locator/topology: do_sort_by_proximity: shuffle equal-distance replicas
To improve balancing when reading in 1 < CL < ALL

This implementation has a moderate impact on
the function performance in contrast to full
std::shuffle of the vector before stable_sort:ing it
(especially with large number of nodes to sort).

Before:
test                                               iterations      median         mad         min         max      allocs       tasks        inst      cycles
sort_by_proximity_topology.perf_sort_by_proximity    25541973    39.225ns     0.114ns    38.966ns    39.339ns       0.000       0.000       588.5       116.6

After:
sort_by_proximity_topology.perf_sort_by_proximity    19689561    50.195ns     0.119ns    50.076ns    51.145ns       0.000       0.000       622.5       150.6

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-24 13:00:17 +02:00
Benny Halevy
0fe8bdd0db locator/topology: sort_by_proximity: calculate distance only once
And use a temporary vector to use the precalculated distances.
A later patch will add some randomization to shuffle nodes
at the same distance from the reference node.

This improves the function performance by 50% for 3 replicas,
from 77.4 ns to 39.2 ns, larger replica sets show greater improvement
(over 4X for 15 nodes):

Before:
test                                               iterations      median         mad         min         max      allocs       tasks        inst      cycles
sort_by_proximity_topology.perf_sort_by_proximity    12808773    77.368ns     0.062ns    77.300ns    77.873ns       0.000       0.000      1194.2       231.6

After:
sort_by_proximity_topology.perf_sort_by_proximity    25541973    39.225ns     0.114ns    38.966ns    39.339ns       0.000       0.000       588.5       116.6

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-24 12:27:03 +02:00
Benny Halevy
4af522f61e utils: small_vector: expose internal_capacity()
So we can use it for defining other small_vector
deriving their internal capacity from another small_vector
type.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-24 12:19:20 +02:00
Benny Halevy
3a3df43799 storage_proxy: sort_endpoints_by_proximity: lookup my_id only if cannot sort by proximity
topology::sort_by_proximity already sorts the local node
address first, if present, so look it up only when
using SimpleSnitch, where sort_by_proximity() is a no-op.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-24 12:19:20 +02:00
Benny Halevy
75da99ce8b test/perf: add perf_sort_by_proximity benchmark
benchmark sort_by_proximity

Baseline results on my desktop for sorting 3 nodes:

single run iterations:    0
single run duration:      1.000s
number of runs:           5
number of cores:          1
random seed:              20241224

test                                               iterations      median         mad         min         max      allocs       tasks        inst      cycles
sort_by_proximity_topology.perf_sort_by_proximity    12808773    77.368ns     0.062ns    77.300ns    77.873ns       0.000       0.000      1194.2       231.6

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-24 12:18:24 +02:00
Pavel Emelyanov
a19ad3c655 Merge 'install-dependencies.sh: cleanups to silence shellcheck' from Kefu Chai
this changeset includes two changes to silence the warnings reported by shellcheck. This changeset has no functional impact and serves as a proactive code improvement.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#21756

* github.com:scylladb/scylladb:
  install-dependencies.sh: quote array to avoid re-splitting
  install-dependencies.sh: define local variable using "local -A"
2024-12-24 10:27:27 +03:00
Michał Chojnowski
5ce1e4410f message/dictionary_service: introduce dictionary_service
This "service" is a bag for code responsible for dictionary training,
created to unclutter main() from dictionary-specific logic.

It starts the RPC dictionary training loop when the relevant cluster feature is enabled,
pauses and unpauses it appropriately whenever relevant config or leadership
status are updated, and publishes new dictionaries whenever the training fiber produces them.
2024-12-23 23:37:02 +01:00
Michał Chojnowski
6a982ee0dc service: make Raft group 0 aware of system.dicts
Adds glue which causes the contents of system.dicts to be sent
in group 0 snapshots, and causes a callback to be called when
system.dicts is updated locally. The callback is currently empty
and will be hooked up to the RPC compressor tracker in one of the
next commits.
2024-12-23 23:37:02 +01:00
Michał Chojnowski
cc15ca329e db/system_keyspace: add system.dicts
Adds a new system table which will act as the medium for distributing
compression dictionaries over the cluster.

This table will be managed by Raft (group 0). It will be hooked up to it in
follow-up commits.
2024-12-23 23:37:02 +01:00
Michał Chojnowski
0fd1050784 utils: add advanced_rpc_compressor
Adds glue needed to pass lz4 and zstd with streaming and/or dictionaries
as the network traffic compressors for Seastar's RPC servers.

The main jobs of this glue are:
1. Implementing the API expected by Seastar from RPC compressors.
2. Expose metrics about the effectiveness of the compression.
3. Allow dynamically switching algorithms and dictionaries on a running
   connection, without any extra waits.

The biggest design decision here is that the choice of algorithm and dictionary
is negotiated by both sides of the connection, not dictated unilaterally by the
sender.

The negotiation algorithm is fairly complicated (a TLA+ model validating
it is included in the commit). Unilateral compression choice would be much simpler.
However, negotiation avoids re-sending the same dictionary over every
connection in the cluster after dictionary updates (with one-way communication,
it's the only reliable way to ensure that our receiver possesses the dictionary
we are about to start using), lets receivers ask for a cheaper compression mode
if they want, and lets them refuse to update a dictionary if they don't think
they have enough free memory for that.

In hindsight, those properties probably weren't worth the extra complexity and
extra development effort.

Zstd can be quite expensive, so this patch also includes a mechanism which
temporarily downgrades the compressor from zstd to lz4 if zstd has been
using too much CPU in a given slice of time. But it should be noted that
this can't be treated as a reliable "protection" from negative performance
effects of zstd, since a downgrade can happen on the sender side,
and receivers are at the mercy of senders.
2024-12-23 23:37:02 +01:00
Michał Chojnowski
5294762ac7 utils: add dict_trainer 2024-12-23 23:37:02 +01:00
Michał Chojnowski
9de52b1c98 utils: introduce reservoir_sampling
We are planning to improve some usages of compression in Scylla
(in which we compress small blocks of data) by pre-training
compression dictionaries on similar data seen so far.

For example, many RPC messages have similar structure
(and likely similar data), so the similarity could be exploited
for better compression. This can be achieved e.g. by training
a dictionary on the RPC traffic, and compressing subsequent
RPC messages against that dictionary.

To work well, the training should be fed a representative sample
of the compressible data. Such a sample can be approached by
taking a random subset (of some given reasonable size) of the data,
with uniform probability.

For our purposes, we need an online algorithm for this -- one
which can select the random k-subset from a stream of arbitrary
size (e.g. all RPC traffic over an hour), while requiring only
the necessary minimum of memory.

This is a known problem, called "reservoir sampling".
This PR introduces `reservoir_sampler`, which implements
an optimal algorithm for reservoir sampling.

Additionally, it introduces `page_sampler` -- a wrapper for `reservoir_sampler`,
which uses it to select a random sample of pages from a stream of bytes.
2024-12-23 23:37:02 +01:00
Michał Chojnowski
d301c29af5 utils: introduce alien_worker
Introduces a util which launches a new OS thread and accepts
callables for concurrent execution.

Meant to be created once at startup and used until shutdown,
for running nonpreemptible, 3rd party, non-interactive code.

Note: this new utility is almost identical to wasm::alien_thread_runner.
Maybe we should unify them.
2024-12-23 23:37:02 +01:00
Michał Chojnowski
866326efe4 utils: add stream_compressor
Adds utilities for "advanced" methods of compression with lz4
and zstd -- with streaming (a history buffer persisted across messages)
and/or precomputed dictionaries.

This patch is mostly just glue needed to use the underlying
libraries with discontiguous input and output buffers, and for reusing the
same compressor context objects across messages. It doesn't contain
any innovations of its own.

There is one "design decision" in the patch. The block format of LZ4
doesn't contain the length of the compressed blocks. At decompression
time, that length must be delivered to the decompressor by a channel
separate to the compressed block itself. In `lz4_cstream`, we deal
with that by prepending a variable-length integer containing the
compressed size to each compressed block. This is suboptimal for
single-fragment messages, since the user of lz4_cstream is likely
going to remember the length of the whole message anyway,
which makes the length prepended to the block redundant.
But a loss of 1 byte is probably acceptable for most uses.
2024-12-23 23:28:12 +01:00
Pavel Emelyanov
972ff80fad test: Add scope-streaming test (for restore from backup)
- create
  - a cluster with given topology
  - keyspace with tablets and given rf value
  - table with some data
- backup
  - flush all nodes
  - kick backup API on every node
- re-create keyspace and table
  - drop it first
  - create again with the same parameters and schema, but don't
    populate table with data
- restore
  - collect nodes to contact and corresponding list of TOCs
    according to the preferred "scope"
  - ask selected nodes to restore, limiting its streaming scope
    and providing the specific list of sstables
- check
  - select mutation fragments from all nodes for random keys
  - make sure that the number of non-empty responses equals the
    expected rf value

Specific topologies, RFs and stream scopes used are:

    rf = 1, nodes = 3, racks = 1, dcs = 1, scope = node
    rf = 3, nodes = 5, racks = 1, dcs = 1, scope = node
    rf = 1, nodes = 4, racks = 2, dcs = 1, scope = rack
    rf = 3, nodes = 6, racks = 2, dcs = 1, scope = rack
    rf = 3, nodes = 6, racks = 3, dcs = 1, scope = rack
    rf = 2, nodes = 8, racks = 4, dcs = 2, scope = dc

nodes and racks are evenly distributed in racks and dcs respectively
in the last topo RF effectively becomes 4 (2 in each dc)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-12-23 19:28:05 +03:00
Pavel Emelyanov
a24dc02255 api: New "scope" API param to load-and-stream calls
There are two of those -- the POST /storage_service/keyspace that loads
and streams new sstables from /upload and POST /storage_service/restore
that does the same, but gets sstables from object store.

The new optional parameter allow users to tun the streaming phase
behavior. The test/pylib client part is also updated here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-12-23 19:28:05 +03:00
Pavel Emelyanov
960041d4b4 sstables_loader: Propagate scope from API down
Semi-mechanical change that adds newly introduced "scope" parameter to
all the functions between API methods and the low-level streamer object.
No real functional changes. API methods set it to "all" to keep existing
behavior.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-12-23 19:28:05 +03:00
Pavel Emelyanov
e8201a7897 sstables_loader: Filter tablets based on scope
Loading and streaming tablets has pre-filtering loop that walks the
tablet map sorts sstables into three lists:
- fully contained in one of map ranges
- partially overlapping with the map
- not intersecting with the map

Sstables from the 3rd list is immediately dropped from the process and
for the remaining two core load-and-stream happens.

This filtering deserves more care from the newly introduced scope. When
a tablet replica set doesn't get in the scope, the whole entry can be
disregarded, because load-and-stream will only do its "load" part anyway
and all mutations from it will be ignored.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-12-23 19:28:05 +03:00
Pavel Emelyanov
93aed22cd5 streamer: Disable scoped streaming of primary replica only
There's been some discussions of how primary replica only streaming
schould interact with the scope. There are two options how to consider
this combination:

- find where the primary replica is and handle it if it's within the
  requested sope
- within the requested scope find the primary replica for that subset of
  nodes, then handle it

There's also some itermediate solution: suppoer "primary replica in DC"
and reject all other combinations.

Until decided which way is correct, let's disable this configuration.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-12-23 19:19:17 +03:00
Pavel Emelyanov
30aac0d1da sstables_loader: Introduce streaming scope
Currently load-and-stream sends mutations to whatever node is considered
to be a "replica" for it. One exception is the "primary-replica-only"
flag that can be requested by the user.

This patch introduces a "scope" parameter that limits streaming part in
where it can stream the data to with 4 options:

- all -- current way of doing things, stream to wherever needed
- dc -- only stream to nodes that live in the same datacenter
- rack -- only stream to nodes that live in the same rack
- node -- only "stream" to current node

It's not yet configurable and streamer object initializes itself with
"all" mode. Will be changed later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-12-23 19:19:17 +03:00
Pavel Emelyanov
7c1eaa427e sstables_loader: Wrap get_endpoints()
Preparational patch. Next will add more code to get_endpoints() that
will need to work for both if/else branches, this change helps having
less churn later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-12-23 19:19:17 +03:00
Benny Halevy
68b0b442fd locator: refactor sort_by_proximity
Extract can_sort_by_proximity() out so it can be used
later by storage_proxy, and introduce do_sort_by_proximity
that sorts unconditionally.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-23 16:42:55 +02:00
Kefu Chai
cd2a2bd021 repair: correct misspelling of "corespondent"
replace "corespondent" with "corresponding" in a logging message.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22003
2024-12-23 11:29:58 +02:00
Takuya ASADA
03461d6a54 test: compile unit tests into a single executable
To reduce test executable size and speed up compilation time, compile unit
tests into a single executable.

Here is a file size comparison of the unit test executable:

- Before applying the patch
$ du -h --exclude='*.o' --exclude='*.o.d' build/release/test/boost/ build/debug/test/boost/
11G	build/release/test/boost/
29G	build/debug/test/boost/

- After applying the patch
du -h --exclude='*.o' --exclude='*.o.d' build/release/test/boost/ build/debug/test/boost/
5.5G	build/release/test/boost/
19G	build/debug/test/boost/

It reduces executable sizes 5.5GB on release, and 10GB on debug.

Closes #9155

Closes scylladb/scylladb#21443
2024-12-22 19:14:09 +02:00
Piotr Smaron
200f0bb219 alternator: use get_datacenters() in get_network_topology_options()
Currently, `get_network_topology_options()` is using gossip data
and iterates over topology using IPs and not host IDs, which may
result in operating on inconsistent data.
This method's implemenations has been changed to instead use
`get_datacenters()`, which should always return consistent data.

Fixes: scylladb/scylladb#21490

Closes scylladb/scylladb#21940
2024-12-22 18:57:10 +02:00
Avi Kivity
f8ce49ebe9 cql3: implement NOT IN
Where the grammar supports IN, we add NOT IN. This includes the WHERE
clause and LWT IF clause.

Evaluation of NOT IN follows from IN.

In statement_restrictions analysis, they are different, as NOT IN
doesn't enable any clever query plan and must filter.

Some tests are added. An error message was changed ('in' changed to 'IN'),
so some tests are adjusted.

Closes scylladb/scylladb#21992
2024-12-22 15:15:23 +02:00
Kefu Chai
10c79a4d47 test/pylib: do not check for self.cmd when tearing down ScyllaServer
we already check `self.cmd` for null at the very beginning of the
`ScyllaServer.stop()`, and in the `try` block, we don't reset
`self.cmd`, hence there is no need to check it again.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21936
2024-12-20 16:21:40 +02:00
Avi Kivity
eb62593f2c treewide: use angle brackets when including seastar headers
We treat Seastar as a "system" library, and those are included
with angle brackets.

Closes scylladb/scylladb#21959
2024-12-20 16:16:28 +02:00
Kefu Chai
f1a0613a39 mutation: remove unused function
`prefixed()` is a static function in `mutation_partition_v2.cc`.
and this function is not used in this translation unit. so let's
remove it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22006
2024-12-20 16:12:10 +02:00
Yaniv Michael Kaul
dbe4ac7465 LICENSE-ScyllaDB-Source-Available.md: fix markdown
Codespell complained.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#21980
2024-12-20 16:11:39 +02:00
Aleksandra Martyniuk
1c29726477 replica: do not set tablet_task_info if it isn't valid
Currently, in tablet_map_to_mutation, repair's and migration's
tablet_task_info is always set.

Do not set the tablet_task_info if there is no running operation.

Closes scylladb/scylladb#22005
2024-12-20 16:10:53 +02:00
Amnon Heiman
48f7ef1c30 alternator/executor.cc: Add WCU for update_item
This patch adds WCU support for update_item. The way Alternator modifies
values means we don't always have the full item sizes. When there is a
read-before-write, the code in rmw_operation takes care of the object
size.

When updating a value without read-before-write, we will make a rough
estimation of the value's size. This is better than simply taking 1 (as
we do with delete) and is also more Alternator-like.
2024-12-20 14:55:55 +02:00
Kefu Chai
2a9f34bb85 test/pytest.ini: put repair marker declaration back
During the consolidation of per-suite pytest.ini files (commit 8bf62a086f),
the 'repair' marker was inadvertently dropped. This led to pytest warnings
for tests using the @pytest.mark.repair decorator.

This patch restores the marker declaration to eliminate the distracting
PytestUnknownMarkWarning:

```
test/topology_experimental_raft/test_tablets.py:396
  /home/kefu/dev/scylladb/test/topology_experimental_raft/test_tablets.py:396: PytestUnknownMarkWarning: Unknown pytest.mark.repair - is this a typo?  You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/how-to/mark.html
    @pytest.mark.repair

```

Restoring the marker allows tests to use the 'repair' mark without
generating warnings.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21931
2024-12-20 14:04:50 +02:00
Botond Dénes
42d24b2a8a Merge 'Retire topology::sort_by_proximity and compare_endpoints flavors using gms::inet_address' from Benny Halevy
This series converts the call site using compare_endpoints with gms::inet_address.
With that both flavors of compare_endpoints and sort_by_proximity for inet_address
can be retired as no other uses remain.

Also, add a unit test for topology::sort_by_proximity before further changes
to it are considered.

* Code cleanup, no backport is needed

Closes scylladb/scylladb#21976

* github.com:scylladb/scylladb:
  test: network_topology_strategy_test: add test_topology_sort_by_proximity
  locator/topology: retire sort_by_proximity/compare_endpoints for inet_address
  test: test_topology_compare_endpoints: use host_id:s
2024-12-20 13:34:55 +02:00
Aleksandra Martyniuk
da7301679b test: truncate the table before node ops task checks
Truncate a table before testing node ops tasks to check if the truncate
request won't be considered by node_ops_virtual_task.
2024-12-20 12:26:42 +01:00
Aleksandra Martyniuk
ee4bd287fd node_ops: rename a method that get node ops entries 2024-12-20 12:25:48 +01:00
Aleksandra Martyniuk
a7fc566c7e node_ops: filter topology_requests entries
Currently node_ops_virtual_task shows stats of all system.topology_request
entries. However, the table also contains info about non-node_ops requests,
e.g. truncate.

Filter the entries used by node_ops_virtual_task by their type.
With this change bootstrap of the first node will not be visible.

Update the test accordingly.
2024-12-20 12:20:42 +01:00
Yaron Kaikov
74c5aabd23 build_docker: add option for building container based on Ubuntu Pro
Today our container is based on ubuntu:22.04, we need to build another container based on Ubuntu Pro for FIPS support (currently the latest one is 20.04)

The default docker build process doesn't change, if FIPS is required I have added `--type pro` to build a supported container.

To enable FIPS there is a need to attach an Ubuntu Pro subscription (it will be done as part of https://github.com/scylladb/scylla-pkg/issues/4186)

Closes scylladb/scylladb#21974
2024-12-20 13:09:24 +02:00
Asias He
0141906c4a repair: Enable small table optimization for RBNO rebuild
Similar to 9ace191616 (repair: Enable
small table optimization for RBNO bootstrap and decommission), this
patch enables small table optimization for RBNO rebuild.

This is useful for rebuild ops which is used for building an empty DC.

Fixes: #21951

Closes scylladb/scylladb#21952
2024-12-20 13:03:34 +02:00
Kefu Chai
24283d9dd0 test/topology: rename manager_internal to manager_client
instead of reusing the variable name and overriding the parameter,
use a new name for the return value of `manager_internal()` for better
readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21932
2024-12-20 13:01:45 +02:00
Kefu Chai
6914892a1b repair: do not include unused headers
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21837
2024-12-20 08:55:56 +02:00
Botond Dénes
d4129ddaa6 Merge 'sstables_manager: do not reclaim unlinked sstables' from Lakshmi Narayanan Sreethar
When an sstable is unlinked, it remains in the _active list of the
sstable manager. Its memory might be reclaimed and later reloaded,
causing issues since the sstable is already unlinked. This patch updates
the on_unlink method to reclaim memory from the sstable upon unlinking,
remove it from memory tracking, and thereby prevent the issues described
above.

Added a testcase to verify the fix.

Fixes #21887

This is a bug fix in the bloom filter reload/reclaim mechanism and should be backported to older versions.

Closes scylladb/scylladb#21895

* github.com:scylladb/scylladb:
  sstables_manager: reclaim memory from sstables on unlink
  sstables_manager: introduce reclaim_memory_and_stop_tracking_sstable()
  sstables: introduce disable_component_memory_reload()
  sstables_manager: log sstable name when reclaiming components
2024-12-19 15:18:16 +02:00
Kefu Chai
16397d8cba message: do not include unused header
In commit bfee93c7, repair verbs were moved to IDL. During this refactoring,
the `gc_clock.hh` header became unused as its references were relocated.
`clang-include-cleaner` helped identify this unnecessary include, which is
now removed to clean up the codebase.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21919
2024-12-19 15:16:34 +02:00
Michał Chojnowski
f6ebd445e4 test_tablets.py: limit concurrency in test_tablet_storage_freeing
Apparently the python driver can't deal with the current concurrency sometimes.
Lower it from 1000 to 100.

Fixes scylladb/scylladb#20489

Closes scylladb/scylladb#20494
2024-12-19 15:14:41 +02:00
Kefu Chai
df36985fc3 raft: do not include unused headers
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21838
2024-12-19 14:57:22 +02:00
Kefu Chai
93be8f3a0c db,sstables: migate boost::range::stable_partition to std library
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::stable_partition`.

in this change, we:

- replace `boost::range::stable_parition()` to
  `std::ranges::stable_parition()`
- since `std::ranges::stable_parition()` returns a subrange instead of
  an iterator, change the names of variables which were previously used
  for holding the return value of `boost::range::stable_partition()`
  accordingly for better readability.
- remove unused `#include` of boost headers

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21911
2024-12-19 14:56:07 +02:00
Avi Kivity
a4440392d7 build: update dependencies for features to be ported from enterprise
ldap/slapd/toxiproxy/cyrus-sasl - for ldap authentication and authorization
git-lfs/bolt - for profile-guided optimization
lz4-static - for dictionary based network compression
jwt - for Oauth/GCP connectivity (for key management)
openkmip - for kmip testing
fipscheck - for FIPS validation

Frozen toolchain regenerated, with optimized clang from

  https://devpkg.scylladb.com/clang/clang-18.1.8-Fedora-40-aarch64.tar.gz
  https://devpkg.scylladb.com/clang/clang-18.1.8-Fedora-40-x86_64.tar.gz
2024-12-19 14:26:31 +02:00
Wojciech Mitros
37a25d3af4 mv: avoid stalls when calculating affected clustering ranges
Currently, when finishing db::view::calculate_affected_clustering_ranges
we deoverlap, transform and copy all ranges prepared before. This
is all done within a single continuation and can cause stalls.

We fix this by adding yields after each transform and moving elements
to the final vector one by one instead of copying them all at the end.

After this change, the longest continuation in this code will be
deoverlapping the initial ranges (and one transform). While it has
a relatively high computational complexity (we sort all ranges), it
should execute quickly because we're operating on views there and
we don't need to copy the actual bytes. If we encounter a stall there,
we'll need to implement an asynchronous `deoverlap` method.

Fixes scylladb/scylladb#21843

Closes scylladb/scylladb#21846
2024-12-19 12:50:30 +01:00
Kamil Braun
91cddcc17f Merge 'Do not reset quarantine list in non raft mode' from Gleb Natapov
The series contains small fixes to the gossiper one of which fixes #21930. Others I noticed while debugged the issue.

Fixes: scylladb/scylladb#21930

Closes scylladb/scylladb#21956

* github.com:scylladb/scylladb:
  gossiper: do not reset _just_removed_endpoints in non raft mode
  gossiper: do not send echo message to yourself
  gossiper: do not call apply for the node's old state
2024-12-19 11:03:35 +01:00
Pavel Emelyanov
bb094cc099 Merge 'Make restore task abortable' from Calle Wilund
Fixes #20717

Enables abortable interface and propagates abort_source to all s3 objects used for reading the restore data.

Note: because restore is done on each shard, we have to maintain a per-shard abort source proxy for each, and do a background per-shard abort on abort call. This is synced at the end of "run()".

Abort source is added as an optional parameter to s3 storage and the s3 path in distributed loader.

There is no attempt to "clean up" an aborted restore. As we read on a mutation level from remote sstables, we should not cause incomplete sstables as such, even though we might end up of course with partial data restored.

Closes scylladb/scylladb#21567

* github.com:scylladb/scylladb:
  test_backup: Add restore abort test case
  sstables_loader: Make restore task abortable
  distributed_loader: Add optional abort_source to get_sstables_from_object_store
  s3_storage: Add optional abort_source to params/object
  s3::client: Make "readable_file" abortable
2024-12-19 12:23:33 +03:00
Benny Halevy
67b7015ced test: network_topology_strategy_test: add test_topology_sort_by_proximity
Before further changes are made to sort_by_proximity
add a unit test for it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-19 09:45:02 +02:00
Benny Halevy
1c5b0eca41 locator/topology: retire sort_by_proximity/compare_endpoints for inet_address
Those are not used anymore now that the last call
site for compare_endpoints by inet_address is converted
to use host_id.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-19 09:44:41 +02:00
Benny Halevy
dcdc60fffd test: test_topology_compare_endpoints: use host_id:s
This is the last call site requiring the compare_endpoints
flavour for inet_address.

Once this test is converted to use host_id:s instead,
compare_endpoints and sort_by_proximity can be simplified
to support only host_id:s.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-19 09:44:26 +02:00
Kefu Chai
2a31a82ae2 .github: Ensure header generation before include analysis
When running clang-include-cleaner, the tool performs static analysis by
"compiling" specified source files. Previously, non-existent included headers
caused the tool to skip source files, reducing the effectiveness of unused
include detection.

Problem:
- Header files like 'rust/wasmtime_bindings.hh' were not pre-generated
- Compilation errors led to skipping source file analysis

  ```
  /__w/scylladb/scylladb/lang/wasm.hh:15:10: fatal error: 'rust/wasmtime_bindings.hh' file not found
     15 | #include "rust/wasmtime_bindings.hh"
        |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~
  Skipping file /__w/scylladb/scylladb/lang/wasm.hh due to compiler errors. clang-include-cleaner expects to work on compilable source code.
  1 error generated.
```

- This significantly reduced clang-include-cleaner's coverage

Solution:
- Build the `wasmtime_bindings` target to generate required header files
- Ensure all necessary headers are created before running static analysis
- Enable full source file checking for unused includes

By generating headers before analysis, we prevent skipping of source files
and improve the comprehensiveness of our include cleaner workflow.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21739
2024-12-19 09:41:46 +02:00
Ferenc Szili
dc375b8cd3 test: enable test_truncate_with_coordinator_crash
This test was added in PR #19789 but was disabled with xfail because of
the bug with way truncate saved the commit log replay positions. More
specifically, the replay positions for shards that had no mutations were
saved to system.truncated with shard_id == 0, regardless for which shard
it was actually saved for (see #21719).
The bug was fixed in #21722, so this change removes the xfail tag from
the test.

Closes scylladb/scylladb#21902
2024-12-18 18:02:52 +01:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Botond Dénes
1a717f3014 service/storage_proxy: data_resolver::resolve(): apply mutations gently
The data resolved has to apply all mutations from all replica to a
single mutation. In the extreme case, when all rows are dead, the
mutations can have around 10K rows in them. This is not a huge amount,
but it is enough to cause moderate stalls of <20ms.
To avoid this, use the gentle variant of apply(), which can yield in the
middle.

Fixes: scylladb/scylladb#21818

Closes scylladb/scylladb#21884
2024-12-18 15:21:19 +01:00
Kefu Chai
e65fc35b5e replica: do not include unused headers
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21836
2024-12-18 13:52:57 +02:00
Avi Kivity
5a849b0a6a Merge "Move more subsystems to use host ids instead of ips" from Gleb
"
This series converts repair, streaming and node_ops (and some parts of
alternator) to work on host ids instead of ips. This allows to remove
a lot of (but not all) functions that work on ips from effective
replication map.

CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/13830/

Refs: scylladb/scylladb#21777
"

* 'gleb/move-to-host-id-more' of github.com:scylladb/scylla-dev:
  locator: topology: remove no longer use get_all_ips()
  gossiper: change get_unreachable_nodes to host ids
  locator: drop no longer used ip based functions from effective replication map and friends
  test: move network_topology_strategy_test and token_metadata_test to use host id based APIs
  replica/database: drop usage of ip in favor of host id in get_keyspace_local_ranges
  replica/mutation_dump: use host ids instead of ips
  alternator: move ttl to work with host ids instead of ips
  storage_service: move node_ops code to use host ids instead of host ips
  streaming: move streaming code to use host ids instead of host ips
  repair: move repair code to use host ids instead of host ips
  gossiper: add get_unreachable_host_ids() function
  locator: topology: add more function that return host ids to effective replication map
  locator: add more function that return host ids to effective replication map
2024-12-18 13:48:22 +02:00
Piotr Dulikowski
d067d8caef Merge 'More Python tests for materialized view and Alternator GSI feature' from Nadav Har'El
This patch includes more tests (in Python) that I wrote while implementing the Alternator UpdateTable feature for adding a GSI to an existing table (https://github.com/scylladb/scylladb/issues/11567).

I explain each of these tests in the separate patches below, but basically they fall into two types:
1. Tests which pass with today's materialized views and Alternator GSI/LSI, and serve to ensure that whatever changes I do to the view update implementation, doesn't break corner cases that already worked.
2. Tests for the UpdateTable feature in Alternator which doesn't work today so xfail - and will need to work for #11567. We already had a few tests for this, but here I add more and improve coverage of various corner cases I discovered while implementing the featue.

I already have a working prototype for #11567 which passes all these tests. Many of these tests helped exposed various bugs in earlier versions of my code.

Closes scylladb/scylladb#21927

* github.com:scylladb/scylladb:
  test/cqlpy: a few more functional tests for materialized views
  test/alternator: more tests for UpdateTable create and delete GSI
  test/alternator: make UpdateTable tests wait less
  test/alternator: move UpdateTable tests to a separate file
  test/alternator: add another test for elaborate GSI updates
  test/alternator: test that DescribeTable returns IndexStatus for GSI
  test/alternator: fix wrong test for UpdateTable metrics
  test/alternator: add test for missing attribute in item in LSI
  test/alternator: test that DescribeTable doesn't return IndexStatus for LSI
  test/alternator: add tests for RBAC for create and delete GSI
2024-12-17 20:43:07 +01:00
Yaron Kaikov
3a00ffd2eb build_docker.sh: remove rsyslog installation and conf
It seems that no one is using rsyslog, so there is no point having it
inside our container (see https://github.com/scylladb/scylladb/issues/21923#issuecomment-2545191667)

Refs: https://github.com/scylladb/scylladb/issues/21923

Closes scylladb/scylladb#21953
2024-12-17 17:34:35 +02:00
Gleb Natapov
e318dfb83a gossiper: do not reset _just_removed_endpoints in non raft mode
By the time the function is called during start it may already be
populated.

Fixes: scylladb/scylladb#21930
2024-12-17 16:57:13 +02:00
Gleb Natapov
3368019982 gossiper: do not send echo message to yourself
When sending by ID we should check that we do not translate our old
address to our ID and sending locally. mark_alive should not be called
with node's old ip anyway.
2024-12-17 16:57:13 +02:00
Gleb Natapov
e80355d3a1 gossiper: do not call apply for the node's old state
If a nodes changed its address an old state may be still in a gossiper,
so ignore it.
2024-12-17 16:57:13 +02:00
Avi Kivity
01cdba9a98 Merge 'cache_algorithm_test: fix flaky failures' from Michał Chojnowski
This series attempts to get read of flakiness in `cache_algorithm_test` by solving two problems.

Problem 1:

The test needs to create some arbitrary partition keys of a given size. It intends to create keys of the form:
0x0000000000000000000000000000000000000000...
0x0100000000000000000000000000000000000000...
0x0200000000000000000000000000000000000000...
But instead, unintentionally, it creates partially initialized keys of the form: 0x0000000000000000garbagegarbagegarbagegar...
0x0100000000000000garbagegarbagegarbagegar...
0x0200000000000000garbagegarbagegarbagegar...

Each of these keys is created several times and -- for the test to pass -- the result must be the same each time.
By coincidence, this is usually the case, since the same allocator slots are used. But if some background task happens to overwrite the allocator slot during a preemption, the keys used during "SELECT" will be different than the keys used during "INSERT", and the test will fail due to extra cache misses.

Problem 2:

Cache stats are global, so there's no good way to reliably
verify that e.g. a given read causes 0 cache misses,
because something done by Scylla in a background can trigger a cache miss.

This can cause the test to fail spuriously.

With how the test framework and the cache are designed, there's probably
no good way to test this properly. It would require ensuring that cache
stats are per-read, or at least per-table, and that Scylla's background
activity doesn't cause enough memory pressure to evict the tested rows.

This patch tries to deal with the flakiness without deleting the test
altogether by letting it retry after a failure if it notices that it
can be explained by a read which wasn't done by the test.
(Though, if the test can't be written well, maybe it just shouldn't be written...)

Fixes #21536

Should be backported to prevent flaky failures in older branches.

Closes scylladb/scylladb#21948

* github.com:scylladb/scylladb:
  cache_algorithm_test: harden against stats being confused by background activity
  cache_algorithm_test: fix a use of an uninitialized variable
2024-12-17 14:46:43 +02:00
Lakshmi Narayanan Sreethar
4fe4367242 sstables_manager: reclaim memory from sstables on unlink
When an sstable is unlinked, it remains in the _active list of the
sstable manager. Its memory might be reclaimed and later reloaded,
causing issues since the sstable is already unlinked. This patch updates
the on_unlink method to reclaim memory from the sstable upon unlinking,
remove it from memory tracking, and thereby prevent the issues described
above.

Added a testcase to verify the fix.

Fixes #21887

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-12-17 18:14:43 +05:30
Lakshmi Narayanan Sreethar
5dffc19f2d sstables_manager: introduce reclaim_memory_and_stop_tracking_sstable()
When an sstable is unlinked or deactivated, it should be removed from
the component memory tracking metrics and any further reload/reclaim
should be disabled. This patch adds a new method that implements the
above mentioned functionality. This patch also updates the deactivate()
to use the new method. Next patch will use it to disable tracking when
an sstable is unlinked.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-12-17 18:14:43 +05:30
Lakshmi Narayanan Sreethar
b7b4c5c661 sstables: introduce disable_component_memory_reload()
Added a new method to disable reload of previously reclaimed components
from the sstable. This will be used to disable reload of bloom filters
after an sstable has been unlinked or deactivated.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-12-17 18:14:43 +05:30
Lakshmi Narayanan Sreethar
6ad962cb38 sstables_manager: log sstable name when reclaiming components
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-12-17 18:14:36 +05:30
Dawid Mędrek
461a6b129c docs: Update documentation on CREATE ROLE WITH HASHED PASSWORD
As part of #18750, we added a CQL statement CREATE ROLE WITH SALTED HASH
that prevented hashing a password when creating a role, effectively leading
to inserting a hash given by the user directly into the database. In #21350,
we noticed that Cassandra had implemented a CQL statement of similar semantics
but different syntax. We decided to rename Scylla's statement to be compatible
with Cassandra. Unfortunately, we didn't notice one more difference between
what we had in Scylla and what was part of Cassandra.

Scylla's statement was originally supposed to only be used when restoring
the schema and the user needn't have to be aware of its existence at all:
the database produced a sequence of CQL statements that the user saved to
a file and when a need to restore the schema arose, they would execute
the contents of the file. That's why that although we documented the feature,
it was only done in the necessary places. Those that weren't related to
the backup & restore procedure were deliberately skipped.

Cassandra, on the other hand, added the statement for a different purpose
(for details, see the relevant issue) and it was supposed to be used by
the user by design. The statement is also documented as such.

Since we want to preserve compatibility with Cassandra, we document
the statement and its semantics in the user documentation, explicitly
implying that it can be used by the user.

Fixes scylladb/scylladb#21691
2024-12-17 13:43:36 +01:00
Dawid Mędrek
e365653560 test/boost: Add test for creating roles with hashed passwords
We add a new test verifying that after creating a role with a hashed
password using one of the supported encryption algorithms: bcrypt,
sha256, sha512, or md5, the user can successfully log in.
2024-12-17 13:42:15 +01:00
Tomasz Grabiec
e732ff7cd8 tablets: load_balancer: Fail when draining with no candidate nodes
If we're draining the last node in a DC, we won't have a chance to
evaluate candidates and notice that constraints cannot be satisfied (N
< RF). Draining will succeed and node will be removed with replicas
still present on that node. This will cause later draining in the same
DC to fail when we will have 2 replicas which need relocaiton for a
given tablet.

The expected behvior is for draining to fail, because we cannot keep
the RF in the DC. This is consistent, for example, with what happens
when removing a node in a 2-node cluster with RF=2.

Fixes #21826
2024-12-17 12:14:18 +01:00
Tomasz Grabiec
8718450172 tablets: load_balancer: Ignore skip_list when draining
When doing normal load balancing, we can ignore DOWN nodes in the node
set and just balance the UP nodes among themselves because it's ok to
equalize load just in that set, it improves the situation.

It's dangerous to do that when draining because that can lead to
overloading of the UP nodes. In the worst case, we can have only one
non-drained node in the UP set, which would receive all the tablets of
the drained node, doubling its load.

It's safer to let the drain fail or stall. This is decided by topology
coordinator, currently we will fail (on barrier) and rollback.
2024-12-17 12:14:18 +01:00
Botond Dénes
73fc135e02 Merge 'test.py: make sure topology/ and topology_custom/ passes with tablets on.' from Konstantin Osipov
Explicitly disable tablets in a few tests that rely on features not yet supported with tablets.

Closes scylladb/scylladb#21070

* github.com:scylladb/scylladb:
  test: disable tablets in test_raft_fix_broken_snapshot
  test: disable tablets in test_raft_recovery_stuck
  test: disable tablets in tet_raft_recovery_majority_lost
  test: don't run test_raft_recovery_basic with tablets
  test: fix test_writes_to_previous_cdc_generations work with tablets
  test: fix topology_custom/test_mv_topology_change.py to work with tablets
  test: correct replication factor in test_multidc.py
  test: update test_view_build_status to work with tablets
  test: fix test_change_rpc_address with tablets.
  test: explicitly disable tablets in test_gropu0_schema_versioning
  test: disable tablets in topology/test_mutation_schema_change.py
  test: disable tablets in topology/test_mv.py
2024-12-17 08:38:10 +02:00
Aleksandra Martyniuk
d0cda8ebef replica: check enabled features in tablet_map_to_mutation
Before adding a value to a new column in tablet_map_to_mutation
check if the column is supported by the whole cluster.

Closes scylladb/scylladb#21941
2024-12-17 07:02:11 +02:00
Michał Chojnowski
6caaead4ac cache_algorithm_test: harden against stats being confused by background activity
Cache stats are global, so there's no good way to reliably
verify that e.g. a given read causes 0 cache misses,
because something done by Scylla in a background can trigger a cache miss.

This can cause the test to fail spuriously.

With how the test framework and the cache are designed, there's probably
no good way to test this properly. It would require ensuring that cache
stats are per-read, or at least per-table, and that Scylla's background
activity doesn't cause enough memory pressure to evict the tested rows.

This patch tries to deal with the flakiness without deleting the test
altogether by letting it retry after a failure if it notices that it
can be explained by a read which wasn't done by the test.
(Though, if the test can't be written well, maybe it just shouldn't be written...)
2024-12-16 23:14:30 +01:00
Michał Chojnowski
1fffd976a4 cache_algorithm_test: fix a use of an uninitialized variable
The test needs to create some arbitrary partition keys of a given size.
It intends to create keys of the form:
0x0000000000000000000000000000000000000000...
0x0100000000000000000000000000000000000000...
0x0200000000000000000000000000000000000000...
But instead, unintentionally, it creates partially initialized keys of the form:
0x0000000000000000garbagegarbagegarbagegar...
0x0100000000000000garbagegarbagegarbagegar...
0x0200000000000000garbagegarbagegarbagegar...

Each of these keys is created several times and -- for the test to pass --
the result must be the same each time.
By coincidence, this is usually the case, since the same allocator slots are used.
But if some background task happens to overwrite the allocator slot during a
preemption, the keys used during "SELECT" will be different than the keys used
during "INSERT", and the test will fail due to extra cache misses.
2024-12-16 23:14:13 +01:00
Nadav Har'El
99e7fdef6d test/cqlpy: a few more functional tests for materialized views
This patch adds a few more functional tests for the CQL materialized
view feature in the cqlpy. The new tests pass, but helped me catch bugs (and
understand what are *not* bugs) while refactoring some view update code.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-16 19:36:47 +02:00
Nadav Har'El
d9af154772 test/alternator: more tests for UpdateTable create and delete GSI
We already have in test_gsi_updatetable.py several functional tests for
the Alternator feature of adding or deleting a GSI on an existing table,
through the UpdateTable operation.
This patch adds many more tests for various corner cases of this feature -
tests developed in parallel with actually implementing that feature.

All test in test_gsi_updatetable.py pass on Amazon DynamoDB but currently
xfail on Alternator, due to the following issues:

 * #11567: Alternator: allow adding a GSI to a pre-existing table
 * #9424: Alternator GSIs should exclude items with empty-string key components

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-16 19:36:47 +02:00
Nadav Har'El
5c7b8c8e4d test/alternator: make UpdateTable tests wait less
The UpdateTable tests for creating and deleting a GSI need to wait for
the asynchronous operation of the view's building and deletion, using
two utility functions wait_for_gsi() and wait_for_gsi_gone().

Because I originally wrote these tests for DynamoDB and its extremely
high latency for these operations, these functions waited a whole second
before checking for the end of the wait. This whole-second sleep is
absurd in Alternator where building a small view takes just a fraction of
a second. So let's lower the sleep time from 1 second to 0.1 seconds,
and allow these tests to pass much faster on Alternator (once this
feature is implemented in Alternator, of course - until then all these
tests still fail immediately on an unimplemented operation).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-16 19:36:47 +02:00
Nadav Har'El
b1bd5cdf0f test/alternator: move UpdateTable tests to a separate file
The source file test/alternator/test_gsi.py has already grown very
large, so this patch moves all the existing tests related to using
UpdateTable to add or delete a GSIs to a separate file:
test_gsi_updatetable.py.

We just move tests here - no new tests or functional changes to the
tests - but did use the opportunity for some small improvements in
the comments.

In the next patch we'll add more tests to this new file.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-16 19:36:47 +02:00
Nadav Har'El
cc308bd0cc test/alternator: add another test for elaborate GSI updates
We have a test, test/alternator/test_gsi.py::test_update_gsi_pk which
created a GSI whose *partition key* was a regular column in the base
table, and exercised various elaborate updates requiring adding,
updating and deleting of rows from the materialized view.

In this patch, we add another similar test case, just for a *clustering
key*.

Both these tests are important regression tests - when we later
reimplement GSI we'll want to verify that none of the complex update
scenarios got broken (and indeed, some broken code did break these
tests).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-16 18:56:28 +02:00
Nadav Har'El
9094fe1608 test/alternator: test that DescribeTable returns IndexStatus for GSI
This patch adds a test reproducing issue #11471 - where DescribeTable
on a table that as an already built GSI (creating with the table itself)
must return IndexStatus == "ACTIVE".

This test passes on DynamoDB, but xfails on Alternator because of
issue #11471.

We actually had this check earlier, but it was part of a bigger xfailing
tests that checked multiple features. It's better to have it as a
separate test just for this feature, as we'll soon fix this issue and
make this test pass.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-16 18:56:28 +02:00
Nadav Har'El
1b120e3c7e test/alternator: fix wrong test for UpdateTable metrics
The test we had for counting Alternator operations metrics ran the
UpdateTable request without any parameters, which isn't actually a
valid call - Amazon DynamoDB rejects such a call, saying one of the
different parameters must be present, and we'll want to do that
later too.

So let's fix the test to use a valid UpdateTable request, one that
does the silly BillingMode='PAY_PER_REQUEST'. This is already the
current setting, so nothing is really changed, but it's still counted
as an operation in the metric.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-16 18:56:28 +02:00
Nadav Har'El
85088516b2 test/alternator: add test for missing attribute in item in LSI
Test that when a table has an LSI, then if the indexed attribute is
missing, the item is added to the base table but not the index.

We already have exactly the same test for GSI in test_gsi.py, but forgot
to do write the same test for LSI. It's important to test this scenario
separately for GSIs and LSIs because in an upcoming GSI reimplementation
we plan to make the GSI and LSI implementation slightly different, and
they can have separate bugs (and in fact, we had such an LSI-specific
bug in one broken implementation).

We also have the same scenario that is tested here in the test
test_streams.py::test_streams_updateitem_old_image_lsi_missing_column
but that was a Alternator Streams test and we should have a more basic
test for this scenario in test_lsi.py.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-16 18:56:28 +02:00
Nadav Har'El
b00f5a6070 test/alternator: test that DescribeTable doesn't return IndexStatus for LSI
Whereas GSIs have an IndexStatus when described by DescribeTable,
LSIs do not. The purpose of IndexStatus is to tell when the index is live,
and this is not needed for LSIs because they cannot be added to a base
table that already exists.

We already had a test for this, but it was hidden in an xfailing test
for many different DescribeTable attributes - so let's move it into it's
own, *passing*, test. The new tests passes on both Alternator and
Amazon DynamoDB.

This test is an important regression test for when we later add
IndexStatus support to GSI, and this test will ensure that we don't
accidentally introduce IndexStatus to LSIs as well - DynamoDB doesn't
generate it for LSIs so neither should Alternator.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-16 18:56:28 +02:00
Nadav Har'El
373b37b5da test/alternator: add tests for RBAC for create and delete GSI
In later patches we will implement (as requested in issue #11567) the
UpdateTable operation for creating a new GSI or removing a GSI on an
existing table. In this patch we add to test/alternator/test_cql_rbac.py
tests to exhaustively check that the new operations will behave as expected
in respect to role-based access control (RBAC):

1. UpdateTable requires the ALTER permissions on the affected table -
   as was already the case before (and was documented in compatibility.md).
   This should also be true for the newly-implemented UpdateTable
   operations that create a GSI and delete a GSI, and we test that.

   The above statement may sound counter-intuitive - why does creating
   or deleting a GSI require ALTER permissions (on the base table), not
   CREATE or DROP permissions? But this makes sense when you consider
   that CREATE permissions should allow you create new independent tables,
   not to change the behavior or performance of existing tables (which
   adding a GSI does).

2. When a role has permissions to create a GSI, it should be able to
   read the new GSI (SELECT permissions). This is known as "auto-grant".

3. When a GSI is deleted, whatever permissions was set on it is revoked,
   so that if it's later recreated, the old permissions don't resurface.
   This is known as "auto-revoke".

Because the UpdateTable feature for creating and deleting a GSI is not
yet enabled, the new tests are all marked "xfail".

The new tests, like all tests in the file test/alternator/test_cql_rbac.py
are Scylla-only and are skipped on Amazon DynamoDB - because they test
the Scylla-only CQL-based role-based access control API.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-16 18:55:28 +02:00
Tomasz Grabiec
2de3c079b2 tablets: topology_coordinator: Keep tablet_draining transition if nodes are not drained
Empty plan with nodes to drain meant that we can exit tablet_draining
transition and move to the next stage of decommission/removenode.

In case tablet scheduler creates an empty plan for some reason but
there are still underained tablets, that could put topology in an
invalid state. For example, this can currently happen if there
are no non-draining nodes in a DC.

This patch adds a safety net in the topology coordinator which
prevents moving forward with undrained tablets.
2024-12-16 16:54:59 +01:00
Konstantin Osipov
686c0e517f test: disable tablets in test_raft_fix_broken_snapshot
The test is using force_gossip_topology_changes which
doesn't work with tablets.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
bba034202d test: disable tablets in test_raft_recovery_stuck
The test is using force_gossip_topology_mode which doesn't
work with tablets.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
3767a54696 test: disable tablets in tet_raft_recovery_majority_lost
The test is using force_gossip_topology_mode which doesn't
work with tablets.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
e961d692e6 test: don't run test_raft_recovery_basic with tablets
It uses force_gossip_topology_changes, which doesn't work
with tablets.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
d6fc0d5512 test: fix test_writes_to_previous_cdc_generations work with tablets
The test is testing CDC. CDC doesn't work with tablets.
Explicitly disable tablets in the keyspaces used by the test.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
169c2e62b8 test: fix topology_custom/test_mv_topology_change.py to work with tablets
test_mv_topology_change runs in gossip mode, so disable tablets as well.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
ff43f8d9f6 test: correct replication factor in test_multidc.py
In tablets mode, it is not allowed to CREATE a table
if replication factor can be satisfied. E.g. if the keyspace
is defined to have replication_factor = 3 and there
are only 2 replicas, in vnodes mode one still can
CREATE the table and write to it, whereas in tablets
mode one gets an error.

The confusion is what 'replication_factor' means.
When NetworkTopologyStrategy is used, in multi-dc mode, each DC must
have at least 'replication_factor' replicas and stores
'replication_factor' copies of data.

The test author (as well as the author of this "fix", see
my confused report of gh-21166) assumed that 'replication_factor'
means the total number of replicas, not the number of replicas
per DC.

Correct the test to use only one replica per DC, as this is the
topology the test is working with. The test is not specific
to the number of replicas, so the change does not impact
the logic of the test.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
1e582b4c0f test: update test_view_build_status to work with tablets
The test runs a bunch of tests in gossip only mode, which doesn't
work with tablets, so disable tablets explicitly in these tests.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
3e55f1c033 test: fix test_change_rpc_address with tablets.
With tablets, it's not allowed to create a table in a keyspace
which replication factor exceeds the actual number of nodes in the
cluster.

Pass the replication factor to random_tables fixture so that
a keyspace with a correct replication_factor is created.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
4b10c10c1b test: explicitly disable tablets in test_gropu0_schema_versioning
This is a gossip-based topology changes test, and tablets
don't work with gossip based topology.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
4aa7dca862 test: disable tablets in topology/test_mutation_schema_change.py
This test uses lightweight transactions, which are not enabled
with tablets keyspaces.
2024-12-16 08:38:05 -05:00
Konstantin Osipov
2866b4f550 test: disable tablets in topology/test_mv.py
The test file contains two test cases, which both test
materialized view tombstone gc settings. With tablets the default
is "repair" which is different from vnodes.

The tests are testing that the gc settings are not inherited. With
tablets, the gc settings are forced. This is indistinguishable from
inheriting, so the tests are failing when run with tablets.
2024-12-16 08:38:05 -05:00
Botond Dénes
e6447f60c2 Merge 'db,auth,locator: Remove unused member variables' from Kefu Chai
this issue was identified by clang-20.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#21835

* github.com:scylladb/scylladb:
  locator: remove unused member variable
  auth: remove unused member variable
  db: remove unused member variable
2024-12-16 15:16:17 +02:00
Kefu Chai
f2638c3d18 test: topology_custom: restrcuture comment as ordered list
When investigating issue #21724, the docstring for
`test_recover_stuck_raft_recovery` was found to be difficult to follow.
Restructured the docstring into an ordered list to:

1. Improve readability
2. Clearly outline the test steps
3. Make the test's logic and flow more immediately comprehensible

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21728
2024-12-16 14:30:13 +02:00
Pavel Emelyanov
7db9132b56 test: Add validation of getting/changing compaction strategy via REST API
The /column_family/compaction_strategy has GET and POST implemented, the
latter changes the strategy on the table.

Unknown strategy name implicitly renders internal server error code by
catching exception from compaction_strategy::type() that tries to
convert strategy name string to strategy enum class type.

This is to finish validation of #21533

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#21569
2024-12-16 14:28:23 +02:00
Botond Dénes
34a8b492be Merge 'materialized view: make flow-control maximum delay configurable' from Piotr Dulikowski
This pull request is continuation of scylladb/scylladb#20688 - contents of the main commit are the same, the only change is the additional commit with a test.

Until this patch, the materialized view flow-control algorithm (https://www.scylladb.com/2018/12/04/worry-free-ingestion-flow-control/) used a constant delay_limit_us hard-coded to one second, which means that when the size of view-update backlog reached the maximum (10% of memory), we delay every request by an additional second - while smaller amounts of backlog will result in smaller delays.

This hard-coded one maximum second delay was considered *huge* - it will slow down a client with concurrency 1000 to just 1000 requests per second - but we already saw some workloads where it was not enough - such as a test workload running very slow reads at high concurrency on a slow machine, where a latency of over one second was expected for each read, so adding a one second latecy for writes wasn't having any noticable affect on slowing down the client.

So this patch replaces the hard-coded default with a live-updateable configuration parameter, `view_flow_control_delay_limit_in_ms`, which defaults to 1000ms as before.

Another useful way in which the new `view_flow_control_delay_limit_in_ms` can be used is to set it to 0. In that case, the view-update flow control always adds zero delay, and in effect - does absolutely nothing. This setting can be used in emergency situations where it is suspected that the MV flow control is not behaving properly, and the user wants to disable it.

The new parameter's help string mentions both these use cases of the parameter.

Fixes #18187

This is new functionality, no need to backport to any open source release.

Closes scylladb/scylladb#21647

* github.com:scylladb/scylladb:
  materialized views: test for the MV delay configuration parameter
  service: add injection for skipping view update backlog
  materialized view: make flow-control maximum delay configurable
2024-12-16 14:20:33 +02:00
Yaron Kaikov
2e6755ecca .github/scripts/auto-backport.py: Add comment to PR when conflicts apply
When we open a PR with conflicts, the PR owner gets a notification about the assignment but has no idea if this PR is with conflicts or not (in Scylla it's important since CI will not start on draft PR)

Let's add a comment to notify the user we have conflicts

Closes scylladb/scylladb#21939
2024-12-16 14:17:40 +02:00
Raphael S. Carvalho
013e0d53ff replica: Fix use-after-free due to a race between split and cleanup
There is an assumption that every destroyed compaction_group will be stopped first.
Otherwise, the group is still referenced by compaction manager and can use it after
freed. That's what happened in issue #21867 in the context of merge.

The issue is pre-existing but was made more likely with merge.

One problem is a race between split and cleanup, where if split is emitted while
cleanup is stopping groups, it can happen split preparation adds new groups that will
never be closed, since cleanup is already past the group stopping step.

Another problem found is that split completion handler is not accounting for possible
existence of merging groups, if split happens right after merge. Split completion
handler should stop all empty groups that previously had data split from them.

The problems will be fixed by guaranteeing that new groups will not be added for a
tablet being migrated away, and that empty groups are properly closed when handling
split completion.

A reproducer was added.

Fixes #21867.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#21920
2024-12-16 13:19:26 +02:00
Avi Kivity
fe9fcdfe30 task_manager.hh: replace boost ranges with std ranges
Standardize on one range library to reduce dependency load.

Unfortunately, std::views::concat (the replacement for boost::join),
is C++26 only. We use two separate inserts to the result vector to
compensate, and rationalize it by saying that boost::join() is likely
slow due to the need for type-erasure.

Closes scylladb/scylladb#21834
2024-12-16 13:08:02 +02:00
Artsiom Mishuta
e4dc86b552 fix(test.py): adjust break_manager method
remove unnecessary _mark_dirty call

server_broken_event - stop the whole file execution
(prevent the next tests from running because
Pyhon server object is broken PR: scylladb/scylladb#18236).
and next file execution will create its new cluster
so _mark_dirty will not change anything

Closes scylladb/scylladb#21429
2024-12-16 11:24:03 +01:00
Benny Halevy
8832301fe0 storage_service: replicate_to_all_cores: clear_gently pending erms
In case the update is rolled back on error, call clear_gently
for table_erms and view_erms to prevent potential stalls
with a large number of tables.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-15 12:17:28 +02:00
Benny Halevy
500ca17370 test_mv_topology_change: drop delay_after_erm_update injection case
After last patch, we deliberately don't yield between
update of base table erm and updating its view,
which was the scenario tested with the `delay_after_erm_update`
error injection point.

Instead, call maybe_yield in between base/views updates
to prevent reactor stalls with many tables.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-15 12:11:43 +02:00
Benny Halevy
4bfa3060d0 storage_service: replicate_to_all_cores: update base and view tables atomically
Currently, the loop updating all tables (including views) with the
new effective_replication_map may yield, and therefore expose
a state where the base and view tables effective_replication_map
and topology are out of sync (as seen in scylladb/scylladb#17786)

To prevent that, loop over all base tables and for each table
update the base table and all views atomically, without yielding,
and so allow yielding only between base tables.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-15 12:11:41 +02:00
Benny Halevy
10c4cf930c table: make update_effective_replication_map sync again
Commit f2ff701489 introduced
a yield in update_effective_replication_map that might
cause the storage_group manager to be inconsistent with the
new effective_replication_map (e.g. if yielding right
before calling `handle_tablet_split_completion`.

Also, yielding inside storage_service::replicate_to_all_cores
update loop means that base tables and their views
aren't updated atomically, that caused scylladb/scylladb#17786

This change essentially reverts f2ff701489
and makes handle_tablet_split_completion synchronous too.
The stopped compaction groups future is kept as a memebr and
storage_group_manager::stop() consumes this future during table::stop().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-15 11:45:08 +02:00
Gleb Natapov
6890281486 locator: topology: remove no longer use get_all_ips() 2024-12-15 11:31:11 +02:00
Gleb Natapov
c2e3d875ab gossiper: change get_unreachable_nodes to host ids 2024-12-15 11:31:11 +02:00
Gleb Natapov
c39474cc7e locator: drop no longer used ip based functions from effective replication map and friends 2024-12-15 11:31:11 +02:00
Gleb Natapov
c5f1dc6293 test: move network_topology_strategy_test and token_metadata_test to use host id based APIs 2024-12-15 11:31:11 +02:00
Gleb Natapov
ca55d1e658 replica/database: drop usage of ip in favor of host id in get_keyspace_local_ranges 2024-12-15 11:31:11 +02:00
Gleb Natapov
77f8abb19a replica/mutation_dump: use host ids instead of ips 2024-12-15 11:31:11 +02:00
Gleb Natapov
38c13975ca alternator: move ttl to work with host ids instead of ips 2024-12-15 11:31:11 +02:00
Gleb Natapov
03c8ffa45c storage_service: move node_ops code to use host ids instead of host ips 2024-12-15 11:31:11 +02:00
Gleb Natapov
41a57ed2e8 streaming: move streaming code to use host ids instead of host ips
The patch is rather large, but it is a straightforward conversion from
one type to another.
2024-12-15 11:31:11 +02:00
Gleb Natapov
e479ba88af repair: move repair code to use host ids instead of host ips
The patch is rather large, but it is a straightforward conversion from
one type to another.
2024-12-15 11:31:11 +02:00
Gleb Natapov
92815684df gossiper: add get_unreachable_host_ids() function
Will be needed later.
2024-12-15 11:31:10 +02:00
Gleb Natapov
1751791b53 locator: topology: add more function that return host ids to effective replication map
Add host id functions variants along with those that ip based. We will
need them to move more code to host ids.
2024-12-15 11:31:10 +02:00
Gleb Natapov
3b8345ee44 locator: add more function that return host ids to effective replication map
Add host id functions variants along with those that ip based. We will
need them to move more code to host ids.
2024-12-15 11:16:45 +02:00
Kefu Chai
5697160238 install-dependencies.sh: quote array to avoid re-splitting
Before this change, we didn't quote the array of the keys of an
associative array. and shellcheck warns like:

```
In install-dependencies.sh line 330:
    for package in ${!pip_packages[@]}
                   ^-----------------^ SC2068 (error): Double quote array expansions to avoid re-splitting elements.

```

While the current keys in the associative array do not contain spaces,
quoting array expansions is a recommended defensive programming practice.
This change:
- Prevents potential future issues with unexpected whitespace
- Silences Shellcheck warning without changing functionality
- Improves code quality and maintainability

Specifically modified the array iteration from:
   `for package in ${!pip_packages[@]}`
to:
   `for package in "${!pip_packages[@]}"`

This change has no functional impact and serves as a proactive code improvement.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-14 21:11:58 +08:00
Kefu Chai
6eda41f305 install-dependencies.sh: define local variable using "local -A"
"declare -A local GO_ARCH" does not define a single variable, instead
it defines two variables named "local" and "GO_ARCH". shellcheck
warns when analyzing this script:

```
In ./install-dependencies.sh line 188:
    declare -A local GO_ARCH=(
               ^---^ SC2316 (error): This applies declare to the variable named local, which is probably not what you want. Use a separate command or the appropriate `declare` optionsinstead.
               ^---^ SC2034 (warning): local appears unused. Verify use (or export if used externally).
```

and per the output of "help declare":

```
declare: declare [-aAfFgiIlnrtux] [name[=value] ...] or declare -p [-aAfFilnrtux] [name ...]
```
we defined two associative arrays instead of one.

In this change, we use the correct Bash syntax `local -A GO_ARCH` to:
- Create a single, locally-scoped associative array
- Eliminate static analysis warnings
- Improve code readability and maintainability

This is a cleanup change with no production impact.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-14 21:11:58 +08:00
Pavel Emelyanov
3081ce24cd nodetool: Implement [gs]etstreamthroughput commands
They exist in the original documentation, but are not yet implemented.
Now it's possible to do it.

It slightly more complex that its compaction counterpart in a sense than
get method reports megabits/s by default and has an option to convert to
MiBs.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-12-13 14:39:47 +03:00
Pavel Emelyanov
67089fd5a1 nodetool: Implement [gs]etcompationthroughput commands
They exist in the original documentation, but are not yet implemented.
Now it's possible to do it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-12-13 14:39:47 +03:00
Pavel Emelyanov
eb29d6f4b0 test: Add validation of how IO-updating endpoints work
There are now four of those and these are all the same in the way they
interpret the value parameter (though it's named differently)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-12-13 13:02:44 +03:00
Botond Dénes
5880a1b90b Merge 'tasks: add tablet migration virtual task' from Aleksandra Martyniuk
In this change, tablet_virtual_task starts supporting tablet
migration, in addition to tablet repair. Both tablet operations
reuse the same virtual_task because their task data is retrieved
similarly. However, it changes nothing from the task manager
API users' perspective. They can list running migrations or check
their statuses all the same as if migration had its own virtual_task.

Users can see running migration tasks - finished tasks are not
presented with the task manager API. However, the result
of the migration (whether it succeeded or failed) would be
presented to users, if they use wait API.

If a migration was reverted, it will appear to users as failed.
We assume that the migration was reverted, when its destination
does not contain a tablet replica.

Fixes: https://github.com/scylladb/scylladb/issues/21365.

No backport, new feature

Closes scylladb/scylladb#21729

* github.com:scylladb/scylladb:
  test: boost: check migration_task_info in tablet_test.cc
  replica: add repair related fields to tablet_map_to_mutation
  test: add tests to check the failed migration virtual tasks
  test: add tests to check the list of migration virtual tasks
  test: add tests to check migration virtual tasks status
  test: topology_tasks: generalize repair task functions
  service: extend tablet_virtual_task::abort
  service: extend tablet_virtual_task::wait
  service: extend tablet_virtual_task::get_status_helper
  service: extend tablet_virtual_task::contains
  service: extend tablet_virtual_task::get_stats
  service: tasks: make get_table_id a method of virtual_task_hint
  service: tasks: extend virtual_task_hint
  replica: service: add migration_task_info column to system.tablets
  locator: extend tablet_task_info to cover migration tasks
  locator: rename tablet_task_info methods
2024-12-13 10:54:03 +02:00
Pavel Emelyanov
fa1ad5ecfd api: Implement /storage_service/(stream|compaction)_throughput endpoints
Both values are in fact db::config named values. They are observed by,
respectively, compaction manager and stream manager: when changed, the
observer kicks corresponding sched group's update_io_bandwidth() method.

Despite being referenced by managers, there's no way to update those
values anyhow other than updating config's named values themselves.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-12-13 11:51:52 +03:00
Pavel Emelyanov
6659ceca4f api: Disqualify const config reference
Some endpoints in config block will need to actually _update_ values on
config (see next patches why), and const reference stands on the way.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-12-13 11:51:52 +03:00
Pavel Emelyanov
f3775ba957 api: Implement /storage_service/stream_throughput endpoint
The value can be obtained from the stream_manager

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-12-13 11:51:52 +03:00
Pavel Emelyanov
b8bd170212 api: Move stream throughput set/get endpoints from storage service block
In order to get stream throughput, the API will need stream_manager.
In order to set stream throughput, the API will need db::config to
update the corresponding named value on it.

Said that, move the endpoints to relevant blocks.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-12-13 11:51:52 +03:00
Pavel Emelyanov
d2c9c2abe8 api: Move set_compaction_throughput_mb_per_sec to config block
In order to update compaction throughput API would need to update the
db::config value, so the endpoint in question should sit in the block
that has db::config at hand.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-12-13 11:51:52 +03:00
Pavel Emelyanov
7d6f8d728b util: Include fmt/ranges.h in config_file.hh
The operator() of named_value() prints the allowed values on error which
can be a vector, so the ranges formatting should be there.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-12-13 11:51:52 +03:00
Yaron Kaikov
b4b7617554 github: check if PR is closed instead of merge
In Scylla, we can have either `closed` or `merged` PRs. Based on that we decide when to start the backport process when the label was added after the PR is closed (or merged),

In https://github.com/scylladb/scylladb/pull/21876 even when adding the proper backport label didn't trigger the backport automation. Https://github.com/scylladb/scylladb/pull/21809/ caused this, we should have left the `state=closed` (this includes both closed and merged PR)

Fixing it

Closes scylladb/scylladb#21906
2024-12-13 06:36:03 +02:00
Avi Kivity
0114e4c2ae Update seastar submodule
* seastar 72c7ac575...3133ecdd6 (12):
  > util/backtrace: Optimize formatter to reduce memory allocation overhead
  > scheduler: Report long queue stall
  > log: drop specialization of boost::lexical_cast for log_level
  > stall-detector: Remove unused _stall_detector_reports_per_minute
  > Merge 'when_all: add Sentinel support to when_all_succeed() ' from Kefu Chai
  > scripts/perftune.py: Implement AWS IMDSv2 call
  > net/tls: Add a way to disable certificate validation
  > tests: Improve websocket parser tests
  > scripts/stall-analyser: improve error messages on invalid input
  > reserve-memory: document that seastar just doesnt use the reserves
  > Merge 'Minor metrics memory optimizations' from Stephan Dollberg
  > json_formatter: Add support for standard range containers

Closes scylladb/scylladb#21869
2024-12-12 18:30:54 +02:00
Gleb Natapov
34a4144a17 messaging_service: do not rely on address map to find an IP rpc client is connected to
Store the endpoint ip address together with the client (note it may be
different from the address the client is connected to in case
preferable address is different). This allows up to drop lookup in the
address map which may eventually fail if an endpoint was already
deleted.

Fixes: scylladb/scylladb#21840

Message-ID: <Z1mpMMe-o0ggBU_F@scylladb.com>
2024-12-12 18:10:58 +02:00
Avi Kivity
ecd78c88bf Merge "move more verbs to idl" from Gleb
"
The series moves node ops, repair and streaming verbs to IDL. Also
contains IDL related cleanups.

In addition to the CI tested manually by bootstrapping a node with the
series into a cluster of old nodes with repair and streaming both in
gossiper and raft mode. This exercises repair, streaming and node_ops
paths.

"

* 'gleb/move-more-rpcs-to-idl-v3' of github.com:scylladb/scylla-dev:
  repair: repair_flush_hints_batchlog_request::target_nodes is not used any more, so mark it as such
  streaming: move streaming verbs to IDL
  messaging_service: move repair verbs to IDL
  node_ops: move node_ops_cmd to IDL
  idl: rename partition_checksum.dist.hh to repair.dist.hh
  idl: move node_ops related stuff from the repair related IDL
2024-12-12 17:19:43 +02:00
muthu90tech
e49381119d locator: topology: use node& instead of node*
This change goes thru locator:topology to use node&
instead of node* where nullptr is not possible. There are
places where the node object is used in unordered_set, in
those cases the node is wrapped in std::reference_wrapper.

Fixes scylladb/scylladb#20357

Closes scylladb/scylladb#21863
2024-12-12 13:22:55 +01:00
Aleksandra Martyniuk
8943188442 test: boost: check migration_task_info in tablet_test.cc 2024-12-12 11:40:55 +01:00
Aleksandra Martyniuk
3f9c76c52d replica: add repair related fields to tablet_map_to_mutation 2024-12-12 11:40:40 +01:00
Botond Dénes
05246e123d Merge 'sstables: Avoid computing column_values_fixed_lengths on each read' from Tomasz Grabiec
Reads which need sstable index were computing
column_values_fixed_lengths each time. This showed up in perf profile
for a sstable-read heavy workload, and amounted to about 1-2% of time.

Computing it involves type name parsing.

Avoid by using cached per-sstable mapping. There is already
sstable::_column_translation which can be used for this. It caches the
mapping for the least-recently used schema. Since the cursor uses the
mapping only for primary key columns, which are stable, any schema
will do, so we can use the last _column_translation. We only need to
make sure that it's always armed, so sstable loading is augmented with
arming with sstable's schema.

Also, fixes a potential use-after-free on schema in column_translation.

Closes scylladb/scylladb#21347

* github.com:scylladb/scylladb:
  sstables: Fix potential use-after-free on column_translation::column_info::name
  sstables: Avoid computing column_values_fixed_lengths on each read
2024-12-12 12:22:32 +02:00
Kefu Chai
714d12014e sstable/mx: use subrange.advance() when appropriate
Replace manual subrange advancement with the more concise and readable
`subrange.advance()` method. This change:

- Eliminates unnecessary subrange instance creation
- Improves code readability
- Reduces potential for unnecessary object allocation
- Leverages the built-in `advance()` method for cleaner iterator handling

The modification simplifies the iteration logic while maintaining the
same functional behavior.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21865
2024-12-12 10:04:12 +02:00
Gleb Natapov
c095f63ea5 repair: repair_flush_hints_batchlog_request::target_nodes is not used any more, so mark it as such
After b3b3e880d3 target_nodes is not used
by the receiver, so we can skip setting it on sender as well.
2024-12-11 18:26:57 +02:00
Gleb Natapov
92c2558a83 streaming: move streaming verbs to IDL 2024-12-11 18:26:50 +02:00
Aleksandra Martyniuk
bc17535427 test: add tests to check the failed migration virtual tasks 2024-12-11 15:17:16 +01:00
Aleksandra Martyniuk
be8dfd220f test: add tests to check the list of migration virtual tasks 2024-12-11 15:17:16 +01:00
Aleksandra Martyniuk
b473efbefd test: add tests to check migration virtual tasks status 2024-12-11 15:17:15 +01:00
Aleksandra Martyniuk
c81dcfc465 test: topology_tasks: generalize repair task functions
Generalize repair task functions so that they can be reused for other
tablet tasks.
2024-12-11 15:17:15 +01:00
Aleksandra Martyniuk
e0d3182fa0 service: extend tablet_virtual_task::abort
Set migration tasks as non abortable.
2024-12-11 15:17:15 +01:00
Aleksandra Martyniuk
4c529a8f2e service: extend tablet_virtual_task::wait
Extend tablet_virtual_task::wait to support migration tasks.

To decide what is a state of a finished migration virtual task
(done or failed), the tablet replicas are checked. The task state
is set to done, if the replicas contain the destination of a tablet
migration.
2024-12-11 15:17:14 +01:00
Aleksandra Martyniuk
de191fb851 service: extend tablet_virtual_task::get_status_helper
Extend tablet_virtual_task::get_status_helper to cover migration
tasks. get_status_helper is used by get_status and wait methods.
Waiting for a task in the latter will be modified in the following
patch.
2024-12-11 15:17:14 +01:00
Aleksandra Martyniuk
50ce3d9106 service: extend tablet_virtual_task::contains
Extend tablet_virtual_task::contains to check migration operations.
Returned virtual_task_hint contains also tablet_id (only for migration
tasks) and task_type.

Return immediately from methods that do not support migration
for non-repair task types. The methods' support for migration
will be implemented in the following patches.
2024-12-11 15:17:14 +01:00
Aleksandra Martyniuk
53bd61a539 service: extend tablet_virtual_task::get_stats
Extend tablet_virtual_task::get_stats to list migration tasks.
2024-12-11 15:17:13 +01:00
Aleksandra Martyniuk
215a15d103 service: tasks: make get_table_id a method of virtual_task_hint 2024-12-11 15:17:08 +01:00
Aleksandra Martyniuk
0caffd67f8 service: tasks: extend virtual_task_hint
Extend virtual_task_hint to contain task_type and tablet_id. These
fields would be used by tablet_virtual_task in the following patches.
2024-12-11 15:15:28 +01:00
Anna Stuchlik
98860905d8 doc: remove wrong image upgrade info (5.2-to-2023.1)
This commit removes the information about the recommended way of upgrading
ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade
procedure is not supported (it was implemented, but then reverted).

Refs https://github.com/scylladb/scylladb/issues/15733

Closes scylladb/scylladb#21876
2024-12-11 14:00:30 +02:00
Kefu Chai
03599477af dht: include a smaller header file
Replace `dht/sharder.hh` with a "smaller" header, which provides
just the enough dependencies.

in f744007e, we traded `database.hh` with a smaller set of headers.
but it turns out `dht/sharder.hh` can be replaced with a even smaller
one. because `dht::sharder` is defined by `dht/token-sharding.hh`, and
what we need from `dht/sharder.hh` is this class's declaration.
`clang-include-cleaner` identified this issue.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21881
2024-12-11 13:53:01 +02:00
Aleksandra Martyniuk
9fad3a621a replica: service: add migration_task_info column to system.tablets
Add migration_task_info column to system.tablets. Set migration_task_info
value on migration request if the feature is enabled in the cluster.
Reflect the column content in tablet_metadata.
2024-12-11 12:07:36 +01:00
Aleksandra Martyniuk
332347490c locator: extend tablet_task_info to cover migration tasks 2024-12-11 12:07:36 +01:00
Aleksandra Martyniuk
dee6404aa4 locator: rename tablet_task_info methods 2024-12-11 12:07:36 +01:00
Michael Litvak
373855b493 service/qos/service_level_controller: update cache on startup
Update the service level cache in the node startup sequence, after the
service level and auth service are initialized.

The cache update depends on the service level data accessor being set
and the auth service being initialized. Before the commit, it may happen that a
cache update is not triggered after the initialization. The commit adds
an explicit call to update the cache where it is guaranteed to be ready.

Fixes scylladb/scylladb#21763

Closes scylladb/scylladb#21773
2024-12-11 12:05:28 +01:00
Tomasz Grabiec
440a96605f Merge 'topology_custom/test_tablets: add remove/replace tests for edge cases' from Benny Halevy
Test cases related to #21826:

1. test_remove_failure_with_no_normal_token_owners_in_dc: attempts to
    remove a node with another node down in the datacenter, leaving
    no normal token owners in that dc (reproducing #21826).
    Removenode is expected to fail in this case since it
    should have no place to rebuild the removed node replicas,
    yet it currently succeeds unexpectedly.

2. test_remove_failure_then_replace: verify that removenode
    fails as expected when there are not enough nodes to
    rebuild its replicas on, with and without additional zero-token nodes.

3. test_replace_with_no_normal_token_owners_in_dc: verify that
    nodes can be replaced in a datacenter that has no live
    token owners, with and without additional zero-token nodes.
    Tablet replace uses all replicas to rebuild the lost replicas
    and therefore should succeed in the edge case.
    The restored data is verified as well.

Refs #21826

* New tests, no backport needed

Closes scylladb/scylladb#21827

* github.com:scylladb/scylladb:
  topology_custom/test_tablets: add remove/replace tests for edge cases
  test: pylib: _cluster_remove_node: log message on successful paths
  test: pylib: _cluster_remove_node: mark server as removed only when removenode succeeded
2024-12-11 12:04:14 +01:00
Kefu Chai
9f749487cd main.cc: fix typos in comment
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21868
2024-12-11 08:42:41 +02:00
Benny Halevy
b95312064f topology_custom/test_tablets: add remove/replace tests for edge cases
Test cases related to #21826:

1. test_remove_failure_with_no_normal_token_owners_in_dc: attempts to
remove a node with another node down in the datacenter, leaving
no normal token owners in that dc (reproducing #21826).
Removenode is expected to fail in this case since it
should have no place to rebuild the removed node replicas,
yet it currently succeeds unexpectedly.

2. test_remove_failure_then_replace: verify that removenode
fails as expected when there are not enough nodes to
rebuild its replicas on, with and without additional zero-token nodes.

3. test_replace_with_no_normal_token_owners_in_dc: verify that
nodes can be replaced in a datacenter that has no live
token owners, with and without additional zero-token nodes.
Tablet replace uses all replicas to rebuild the lost replicas
and therefore should succeed in the edge case.
The restored data is verified as well.

Refs #21826

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-10 21:39:15 +02:00
Tomasz Grabiec
8e60a0b831 Merge 'truncate: make TRUNCATE TABLE safe with tablets' from Ferenc Szili
Currently truncating a table works by issuing an RPC to all the nodes which call `database::truncate_table_on_all_shards()`, which makes sure that older writes are dropped.

It works with tablets, but is not safe. A concurrent replication process may bring back old data.

This change makes makes TRUNCATE TABLE a topology operation, so that it excludes with other processes in the system which could interfere with it. More specifically, it makes TRUNCATE a global topology request.

Backporting is not needed.

Fixes #16411

Closes scylladb/scylladb#19789

* github.com:scylladb/scylladb:
  docs: docs: topology-over-raft: Document truncate_table request
  storage_proxy: fix indentation and remove empty catch/rethrow
  test: add tests for truncate with tablets
  storage_proxy: use new TRUNCATE for tablets
  truncate: make TRUNCATE a global topology operation
  storage_service: move logic of wait_for_topology_request_completion()
  RPC: add truncate_with_tablets RPC with frozen_topology_guard
  feature_service: added cluster feature for system.topology schema change
  system.topology_requests: change schema
  storage_proxy: propagate group0 client and TSM dependency
2024-12-10 17:50:50 +01:00
Kefu Chai
8d63d31e57 service: fix a typo in comment
s/contraints/constraints/

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21851
2024-12-10 15:58:49 +02:00
Gleb Natapov
fbfee9666e locator: put real host id into the replication map for everywhere replication strategy
Everywhere replication strategy returns zero host id in replica set instead
of the real one if no tokens are configured yet in token metadata. It
worked because code that translates ids to ips knows that zero host id
is a special one, so putting zero there was equivalent to allow local
access. But now we use host ids directly so we need to return real host
id here to allow local access before token metadata is populated.

Message-ID: <Z1hBHsEo4wYzzgvJ@scylladb.com>
2024-12-10 15:36:00 +02:00
Patryk Jędrzejczak
74dad7d1eb raft: improve logs for abort while waiting for apply
New logs allow us to easily distinguish two cases in which
waiting for apply times out:
- the node didn't receive the entry it was waiting for,
- the node received the entry but didn't apply it in time.

Distinguishing these cases simplifies reasoning about failures.
The first case indicates that something went wrong on the leader.
The second case indicates that something went wrong on the node
on which waiting for apply timed out.

As it turns out, many different bugs result in the `read_barrier`
(which calls `wait_for_apply`) timeout. This change should help
us in debugging bugs like these.

We want to backport this change to all supported branches so that
it helps us in all tests.

Closes scylladb/scylladb#21855
2024-12-10 14:23:39 +01:00
Tomasz Grabiec
bf18a17bd6 tablets: scheduler: Fix temporary imbalance in a mixed-capacity cluster on decommission
When tablet scheduler drains nodes, it chooses target location based
on "badness" metric. Nodes with lowest score are preferred. Before the
patch, the score which was used was the number of tablets on that node
post-movement. This way we populate least-loaded node first. But this
works only if nodes have equal number of shards. If nodes have different
capacity, then number of tablets is not a good metric, because we don't
aim to equalize per-node count, but per-shard count. We assume that each
shard has equal capacity.

Because of this bug, during decommission, the nodes with fewer shards
would be preferred to receive replicas, which may lead to overloading
of those nodes. This imbalance would be later fixed by the normal load
balancing logic, but it's still problematic.

Fixes #21783

Closes scylladb/scylladb#21860
2024-12-10 14:18:03 +02:00
Benny Halevy
eeb6d3dd74 test: pylib: _cluster_remove_node: log message on successful paths
Log a message when removenode succeeded as expected
or when it failed as expected with the `expected_error`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-10 11:55:27 +02:00
Benny Halevy
cd566924b9 test: pylib: _cluster_remove_node: mark server as removed only when removenode succeeded
Currently, we call server_mark_removed also when removenode
failed with the `expected_error`, where the function returns success
but the server is not supposed to be in a removed state.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-10 11:55:27 +02:00
Botond Dénes
5d040e0206 Merge 'truncate: commit log replay positions are not saved correctly' from Ferenc Szili
TRUNCATE TABLE saves the current commit log replay positions in case there is a crash so that replay knows where to begin replaying the mutations. These are collected and saved per shard into `system.truncated`. In case a shard received no mutations, its replay position will be an empty, default constructed object of type `db::replay_position` with its members set to 0. Truncate will incorrectly interpret these empty replay positions as if they were coming from shard 0, and save them as such, potentially overwriting an actual valid replay position coming from the actual shard 0. In the case of a crash, this will cause the commit log on shard 0 to be replayed from the beginning, and result with data resurrection.

Fixes #21719

Closes scylladb/scylladb#21722

* github.com:scylladb/scylladb:
  test: add test for truncate saving replay positions
  database: correctly save replay position for truncate
2024-12-10 10:05:30 +02:00
Botond Dénes
924189c50e Merge 'replica/table: improve error message when encountering orphaned sstables' from Lakshmi Narayanan Sreethar
On startup, if a server reads an sstable that belongs to a tablet that
doesn't have any local replica, it throws an error in the following
format and refuses to start :

```
Storage wasn't found for tablet 1 of table test.test
```

This patch updates the code path to throw a nicer error that includes
the sstable name that caused the problem.

This patch also adds a testcase to verify the error being thrown.

Fixes https://github.com/scylladb/scylladb/issues/18038

PR improves an error message - no need to backport.

Closes scylladb/scylladb#21805

* github.com:scylladb/scylladb:
  replica/table: fix indent in compaction_group_for_sstable
  replica/table: improve error message when encountering orphaned sstables
2024-12-10 06:34:12 +02:00
Kefu Chai
ce2f80c227 treewide: migrate from boost::make_iterator_range to ranges::subrange
Replace boost::make_iterator_range() with std::ranges::subrange.

This change improves code modernization and reduces external dependencies:

- Replace boost::make_iterator_range() with std::ranges::subrange
- Remove boost/range/iterator_range.hpp include
- Improve iterator type detection in interval.hh using std::ranges::const_iterator_t<Range>

This is part of ongoing efforts to modernize our codebase and minimize
external dependencies.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21787
2024-12-09 21:31:53 +02:00
Pavel Emelyanov
6eb6b96456 dirty-memory-manager: Brush up "blocked" state check
One of run_when_memory_available() checks mirrors the one done by the
execution_permitted() helper, so its worth re-using it. Since the former
helper is header template, the latter is worth moving to header too.
And, once re-used, the `bool blocking` variable becomes excessive, and
the `if (blocking)` check can also be expressed with fewer LOCs.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#21812
2024-12-09 20:44:22 +02:00
Kefu Chai
48c8d24345 treewide: drop support for fmt < v10
since fedora 38 is EOL. and fedora 39 comes with fmt v10.0.0, also,
we've switched to the build image based on fedora 40, which ships
fmt-devel v10.2.1, there is no need to support fmt < 10.

in this change, we drop the support fmt < 10.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21847
2024-12-09 20:42:38 +02:00
Avi Kivity
1bac6b75dc Merge 'Reserve IOCBs for tool applications' from Botond Dénes
Artifact tests have been failing since the switch to the native nodetool, because ScyllaDB doesn't leave any IOCBs for tools. On some setups it will consume all of them and then nodetool and any other native app will refuse to start because it will fail to allocate IOCBs.
This PR fixes this by making use of the freshly introduced `--reserve-io-control-blocks` seastar option, to reserve IOCBs for tool applications. Since the `linux-aio` and `epoll` reactor backends require quite a bit of these, we enable the `io_uring` reactor backend and switch tools to use this backend instead. The `io_uring` reactor backend needs just 2 IOCBs to function, so the reserve of 10 IOCBs set up in this PR is good for running 5 tool applications in parallel, which should be more than enough.

Fixes: https://github.com/scylladb/scylladb/issues/19185

The problem this PR fixes has a manual workaround (and is rare to begin with), no backport needed.

Closes scylladb/scylladb#21527

* github.com:scylladb/scylladb:
  main: configure a reserve IOCB for scylla-nodetool and friends
  configure: enable the io_uring backend
  main: use configure seastar defaults via app_template::seastar_options
2024-12-09 19:22:19 +02:00
Kefu Chai
a9c244ddf7 dist: scylla_io_setup: use raw string to avoid invalid escape sequence
Use raw string literals to prevent syntax warnings when using regular
expressions with backslash-based patterns.

The original code triggered a SyntaxWarning in developer mode (`python3 -Xdev`)
due to unescaped backslash characters in regex patterns like '\s'. While
CPython typically interprets these silently, strict Python parsing modes
raise warnings about potentially unintended escape sequences.

This change adds the `r` prefix to string literals containing regex patterns,
ensuring consistent behavior across different Python runtime configurations
and eliminating unnecessary syntax warning like:

```
/opt/scylladb/scripts/libexec/scylla_io_setup:41: SyntaxWarning: invalid escape sequence '\s'
  pattern = re.compile(_nocomment + r"CPUSET=\s*\"" + _reopt(_cpuset) + _reopt(_smp) + "\s*\"")
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21839
2024-12-09 19:18:39 +03:00
Emil Maskovsky
969b396699 gossiper: fix the backward incompatible change
In the cleanup commit a840949ea0
a regression was introduced that caused backward incompatible changes
in the gossiper application state name strings.

In the e486e0f759 the value
`application_state::CDC_STREAMS_TIMESTAMP` was changed to
`application_state::CDC_GENERATION_ID`, but the name string
"CDC_STREAMS_TIMESTAMP" was kept for backward compatibility.

The cleanup commit a840949ea0 however
changed the name string to "CDC_GENERATION_ID" by ommission (not noticing
the difference) which caused backward incompatible change.

There is also another case found of "IGNOR_MSB_BITS" (that has a typo -
missing the "E" in "IGNORE") to "IGNORE_MSB_BITS", which also needs to
be reverted back to keep the backward compatibility.

Fixes: scylladb/scylladb#21811

Closes scylladb/scylladb#21813
2024-12-09 16:46:25 +01:00
Ferenc Szili
49cc771bda docs: docs: topology-over-raft: Document truncate_table request 2024-12-09 16:38:50 +01:00
Ferenc Szili
781f0a2397 storage_proxy: fix indentation and remove empty catch/rethrow
This change fixes code indentation in storage_proxy::remote::send_truncate_blocking()
It also removes an empty catch and rethrow block.
2024-12-09 16:38:50 +01:00
Ferenc Szili
e65a235fd5 test: add tests for truncate with tablets
This patch adds the unit tests for truncate with tablets.

test_truncate_while_migration() triggers a tablet migration, then runs
a TRUNCATE TABLE for the table containing the tablet being migrated.
test_truncate_with_concurrent_drop() starts a truncate, then attempts to
drop the table while it is being truncated.
test_truncate_while_node_restart() validates the case where a replica
node is restarted while truncate is running.
test_truncate_with_coordinator_crash() validates if truncate is
correctly completed in cases where the topology coordinator has crashed
or restarted after the truncate session is cleared, but before the
truncate request is finalized.
2024-12-09 16:38:50 +01:00
Ferenc Szili
4cd7a1acab storage_proxy: use new TRUNCATE for tablets
This change adds branching based on keyspace replication method, and
uses the new TRUNCATE for keyspaces with tablets.
2024-12-09 16:38:50 +01:00
Ferenc Szili
93cfeb9160 truncate: make TRUNCATE a global topology operation
This commit adds the code needed to create a TRUNCATE global topology
request. It also adds the handler for this request to the topology
coordinator.
The execution of the truncate operation is not canceled on a timeout,
but the query coordinator side will return a timeout error.
2024-12-09 16:38:37 +01:00
Gleb Natapov
ed7ea1dc71 feature_service: fix typo in address_nodes_by_host_ids feature name
Message-ID: <Z1WYaYuQuPP8lNAX@scylladb.com>
2024-12-09 17:27:27 +02:00
Tomasz Grabiec
2b16428b4f sstables: Fix potential use-after-free on column_translation::column_info::name
column_translation::state is storing pointers to column names, which
are stable only as long as schema_ptr is alive. sstable object caches
last used column_translation, and reuses column_translation::state if
the schema version matches. But this doesn't guarantee that the schema
object was not destroyed and recreated in between. This can happen if
the schema version expired in registry and then was pulled again from a
different node via get_schema_for_read().

Spotted by reading the code.

Fix by storing schema_ptr in column_translation. This can pin old
schema in memory until a newer schema is used to read the sstable, or
until sstable is compacted away. I think this shouldn't be a problem
in practice.
2024-12-09 14:05:37 +01:00
Tomasz Grabiec
b0a5bf8b4a sstables: Avoid computing column_values_fixed_lengths on each read
Reads which need clustering index cursor were computing
column_values_fixed_lengths each time. This showed up in perf profile
for a sstable-read heavy workload, and amounted to about 1%.

Avoid by using cached per-sstable mapping. There is already
sstable::_column_translation which can be used for this. It caches the
mapping for the most recently used schema. Since the cursor uses the
mapping only for primary key columns, which are stable, any schema
will do, so we can use the last _column_translation. We only need to
make sure that it's always armed, so sstable loading is augmented with
arming with sstable's schema.
2024-12-09 14:05:37 +01:00
Gleb Natapov
bfee93c747 messaging_service: move repair verbs to IDL 2024-12-09 14:50:52 +02:00
Gleb Natapov
5f6007f6ec node_ops: move node_ops_cmd to IDL 2024-12-09 14:50:52 +02:00
Gleb Natapov
39c75d3add idl: rename partition_checksum.dist.hh to repair.dist.hh
The file has many more things than partition_checksum. All of them
are repair related now.
2024-12-09 14:49:59 +02:00
Michael Litvak
53224d90be service/qos: increase timeout of internal get_service_levels queries
The function get_service_levels is used to retrieve all service levels
and it is called from multiple different contexts.
Importantly, it is called internally from the context of group0 state reload,
where it should be executed with a long timeout, similarly to other
internal queries, because a failure of this function affects the entire
group0 client, and a longer timeout can be tolerated.
The function is also called in the context of the user command LIST
SERVICE LEVELS, and perhaps other contexts, where a shorter timeout is
preferred.

The commit introduces a function parameter to indicate whether the
context is internal or not. For internal context, a long timeout is
chosen for the query. Otherwise, the timeout is shorter, the same as
before. When the distinction is not important, a default value is
chosen which maintains the same behavior.

The main purpose is to fix the case where the timeout is too short and causes
a failure that propagates and fails the group0 client.

Fixes scylladb/scylladb#20483

Closes scylladb/scylladb#21748
2024-12-09 13:20:32 +01:00
Kefu Chai
6a18db0aea node_ops: switch from boost::join() to std::ranges::join_view()
Replace boost::join() with std::ranges::join_view() as an interim solution
before C++26's std::views::concat becomes available. This change:

- Reduces dependencies on the Boost Ranges library
- Moves closer to standard library implementations
- Improves code maintainability and future compatibility

This is part of ongoing efforts to modernize our codebase and minimize
external dependencies.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21786
2024-12-09 13:46:44 +03:00
Botond Dénes
2491a31f4c docs: cql/ddl.rst: document {min,max}_index_interval
Closes scylladb/scylladb#21795
2024-12-09 13:45:20 +03:00
Emil Maskovsky
8191e57036 treewide: fix annotations reported by GH checks
Clean up the unnecessary includes reported by the GitHub checks that are
polluting the PR diffs.

The "utils/assert.hh" report should be actually fixed by the #21739, but
as the usage of `SEASTAR_ASSERT()` is protected by the `SEASTAR_DEBUG`
check it makes sense to include the header conditionally as well.

Closes scylladb/scylladb#21817
2024-12-09 13:44:12 +03:00
Kefu Chai
259ab6dee7 locator: remove unused member variable
this issue was identified by clang-20:

```
/home/kefu/.local/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -isystem /home/kefu/dev/scylladb/seastar/include -isystem /home/kefu/dev/scylladb/build/Debug/seastar/gen/include -isystem /usr/include/p11-kit-1 -isystem /home/kefu/dev/scylladb/abseil -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -std=gnu++23 -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_DEBUG -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEBUG_PROMISE -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DFMT_SHARED -DWITH_GZFILEOP -MD -MT locator/CMakeFiles/scylla_locator.dir/Debug/ec2_multi_region_snitch.cc.o -MF locator/CMakeFiles/scylla_locator.dir/Debug/ec2_multi_region_snitch.cc.o.d -o locator/CMakeFiles/scylla_locator.dir/Debug/ec2_multi_region_snitch.cc.o -c /home/kefu/dev/scylladb/locator/ec2_multi_region_snitch.cc
In file included from /home/kefu/dev/scylladb/locator/ec2_multi_region_snitch.cc:11:
/home/kefu/dev/scylladb/locator/ec2_multi_region_snitch.hh:31:10: error: private field '_broadcast_rpc_address_specified_by_user' is not used [-Werror,-Wunused-private-field]
   31 |     bool _broadcast_rpc_address_specified_by_user;
      |          ^
1 error generated.
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-09 10:31:09 +08:00
Kefu Chai
c5c5990578 auth: remove unused member variable
this issue was identified by clang-20:

```
/home/kefu/.local/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build -I/home/kefu/dev/scylladb/build/gen -isystem /home/kefu/dev/scylladb/seastar/include -isystem /home/kefu/dev/scylladb/build/Debug/seastar/gen/include -isystem /usr/include/p11-kit-1 -isystem /home/kefu/dev/scylladb/abseil -isystem /home/kefu/dev/scylladb/build/rust -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -std=gnu++23 -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_DEBUG -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEBUG_PROMISE -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DFMT_SHARED -DWITH_GZFILEOP -MD -MT auth/CMakeFiles/scylla_auth.dir/Debug/default_authorizer.cc.o -MF auth/CMakeFiles/scylla_auth.dir/Debug/default_authorizer.cc.o.d -o auth/CMakeFiles/scylla_auth.dir/Debug/default_authorizer.cc.o -c /home/kefu/dev/scylladb/auth/default_authorizer.cc
In file included from /home/kefu/dev/scylladb/auth/default_authorizer.cc:11:
/home/kefu/dev/scylladb/auth/default_authorizer.hh:29:36: error: private field '_group0_client' is not used [-Werror,-Wunused-private-field]
   29 |     ::service::raft_group0_client& _group0_client;
      |                                    ^
1 error generated.
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-09 10:31:09 +08:00
Kefu Chai
fea0548b44 db: remove unused member variable
this issue was identified by clang-20:

```
/home/kefu/.local/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/build -isystem /home/kefu/dev/scylladb/seastar/include -isystem /home/kefu/dev/scylladb/build/Debug/seastar/gen/include -isystem /usr/include/p11-kit-1 -isystem /home/kefu/dev/scylladb/abseil -isystem /home/kefu/dev/scylladb/build/rust -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -std=gnu++23 -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_DEBUG -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEBUG_PROMISE -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DFMT_SHARED -DWITH_GZFILEOP -MD -MT db/CMakeFiles/db.dir/Debug/hints/resource_manager.cc.o -MF db/CMakeFiles/db.dir/Debug/hints/resource_manager.cc.o.d -o db/CMakeFiles/db.dir/Debug/hints/resource_manager.cc.o -c /home/kefu/dev/scylladb/db/hints/resource_manager.cc
In file included from /home/kefu/dev/scylladb/db/hints/resource_manager.cc:9:
/home/kefu/dev/scylladb/db/hints/resource_manager.hh:130:29: error: private field '_proxy' is not used [-Werror,-Wunused-private-field]
  130 |     service::storage_proxy& _proxy;
      |                             ^
1 error generated.
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-12-09 10:31:09 +08:00
Avi Kivity
f744007e13 dht: auto_refreshing_sharder.hh: don't include database.hh
database.hh is a heavyweight include file with a lot of fan-in.
auto_refreshing_sharder.hh has a lot of fan out. The combination
means a large dependency load.

Deinline the class and use forward declarations to avoid the #include.

There is no expected performance impact because all the functions are
virtual.

Ref #1

Note: this shouldn't belong in dht, but be injected by a higher layer,
but this isn't addressed by the patch.

Closes scylladb/scylladb#21768
2024-12-06 23:11:52 +01:00
Tomasz Grabiec
7e2875d648 Merge 'Add tablet merge support' from Raphael Raph Carvalho
The goal of merge is to reduce the tablet count for a shrinking table. Similar to how split increases the count while the table is growing. The load balancer decision to merge is implemented today (came with infrastructure introduced for split), but it wasn't handled until now.

Initial tablet count is respected while the table is in "growing mode". For example, the table leaves it if there was a need to split above the initial tablet count. After the table leaves the mode, the average size can be trusted to determine that the table is shrinking. Merge decision is emitted if the average tablet size is 50% of the target. Hysteresis is applied to avoid oscillations between split and merges.

Similar to split, the decision to merge is recorded in tablet map's resize_type field with the string "merge". This is important in case of coordinator failover, so new coordinator continues from where the old left off.

Unlike split, the preparation phase during merge is not done by the replica (with split compactions), but rather by the coordinator by co-locating sibling tablets in the same node's shard. We can define sibling tablets as tablets that have contiguous range and will become one after merge. The concept is based on the power-of-two constraint and token contiguity. For example, in a table with 4 tablets, tablets of ids 0 and 1 are siblings, 2 and 3 are also siblings.

The algorithm for co-locating sibling tablets is very simple. The balancer is responsible for it, and it will emit migrations so that "odd" tablet will follow the "even" one. For example, tablet 1 will be migrated to where tablet 0 lives. Co-location is low in priority, it's not the end of the world to delay merge, but it's not ideal to delay e.g. decommission or even regular load balancing as that can translate into temporary unbalancing, impacting the user activities. So co-location migrations will happen when there is no more important work to do.
While regular balancing is higher in priority, it will not undo the co-location work done so far. It does that by treating co-located tablets as if they were already merged. The load inversion convergence check was adjusted so balancer understand when two tablets are being migrated instead of one, to avoid oscillations.

When balancer completes co-location work for a table undergoing merge, it will put the id of the table into the resize_plan, which is about communicating with the topology coordinator that a table is ready for it. With all sibling tablets co-located, the coordinator can resize the tablet map (reduce it by a factor of 2) and record the new map into group0. All the replicas will react to it (on token metadata update) by merging the storage (memtable(s) + sstables) of sibling tablets into one.

Fixes #18181.

system test details:

test: https://github.com/pehala/scylla-cluster-tests/blob/tablets_split_merge/tablets_split_merge_test.py
yaml file: https://github.com/pehala/scylla-cluster-tests/blob/tablets_split_merge/test-cases/features/tablets/tablets-split-merge-test.yaml

instance type: i3.8xlarge
nodes: 3
target tablet size: 0.5G (scaled down by 10, to make it easier to trigger splits and merges)
description: multiple cycles of growing and shrinking the data set in order to trigger splits and merges.
data_set_size: ~100G
initial_tablets: 64, so it grew to 128 tablets on split, and back to 64 on merge.

latency of reads and writes that happened in parallel to split and merge:
```
$ for i in scylla-bench*; do cat $i | grep "Mode\|99th:\|99\.9th:"; done
Mode:			 write
  99.9th:	 3.145727ms
  99th:		 1.998847ms
  99.9th:	 3.145727ms
  99th:		 2.031615ms
Mode:			 read
  99.9th:	 3.145727ms
  99th:		 2.031615ms
  99.9th:	 3.145727ms
  99th:		 2.031615ms
Mode:			 write
  99.9th:	 3.047423ms
  99th:		 1.933311ms
  99.9th:	 3.047423ms
  99th:		 1.933311ms
Mode:			 read
  99.9th:	 3.145727ms
  99th:		 1.900543ms
  99.9th:	 3.145727ms
  99th:		 1.900543ms
Mode:			 write
  99.9th:	 5.079039ms
  99th:		 3.604479ms
  99.9th:	 35.389439ms
  99th:		 25.624575ms
Mode:			 write
  99.9th:	 3.047423ms
  99th:		 1.998847ms
  99.9th:	 3.047423ms
  99th:		 1.998847ms
Mode:			 read
  99.9th:	 3.080191ms
  99th:		 2.031615ms
  99.9th:	 3.112959ms
  99th:		 2.031615ms
```

Closes scylladb/scylladb#20572

* github.com:scylladb/scylladb:
  docs: Document tablet merging
  tests/boost: Add test to verify correctness of balancer decisions during merge
  tests/topology_experimental_raft: Add tablet merge test
  service: Handle exception when retrying split
  service: Co-locate sibling tablets for a table undergoing merge
  gms: Add cluster feature for tablet merge
  service: Make merge of resize plan commutative
  replica: Implement merging of compaction groups on merge completion
  replica: Handle tablet merge completion
  service: Implement tablet map resize for merge
  locator: Introduce merge_tablet_info()
  service: Rename topology::transition_state::tablet_split_finalization
  service: Respect initial_tablet_count if table is in growing mode
  service: Wire migration_tablet_set into the load balancer
  locator: Add tablet_map::sibling_tablets()
  service: Introduce sorted_replicas_for_tablet_load()
  locator/tablets: Extend tablet_replica equality comparator to three-way
  service: Introduce alias to per-table candidate map type
  service: Add replication constraint check variant for migration_tablet_set
  service: Add convergence check variant for migration_tablet_set
  service: Add migration helpers for migration_tablet_set
  service/tablet_allocator: Introduce migration_tablet_set
  service: Introduce migration_plan::add(migrations_vector)
  locator/tablets: Introduce tablet_map::for_each_sibling_tablets()
  locator/tablets: Introduce tablet_map::needs_merge()
  locator/tablets: Introduce resize_decision::initial_decision()
  locator/tablets: Fix return type of three-way comparison operators
  service: Extract update of node load on migrations
  service: Extract converge check for intra-node migration
  service: Extract erase of tablet replicas from candidate list
  scripts/tablet-mon: Allow visualization of tablet id
2024-12-06 18:06:20 +01:00
Lakshmi Narayanan Sreethar
401e7c8f69 replica/table: fix indent in compaction_group_for_sstable
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-12-06 21:22:24 +05:30
Lakshmi Narayanan Sreethar
fa10b0b390 replica/table: improve error message when encountering orphaned sstables
On startup, if a server reads an sstable that belongs to a tablet that
doesn't have any local replica, it throws an error in the following
format and refuses to start :

```
Storage wasn't found for tablet 1 of table test.test
```

This patch updates the code path to throw a nicer error that includes
the sstable name that caused the problem.

This patch also adds a testcase to verify the error being thrown.

Fixes #18038
2024-12-06 21:22:24 +05:30
Kefu Chai
37c49acbac docs/cql/ddl: Clarify crc_check_chance option behavior
Although `crc_check_chance` is accepted as a configuration option in ScyllaDB,
the value is currently ignored during runtime. This change makes this behavior
explicit in the documentation to prevent potential user misunderstandings.

Changes:
- Explicitly document that the option is currently a no-op
- Provide clear guidance on the current implementation
- Prevent confusion about the option's actual functionality

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21794
2024-12-06 13:48:03 +02:00
Abhinav
6c90a25014 Fix gossiper orphan node floating problem by adding a remover fiber
In the current scenario, if during startup, a node crashes after initiating gossip and before joining group0,
then it keeps floating in the gossiper forever because the raft based gossiper purging logic is only effective
once node joins group0. This orphan node hinders the successor node from same ip to join cluster since it collides
with it during gossiper shadow round.

This commit intends to fix this issue by adding a background thread which periodically checks for such orphan entries in
gossiper and removes them.

A test is also added in to verify this logic. This test fails without this background thread enabled, hence
verifying the behavior.

Fixes: scylladb/scylladb#20082

Closes scylladb/scylladb#21600
2024-12-06 10:45:07 +01:00
Kefu Chai
e04aca7efe github: do not nest ${{}} inside condition
In commit 2596d157, we added a condition to run auto-backport.py only
when the GitHub Action is triggered by a push to the default branch.
However, this introduced an unexpected error due to incorrect condition
handling.

Problem:
- `github.event.before` evaluates to an empty string
- GitHub Actions' single-pass expression evaluation system causes
  the step to always execute, regardless of `github.event_name`

Despite GitHub's documentation suggesting that ${{ }} can be omitted,
it recommends using explicit ${{}} expressions for compound conditions.

Changes:
- Use explicit ${{}} expression for compound conditions
- Avoid string interpolation in conditional statements

Root Cause:
The previous implementation failed because of how GitHub Actions
evaluates conditional expressions, leading to an unintended script
execution and a 404 error when attempting to compare commits.

Example Error:

```
  python .github/scripts/auto-backport.py --repo scylladb/scylladb --base-branch refs/heads/master --commits ..2b07d93beac7bc83d955dadc20ccc307f13f20b6
  shell: /usr/bin/bash -e {0}
  env:
    DEFAULT_BRANCH: master
    GITHUB_TOKEN: ***
Traceback (most recent call last):
  File "/home/runner/work/scylladb/scylladb/.github/scripts/auto-backport.py", line 201, in <module>
    main()
  File "/home/runner/work/scylladb/scylladb/.github/scripts/auto-backport.py", line 162, in main
    commits = repo.compare(start_commit, end_commit).commits
  File "/usr/lib/python3/dist-packages/github/Repository.py", line 888, in compare
    headers, data = self._requester.requestJsonAndCheck(
  File "/usr/lib/python3/dist-packages/github/Requester.py", line 353, in requestJsonAndCheck
    return self.__check(
  File "/usr/lib/python3/dist-packages/github/Requester.py", line 378, in __check
    raise self.__createException(status, responseHeaders, output)
github.GithubException.UnknownObjectException: 404 {"message": "Not Found", "documentation_url": "https://docs.github.com/rest/commits/commits#compare-two-commits", "status": "404"}
```

Fixes scylladb/scylladb#21808
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21809
2024-12-06 11:11:20 +02:00
Kefu Chai
9f5e2488dd locator,service: correct the misspellings
these misspellings were identified by codespell. in this change,
they are corrected.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21796
2024-12-06 11:10:51 +02:00
Piotr Dulikowski
def51e252d Merge 'service/topology_coordinator: migrate view builder only if all nodes are up' from Michał Jadwiszczak
The migration process is doing read with consistency level ALL,
requiring all nodes to be alive.

Fixes scylladb/scylladb#20754

The PR should be backported to 6.2, this version has view builder on group0.

Closes scylladb/scylladb#21708

* github.com:scylladb/scylladb:
  test/topology_custom/test_view_build_status: add reproducer
  service/topology_coordinator: migrate view builder only if all nodes are up
2024-12-06 09:07:07 +01:00
Piotr Dulikowski
c601f7a359 Merge 'transport/server: revert using async function in for_each_gently()' from Michał Jadwiszczak
This patch reverts 324b3c43c0 and adds synchronous versions of `service_level_controller::find_effective_service_level()` and `client_state::maybe_update_per_service_level_params()`.

It isn't safe to do asynchronous calls in `for_each_gently`, as the
connection may be disconnected while a call in callback preempts.

Fixes scylladb/scylladb#21801

Closes scylladb/scylladb#21761

* github.com:scylladb/scylladb:
  Revert "generic_server: use async function in `for_each_gently()`"
  transport/server: use synchronous calls in `for_each_gently` callback
  service/client_state: add synchronous method to update service level params
  qos/service_level_controller: add `find_cached_effective_service_level`
2024-12-06 08:48:41 +01:00
Emil Maskovsky
2b07d93bea raft: clean up the documentation
Small adjustments and improvements to the documentation in the raft
section.

Fixing Markdown lint warnings:
- MD004/ul-style: Unordered list style [Expected: dash; Actual: asterisk]
- MD007/ul-indent: Unordered list indentation [Expected: 0; Actual: 2]
- MD032/blanks-around-lists: Lists should be surrounded by blank lines
- MD036/no-emphasis-as-heading: Emphasis used instead of a heading
- MD046/code-block-style: Code block style [Expected: fenced; Actual: indented]

Closes scylladb/scylladb#21780
2024-12-05 13:44:11 +01:00
Gleb Natapov
636006f976 topology coordinator: do not for replaced node to appear in the gossiper
There is no point waiting for a node been replaced to appear in the
gossiper since it either will be there already or it will never appear.
gossiper:is_alive() knows how to handle both of those cases, so just
call it directly.
2024-12-05 13:36:52 +01:00
Michał Jadwiszczak
fe67efda5b Revert "generic_server: use async function in for_each_gently()"
This reverts commit 324b3c43c0.

It isn't safe to do asynchronous calls in `for_each_gently`, as the
connection may be disconnected while a call in callback preempts.

Fixes scylladb/scylla#21801
2024-12-05 13:32:47 +01:00
Piotr Dulikowski
dcaf6582c4 materialized views: test for the MV delay configuration parameter
The test does the following:

- Enables an error injection which will cause further view updates to
  get stuck, occupying space in memory and affecting the backlog,
- Performs a single, large write to the base table which causes a single
  view update to be generated; the write is then followed with one more,
  small write to make sure that the other write will be affected by the
  first write's backlog,
- Reads relevant metrics in order to check the exact value of the delay
  that was calculated for the base table write due to MV backpressure.

This is done for different values of the MV delay configuration
parameter (view_flow_control_delay_limit_in_ms) and the calculated
delays are collected into a list. Lastly, the test checks that the
relation between parameter value and the calculated delays is linear.
2024-12-05 11:48:45 +01:00
Avi Kivity
9024e4940c counters.hh: drop unused boost includes
Re-add them to source files that need them.

Closes scylladb/scylladb#21738
2024-12-05 12:27:41 +02:00
Nadav Har'El
86a8ca8a9f Merge 'Alternator add WCU for delelte item' from Amnon Heiman
This series adds WCU support for the delete item operation.
It also splits the Alternator WCU metric by an ops label to give us better visibility of how much each ops contributes to the WCU calculation.

No need to backport to the open source

Closes scylladb/scylladb#21709

* github.com:scylladb/scylladb:
  test_returnconsumedcapacity.py: Add delete Item tests
  alternator/executor: Add WCU support for delete item
  alternator/executer use uint in describe_item
  alternator/consumed_capacity.hh: Make the total_bytes public
  test_metrics validate split wcu_total to ops
  Alternato: split WCU metrics into ops
2024-12-05 11:27:20 +02:00
Piotr Dulikowski
66afdc9b3c service: add injection for skipping view update backlog
Information about view update backlog is propagated in two main ways:

- In RPCs that serve as responses to writes (MUTATION_DONE /
  MUTATION_FAILED)
- Via gossip (application_state::VIEW_BACKLOG)

In tests, it can be benefical to disable the second mechanism. View
update backlog propagation via write responses happens synchronously
with respect to writes so it is easier to control and reason about,
while gossip is asynchronous and can overwrite the backlog that was
propagated via write responses.

Add `skip_updating_local_backlog_via_view_update_backlog_broker` error
injection which skips the logic that updates the local, per-endpoint
cache of view update backlogs from the gossip state.
2024-12-05 09:51:57 +01:00
Nadav Har'El
49f11f655c materialized view: make flow-control maximum delay configurable
Until this patch, the materialized view flow-control algorithm
(https://www.scylladb.com/2018/12/04/worry-free-ingestion-flow-control/)
used a constant delay_limit_us hard-coded to one second, which means
that when the size of view-update backlog reached the maximum (10%
of memory), we delay every request by an additional second - while
smaller amounts of backlog will result in smaller delays.

This hard-coded one maximum second delay was considered *huge* - it will
slow down a client with concurrency 1000 to just 1000 requests per
second - but we already saw some workloads where it was not enough -
such as a test workload running very slow reads at high concurrency
on a slow machine, where a latency of over one second was expected
for each read, so adding a one second latecy for writes wasn't having
any noticable affect on slowing down the client.

So this patch replaces the hard-coded default with a live-updateable
configuration parameter, `view_flow_control_delay_limit_in_ms`, which
defaults to 1000ms as before.

Another useful way in which the new `view_flow_control_delay_limit_in_ms`
can be used is to set it to 0. In that case, the view-update flow
control always adds zero delay, and in effect - does absolutely
nothing. This setting can be used in emergency situations where it
is suspected that the MV flow control is not behaving properly, and
the user wants to disable it.

The new parameter's help string mentions both these use cases of
the parameter.

Fixes #18187

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-05 09:51:56 +01:00
Yaron Kaikov
816a8eafbc ./github/workflow/add-label-when-promoted: fix indentaion which preventing the workflow to be triggered when label was added
this workflow should be triggered either if a push event occurred or
pull_request_target (which mean someone added backport label)

It seems that due to wrong indentation the workflow wasn't trigger
during label add

Fixing it

Closes scylladb/scylladb#21791
2024-12-05 09:47:58 +02:00
Pavel Emelyanov
dd8f56ad3a test: Move test_query_built_indexes_virtual_table from boost to cqlpy
And split it into two -- one for materialized view, another for
secondary index. This is to fit current cqlpy layout that has different
files for views and indexes.

refs: #21552
refs: #21551 (detached this patch from there, as that PR needs fix in
the core code)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#21677
2024-12-05 09:17:23 +02:00
Raphael S. Carvalho
d93a0040e5 docs: Document tablet merging
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-04 13:11:11 -03:00
Raphael S. Carvalho
8344722a26 tests/boost: Add test to verify correctness of balancer decisions during merge
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-04 13:11:11 -03:00
Raphael S. Carvalho
76ab293505 tests/topology_experimental_raft: Add tablet merge test
Passed ./test.py --mode=dev ... --repeat=50.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-04 13:11:07 -03:00
Pavel Emelyanov
e1db35c100 Merge 'auth/passwords: Clean up prefix_for_scheme()' from Dawid Mędrek
In this PR, we get rid of the unnecessary default switch case in `prefix_for_scheme()`. We also change the return type of the function to `std::string_view` as it's easier to operate on.

Backport: not needed; this is a code cleanup.

Closes scylladb/scylladb#21749

* github.com:scylladb/scylladb:
  auth/passwords: Change return type of prefix_for_scheme to std::string_view
  auth/passwords.cc: Remove default case in prefix_for_scheme()
2024-12-04 18:38:14 +03:00
Kefu Chai
61ae4a1c86 mutation: remove unused "#include"s
This commit follows up on commit f436edfa22, which initially cleaned up
unused #include directives in the "mutation" subdirectory. This change
removes additional unused header files that were missed in the previous
cleanup.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21740
2024-12-04 15:36:33 +03:00
Kefu Chai
04acf8b075 .github: Add differential-shellcheck workflow for shell script analysis
Introduce a new GitHub workflow to run shellcheck on changed shell
scripts. This workflow automatically detect and highlight potential
shell script issues in pull requests. This change is a follow-up to
commit 0700b322 which fixed an undefined variable issue in `install.sh`.
It intends to leverage static analysis to improve script quality and
catch potential errors early.

Shellcheck will now:
- Analyze all shell scripts modified in pull requests
- Provide inline comments with specific issue details
- Help prevent similar variable-related mistakes in the future

See also
https://github.com/redhat-plumbers-in-action/differential-shellcheck

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21755
2024-12-04 13:34:53 +02:00
Ferenc Szili
fa3ec6e633 storage_service: move logic of wait_for_topology_request_completion()
This change moves to logic of
storage_service::wait_for_topology_request_completion() into
topology_state_machine.
2024-12-04 12:03:15 +01:00
Ferenc Szili
36d35d2297 RPC: add truncate_with_tablets RPC with frozen_topology_guard
This change introduces a new truncate_with_tablets RPC with a parameter
of type service::frozen_topology_guard. This is materialized on replica
nodes into a topology_guard which guarantees that truncate is performed
under a global session, which, in turn, makes sure that we don't execute
truncate as a result of stale RPCs.

Also, this RPC does not have a timeout. Timeout will be handled on the
coordinator side, and the truncate operation will not be allowed to time
out.
2024-12-04 11:30:07 +01:00
Ferenc Szili
bfbfc0fea9 feature_service: added cluster feature for system.topology schema change
This patch adds a feature serive which protects the system.topology
schema change against situations where clusters are incompletely
upgraded to new a version and could be rolled back.
2024-12-04 11:30:07 +01:00
Ferenc Szili
3ac44109e3 system.topology_requests: change schema
This commit adds the new column in the system.topology_requests
table which are needed for the new global topology request.
2024-12-04 11:30:06 +01:00
Ferenc Szili
7f29b7d8f6 storage_proxy: propagate group0 client and TSM dependency
This commit makes storage_proxy::remote dependent on raft_group0_client
and topology_state_machine. storage_proxy::remote gets references to these via
the call to start_remote(). These references will be needed to call
storage_service::truncate_table_with_tablets().
2024-12-04 11:30:06 +01:00
Botond Dénes
f55dc71c3f Merge 'Use checksummed input streams in validate_checksums()' from Nikos Dragazis
With commits ed7d352e7d and bb1867c7c7, we now have input streams for both compressed and uncompressed SSTables that provide seamless checksum and digest checking. The code for these was based on `validate_checksums()`, which implements its own validation logic over raw streams. This has led to some duplicate code.

This PR deduplicates the uncompressed case by modifying `validate_checksums()` to use a checksummed input stream instead of a raw stream. The same cannot be done for compressed SSTables though. The reason is that `validate_checksums()` needs to examine the whole data file, even if an invalid chunk is encountered. In the checksummed case we support that by offloading the error handling logic from the data source via a function parameter. In the compressed data source we cannot do that because it needs to return decompressed data and decompression may fail if the data are invalid.

This PR also enables `validate_checksums()` to partially verify SSTables with just the per-chunk checksums if the digest is missing.

In more detail, this PR consists of:
* Port of some integrity checks from `do_validate_uncompressed()` to the checksummed data source. It should now be able to detect corruption due to truncated or appended chunks (expected number of chunks is retrieved from the CRC component).
* Introduction of `error_handler` parameter in checksummed data source and `data_stream()`.
* Refactoring of `validate_checksums()`. The JSON response of `sstable validate-checksums` was also modified to report a missing digest.
*  Tests for `validate_checksums()` against SSTables with truncated data, appended data, invalid digests, or no digest.

Refs #19058.

This PR is a hybrid of cleanup and feature. No backport is needed.

Closes scylladb/scylladb#20933

* github.com:scylladb/scylladb:
  tools/scylla-sstable: Rename valid_checksums -> valid
  test: Check validate_checksums() with missing digest
  sstables: Allow validate_checksums() to report missing digests
  sstables: Refactor validate_checksums() to use checksummed data stream
  sstables: Add error_handler parameter to data_stream()
  sstables: Add error handler in checksummed data source
  sstables: Check for excessive chunks in checksummed data source
  sstables: Check for premature EOF in checksummed data source
  test: test_validate_checksums: Check SSTable with invalid digest
  test: test_validate_checksums: Check SSTable with appended data
  test: test_validate_checksums: Complement test for truncated SSTable
2024-12-04 10:46:18 +02:00
Gleb Natapov
b47faed54f idl: move node_ops related stuff from the repair related IDL
Create separate IDL file for node_ops stuff.
2024-12-04 10:36:40 +02:00
Benny Halevy
d5d4307a20 scylla-sstable: dump-summary: print also first and last tokens
To help scylla-manager restore to map sstables to
nodes or tablets, print also the tokens of the
sstable first and last keys.

For example, the json output will now look like this:
```
$ build/dev/scylla sstable dump-summary /tmp/scylla-344593/data/ks/t-52a92590afd011ef9b68ba86378ed63b/me-3glp_0tm9_00uv52doobo0bvk2t7-big-Data.db | jq
{
  "sstables": {
    "/tmp/scylla-344593/data/ks/t-52a92590afd011ef9b68ba86378ed63b/me-3glp_0tm9_00uv52doobo0bvk2t7-big-Data.db": {
      "header": {
        "min_index_interval": 128,
        "size": 1,
        "memory_size": 16,
        "sampling_level": 128,
        "size_at_full_sampling": 0
      },
      "positions": [
        4
      ],
      "entries": [
        {
          "key": {
            "token": "2008715943680221220",
            "raw": "000400000064",
            "value": "100"
          },
          "position": 0
        }
      ],
      "first_key": {
        "token": "2008715943680221220",
        "raw": "000400000064",
        "value": "100"
      },
      "last_key": {
        "token": "9010454139840013625",
        "raw": "000400000003",
        "value": "3"
      }
    }
  }
}
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#21735
2024-12-04 10:16:13 +02:00
Botond Dénes
ed08709e75 main: configure a reserve IOCB for scylla-nodetool and friends
Make use of the recently introduced reserve_io_control_blocks to ensure
some reserve IOCBs are left for scylla-nodetool or any other native tool
that might be running intermittently next to ScyllaDB.
These tool apps use the io_uring reactor backend, which requires just 2
IOCBs to function, so the configured default reserve of 10 is good for
running 5 instances of these tools next to ScyllaDB, which should be
good enough.
2024-12-04 02:56:14 -05:00
Botond Dénes
ca956c0180 configure: enable the io_uring backend
To be used by the tool apps -- also change the backend selected in
tools::utils::configure_tool_mode().
We keep using the more mature AIO backend in ScyllaDB itself, so main.cc
sets the linux_aio backend as the default one (the user can still change
this, same as before).
2024-12-04 02:55:31 -05:00
Botond Dénes
f7d66a436e main: use configure seastar defaults via app_template::seastar_options
Instead of the legacy app_template::config. This allows for greater
flexibility, as any option's default can be changed this way, not just
those few that are promoted to app_template::config. This will be made
use of in the next patches.
2024-12-04 02:35:56 -05:00
Raphael S. Carvalho
534ce7340f service: Handle exception when retrying split
It might happen sleep will fail during shutdown, so we should handle
failure for shutdown to proceed gracefully.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 23:55:44 -03:00
Raphael S. Carvalho
3e518c7b23 service: Co-locate sibling tablets for a table undergoing merge
This implements the ability for the balancer to co-locate sibling
tablets on the same shard. Co-location is low in priority, so
regular load balancer is preferred over it. Previous changes
allowed balancer to move co-located sibling tablets together,
to not undo the co-location work done so far.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 23:55:43 -03:00
Raphael S. Carvalho
cd5d1d3c99 gms: Add cluster feature for tablet merge
The reason we need it is that tablet merge can only be finalized
when the cluster agrees on the feature, otherwise unpatched
nodes would fail to handle merge finalization, potentially
crashing.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 21:47:01 -03:00
Raphael S. Carvalho
0a6d41305a service: Make merge of resize plan commutative
set_resize_plan() breaks commutativity since it may override the
resize plans done earlier, for example, when adding co-location
migrations in the DC plan.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 21:47:01 -03:00
Raphael S. Carvalho
70b3963b8d replica: Implement merging of compaction groups on merge completion
When handling merge completion, compaction groups that belonged to
sibling tablets are placed into the same storage group, since those
tablets become one after merge.

In order to merge two groups, the source group needs its memtable to
be flushed first, such that all the data can be moved into the
destination.

The handling happens in update_effective_replication_map() which cannot
afford to wait for I/O, so the group merge will happen in background.
There's a fiber that will wake up on merge completion and will iterate
through the new set of storage groups (after merge), and will work
on merging additional compaction groups into the main one.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 21:47:01 -03:00
Raphael S. Carvalho
907739f3d1 replica: Handle tablet merge completion
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 21:47:01 -03:00
Raphael S. Carvalho
48dcefbf45 service: Implement tablet map resize for merge
This implements the ability to resize the tablet map for merge if
the balancer emits the decision to finalize the merge when all
sibling replicas are colocated for a table. But the co-location
plan is not implemented in the balancer yet, so this is still
not in use.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 21:47:01 -03:00
Raphael S. Carvalho
014e1c9a0f locator: Introduce merge_tablet_info()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 21:47:00 -03:00
Raphael S. Carvalho
e00798f1b1 service: Rename topology::transition_state::tablet_split_finalization
This transition state will be reused by merge completion, so let's
rename it to tablet_resize_finalization.
The completion handling path will also be reused, so let's rename
functions involved similarly.

The old name "tablet split finalization" is deprecated but still
recognized and points to the correct transition. Otherwise, the
reverse lookup would fail when populating topology system table
which last state was split finalization.

NOTE:
I thought of adding a new tablet_merge_finalization, but it would
complicate things since more than one table could be ready for
either split or merge, so you need a generic transition state
for handling resize completion.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
75a6fe6a75 service: Respect initial_tablet_count if table is in growing mode
The initial_tablet_count is respected while the table is in "growing mode".
The table implicitly enters this mode when created, since we expect the
table to be populated thereafter.
We say that a table leaves this mode if it required a split above the initial
tablet count. After that, we can rely purely on the average size to say that
a table is shrinking and requires merge.

This is not perfect and we may want to leave the mode too if we detect
the table is shrinking (or even not growing for some significant amount
of time), before any split happened.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
5d3b9dba47 service: Wire migration_tablet_set into the load balancer
If table is undergoing merge, co-located replicas of sibling tablets will
be treated by balancer as if they were a single migration candidate.
The reason for that is that the balancer must not undo the co-location work
done previously on behalf of merge decision.
Sibling tablets will be put in the same migration plan, but note that
each tablet is still migrated independently in the state machine.
The balancer will exclude both co-located tablets from the candidate list
if either haven't finished migration yet. It achieves that by pretending
migration of sibling tablets succeeded, allowing it to note that tablets
are co-located even though either can still be migrating.

The load inversion convergence check also happens after picking a
candidate now, since the balancer must be aware that co-located
tablets are being migrated together and we want to avoid oscillations.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
a5cc6fb297 locator: Add tablet_map::sibling_tablets()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
fd33e6dfad service: Introduce sorted_replicas_for_tablet_load()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
fd6bf7b357 locator/tablets: Extend tablet_replica equality comparator to three-way
Will be needed later for sorting tablet replicas.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
ba633b1da2 service: Introduce alias to per-table candidate map type
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
2791923a21 service: Add replication constraint check variant for migration_tablet_set
We have a check that moving a tablet from A to B won't violate
replication constraints.

The contraints might not be the same for two sibling tablets that have co-located
replicas.
Example:
    nodes   = {A, B, C, D}
    tablet1 = {A, B, C}
    tablet2 = {A, B, D}
    viable target for {tablet1, B} is D.
    viable target for {tablet2, B} is C.

When co-located replicas share a viable target, then a migration can be emitted to
preserve co-location.
To allow decommission when co-located replicas don't share a viable target, a skip
info will be returned for each tablet, even though that means breaking this
co-location. Decommission is higher in priority.

Also, doing some preparation for integration of migration_tablet_set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
308741c9cb service: Add convergence check variant for migration_tablet_set
The load inversion convergence check should be able to know when two
tablets are being migrated instead of one, to avoid oscillations.
This will be wired when migration_tablet_set is wired.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
ed06b4b1e7 service: Add migration helpers for migration_tablet_set
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
a5db92b9e6 service/tablet_allocator: Introduce migration_tablet_set
This new type will allow the load balancer to treat co-located tablets
as a single candidate (will treat them as if they were already merged),
allowing co-located replicas to be migrated together (in the same
migration plan).

The type is a variant of global_tablet_id and colocated_tablets
(which holds the global_tablet_id of the sibling tablets).

It will be eventually wired after some more preparation. It will allow
for minimal amount of changes in the balancer code.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
96d4f2230e service: Introduce migration_plan::add(migrations_vector)
Allow addition of multiple tablet_migration_info into the plan.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
3082ff992c locator/tablets: Introduce tablet_map::for_each_sibling_tablets()
Adding interface to iterate through sibling tablets for a given table,
one pair at a time.

Initially I thought of having for_each_sibling_tablet do nothing for single
tablet tables. But later I bumped into complications when wiring it into
load balancer for building candidate list, since single-tablet tables
have to be special cased.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
47c8237de0 locator/tablets: Introduce tablet_map::needs_merge()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
93990eb162 locator/tablets: Introduce resize_decision::initial_decision()
Know whether resize (e.g. split) decision was needed above initial tablet
count will be helpful for guiding the merge decision, since we don't
want a merge to happen while table is still growing, but hasn't left
the merge threshold yet.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
61f694acf5 locator/tablets: Fix return type of three-way comparison operators
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
4e20a5eeb1 service: Extract update of node load on migrations
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
e2edcf2c88 service: Extract converge check for intra-node migration
This extraction will make it easier later when co-located tablets
are introduced in load balancer.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
4a0c3ca576 service: Extract erase of tablet replicas from candidate list
Intra and inter migration can reuse it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
3f9e317b23 scripts/tablet-mon: Allow visualization of tablet id
That will help visualizing co-location of sibling tablets for a table
that is undergoing merge.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Avi Kivity
841481c202 Merge "move storage proxy and adjacent services to identify hosts by ids" from Gleb
"
This rather large patch series moves storage proxy and some adjacent
services (like migration manager) to use host ids to identify nodes rather
than ips. Messaging service gains a capability to address nodes by host
ids (which allows dropping translations from topology coordinator code
that worked on host ids already) and also makes sure that a node with
incorrect host id will reject a message (can happen during address
changes).

The series gets rid of the raft address map completely and replaces it with
the gossiper address map which is managed by the gossiper since translation
is now done in the layer below raft.

Fixes: scylladb/scylladb#6403

perf-simple-query -- smp 1 -m 1G output

Before:

enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
64336.82 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41291 insns/op,   24485 cycles/op,        0 errors)
62669.58 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41277 insns/op,   24695 cycles/op,        0 errors)
69172.12 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41326 insns/op,   24463 cycles/op,        0 errors)
56706.60 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41143 insns/op,   24513 cycles/op,        0 errors)
56416.65 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41186 insns/op,   24851 cycles/op,        0 errors)

         throughput: mean=61860.35 standard-deviation=5395.48 median=62669.58 median-absolute-deviation=5153.75 maximum=69172.12 minimum=56416.65
instructions_per_op: mean=41244.62 standard-deviation=76.90 median=41276.94 median-absolute-deviation=58.55 maximum=41326.19 minimum=41142.80
  cpu_cycles_per_op: mean=24601.35 standard-deviation=167.39 median=24512.64 median-absolute-deviation=116.65 maximum=24851.45 minimum=24462.70

After:

enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
65237.35 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   40733 insns/op,   23145 cycles/op,        0 errors)
59283.09 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40624 insns/op,   23948 cycles/op,        0 errors)
70851.03 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40625 insns/op,   23027 cycles/op,        0 errors)
70549.61 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40650 insns/op,   23266 cycles/op,        0 errors)
68634.96 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40622 insns/op,   22935 cycles/op,        0 errors)

         throughput: mean=66911.21 standard-deviation=4814.60 median=68634.96 median-absolute-deviation=3638.40 maximum=70851.03 minimum=59283.09
instructions_per_op: mean=40650.89 standard-deviation=47.55 median=40624.60 median-absolute-deviation=27.11 maximum=40733.37 minimum=40622.33
  cpu_cycles_per_op: mean=23264.16 standard-deviation=402.12 median=23145.29 median-absolute-deviation=237.63 maximum=23947.96 minimum=22934.59

CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/13531/
SCT (longevity-100gb-4h with nemesis_selector: ['topology_changes']): https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/gleb/job/move-to-host-id/3/

Tested mixed cluster manually.
"

* 'gleb/move-to-host-id-v2' of github.com:scylladb/scylla-dev: (55 commits)
  group0: drop unused field from replace_info struct
  test: rename raft_address_map_test to address_map_test and move if from raft tests
  raft_address_map: remove raft address map
  topology coordinator: do not modify expire state for left/new nodes any more in raft address map
  topology coordinator: drop expiring entries in gossiper address map on error injections since raft one is no longer used
  group0: drop raft address map dependency from raft_rpc
  group0: move raft_ticker_type definition from raft_address_map.hh
  storage_service: do not update raft address map on gossiper events
  group0: drop raft address map dependency from raft_server_with_timeouts
  group0: move group0 upgrade code to host ids
  repair: drop raft address map dependency
  group0: remove unused raft address map getter from raft_group0
  group0: drop raft address map from group0_state_machine dependency since it is not used there any more
  group0: remove dependency on raft address map from group0_state_id_handler
  gossiper: add get_application_state_ptr that searches by host_id
  gossiper: change get_live_token_owners to return host ids
  view: move view building to host id
  hints: use host id to send hints
  storage_proxy: remove id_vector_to_addr since it is no longer used
  db: consistency_level: change is_sufficient_live_nodes to work on host ids
  ...
2024-12-03 18:18:48 +02:00
Amnon Heiman
d2ca1ebfa0 test_returnconsumedcapacity.py: Add delete Item tests
This patch adds three basic tests for delete item. A simple one that
validate that a simple short delete item returns 1 WCU.
The second tries to delete a missing item.
The third stores a bigger item and use the ReturnValues='ALL_OLD' to
make the API gets the previous stored item and see that the WCU is as
expected.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-12-03 15:55:41 +02:00
Amnon Heiman
c62cd08fbe alternator/executor: Add WCU support for delete item
Calculating the item length of WCU deleted Item depends on how the
operations was performed.

In a simple scenario it would be consider a 1 byte.
With an unsafe Read-Before-Write the item is return by get_perious_item
and with LWT the item is get from the apply method.

This patch changes the calls to describe_single_item in the last two
scenarios so that they would use the read item to determine the item
length.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-12-03 15:55:41 +02:00
Amnon Heiman
b888ed84f7 alternator/executer use uint in describe_item
Actions in rmw_operation can use describe_item to determine to get an
existing value (Read before Write scenario) on those cases the existing
item size can be bigger than the one we are storing (in the extreme
case, when deleting an object we only have its keys)

This modify the describe_item API so it would take a pointer to uint
instead of the consumed_capacity_counter so we can use it to get the old
value size and depends on that, determine the size that will be used for
the WCU calculation.
2024-12-03 15:55:41 +02:00
Amnon Heiman
3c6594b26a alternator/consumed_capacity.hh: Make the total_bytes public
rmw operations needs to be able to modify consume_capacity total_bytes
directly.
Depends on the previous stored item the length on which the WCU will be
calculated can be different than the length of the operation.
This patch makes the total_bytes public so it will be possible to modify
it directly.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-12-03 15:55:41 +02:00
Amnon Heiman
f4c79d7728 test_metrics validate split wcu_total to ops
This patch modify the post_item WCU test to validate that it uses the
right ops. Note that the test will pass even before this change but we
want to validate the extra label.
2024-12-03 15:55:41 +02:00
Amnon Heiman
8f3dd877ff Alternato: split WCU metrics into ops
This patch add visibility to the WCU metrics. It uses a label 'ops' to
split each of the operations that contribute to WCU into their
operations.

When summing over all ops value the result will be the same.
2024-12-03 15:55:41 +02:00
Avi Kivity
b99d4ec055 abstract_replication_strategy.hh: apply pimpl to boost::icl::interval_map
interval_map is a heavyweight header, hide it behind the pimpl idiom
to reduce #include load.

Ref #1
2024-12-03 13:59:45 +01:00
Botond Dénes
b6a9c79af3 utils/big_decimal: add fast paths to operator <=>
Currently, the tri-compare operator for big_decimal (operator <=>), uses
a precise but potentially very expensive algorithm for comparing the
numbers: it first brings them to the same scale, then compares the
normalized unscaled values. big_decimal has abritrary precisions,
therefore the stored numbers can be arbitrarily large.
In extreme cases, comparing two numbers can result in huge amount of
memory allocated and stalls. If this type is used int he primary key of
a table, these comparisons can make the node completely unresponsive.

This patch adds the following fast-paths to operator <=>:
* An early return for the case of equal scales.
* An early return for different signs.
* An early return for the case where one or both of the numbers are 0.
* A fast algorithm for detecting the case where the there is a big
  difference between the two numbers. This algorithm works only with the
  scales and is able to compare the two numbers by using only one division
  and some additions and substractions. This algorithm is imprecise and
  when the numbers are closer than its confidence window, it will
  fall-back to the current slow but precise tri-compare.

All but the last case should have been fast before as well, but the
scale-compare algorithm makes a huge difference. Numbers, which would
previously make the node unresponsive, now compare in constant-time.

Fixes: scylladb/scylladb#21716

Closes scylladb/scylladb#21715
2024-12-03 14:56:51 +02:00
Kamil Braun
8f858325b6 Merge 'topology_coordinator: introduce reload_count in topology state and use it to prevent race' from Gleb Natapov
Topology request table may change between the code reading it and
calling to cv::when() since reading is a preemption point. In this
case cv:signal can be missed. Detect that there was no signal in between
reading and waiting by introducing reload_count which is increased each
time the state is reloaded and signaled. If the counter is different
before and after reading the state may have change so re-check it again
instead of sleeping.

Closes scylladb/scylladb#21713

* github.com:scylladb/scylladb:
  topology_coordinator: introduce reload_count in topology state and use it to prevent race
  storage_service: use conditional_variable::when in co-routines consistently
2024-12-03 12:00:56 +01:00
Michał Jadwiszczak
38a697d064 transport/server: use synchronous calls in for_each_gently callback
Although the callbacks still return `future<>`, prepare to revert
324b3c43c0 by doing only synchronous calls
in the callbacks.
2024-12-03 11:05:29 +01:00
Michał Jadwiszczak
087bbdc4c8 service/client_state: add synchronous method to update service level params
Similarly to `maybe_update_per_service_level_params`, the method update
connection's params but it gets `service_level_options` as an argument
instead of asking `service_level_controller`.
2024-12-03 10:50:02 +01:00
Michał Jadwiszczak
0a17eca5a1 qos/service_level_controller: add find_cached_effective_service_level
The method is a synchronous equivalent of
`find_effective_service_level`.
It uses recently introduced effective service level cache, so retrieve
user's effective service level is done by quick lookup to the cache.
2024-12-03 10:46:39 +01:00
Michał Jadwiszczak
dab3256dc1 test/topology_custom/test_view_build_status: add reproducer
The test reproduces scylladb/scylladb#20754
2024-12-03 10:17:26 +01:00
Kefu Chai
4bc7e068ff locator: remove unused "#include"s
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21754
2024-12-03 11:05:35 +02:00
Kefu Chai
bab12e3a98 treewide: migrate from boost::adaptors::transformed to std::views::transform
now that we are allowed to use C++23. we now have the luxury of using
`std::views::transform`.

in this change, we:

- replace `boost::adaptors::transformed` with `std::views::transform`
- use `fmt::join()` when appropriate where `boost::algorithm::join()`
  is not applicable to a range view returned by `std::view::transform`.
- use `std::ranges::fold_left()` to accumulate the range returned by
  `std::view::transform`
- use `std::ranges::fold_left()` to get the maximum element in the
  range returned by `std::view::transform`
- use `std::ranges::min()` to get the minimal element in the range
  returned by `std::view::transform`
- use `std::ranges::equal()` to compare the range views returned
  by `std::view::transform`
- remove unused `#include <boost/range/adaptor/transformed.hpp>`
- use `std::ranges::subrange()` instead of `boost::make_iterator_range()`,
  to feed `std::views::transform()` a view range.

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

limitations:

there are still a couple places where we are still using
`boost::adaptors::transformed` due to the lack of a C++23 alternative
for `boost::join()` and `boost::adaptors::uniqued`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21700
2024-12-03 09:41:32 +02:00
Kefu Chai
99de3962c3 db/schema_applier: Fix spelling annotations to pass codespell checks
This commit addresses inconsistent spelling annotations that triggered
codespell warnings in our codebase.

Problem:
- Previous annotations like "CREATEing" and "DROPing" were flagged as
  misspellings by the codespell workflow
- These annotations were used to describe CQL statement execution contexts

Solution:
- Updated annotations to "CREAT'ing" and "DROP'ing"
- Preserves the intent of the original annotations
- Silences codespell warnings without changing the underlying meaning
- Ensures consistent and spell-checker-friendly code documentation

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21741
2024-12-03 09:01:26 +02:00
Botond Dénes
b87fb94a5e Merge 'tasks: add tablet repair virtual task' from Aleksandra Martyniuk
Add tablet task manager module and keep it in storage_service.
Introduce tablet_virtual_task that covers tablet repair.

Thanks to a repair virtual task, a user can check the list of pending
repairs, get the status of a specific repair, or abort it using the task
manager API.

Fixes: #21368.

No backport, new feature

Closes scylladb/scylladb#21624

* github.com:scylladb/scylladb:
  test: add test to check tablet repair tasks
  test: topology_tasks: enable tablets
  service: keep tablets module in storage_service
  service: rename storage_service::_task_manager_module
  service: add tablet_virtual_task
  tasks: utilize preliminary virtual task lookup
2024-12-02 17:22:44 +02:00
Nadav Har'El
c45ddb964f pytest: don't override default live-logging setting
In commit 8bf62a0 we introduced a test/pytest.ini which affects every
run of pytest in the project. One specific line in that file

    log_cli = true

Overrides pytest's standard CLI output, which is traditionally short
unless the "-v" (verbose) option is used, to be always long and spammy.
There is absolutely no reason to do that - if the user wants to run
"pytest -v", they can do that - it doesn't need to be the default.

Moreover, as https://docs.pytest.org/en/stable/how-to/logging.html
explains, the "log_cli = true" was added in pytest 3.4 to revert to
pytest 3.3 behavior that "community feedback" showed was NOT LIKED.
Why would we want to revert to behavior that wasn't liked?

After this patch, which removes that line, the output of commands
like
    cd test/cqlpy; pytest

return to what they used to be before commit 8bf62a0 and what the
pytest developers intended. Users who like verbose output can use
"pytest -v".

Fixes #21712

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#21717
2024-12-02 17:00:51 +02:00
Takuya ASADA
0700b322b8 install.sh: fix incorrect variable name
$without_systemd_check is incorrect variable name, it should be
$skip_systemd_check.

The bug skips to run "systemctl --user daemon-reload" unexpectedly on
nonroot mode installation.

This is likely root cause of the issue #21720.

Fixes #21720

Closes scylladb/scylladb#21747
2024-12-02 16:37:33 +02:00
Dawid Mędrek
1d5502706b auth/passwords: Change return type of prefix_for_scheme to std::string_view 2024-12-02 14:53:38 +01:00
Dawid Mędrek
329a888438 auth/passwords.cc: Remove default case in prefix_for_scheme()
We get rid of the default switch case in the function because it's not
necessary. It's better to get a warning from the compiler if the switch
is nonexhaustive and possibly prevent a bug (operating on a null pointer
may often lead to undefined behavior).
2024-12-02 14:49:44 +01:00
Calle Wilund
91d77987be test_backup: Add restore abort test case
Not a very good test, since the end result cannot be very well verified,
but at least does some checking.
2024-12-02 12:37:58 +00:00
Calle Wilund
cbe255e736 sstables_loader: Make restore task abortable
Fixes #20717

Enables abortable interface and propagates abort_source to
all s3 objects used for reading the restore data.

Note: because restore is done on each shard, we have to maintain
a per-shard abort source proxy for each, and do a background
per-shard abort on abort call. This is synced at the end of "run()"

v2:
* Simplify abortability by using a function-local gate instead.
2024-12-02 12:36:44 +00:00
Calle Wilund
6a2a18a2fc distributed_loader: Add optional abort_source to get_sstables_from_object_store 2024-12-02 12:30:24 +00:00
Calle Wilund
f30864b571 s3_storage: Add optional abort_source to params/object
Adds an abort_source to s3 storage params and resulting
storage interface. Propagates said source to s3 objects created.
2024-12-02 12:30:24 +00:00
Calle Wilund
af4dd1f2cb s3::client: Make "readable_file" abortable
Adds optional abortable source to "readable_file" interface.
Note: the abortable aspect is not preserved across a "dup()" call
however, since these objects are generally not used in a cross-shard
fashion, it should be ok.
2024-12-02 12:30:24 +00:00
Avi Kivity
58baeac0ad Merge 'compaction: update maintenance sstable set on scrub compaction completion' from Lakshmi Narayanan Sreethar
Scrub compaction can pick up input sstables from maintenance sstable set
but on compaction completion, it doesn't update the maintenance set
leaving the original sstable in set after it has been scrubbed. To fix
this, on compaction completion has to update the maintenance sstable if
the input originated from there. This PR solves the issue by updating the
correct sstable_sets on compaction completion.

Fixes #20030

This issue has existed since the introduction of main and maintenance sstable sets into scrub compaction. It would be good to have the fix backported to versions 6.1 and 6.2.

Closes scylladb/scylladb#21582

* github.com:scylladb/scylladb:
  compaction: remove unused `update_sstable_lists_on_off_strategy_completion`
  compaction_group: replace `update_sstable_lists_on_off_strategy_completion`
  compaction_group: rename `update_main_sstable_list_on_compaction_completion`
  compaction_group: update maintenance sstable set on scrub compaction completion
  compaction_group: store table::sstable_list_builder::result in replacement_desc
  table::sstable_list_builder: remove old sstables only from current list
  table::sstable_list_builder: return removed sstables from build_new_list
2024-12-02 13:32:49 +02:00
Nadav Har'El
6d37b53653 test/alternator: move comment next to bizarre code that it explains
In commit 9ff9cd37c3 we added in
test/alternator/test_number.py a workaround for a boto3 bug that
prevented us (and still prevents us) from testing numbers with high
precision. Because the workaround was so bizarre, the three lines it
requires - two imports and an assignment - were preceded by a 5-line
comment explaining it.

Unfortunately, a later commit 93b9b85c12
went and arbitrarily moved import lines around to satisfy some PEP-8
"requirements", resulting in the comment being separated from the lines
it was supposed to explain.

This patch moves the comment in front of the main line it explains.
The two imports that are needed just for this line and aren't used
elsewhere remain in their current place (where the PEP8 police demands
they stay), but this is less important for the understanding of this
trick so it's fine.

No functionality of the test was changed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#21635
2024-12-02 10:56:09 +01:00
Abhinav
acd643bd75 test: Parametrize 'replacement with inter-dc encryption' test to confirm behavior in zero token node cases.
In the current scenario, 'test_replace_with_encryption' only confirms the replacement with inter-dc encryption
for normal nodes. This commit increases the coverage of test by parametrizing the test to confirm behavior
for zero token node replacement as well. This test also implicitly provides
coverage for bootstrap with encryption of zero token nodes.

This PR increases coverage for existing code. Hence we need to backport it. Since only 6.2 version has zero
token node support, hence we only backport it to 6.2

Fixes: scylladb/scylladb#21096

Closes scylladb/scylladb#21609
2024-12-02 10:32:46 +01:00
Gleb Natapov
052e893444 group0: drop unused field from replace_info struct
The field is no longer used.
2024-12-02 10:31:14 +02:00
Gleb Natapov
1028ce17cd test: rename raft_address_map_test to address_map_test and move if from raft tests
It has nothing to do with raft now.
2024-12-02 10:31:14 +02:00
Gleb Natapov
96309224ff raft_address_map: remove raft address map
It is no longer used.
2024-12-02 10:31:14 +02:00
Gleb Natapov
b9d454c0d5 topology coordinator: do not modify expire state for left/new nodes any more in raft address map
The map is no longer used and gossiper address map is fully managed by
the gossiper.
2024-12-02 10:31:13 +02:00
Gleb Natapov
cbb6148a36 topology coordinator: drop expiring entries in gossiper address map on error injections since raft one is no longer used 2024-12-02 10:31:13 +02:00
Gleb Natapov
fca1f90cc7 group0: drop raft address map dependency from raft_rpc
No need to update raft address map on config changes any longer.
2024-12-02 10:31:13 +02:00
Gleb Natapov
64b135db7d group0: move raft_ticker_type definition from raft_address_map.hh
It has nothing to do with raft address map after all.
2024-12-02 10:31:13 +02:00
Gleb Natapov
c65f64cc5f storage_service: do not update raft address map on gossiper events
Raft address map is not use any longer to resolve addresses anyway, so
drop dependency on it from raft_ip_address_updater and rename it to
reflect that it is no longer raft address map specific.
2024-12-02 10:31:13 +02:00
Gleb Natapov
fa1397af13 group0: drop raft address map dependency from raft_server_with_timeouts
It is only needed to translate id to ip in the log output, but there
is no point in doing so now. All the logging (in the converted code)
is id based now.
2024-12-02 10:31:13 +02:00
Gleb Natapov
fbaf0a3cce group0: move group0 upgrade code to host ids
Drop unneeded ip to id translation.
2024-12-02 10:31:13 +02:00
Gleb Natapov
4ddb925997 repair: drop raft address map dependency
Replace it with gossiper address map, but make dependency localized.
Only functions that actually use address map get it now.
2024-12-02 10:31:13 +02:00
Gleb Natapov
ef09a93843 group0: remove unused raft address map getter from raft_group0 2024-12-02 10:31:13 +02:00
Gleb Natapov
85233830cf group0: drop raft address map from group0_state_machine dependency since it is not used there any more 2024-12-02 10:31:13 +02:00
Gleb Natapov
8fbb28cfcb group0: remove dependency on raft address map from group0_state_id_handler
Now that we can look up gossip state by host id we do not need to do the
translation in group0_state_id_handler.
2024-12-02 10:31:13 +02:00
Gleb Natapov
18a9de51e7 gossiper: add get_application_state_ptr that searches by host_id 2024-12-02 10:31:13 +02:00
Gleb Natapov
7d751709e3 gossiper: change get_live_token_owners to return host ids
Also amend the only user and drop the ip to id translation.
2024-12-02 10:31:13 +02:00
Gleb Natapov
20d1b80535 view: move view building to host id
Use host ids in view building code as well.
2024-12-02 10:31:13 +02:00
Gleb Natapov
0ca14ef8b7 hints: use host id to send hints
Drop address translation that no longer needed. Templates here are used
temporarily until another user of the function (MV) is converted as
well.
2024-12-02 10:31:12 +02:00
Gleb Natapov
5b9e4c2f07 storage_proxy: remove id_vector_to_addr since it is no longer used
Was needed during transition period only.
2024-12-02 10:31:12 +02:00
Gleb Natapov
6116751e44 db: consistency_level: change is_sufficient_live_nodes to work on host ids
It is called from storage proxy which works on host ids now.
2024-12-02 10:31:12 +02:00
Gleb Natapov
eb3d2307ce replication_strategy: move sanity_check_read_replicas to host id
It is called from storage proxy which works on host ids now.
2024-12-02 10:31:12 +02:00
Gleb Natapov
ccbfabb858 db: consistency_level: move filter_for_query to host id
It is called from storage proxy which works on host ids now.
2024-12-02 10:31:12 +02:00
Gleb Natapov
474b47ed22 database: move hits rates handling to host ids
Hits rates map is now indexed by ip. Change it to be indexed by host id since this is what
storage proxy uses now.
2024-12-02 10:31:12 +02:00
Gleb Natapov
d2cf5ca030 messaging_service: pass host id to connection_dropped handler id available
RPC clients which are host id aware may pass the id to
connection_dropped callback and save the need for translation.
2024-12-02 10:31:12 +02:00
Gleb Natapov
9f7183286a storage_proxy: change batchlog to work on host ids
It was not translated in the first pass.
2024-12-02 10:31:12 +02:00
Gleb Natapov
a1fdc8c847 storage_proxy: change mutation rpcs to send forward and reply addresses as host ids
RPCs from old nodes will still use old format so translation will be
used in this case. The change is backwards compatible thanks to RPC
extensibility.
2024-12-02 10:31:12 +02:00
Gleb Natapov
cd9b349886 migration_manager: move to use host ids instead of ips
Users also amended to pass ids instead of ips.
2024-12-02 10:31:12 +02:00
Gleb Natapov
2f23a21a23 raft: raft_group_registry: do not insert entry into raft address map on incoming message
Raft map is no longer used to send raft messages. We rely on gossiper
address propagation now.
2024-12-02 10:31:12 +02:00
Gleb Natapov
1f302577d0 group0: move transfer_snapshot to use host ids
No need to translate id to ip any longer.
2024-12-02 10:31:12 +02:00
Gleb Natapov
e695cb1054 topology: use gossiper address map instead of raft one in storage service
Also remove forcing of the replacing node to be alive which is not
needed any more since gossiper no longer inhibits replacing nodes from
advertising themselves.
2024-12-02 10:31:12 +02:00
Gleb Natapov
b6425446c6 gossiper: fix indentation after previous patch 2024-12-02 10:31:11 +02:00
Gleb Natapov
a64b079b5c gossiper: drop advertise_myself parameter to gossiper
The parameter was needed when nodes were addressed by IP, so during
replace with the same IP a new node had to "hide" itself from the
cluster to not get accidentally confused with the old node. Now,
when nodes are addressed by host id the situation is impossible.
2024-12-02 10:31:11 +02:00
Gleb Natapov
12937aeb7f storage_proxy: move to addressing nodes by host ids instead of ips
In this rather large path we mode to address nodes in storage proxy by
host ids instead of ips. Some subsystems storage proxy calls to are
not yet converted to host ids, so we translate back and forth when we
interact with them.
2024-12-02 10:31:11 +02:00
Gleb Natapov
b7402af872 locator: topology: add sort_by_proximity function that works on host ids 2024-12-02 10:31:11 +02:00
Gleb Natapov
0882f2024c locator: topology: make topology object always contain local node
Currently the locator::topology object, when created, does not contain
local node, but it is started to be used to access local database. It
sort of work now because there are explicit checks in the code to handle
this special case like in topology::get_location for instance. We do not
want to hack around it and instead rely on an invariant that the local
node is always there. To do that we add local node during
locator::topology creation. There is a catch though. Unlike with IP host
ID is not known during startup. We actually need to read from the
database to know it, so the topology starts with host ID zero and then
it changes once to the real one. This is not a problem though. As long as
the (one node) topology is consistent (_cfg.this_host_id is equal to the
node's id) local access will work.
2024-12-02 10:31:11 +02:00
Gleb Natapov
9cda32af92 locator: put real host id into the replication map for local replication strategy
Local replication strategy returns zero host id in replica set instead
of the real one. It mostly works now because code that translates ids
to ips knows that zero host id is a special one. But we want to use host
ids directly and we need to return real one (or handle zero special case
everywhere).
2024-12-02 10:31:11 +02:00
Gleb Natapov
e7f869591d gossiper: add address map getters 2024-12-02 10:31:11 +02:00
Gleb Natapov
faef04e688 replication_strategy: add host id versions of get_natural_endpoints/get_pending_endpoints/get_endpoints_for_reading functions
Those functions will return host ids instead of ips.
2024-12-02 10:31:11 +02:00
Gleb Natapov
1c5a7826dc storage_service: pass gossip_address_map
It will be used in the following patches.
2024-12-02 10:31:11 +02:00
Gleb Natapov
79358278f2 service: raft: move raft pinger to sending messages by host id
This allows us to drop dependency on raft_address_map from direct_fd_pinger.
2024-12-02 10:31:11 +02:00
Gleb Natapov
0e045181d8 raft_rpc: use host ids to send raft rpcs
Address translation is no longer needed since host id can be used
directly.
2024-12-02 10:31:11 +02:00
Gleb Natapov
2c17fa6370 topology coordinator: drop raft_address_map dependency
raft_address_map is not used by the coordinator code any longer.
2024-12-02 10:31:11 +02:00
Gleb Natapov
aba4ae0ca1 topology coordinator: rename wait_for_ip to wait_for_gossiper and drop raft address map usage
What wait_for_ip is actually does is waiting for a node to appear in the
gossiper since this is when it is added to the raft address map. Drop
the usage of the address map and check the gossiper directly.
2024-12-02 10:31:11 +02:00
Gleb Natapov
414ec6d5bb topology coordinator: get rid of host id to ip translations
Now we have enough functionality in the gossiper and messaging service
to get rid of ip2id function in the topology coordinator. We can use
hos ids directly.
2024-12-02 10:31:11 +02:00
Gleb Natapov
15145c16d1 gossiper: provide wait_alive that works on host ids
We have wait_alive function that gets an array of ip address and wait
for all of them to be alive. Provide similar one that works on host ids.
2024-12-02 10:31:10 +02:00
Gleb Natapov
84c7aa8f48 gossiper: send up notifications by host ids 2024-12-02 10:31:10 +02:00
Gleb Natapov
609cb2dee9 gossiper: send failure detection ping to a host id instead of ip
This way wrong host will not answer it.
2024-12-02 10:31:10 +02:00
Gleb Natapov
c51263d085 messaging_service: add a separate map for clients created with host id available
We want to use different clients to send messages based on ids and ips,
so provide a separate map to hold them.
2024-12-02 10:31:10 +02:00
Gleb Natapov
83cde134d0 messaging_service: add dst host id to CLIEN_ID RPC and send it if provided
If an RPC client creation was triggered by send function that has host
id as a dst send it as part of CLIENT_ID RPC which is always the first
RPC on each connection. If receiver's host id does not match it will
drop the connection.
2024-12-02 10:30:59 +02:00
Andrei Chekun
6c267bbc70 test.py: Make it test/cqlpy python module
Removed all path modification and migrated to python way of importing packages. This is another small step to the one pool cluster for better scheduling and better resource utilization.

Fixes: https://github.com/scylladb/scylladb/issues/21644

Closes scylladb/scylladb#21585
2024-12-01 18:26:17 +02:00
Takuya ASADA
2b3115ac79 scylla-server.service: drop scylla-jmx.service
Since we dropped scylla-jmx at 3cd2a61, Wants=scylla-jmx.service is not
needed anymore.

Also we have issue on nonroot mode installation with this line (#21720),
we need to drop this now.

Fixes #21720

Closes scylladb/scylladb#21721
2024-12-01 14:14:33 +02:00
Gleb Natapov
aa87fecce2 gossiper: add is_alive that works on host_id
The function checks if a node with provided id is alive. If it fails to map id to ip
or there is no state for the ip found the node is considered to be dead.
2024-12-01 12:12:30 +02:00
Gleb Natapov
76aa41dfcf messaging_service: pass gossip_address_map to the mm and introduce send by id functions
The function looks up provided host id in gossip_address_map and throws
unknown_address if the mapping is not available. Otherwise it sends the
message by IP found.
2024-12-01 12:12:30 +02:00
Gleb Natapov
0e264ccba9 gossiper: populate gossip_address_map
Add a non expiring entry into the address map for each host in the
gossiper state and change one to expiring when the state is deleted.
2024-12-01 12:12:30 +02:00
Gleb Natapov
ca2544e57e gossiper: introduce gossip address map
Introduce new address map that will be populated by the gossiper. Create
in during initialization and pass it to the gossiper.
2024-12-01 12:12:29 +02:00
Gleb Natapov
be5caec54e service: make address_map raft independent
We want to start using address map class outside for raft, so lets make
it work on host_id instead of raft::servers_id and move is outside of
raft.
2024-12-01 12:12:29 +02:00
Gleb Natapov
cc1b5aaf51 idl: generate host_id variant of send functions as well
We want to be able to address nodes by host ids. For that lets generate
send functions that gets host_id as a dst parameter.

Changes to raft_rpc are needed because otherwise the compiler cannot
select a correct overload.
2024-12-01 12:12:29 +02:00
Gleb Natapov
3ca8bdea11 topology_coordinator: introduce reload_count in topology state and use it to prevent race
Topology request table may change between the code reading it and
calling to cv::when() since reading is a preemption point. In this
case cv:signal can be missed. Detect that there was no signal in between
reading and waiting by introducing reload_count which is increased each
time the state is reloaded and signaled. If the counter is different
before and after reading the state may have change so re-check it again
instead of sleeping.

Fixes: scylladb/scylladb#19994
2024-12-01 11:02:57 +02:00
Gleb Natapov
b41bf0da6f storage_service: use conditional_variable::when in co-routines consistently
This function is co-routine optimized.
2024-12-01 10:43:48 +02:00
Kefu Chai
65949ce607 test: topology_custom: ensure node visibility before keyspace creation
Building upon commit 69b47694, this change addresses a subtle synchronization
weakness in node visibility checks during recovery mode testing.

Previous Approach:
- Waited only for the first node to see its peers
- Insufficient to guarantee full cluster consistency

Current Solution:
1. Implement comprehensive node visibility verification
2. Ensure all nodes mutually recognize each other
3. Prevent potential schema propagation race conditions

Key Improvements:
- Robust cluster state validation before keyspace creation
- Eliminate partial visibility scenarios

Fixes scylladb/scylladb#21724

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21726
2024-11-29 17:13:21 +01:00
Kefu Chai
afeff0a792 docs: explain task status retention and one-time query behavior
Task status information from nodetool commands is not retained permanently:

- Status of completed tasks is only kept for `task_ttl_in_seconds`
- Status is removed after being queried, making it a one-time operation

This behavior is important for users to understand since subsequent
queries for the same completed task will not return any information.
Add documentation to make this clear to users.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21386
2024-11-29 16:36:27 +01:00
Pavel Emelyanov
f2509d90a5 Merge 'mutation: remove unused "#include"s' from Kefu Chai
these unused includes are identified by clang-include-cleaner. after auditing the source files, all of the reports have been confirmed.

please note, because `mutation/mutation.hh` does not include `seastar/coroutine/maybe_yield.hh` anymore, and quite a few source files were relying on this header to bring in the declaration of `maybe_yield()`, we have to include this header in the places where this symbol is used. the same applies to `seastar/core/when_all.hh`.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#21727

* github.com:scylladb/scylladb:
  .github: add "mutation" to CLEANER_DIR
  mutation: remove unused "#include"s
2024-11-29 13:01:53 +03:00
Kefu Chai
efbf6e5526 .github: add "mutation" to CLEANER_DIR
in order to prevent future inclusion of unused headers, let's include
"mutation" subdirectory to CLEANER_DIR, so that this workflow can
identify the regressions in future.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-29 14:01:44 +08:00
Kefu Chai
f436edfa22 mutation: remove unused "#include"s
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.

please note, because `mutation/mutation.hh` does not include
`seastar/coroutine/maybe_yield.hh` anymore, and quite a few source
files were relying on this header to bring in the declaration of
`maybe_yield()`, we have to include this header in the places where
this symbol is used. the same applies to `seastar/core/when_all.hh`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-29 14:01:44 +08:00
Botond Dénes
055a36ae55 main: dump diagnostics on SIGQUIT
Dump a diagnostics report on each shard when receiving a SIGQUIT. The
report is logged with a dedicated logger, called diagnostics.
The report has multiple parts:
* seastar memory diagnostics, similar to that printed by the scylla
  memory command (from scylla-gdb.py).
* reader concurrency semaphore diagnostics for each semaphore.

Example report:

    INFO  2024-11-27 01:31:55,882 [shard 0:main] diagnostics - Diagnostics dump requested via SIGQUIT:
    Dumping seastar memory diagnostics
    Used memory:   3988M
    Free memory:   58M
    Total memory:  4G
    Hard failures: 0

    LSA
      allocated: 4M
      used:      16
      free:      4G

    Cache:
      total: 1M
      used:  642K
      free:  398K

    Memtables:
     total: 3M
     Regular:
      real dirty: 0B
      virt dirty: 0B
     System:
      real dirty: 3M
      virt dirty: 3M

    Replica:
      Read Concurrency Semaphores:
        user: 0/100, 0B/81M, queued: 0
        streaming: 0/10, 0B/81M, queued: 0
        system: 0/10, 0B/81M, queued: 0
        compaction: 0/unlimited, 0B/unlimited
        view update: 0/50, 0B/40M, queued: 0
      Execution Stages:
        apply stage:
             Total: 0
      Tables - Ongoing Operations:
        Pending writes (top 10):
          0 Total (all)
        Pending reads (top 10):
          0 Total (all)
        Pending streams (top 10):
          0 Total (all)

    Small pools:
    objsz spansz usedobj memory unused wst%
        8     4K     858    16K     9K   58
       10     4K       5     8K     8K   99
       12     4K       5     8K     8K   99
       14     4K       0     0B     0B    0
       16     4K      2k    44K    15K   35
       32     4K      4k   136K    16K   11
       32     4K      8k   280K    24K    8
       32     4K      3k    92K     6K    6
       32     4K      4k   140K    21K   14
       48     4K      3k   180K    25K   14
       48     4K      2k   120K    27K   22
       64     4K      2k   156K    18K   11
       64     4K     19k     1M    11K    0
       80     4K      3k   236K    16K    6
       96     4K      6k   572K    49K    8
      112     4K      2k   276K    72K   25
      128     4K     477    80K    20K   25
      160     4K     194    60K    30K   49
      192     4K      1k   232K    39K   16
      224     4K      2k   468K    15K    3
      256     4K     182   100K    55K   54
      320     8K     349   152K    43K   28
      384     8K     332   288K   164K   56
      448     4K     243   180K    74K   40
      512     4K     256   244K   116K   47
      640    16K     185   192K    76K   39
      768    16K     394   432K   137K   31
      896     8K      54   192K   144K   75
     1024     4K     288   432K   144K   33
     1280    32K      92   256K   140K   54
     1536    32K      11   128K   111K   86
     1792    16K      10   144K   126K   87
     2048     8K     487     1M    90K    8
     2560    64K     113   384K   100K   26
     3072    64K       9   256K   228K   89
     3584    32K       3   288K   277K   96
     4096    16K     129   912K   396K   43
     5120   128K      21   384K   275K   71
     6144   128K       4   512K   486K   94
     7168    64K       3   576K   553K   96
     8192    32K     373     3M    56K    1
    10240    64K       6   832K   770K   92
    12288    64K      17   960K   756K   78
    14336   128K       2     1M     1M   97
    16384    64K      14     1M   992K   81

    Page spans:
    index  size  free  used spans
        0    4K    4K    5M    1k
        1    8K    8K    2M   213
        2   16K   16K    2M   106
        3   32K   64K    6M   200
        4   64K   64K    4M    71
        5  128K  384K 3934M   31k
        6  256K    1M  256K     5
        7  512K  512K  512K     2
        8    1M    2M    0B     2
        9    2M    2M    2M     2
       10    4M    4M    0B     1
       11    8M   16M    0B     2
       12   16M   32M    0B     2
       13   32M    0B   32M     1
       14   64M    0B    0B     0
       15  128M    0B    0B     0
       16  256M    0B    0B     0
       17  512M    0B    0B     0
       18    1G    0B    0B     0
       19    2G    0B    0B     0
       20    4G    0B    0B     0
       21    8G    0B    0B     0
       22   16G    0B    0B     0
       23   32G    0B    0B     0
       24   64G    0B    0B     0
       25  128G    0B    0B     0
       26  256G    0B    0B     0
       27  512G    0B    0B     0
       28    1T    0B    0B     0
       29    2T    0B    0B     0
       30    4T    0B    0B     0
       31    8T    0B    0B     0

    INFO  2024-11-27 01:31:55,882 [shard 0:main] diagnostics - Diagnostics dump requested via SIGQUIT:
    Semaphore user with 0/100 count and 0/84850769 memory resources: user request, dumping permit diagnostics:

    permits	count	memory	table/operation/state

    0	0	0B	total

    Stats:
    permit_based_evictions: 0
    time_based_evictions: 0
    inactive_reads: 0
    total_successful_reads: 0
    total_failed_reads: 0
    total_reads_shed_due_to_overload: 0
    total_reads_killed_due_to_kill_limit: 0
    reads_admitted: 0
    reads_enqueued_for_admission: 0
    reads_enqueued_for_memory: 0
    reads_admitted_immediately: 0
    reads_queued_because_ready_list: 0
    reads_queued_because_need_cpu_permits: 0
    reads_queued_because_memory_resources: 0
    reads_queued_because_count_resources: 0
    reads_queued_with_eviction: 0
    total_permits: 0
    current_permits: 0
    need_cpu_permits: 0
    awaits_permits: 0
    disk_reads: 0
    sstables_read: 0
    INFO  2024-11-27 01:31:55,882 [shard 0:main] diagnostics - Diagnostics dump requested via SIGQUIT:
    Semaphore streaming with 0/10 count and 0/84850769 memory resources: user request, dumping permit diagnostics:

    permits	count	memory	table/operation/state

    0	0	0B	total

    Stats:
    permit_based_evictions: 0
    time_based_evictions: 0
    inactive_reads: 0
    total_successful_reads: 6
    total_failed_reads: 0
    total_reads_shed_due_to_overload: 0
    total_reads_killed_due_to_kill_limit: 0
    reads_admitted: 6
    reads_enqueued_for_admission: 0
    reads_enqueued_for_memory: 0
    reads_admitted_immediately: 6
    reads_queued_because_ready_list: 0
    reads_queued_because_need_cpu_permits: 0
    reads_queued_because_memory_resources: 0
    reads_queued_because_count_resources: 0
    reads_queued_with_eviction: 0
    total_permits: 6
    current_permits: 0
    need_cpu_permits: 0
    awaits_permits: 0
    disk_reads: 0
    sstables_read: 0
    INFO  2024-11-27 01:31:55,882 [shard 0:main] diagnostics - Diagnostics dump requested via SIGQUIT:
    Semaphore compaction with 0/2147483647 count and 0/9223372036854775807 memory resources: user request, dumping permit diagnostics:

    permits	count	memory	table/operation/state

    0	0	0B	total

    Stats:
    permit_based_evictions: 0
    time_based_evictions: 0
    inactive_reads: 0
    total_successful_reads: 0
    total_failed_reads: 0
    total_reads_shed_due_to_overload: 0
    total_reads_killed_due_to_kill_limit: 0
    reads_admitted: 0
    reads_enqueued_for_admission: 0
    reads_enqueued_for_memory: 0
    reads_admitted_immediately: 0
    reads_queued_because_ready_list: 0
    reads_queued_because_need_cpu_permits: 0
    reads_queued_because_memory_resources: 0
    reads_queued_because_count_resources: 0
    reads_queued_with_eviction: 0
    total_permits: 27
    current_permits: 0
    need_cpu_permits: 0
    awaits_permits: 0
    disk_reads: 0
    sstables_read: 0
    INFO  2024-11-27 01:31:55,882 [shard 0:main] diagnostics - Diagnostics dump requested via SIGQUIT:
    Semaphore system with 0/10 count and 0/84850769 memory resources: user request, dumping permit diagnostics:

    permits	count	memory	table/operation/state
    1	0	0B	*.*/view_builder/active

    1	0	0B	total

    Stats:
    permit_based_evictions: 0
    time_based_evictions: 0
    inactive_reads: 0
    total_successful_reads: 234
    total_failed_reads: 0
    total_reads_shed_due_to_overload: 0
    total_reads_killed_due_to_kill_limit: 0
    reads_admitted: 234
    reads_enqueued_for_admission: 154
    reads_enqueued_for_memory: 0
    reads_admitted_immediately: 80
    reads_queued_because_ready_list: 154
    reads_queued_because_need_cpu_permits: 0
    reads_queued_because_memory_resources: 0
    reads_queued_because_count_resources: 0
    reads_queued_with_eviction: 0
    total_permits: 235
    current_permits: 1
    need_cpu_permits: 0
    awaits_permits: 0
    disk_reads: 0
    sstables_read: 0
    INFO  2024-11-27 01:31:55,882 [shard 0:main] diagnostics - Diagnostics dump requested via SIGQUIT:
    Semaphore view_update with 0/50 count and 0/42425384 memory resources: user request, dumping permit diagnostics:

    permits	count	memory	table/operation/state

    0	0	0B	total

    Stats:
    permit_based_evictions: 0
    time_based_evictions: 0
    inactive_reads: 0
    total_successful_reads: 0
    total_failed_reads: 0
    total_reads_shed_due_to_overload: 0
    total_reads_killed_due_to_kill_limit: 0
    reads_admitted: 0
    reads_enqueued_for_admission: 0
    reads_enqueued_for_memory: 0
    reads_admitted_immediately: 0
    reads_queued_because_ready_list: 0
    reads_queued_because_need_cpu_permits: 0
    reads_queued_because_memory_resources: 0
    reads_queued_because_count_resources: 0
    reads_queued_with_eviction: 0
    total_permits: 0
    current_permits: 0
    need_cpu_permits: 0
    awaits_permits: 0
    disk_reads: 0
    sstables_read: 0

Fixes: scylladb/scylladb#7400

Closes scylladb/scylladb#21692
2024-11-28 18:52:29 +02:00
Botond Dénes
ff90a77f5b scylla-sstable: revamp schema sources
Demote --scylla-data-dir and --scylla-yaml-file to schema source
helpers, rather than schema source in themselves. This practically means
that when these options are used, they won't define where the tool will
attempt to load the schema from, they will just be helpers to help locate
the schema, for whichever schema source the tool was instructed to use
(or left to choose).
--scylla-data-dir and --scylla-yaml-file being schema sources were
problematic with encryption at rest and for S3 support (not yet
implemented). With encryption, the tool needs access to the
configuration, so --scylla-yaml-file is often used to provide the path
to the configuration file, which contains encryption configuration,
needed for the tool to decrypt the sstable. Currently, using this option
implies forcing the tool to read the schema from the schema tables,
which is a problematic option for tests -- Scylla might be compacting a
schema sstable and this will make the tool fail to load the schema.
Demoting these options the schema helpers, allows providing them, while
at the same time having the option to use a different schema-source.

To allow the user to force the tool to load the schema from the schema
tables, a new --schema-tables option is added. Similarly, a
--sstable-schema option is introduced to force the tool to load the
schema from the sstable itself.

With this, each 4 schema source now has an option to force the use of
said schema source. There are various helper options to be used along
with these.

The documentation as well as the tests are updated with the changes.
The schema related documentation gets an rather extensive facelift
because it was a bit out-of-date and incomplete.

Fixes: scylladb/scylladb#20534

Closes scylladb/scylladb#21678
2024-11-28 18:36:09 +02:00
Ferenc Szili
e54c07ba75 test: add test for truncate saving replay positions
This change adds a test for truncate correctly saving commit log replay
positions.
2024-11-28 17:20:50 +01:00
Kefu Chai
2c9c654798 build: cmake: Enforce explicit library linkage visibility
This change improves dependency management by explicitly specifying
library linkage visibility in CMake targets.

Previously, some ScyllaDB targets used `target_link_libraries()`
without `PUBLIC` or `PRIVATE` keywords, which resulted in transitive
library dependencies by default. This unintentionally exposed
non-public dependencies to downstream targets.

Changes:
- Always use explicit `PRIVATE` or `PUBLIC` keywords with
  `target_link_libraries()`
- Tighten build dependency tree
- Enforce a more modular linkage model

See: [CMake documentation on library dependencies](https://cmake.org/cmake/help/latest/command/target_link_libraries.html)

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21686
2024-11-28 18:15:23 +02:00
Ferenc Szili
036d3287b3 database: correctly save replay position for truncate
This commit fixes a problem with way truncate saves commit log
replay positions. On shards without mutations, truncate would save the
replay position into system.truncated with shard number 0 regardless of
the actual shard number that the replay position was saved for.
2024-11-28 16:18:32 +01:00
Piotr Smaron
a49ed7074d Update in-memory ks.metadata.init_tablets after ALTER KS
Once e.g. `ALTER KEYSPACE` is performed, all in-memory objects should be updated accordingly, but this is not entirely true for keyspace metadata object. The reason for that is that keyspace metadata are stored in 2 system tables: `system_schema.keyspaces` and `system_schema.scylla_keyspaces`. Up until now the in-memory keyspace metadata object has been updated only with entries from the first table, and missed updates when entries from the 2nd table changed. These entries were e.g. initial tablets or storage options.
This change fixes this oversight by considering both tables when checking if keyspace metadata need to be updated. From the implementation point of view, the change is simple: we're considering `system_schema.scylla_keyspaces` also in `merge_keyspaces()` and if old and new schemas have any differences, we include that when altering ks.

Fixes #20768

Backport: no need, I don't think the issue is severe, atm it seems like it can only influence the tablets number, which should not bring the cluster down nor result in returning bad data, it can mostly influence the speed of the db.

Closes scylladb/scylladb#20852
2024-11-28 13:46:32 +01:00
Aleksandra Martyniuk
4e2cd8640c test: add test to check tablet repair tasks 2024-11-28 12:15:42 +01:00
Michał Jadwiszczak
66071d8097 service/topology_coordinator: migrate view builder only if all nodes are up
The migration process is doing read with consistency level ALL,
requiring all nodes to be alive.

This patch also adds the topology state machine notification when a node
is up.
2024-11-28 12:11:08 +01:00
Nikos Dragazis
6091d5d789 sstables: Fix range of input stream in checksummed file data source
The checksummed file data source uses the chunk size to enforce that the
reads from the underlying file input stream will be aligned at the chunk
boundary. This is necessary so that we can validate the checksum of each
chunk.

However, a mismatch in the numeric types caused a bug where the
underlying file input stream would read a smaller portion of the data
file than expected.

The bug is located in the following lines:

```
auto start = _beg_pos & ~(chunk_size - 1);
auto end = (_end_pos & ~(chunk_size - 1)) + chunk_size;
```

`_beg_pos` and `_end_pos` are `uint64_t`, whereas `chunk_size` is
`uint32_t`. When executing the AND operation, the compiler converts the
right operand from `uint32_t` to `uint64_t`. Since the integer is
unsigned, the four most-significant bytes are filled with zeros, thus
erroneously truncating the corresponding bytes of the position.

Fix the bug by explicitly converting the chunk size to `uint64_t` before
any arithmetic operations. Also, replace the handwritten alignment
implementations with the `align_up()` and `align_down()` helpers.

Finally, restrict the file end position to not exceed the file length.
Since the last chunk can be smaller than the chunk size, it could happen
that the end position exceeds the file length after the round-up. This
is not a bug on its own since `make_file_input_stream()` can accept
lengths that go beyond end-of-file, but still it makes the code more
error prone and should be avoided.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#21665
2024-11-28 12:53:05 +02:00
Aleksandra Martyniuk
ab3858e050 test: topology_tasks: enable tablets
Tablets are no longer an experimental feature, but topology_tasks test
suite treats them as if they were.

Enable tablets with their own config option in topology_tasks suite.
2024-11-28 11:42:40 +01:00
Aleksandra Martyniuk
4829dd9de8 service: keep tablets module in storage_service 2024-11-28 11:42:40 +01:00
Aleksandra Martyniuk
6105e6f85c service: rename storage_service::_task_manager_module
Rename storage_service::_task_manager_module to _node_ops_module.
In the following patches, storage service will keep two different
task manager modules.
2024-11-28 11:42:40 +01:00
Aleksandra Martyniuk
409ed508cc service: add tablet_virtual_task
Add tablet_virtual_task, which covers tablet repair.
2024-11-28 11:42:38 +01:00
Dawid Mędrek
7cce9a8f64 db/hints: Prevent dereferencing a null pointer
Before these changes, we dereferenced `app_state` in
`manager::endpoint_downtime_not_bigger_than()` before checking that it's
not a null pointer. We fix that.

Fixes scylladb/scylladb#21699

Closes scylladb/scylladb#21676
2024-11-28 11:31:57 +01:00
Aleksandra Martyniuk
898c8f4e24 tasks: utilize preliminary virtual task lookup
When API user requests status of a virtual task, we first need to find
which virtual_task instance tracks given operation. While doing this we
gather some info regarding the task, but we don't utilize it.

Add virtual_task_hint that keeps info that was gathered during virtual
task lookup and pass it to virtual_task's methods so the info doesn't
need to be retrieved twice.
2024-11-28 11:27:16 +01:00
Ernest Zaslavsky
4035e0877d s3_tests: Add s3 test to check object re-uploading
Add s3 test to check existing object re-uploading succeeds

Closes scylladb/scylladb#21544
2024-11-28 12:46:59 +03:00
Pavel Emelyanov
58a2c6a7c3 Merge 'replica,sstables: track download progress of download_task_impl' from Kefu Chai
Previously, the progress of download_task_impl launched by the "restore" API
was not tracked. Since restore operations can involve large data transfers,
this makes it difficult for users to monitor progress.

The restore process happens in two sequential steps:
1. Open specified SSTables from object storage
2. Download and stream mutation fragments from the opened SSTables to
   mapped destinations

While both steps contribute to overall progress, they use different units
of measurement, making a unified progress metric challenging. Because
the load-and-stream step (step 2) is the largest time-consuming part of the
restore. This change implements progress tracking for this step as an
initial improvement to provide users with partial visibility into the
restore operation.

Fixes scylladb/scylladb#21427
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

---

this a part of experimental feature, hence no need to backport.

Closes scylladb/scylladb#21562

* github.com:scylladb/scylladb:
  test/object_store: Enable tablets to match production settings
  sstables_loader: Track download progress of download_task_impl
  sstables_loader: improve batch tracking using ranges library
  sstables_loader: print streaming progress with moving range
  sstables_loader: mark sstable_streamer::stream_sstable_mutations() private
  sstables_loader: fix indentation in stream_sstable_mutations()
2024-11-28 12:46:31 +03:00
Laszlo Ersek
5f8549a9a0 configure.py: honor "--build-dir" when using CMake
The "--use-cmake" option currently hardwires the build directory as
"$source_dir/build". Adhere to the "--build-dir" option's argument
instead:

- If the option is not specified, its argument defaults to "build"; thus,
  there is no change in behavior.

- If the option specifies a relative pathname, append it to $source_dir.

- If the option specifies an absolute pathname, use it as-is.

This is especially useful for keeping the build directory on a filesystem
separate from the source directory (without resorting to creating "build"
as a symlink, before running "configure.py"). For example, the source tree
can be accessed remotely over sshfs, from a build host, while keeping the
build artifacts (and hence the link stage) local to the build host.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>

Closes scylladb/scylladb#21694
2024-11-28 11:26:26 +03:00
Avi Kivity
7e02f9bbaa tombstone_gc.hh: remove include of boost/icl/interval_map.hh
tombstone_gc.hh is relatively lightweight and is used in many places,
but it includes the heavyweight boost/icl/interval_map.hh. Lighten
the load for its users by wrapping lw_shared_ptr<some icl map type>
in a forward-declared class. Define the class in a new header
tombstone_gc-internals.hh, to be used by the two translation units
that need it.

Ref #1.

Closes scylladb/scylladb#21706
2024-11-28 11:24:51 +03:00
Kefu Chai
23a7e9a6d0 docs: align tablestats documentation with actual output
Update the tablestats documentation to correctly describe the "Number of
partitions" metric. The previous documentation incorrectly referred to
"estimated row count" when the command actually shows estimated partition count.

Before:

```
Number of keys (estimate) | The estimated row count
```

After:

```
Number of partitions (estimate) | The estimated partition count
```

This distinction is important since a partition (identified by its partition
key) can contain multiple rows in ScyllaDB. The updated format also matches
Cassandra's nodetool output for better compatibility.

Fixes scylladb/scylladb#21586

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21598
2024-11-28 09:36:21 +02:00
Lakshmi Narayanan Sreethar
91148e7747 compaction: remove unused update_sstable_lists_on_off_strategy_completion
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-28 11:25:11 +05:30
Lakshmi Narayanan Sreethar
36195d29c6 compaction_group: replace update_sstable_lists_on_off_strategy_completion
Now that `update_sstable_sets_on_compaction_completion` can update both
the main and maintenance sets, callers of
`update_sstable_lists_on_off_strategy_completion` can replace it with
the former.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-28 11:25:11 +05:30
Lakshmi Narayanan Sreethar
4f7b1c0bcd compaction_group: rename update_main_sstable_list_on_compaction_completion
Rename `update_main_sstable_list_on_compaction_completion` to
`update_sstable_sets_on_compaction_completion` as the method updates
both main and maintenance sstable sets now.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-28 11:25:11 +05:30
Lakshmi Narayanan Sreethar
5b4f6b7871 compaction_group: update maintenance sstable set on scrub compaction completion
Scrub compaction can pick up input sstables from maintenance sstable set
but on compaction completion, it doesn't update the maintenance set
leaving the original sstable in set after it has been scrubbed. To fix
this, on compaction completion has to update the maintenance sstable if
the input originated from there.

This patch modifies the `update_sstable_sets_on_compaction_completion`
to remove the input sstable from the maintenance sstable set if it
exists in that set.

Also added a testcase to verify the fix.

Fixes #20030

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-28 11:25:11 +05:30
Lakshmi Narayanan Sreethar
e525c584a7 compaction_group: store table::sstable_list_builder::result in replacement_desc
Directly store the result of `build_new_list` in `replacement_desc`
instead of storing just the newly built sstable_set. Adjust the
`backlog_tracker_adjust_charges` to use the removed sstables list
returned by the `build_new_list`, so that when the next patch updates
the `update_main_sstable_list_on_compaction_completion` to also update
the maintenance sstable set, only sstables removed from main sstable set
will be removed from the backlog tracker.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-28 11:25:11 +05:30
Lakshmi Narayanan Sreethar
87a4fae3e7 table::sstable_list_builder: remove old sstables only from current list
The `build_new_list()` method previously joined the current and new
sstable ranges, removing old sstables from the combined result. This
patch updates the method to treat them separately, ensuring old sstables
are removed only from the current sstable list.

This change enables the method to return the correct set of removed
sstables in cases where an sstable is directly moved from the
maintenance set to the main set.
2024-11-28 11:25:11 +05:30
Lakshmi Narayanan Sreethar
0e08ccd307 table::sstable_list_builder: return removed sstables from build_new_list
Updated the method table::sstable_list_builder::build_new_list() to
return the list of sstables that was removed along with the newly built
sstable set. This change will be used to unify the
`update_sstable_lists` variants in a following patch.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-28 11:25:11 +05:30
Kefu Chai
79cc90141b test/object_store: Enable tablets to match production settings
Enable the `enable_tablets` configuration flag in object store tests to better
align with production environments, where it is enabled by default via the
`scylla.yaml` in Scylla's relocatable tarball. This change will improve test
coverage of tablet-related features.

Previously, `enable_tablets` defaulted to false in tests, creating a mismatch
with typical production deployments.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-28 10:00:45 +08:00
Kefu Chai
5ab4932f34 sstables_loader: Track download progress of download_task_impl
Previously, the progress of download_task_impl launched by the "restore" API
was not tracked. Since restore operations can involve large data transfers,
this makes it difficult for users to monitor progress.

The restore process happens in two sequential steps:
1. Open specified SSTables from object storage
2. Download and stream mutation fragments from the opened SSTables to
   mapped destinations

While both steps contribute to overall progress, they use different units
of measurement, making a unified progress metric challenging. Because
the load-and-stream step (step 2) is the largest time-consuming part of the
restore. This change implements progress tracking for this step as an
initial improvement to provide users with partial visibility into the
restore operation.

Fixes scylladb/scylladb#21427
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-28 10:00:45 +08:00
Kefu Chai
e57f674066 sstables_loader: improve batch tracking using ranges library
Replace manual vector iteration with ranges library to preserve batch size
information. When streaming SSTable mutations, we need to track progress
across batches. The previous implementation used a loop to move elements
from the vector's end, but this approach lost the batch size information
since the SSTable set was moved away during streaming.

Now use std::ranges to take elements from the vector's end instead of
manual iteration. This preserves the original batch size, enabling accurate
progress tracking which will be implemented in a follow-up commit.

Technical changes:
- Replace manual vector iteration with ranges::take_view
- Preserve batch size information for progress tracking
- Maintain existing batch processing behavior

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-28 10:00:45 +08:00
Kefu Chai
38af123e05 sstables_loader: print streaming progress with moving range
After d1db17d490, the processed SSTable counter remained at 0 while
streaming progress was still being displayed. This fix properly tracks
and displays streaming progress by:

- Moving SSTable counter (`nr_sst_current`) to `sstable_streamer::stream_sstables()`
- Generating UUID at the streaming initialization
- Relocating progress reporting to `stream_sstables()` for accurate tracking

This ensures the progress indicator correctly reflects the actual number
of processed SSTables during streaming operations.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-28 10:00:45 +08:00
Kefu Chai
83f55fdb84 sstables_loader: mark sstable_streamer::stream_sstable_mutations() private
the only user of `sstable_streamer::stream_sstable_mutations()` is
`sstable_st6reamer::stream_sstables()`, so mark this member function as
private.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-28 10:00:45 +08:00
Kefu Chai
5a39be9b8c sstables_loader: fix indentation in stream_sstable_mutations()
Fix indentation regression from d1db17d490 where the function body of
`sstable_streamer::stream_sstable_mutations()` was left incorrectly
indented after the function was extracted to decouple streaming from
sstable selection.

Pure style fix, no functional changes.

in this change, we correct the indent.

Refs d1db17d490
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-28 10:00:45 +08:00
Kefu Chai
5e391eee25 treewide: use coroutine::parallel_for_each(range) when appropriate
`coroutine::parallel_for_each` accepts both a range and a pair of
iterators. let's use the former when appropriate. it is simpler this way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21684
2024-11-27 21:00:47 +02:00
Botond Dénes
20bbb1113e test/cqlpy: test_tools.py: use xfail more selectively
ScyllaDB doesn't support counters with tablets yet. So scylla-sstable
tests which use counter schema are marked with xfail, but this is done
too aggressively, disabling too many tests that are otherwise fine.
There are two tests affected:
* test_scylla_sstable_script - this test uses early return when the
  schema parameter is the one with counters and tablets are enabled.
  This is still too eager because tablets are now always enabled. Also,
  the early return make the fact that this test is disabled hidden.
  So change the check to check whether tablets are used on the test
  keyspace and use xfail instead of sneaky early return.
* test_scylla_sstable_dump_data - this test is blanket-disabled when run
  with the tablets parameter. Even though only 1 out of 5 schemas tested
  use counters. Remove the blanket xfail and only add it when test
  keyspace uses tablets and the schema parameter is the one with
  counters.

This makes dozens of test run again, restoring the test coverage lost
with the too eager use of xfail (and sneaky return).

Refs: #18180

Closes scylladb/scylladb#21685
2024-11-27 12:17:56 +03:00
Kefu Chai
8ca1c57de0 test: s3_proxy: bring back InjectingHandler.log_message
in 0dff187b7a, we dropped `InjectingHandler.log_message()`, but this
method was defined to override the default implementation provided by
`BaseHTTPRequestHandler.log_message()`. this change flooded the standard
output when testing `aws_error_injection_test` with `test.py` with
logging messages like:
```
127.0.0.1 - - [26/Nov/2024 17:27:34] "PUT /?Policy=0&Key=%2Ftest%2Ftestobject-large-817295 HTTP/1.1" 200
127.0.0.1 - - [26/Nov/2024 17:27:34] "PUT /?Policy=1&Key=%2Ftest%2Ftestobject-large-817306 HTTP/1.1" 200
```

this is unexpected.

in this change, we bring this method back, and additionally, we
format the logging message lazily.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21689
2024-11-27 12:16:36 +03:00
Botond Dénes
87bdfb80aa docs/dev/reader-concurrency-semaphore.md: fix formatting of diagnostics dump
Indent the whole thing so it is formatted as code, not as text.

Closes scylladb/scylladb#21693
2024-11-27 12:13:16 +03:00
Botond Dénes
ccb433d767 Merge 'tasks: add api_task_ttl for tasks started with API' from Aleksandra Martyniuk
When users start an operation asynchronously with API, they are expected to check the operation's status. Hence, the status should be kept in task manager for reasonable time after the operation is done. The operations that are started internally usually don't need to stay in task manager for that long.

Add api_task_ttl that will be used for tasks started with API. By default it's 1 hour. The time for which non-API tasks stay in task manager isn't changed.

Fixes: #21499.
Refs: #21425.

No backport needed - previous versions may use task_ttl

Closes scylladb/scylladb#21505

* github.com:scylladb/scylladb:
  test: add test to check user_task_ttl
  tasks: api: move make_task method
  docs: nodetool: update backup and restore commands docs
  docs: update task manager docs
  nodetool: add nodetool tasks user-ttl command
  node_ops: use user task ttl for node ops virtual task
  tasks: use user_task_ttl for tasks started by user
  api: task_manager: add /task_manager/user_ttl to get and set user task ttl
  tasks: add task_manager::task::is_user_task method
  tasks: keep updateable_value of task_ttl in task manager
  db: config: add user_task_ttl_seconds named value
2024-11-27 09:57:57 +02:00
Nikita Kurashkin
4ba8a6b1b4 Fix test for DESC TABLE on materialised view to be compatible with Scylla AND Cassandra
Fixes #21026
Refs #21500

Closes scylladb/scylladb#21526
2024-11-27 09:49:23 +02:00
Pavel Emelyanov
4d10cd40f0 s3: Remove unused boost/algorithm/string/classification.hpp inclusion
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#21690
2024-11-26 19:51:21 +02:00
Botond Dénes
98fa00499c Update seastar submodule
* seastar a5432364...72c7ac57 (13):
  > json2code: convert boost integer_range to std iota_view
  > utils/program-options: selection_value: make get_candidate_names() public
  > treewide: Trim trailing spaces
  > build: Clean up `compile_option` after FindSanitizers module
  > file: Convert file operations to use coroutines
  > websocket: Remove unnecessary condition in frame parsing
  > websocket: Fix logic when parsing header
  > websocket: Avoid memory copy when full websocket frames are received
  > websocket: Fix websocket frame parsing on partial packets
  > sharded.hh: remove inline from templates
  > reactor: fix reserve_io_control_blocks config name in error message
  > reactor: next_waitpid_timeout: contants can be defined as constexpr
  > reactor: coroutinize waitpid

Closes scylladb/scylladb#21688
2024-11-26 19:50:45 +02:00
Kamil Braun
1f5b83dc56 Merge 'docs: update admin-tools docs with deprecation and removal notice for java tools' from Botond Dénes
Java tools are deprecated and slated for removal in the next ScyllaDB release.
Update the admin-tools docs and make sure all java tool documentation pages have a notice reflecting this fact.

Fixes: https://github.com/scylladb/scylladb/issues/21149

Should be backported to 6.2, so users of the latest stable version can see the notice.

Closes scylladb/scylladb#21522

* github.com:scylladb/scylladb:
  docs: sstableloader.rst: add deprecation notice
  docs: admin-tools: update deprecation notice for sstable{dump,metadata}
  docs: tools_index.rst: remove deprecated sstablereset and sstablerepairedset tools
2024-11-26 17:03:56 +01:00
Ernest Zaslavsky
793f2c95d1 snapshots: Stop taking snapshots of MVs
Stop taking snapshots of MVs and allow taking snapshot of individual tables, now one can take a snapshot of any base table, any view or index. Also add tests to cover new cases both boost test (using cc code) and pytest (using the API)
Also, update documentation to reflect the change

fixes: #21339
fixes: #20760

Closes scylladb/scylladb#21433
2024-11-26 15:27:30 +02:00
Kamil Braun
9dc8926252 Merge 'a bunch of cleanups and enhancements to various services' from Gleb
Mostly no functional changes here except in patch 3.

* 'gleb/cleanups' of github.com:scylladb/scylla-dev:
  migration_manager: move migration manager verbs to the IDL
  storage_proxy: remove unused function
  storage_proxy: co-routinize handle_paxos_prepare
  storage_proxy: co-routinise handle_paxos_prune
  service: raft: no need to sync schema if the cluster is in raft topology mode
  messaging_service: co-routinize messaging_service::stop_client
  gossiper: rename apply_state_locally_without_listener_notification to apply_state_locally_in_shadow_round
2024-11-26 14:18:00 +01:00
Kefu Chai
a5ee0c896b treewide: migrate from boost::adaptors::filtered to std::views::filter
Modernize the codebase by replacing Boost range adaptors with C++23 standard library views,
reducing external dependencies and leveraging modern C++ language features.

Key Changes:
- Replace `boost::adaptors::filtered` with `std::views::filter`
- Remove `#include <boost/range/adaptor/filtered.hpp>`
- Utilize standard library range views

Motivation:
- Reduce project's external dependency footprint
- Leverage standard library's range and view capabilities
- Improve long-term code maintainability
- Align with modern C++ best practices

Implementation Challenges and Considerations:
1. Range Conversion and Move Semantics
   - `std::ranges::to` adaptor requires rvalue references
   - Necessitated updates to variable and parameter constness
   - Example: `cql3/restrictions/statement_restrictions.cc` modified to remove `const`
     from `common` to enable efficient range conversion

2. Range Iteration and Mutation
   - Range views may mutate internal state during iteration
   - Cannot pass ranges by const reference in some scenarios
   - Solution: Pass ranges by rvalue reference to explicitly indicate
     state invalidation

Limitations:
- One instance of `boost::adaptors::filtered` temporarily preserved
  due to lack of a C++23 alternative for `boost::join()`
- A comprehensive replacement will be addressed in a follow-up change

This change is part of our ongoing effort to modernize the codebase,
reducing external dependencies and adopting modern C++ practices.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21648
2024-11-26 14:26:50 +02:00
Aleksandra Martyniuk
ac6a07117a test: add test to check user_task_ttl 2024-11-26 09:57:42 +01:00
Aleksandra Martyniuk
1712c93261 tasks: api: move make_task method
task_manager::module::make_task method template is used only for
test_task_impl. Move it to api/task_manager_test.cc and modify it
to be test_task_impl-specific.
2024-11-26 09:57:42 +01:00
Aleksandra Martyniuk
1244982071 docs: nodetool: update backup and restore commands docs 2024-11-26 09:57:41 +01:00
Aleksandra Martyniuk
3b86150e88 docs: update task manager docs 2024-11-26 09:57:41 +01:00
Aleksandra Martyniuk
1ade668d79 nodetool: add nodetool tasks user-ttl command 2024-11-26 09:57:23 +01:00
Evgeniy Naydanov
1e9d780e89 test.py: deselect random failures which can cause #21534
Following combinations of error injections and cluster events
can cause #21534.  Disable them for now because they break CI.

Closes scylladb/scylladb#21658
2024-11-26 10:38:15 +02:00
Nikos Dragazis
29ce29db33 tools/scylla-sstable: Rename valid_checksums -> valid
The `sstable validate-checksums` tool provides the validation result
via the `valid_checksums` key in its JSON response.

The name can be misleading as it refers to both the per-chunk checksums
and the digest (full checksum). We use the terms "digest" and
"full checksum" interchangeably.

Replace with the word "valid" to avoid confusion.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-25 16:04:58 +02:00
Nikos Dragazis
28f2aefc7b test: Check validate_checksums() with missing digest
The previous patch extended `validate_checksums()` to perform checksum
validation even if the digest component is missing. Add a test case for
this scenario.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-25 16:04:57 +02:00
Nikos Dragazis
636524bde1 sstables: Allow validate_checksums() to report missing digests
Currently, `validate_checksums()` expects the SSTable to have a digest
component and fails immediately otherwise. This is suboptimal since data
integrity verification could still be carried out partially via checksum
checking.

Lift this restriction by allowing the function to perform checksum
checking in any case, and treat digest checking as best effort. Add a
separate boolean flag in the response to indicate the presence or
absence of the digest component, so that the user can deduce if a valid
result involved digest checking or not.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-25 16:04:57 +02:00
Nikos Dragazis
4c18f90f95 sstables: Refactor validate_checksums() to use checksummed data stream
`validate_checksums()` is used to check the checksums and digests of an
SSTable. Currently, the procedure is ad-hoc: the helper functions
`do_validate_[un]compressed()` loop over a raw stream, calculate the
actual checksums and digest, and compare against the expected ones.

In an effort to reduce code duplication, remove the custom procedure for
uncompressed SSTables and use a checksummed input stream instead. The
checksummed input stream offers the same functionality of checksum and
digest checking transparently. Also, check if the SSTable has checksums
before creating the input stream because `data_stream()` would return a
raw stream in this case.

Although the compressed input stream offers the same checksum and digest
checks, we need to stick with the existing procedure for compressed
SSTables. The reason is that `validate_checksums()` needs to examine the
whole data file, so any failed checksum checks must be tolerated. With
checksummed streams we support that via a user-provided graceful error
handler that just logs a message and updates the validation status.
However, with compressed streams we cannot customize the error handling
logic because they return decompressed data, but decompression may fail
if applied on corrupted data.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-25 16:04:53 +02:00
Nikos Dragazis
a6689ebd2e sstables: Add error_handler parameter to data_stream()
Expose the `error_handler` parameter from the checksummed input stream.
This is a callback function that the input stream calls if an invalid
checksum or digest is encountered.

The parameter is ignored if integrity checking is disabled. It is also
ignored in case of compressed SSTables, since the compressed input
streams do not support it.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-25 15:55:49 +02:00
Nikos Dragazis
77285f24c8 sstables: Add error handler in checksummed data source
Currently, the checksummed data source treats an invalid checksum or
digest as an unrecoverable error by throwing a
`malformed_sstable_exception`. This does not allow to use this data
source in places where it is required to resume after a failed checksum
(e.g., in `validate_checksums()`).

Make the error handling logic customizable via a callback function.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-25 15:55:49 +02:00
Nikos Dragazis
6e0be02165 sstables: Check for excessive chunks in checksummed data source
For uncompressed SSTables, the expected number of chunks is the number
of checksums in the CRC component. The data file must contain the same
number of chunks. Otherwise, the SSTable should be considered as
corrupted.

Add a check in the checksummed data source to ensure that the data file
does not contain more chunks than expected. Run this check every time
the caller reads more data from the stream. The check will be triggered
when they attempt to read past the expected number of chunks and more
chunks are indeed available. This behavior is consistent with the
compressed data source and allows for partial reads to succeed.

This check will not be triggered if an SSTable has been corrupted by
appending new data, but the new data do not overflow the last chunk.
Since the SSTable metadata only record the expected number of chunks, we
cannot know the exact expected file size at a byte-level. However, this
kind of corruption will be detected by the checksum check, and by the
digest check if enabled.

In fact, the checksum check would suffice for all kinds of corruption
due to appended data except for one case: when the pre-corruption data
file was aligned at the chunk boundary, i.e., the last chunk was full.
This patch closes this gap.

Finally, note that, as a side-effect, this patch fixes a bug where we
would do an out-of-bounds read on the checksum array.

This patch is part of incorporating the functionality of
`do_validate_uncompressed()` into the checksummed data source.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-25 15:55:49 +02:00
Nikos Dragazis
5c2b1fa125 sstables: Check for premature EOF in checksummed data source
Uncompressed SSTables may have a last chunk that is smaller than the
chunk size. The condition for premature EOF is when more chunks are
expected when such a chunk is encountered. The expected number of chunks
is the number of checksums in the CRC component.

A premature EOF can happen if the data file has been truncated.
An edge case is when the truncation happened at exactly the chunk
boundary and before the SSTable was loaded. In this case, this check
will not be triggered because the early return statement of `get()`
will evaluate as true (`_pos` will match the `_end_pos`, which is the
actual file size). But it will be caught by the digest check.

This patch is part of incorporating the functionality of
`do_validate_uncompressed()` into the checksummed data source.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-25 15:55:45 +02:00
Aleksandra Martyniuk
e703ba08f8 node_ops: use user task ttl for node ops virtual task
Use user task ttl for node ops virtual task. Modify the test accordingly.
2024-11-25 14:21:53 +01:00
Aleksandra Martyniuk
6241d49b64 tasks: use user_task_ttl for tasks started by user 2024-11-25 14:21:53 +01:00
Aleksandra Martyniuk
19a90e3697 api: task_manager: add /task_manager/user_ttl to get and set user task ttl 2024-11-25 14:21:53 +01:00
Aleksandra Martyniuk
292d00463a tasks: add task_manager::task::is_user_task method 2024-11-25 14:21:53 +01:00
Aleksandra Martyniuk
16e204dfdb tasks: keep updateable_value of task_ttl in task manager
Drop task_ttl observer from task manager and use updateable_value.
2024-11-25 14:20:43 +01:00
Aleksandra Martyniuk
1bf073704c db: config: add user_task_ttl_seconds named value
Add user_task_ttl_seconds config option and keep the value in task manager.
In the following patches tasks started by user will be kept in task
manager for user_task_ttl_seconds after they are finished.
2024-11-25 14:16:06 +01:00
Nadav Har'El
cb6c55209a Merge 'locator: token_metadata: replace boost range with std range' from Avi Kivity
Reduce dependency load by standardizing on std::ranges. This is a little
involved since a we use a custom iterator.

Code cleanup; no backport.

Closes scylladb/scylladb#21421

* github.com:scylladb/scylladb:
  locator: token_metadata: switch from boost ranges to std ranges
  locator: token_metadata: make iterator support std::input_iterator concept
  locator: tokens_metadata: move tokens_iterator to namespace scope
2024-11-25 14:58:45 +02:00
Nikos Dragazis
320c9d17d4 test: test_validate_checksums: Check SSTable with invalid digest
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-25 14:37:00 +02:00
Nikos Dragazis
08b8dfe9c7 test: test_validate_checksums: Check SSTable with appended data
An SSTable can be corrupted by appending random data to it.
`validate_checksums()` should be able to identify such SSTables as
invalid.

Cover this with a test case.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-25 14:36:59 +02:00
Nikos Dragazis
c569cdf534 test: test_validate_checksums: Complement test for truncated SSTable
The tests for `validate_checksums()` already cover the case of a
truncated SSTable. However, the test performs the truncation after the
SSTable has been loaded, which means that the SSTable object has cached
the old file size by the time we validate its checksums. This is a valid
case, but not the most common one.

Add a new test that loads the SSTable after the truncation. Do not use
the same SSTable as for the other tests, since this has been loaded
already.

Additionally, let both tests check SSTables with different types of
truncations: minor truncations affecting only the last chunk,  and major
truncations spanning across multiple chunks.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-25 14:36:59 +02:00
Botond Dénes
510e09c648 docs/ddl: document memtable_flush_period_in_ms
This option was implemented by scylladb/scylladb#20999 but it wasn't
documented. Add a description of this option to the create table page.
Note that the option was accepted already before
scylladb/scylladb#20999, but it's value was ignored.

Fixes: scylladb/scylladb#21671

Closes scylladb/scylladb#21673
2024-11-25 13:53:21 +02:00
Nadav Har'El
61e8975930 Merge 'test/boost/view_schema_test: Improve test_view_update_generating_writetime' from Dawid Mędrek
In this PR, we improve various aspects of the test:

* increase obtained information whenever any test case fails,
* split test cases,
* elaborate on the semantics of generating view updates
  and what exactly we check and why.

Backport: not needed, this is an enhancement.

Closes scylladb/scylladb#21579

* github.com:scylladb/scylladb:
  test/boost/view_schema_test: Improve comments in test_view_update_generating_writetime
  test/boost/view_schema_test.cc: Improve checks in test_view_update_generating_writetime
  test/boost/view_schema_test.cc: Split test cases in test_view_update_generating_writetime
2024-11-25 13:46:56 +02:00
Botond Dénes
090ab796dd Merge 'repair: Enable small table optimization for RBNO bootstrap and decommission' from Asias He
The non local strategy system keyspaces usually contain very litte data.
All the tables within them have to be repaired for all the token ranges,
which could be large in clusters with a large number of nodes. In multiple
DC setup, the repair in RBNO is dominated by the network latency. As a
result, it takes a long time to repair those tables even if they are
almost empty.

To speed up the RBNO bootstrap, especially for starting empty clusters,
this patch enables small table optimization for RBNO for system tables.
We could enable it for small user tables as a follow up.

Tests:

1) A 5ms latency is added to simulate cross dc network delay, 256 tokens
per node, 10 nodes:

- Before

  topology_custom dev topology_custom.test_boot_time.1 1287.06s

- After

  topology_custom dev topology_custom.test_boot_time.1 12.48s

The test shows 100X boot time improvement

2) A SCT test to bootstrap 3 DCs, 3 nodes in each DC.

- Before

  Time to bootstrap = 1h23m

- After

  Time to bootstrap = 13m

The test shows 6X bootstrap time improvement

Fixes #19131

New feature.  No backport is needed.

Closes scylladb/scylladb#21207

* github.com:scylladb/scylladb:
  repair: Enable small table optimization for RBNO bootstrap and decommission
  repair: Move flush_rows after repair_meta class
2024-11-25 11:50:55 +02:00
Nadav Har'El
71c671eeaa docs: copy-edit docs/alternator/compatibility.md
I reread the "ScyllaDB Alternator for DynamoDB users" document
(alternator/compatibility.md) and improved various places that I thought
needed improvement.

Two of the more significant changes is moving the not-really-important
"Scan ordering" section much lower in the document and explaining it
better, and improving the "provisioning" section to focus on the available
and missing functionality, and not on minor API details.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#21605
2024-11-25 10:02:36 +03:00
Pavel Emelyanov
08f1d6362b Merge 'readers: migrate to std::ranges' from Botond Dénes
With C++23, we have ranges available. Migrate the code in `readers/` to use ranges, to reduce our dependency on boost (and modernize the code a bit).

Improvement, no backport required

Closes scylladb/scylladb#21643

* github.com:scylladb/scylladb:
  readers/mutation_reader: migrate to std::ranges
  readers/multishard: migrate to std::ranges::{push,pop}_heap()
  readers/combined: migrate to std::ranges::subrange<>
  readers/combined: migrate to std::ranges::{push,pop}_heap()
2024-11-25 10:01:13 +03:00
Evgeniy Naydanov
5d254b1fdf test.py: topology_random_failures: increase timeout for Scylla startup
We run topology_random_failures in debug mode only and sometimes Scylla
is too slow in this mode.  Increase timeout for Scylla startup from
30s to 180s to reduce flakiness.

Fixes #21101

Closes scylladb/scylladb#21659
2024-11-25 09:58:46 +03:00
Kefu Chai
7bc0b64f0a replica: correct indentation after coroutinizing make_sstables_available
The previous commit (b3ebbf35e2) transformed `make_sstables_available()`
into a coroutine but left behind incorrectly indented statements
from a nested lambda. This commit restores proper indentation.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21660
2024-11-25 09:58:01 +03:00
Asias He
9ace191616 repair: Enable small table optimization for RBNO bootstrap and decommission
The non local strategy system keyspaces usually contain very litte data.
All the tables within them have to be repaired for all the token ranges,
which could be large in clusters with a large number of nodes. In multiple
DC setup, the repair in RBNO is dominated by the network latency. As a
result, it takes a long time to repair those tables even if they are
almost empty.

To speed up the RBNO bootstrap, especially for starting empty clusters,
this patch enables small table optimization for RBNO for system tables.
We could enable it for small user tables as a follow up.

Tests:

1) A 5ms latency is added to simulate cross dc network delay, 256 tokens
per node, 10 nodes:

- Before

  topology_custom dev topology_custom.test_boot_time.1 1287.06s

- After

  topology_custom dev topology_custom.test_boot_time.1 12.48s

The test shows 100X boot time improvement

2) A SCT test to bootstrap 3 DCs, 3 nodes in each DC.

- Before

  Time to bootstrap = 1h23m

- After

  Time to bootstrap = 13m

The test shows 6X bootstrap time improvement

Fixes #19131
2024-11-25 13:46:17 +08:00
Asias He
69531ed8a6 repair: Move flush_rows after repair_meta class
In the next patch, it will call a member function of repair_meta class.

Refs #19131
2024-11-25 13:46:17 +08:00
Dawid Mędrek
926eaf8fe9 test/boost/view_schema_test: Improve comments in test_view_update_generating_writetime
In this commit, we elaborate on the semantics of generating view updates
for each case the test goes through so that the reader less familiar
with the logic has an easier time understanding it.
2024-11-24 22:48:15 +01:00
Dawid Mędrek
2d12acd09a test/boost/view_schema_test.cc: Improve checks in test_view_update_generating_writetime
We modify the checks in the test to obtain full information whenever
a failure happens. Before this change, we compared the number of view
updates one-by-one. As a result, when the first check failed, we didn't
learn anything about the other two. Now we always compare them all at
once.

A negative impact of this commit is that if one of the lambdas throws
an exception, we don't learn ANYTHING. However, a lambda throwing
an exception is a more appalling problem than the comparison failing,
and we DO learn about it in such a situation; so we accept that cost.
2024-11-24 22:48:13 +01:00
Dawid Mędrek
fb62fc6061 test/boost/view_schema_test.cc: Split test cases in test_view_update_generating_writetime
We split some of the test cases so it's clearer what's going on in the
test. Also, if a bug happens in the future, it should be easier to
reason about it when it corresponds to exactly one CQL statement
instead of possibly two.
2024-11-24 22:47:27 +01:00
Andrei Chekun
8bf62a086f test.py: Create central conftest.
Central conftest allows to reduce code duplication and execute all tests
with one pytest command

Closes scylladb/scylladb#21454
2024-11-24 20:09:48 +02:00
Nadav Har'El
7014aec452 Merge 'Alternator measuring RCU and WCU' from Amnon Heiman
Read and Write Consumed Capacity units are an abstract way of measuring Alternator actions. In general, they correspond to the read or write data.

In the long run, the RCU/WCU adds a way of charging an operation and limiting usage.

This series addresses two issues: consume capacity request API and metering.

The Alternator (and DynmoDB) API has an optional parameter allowing users to check the number of units an operation consumes. When a user adds that parameter, the response will contain the number of units used for the operation.

This series adds the consume capacity support to the get_item and put_item, adds a metric to collect the overall RCU and WCU used, and adds a test for the new functionality.

Follow-up PRs will add support for more operations and GSI.

Replaces #19811
Partially implement: #5027

Closes scylladb/scylladb#21543

* github.com:scylladb/scylladb:
  alternator/test_metrics: Add tests for table consumption units
  test_returnconsumedcapacity.py: Add putItem tests
  Alternator: add WCU support
  Add test/alternator/test_returnconsumedcapacity.py
  alternator/executor: Add consume capacity for get_item
  alsternator/stats: Add rcu and wcu metrics to stats
  alternator/executor.hh: white-space cleanup
  Add the consume_capacity helper class
2024-11-24 19:27:03 +02:00
Dawid Mędrek
f913ae571f db/view: Don't generate view updates for unselected columns
The semantics of Scylla's materialized views may vary depending on how their
primary keys correspond to the base table's one. One of the differences is
how we handle writes to columns in the base table that are not selected by
a view:

* Case 1: The view's PK is a permutation of the base table's PK:

  Since the view's primary key cannot be changed in an update, a row in
  the view remains alive as long as the corresponding row in the base table
  is alive.

  The tricky part comes when the base table has columns that are NOT selected
  by the view. CQL3 used to not allow for defining a table that didn't have
  any other columns besides its primary key. Also, when inserting a row into
  a table, it was mandatory to provide at least one value aside from the
  primary key. At some point it changed [1] and the implementation of the
  solution relied on the notion of the row marker.

  Putting the details aside, consider the following scenario:

  (i)   the base table has a primary key consisting of columns
        c_1, ..., c_k, and it has regular columns rc_1, ..., rc_n,
  (ii)  the primary key of an MV defined on that table consists of
        a permutation of c_1, ..., c_k. The MV doesn't select at least
        one of the regular columns of the base table. Without loss of
        generality, let that unselected column be rc_1.
  (iii) the base table has a row R whose only non-null value is the one
        in the regular column rc_1.

  Now, what will R correspond to in the MV? The base table doesn't have a row
  marker, but all of its regular columns in the MV will be NULLs. That's NOT
  allowed.

  To solve that problem, all unselected columns have corresponding virtual
  columns in the MV; the only information they provide is whether there is
  a value in the base table or not. This way, the MV knows if a row is still
  alive or not.

  For that reason, we send view updates to virtual columns in the following
  cases:

  (i)  the value in the column changes from NULL to a value, i.e. it's
       created,
  (ii) the value in the column exists, but its TTL has been updated.

* Case 2: The view's PK has one more column that the base table's one:

  Since the primary key of the view has a regular column C from the base
  table, it is guaranteed that if there's a row in the MV, the corresponding
  row in the base table can remain alive: since C is part of the view's PK,
  it must have a value, so the row in the base table has a value in C too.
  The problem with virtual columns from the previous case doesn't manifest
  in this one. The liveness of the cell in C determines the liveness of
  the whole row in the view.

The semantics gets more complex, but the conclusion is this: in case 1,
virtual columns exist and we may need to generate view updates for them,
while in case 2 virtual columns do NOT exist and so we don't generate
view updates for them.

What changes in this patch is we adjust the code to it. If a view has
a regular column from the base table as part of its primary key, we
no longer emit view updates when we change a column unselected by that
view. It is purely an OPTIMIZATION change.

[1]: https://issues.apache.org/jira/browse/CASSANDRA-4361

Fixes scylladb/scylladb#21652

Closes scylladb/scylladb#21653
2024-11-24 19:01:28 +02:00
Avi Kivity
29497f8c5d Merge 'Automatically compute schema version of system tables' from Tomasz Grabiec
Schema of system tables is defined statically and table_schema_version needs to be explicitly set in code like this:

```
builder.with_version(system_keyspace::generate_schema_version(table_id, version_offset));
```

Whenever schema is changed, the schema version needs to change, otherwise we hit undefined behavior when trying to interpret mutation data created with the old schema using the new schema.

It's not obvious that one needs to do that and developers often forget to do that. There were several instances of mistakes of omission, some caught during review, some not, e.g.: 31ea74b96e.

This patch changes definitions to call the new `schema_builder::with_hash_version()`, which will make the schema builder compute version from schema definition so that changes of the schema will automatically change the version. This way we no longer rely on the developer to remember to bump the version offset.

All nodes should arrive at the same version, which is verified by existing `test_group0_schema_versioning` and a new unit test: `test_system_schema_version_is_stable`.

Closes scylladb/scylladb#21602

* github.com:scylladb/scylladb:
  system_tables: Compute schema version automatically
  schema_builder: Introduce with_hash_version()
  schema: Store raw_view_info in schema::raw_schema
  schema: Remove dead comment
  hashing: Add hasher for unordered_map
  hashing: Add hasher for unique_ptr
  hashing: Add hasher for double

[avi: add missing include <memory> to hashing.hh]
2024-11-24 18:44:32 +02:00
Amnon Heiman
1f688bc670 cql3/query_processor.cc: Add skip_when_empty to metrics
This patch introduces the skip_when_empty flag to all CQL counters that
previously lacked this setting.

The skip_when_empty flag is a metric optimization that prevents
reporting on counters that have never been used. Once a counter has been
used (i.e., it holds a positive value), it will continue to be reported
consistently from that point onward.

Fixes #21046

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes scylladb/scylladb#21565
2024-11-24 17:30:46 +02:00
Gleb Natapov
a1de06d90f migration_manager: move migration manager verbs to the IDL 2024-11-24 11:02:03 +02:00
Gleb Natapov
020e8010e8 storage_proxy: remove unused function 2024-11-24 11:01:39 +02:00
Gleb Natapov
3d6fe7beb3 storage_proxy: co-routinize handle_paxos_prepare 2024-11-24 11:01:31 +02:00
Gleb Natapov
e337e5a3f6 storage_proxy: co-routinise handle_paxos_prune 2024-11-24 11:01:15 +02:00
Gleb Natapov
3cf5c187cb service: raft: no need to sync schema if the cluster is in raft topology mode
Schema syncing during group0 joining is needed during upgrade from a
cluster without raft to one that will be managed by raft, but if
topology cmd is enabled by cluster during group0 join it means that the
cluster is already in the raft mode and the schema sync can be safely
skipped.
2024-11-24 10:58:06 +02:00
Gleb Natapov
793b426137 messaging_service: co-routinize messaging_service::stop_client 2024-11-24 10:57:32 +02:00
Gleb Natapov
20e51e8eb0 gossiper: rename apply_state_locally_without_listener_notification to apply_state_locally_in_shadow_round
The function runs only from shadow round and the difference between
handling regular case and shadow round are more than just notifications.
2024-11-24 10:34:26 +02:00
Kefu Chai
e2e6f4f441 repair: s/Exceute/Execute/ in logging message
fix a typo in the logging message.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21661
2024-11-22 15:21:56 +02:00
muthu90tech
0ea0234a7a Avoid unnecessary copy in query_processor::execute_direct_without_checking_exception_message
instead of making a copy of the warnings vector, make the warnings a non const in prepared_statement
and move the warnings vector to execute_maybe_with_guard

Closes scylladb/scylladb#20361

Closes scylladb/scylladb#21083
2024-11-22 13:34:31 +02:00
Botond Dénes
75ccb9f266 docs: sstableloader.rst: add deprecation notice
The java tools (including sstableloader) are deprecated and slated for
removal in the next ScyllaDB release. Add a notice about this to the
sstableloader page.
2024-11-22 03:39:48 -05:00
Botond Dénes
f22f022e16 docs: admin-tools: update deprecation notice for sstable{dump,metadata}
These two tools already have a deprecation notice, since ScyllaDB 5.4.
Now we have a target release for the actual removal of these tools, so
update the deprecation notice to reflect that.
2024-11-22 03:39:48 -05:00
Botond Dénes
5fe5a15d1c docs: tools_index.rst: remove deprecated sstablereset and sstablerepairedset tools
Theset tools were unused and one of them doesn't even work, as ScyllaDB
doesn't have incremental repair implemented.
We are deprecating the java tools in the next release so drop these from
the list. Since they don't even have a page of their own, they don't get
a deprecation notice like the other tools in this PR.
2024-11-22 03:39:48 -05:00
Alexander Turetskiy
e83ab28d2d Improve compation on read of expired tombstones
compact expired tombstones in cache even if they are blocked by
commitlog

fixes #16781

Closes scylladb/scylladb#21613
2024-11-22 10:31:21 +02:00
Kamil Braun
8d52f30b74 Merge 'more gossiper code cleanups' from Gleb
More gossiper cleanups that accumulated since the previous one.

* 'gleb/more-gossip-cleanup-v2' of github.com:scylladb/scylla-dev:
  gossiper: replace milliseconds with seconds where appropriate
  gossiper: simplify failure_detector_loop loop a bit
  gossiper: use fmt library to format time
  gossiper: drop on_success callback from mutate_live_and_unreachable_endpoints
  gossiper: remove code duplication between shadow round and regular path when state is applied
  gossiper: remove remnants of old shadow round
  gossiper: fix indentation after the last patch
  gossiper: co-routinize do_shadow_round
2024-11-21 11:10:23 +01:00
Kefu Chai
f69ebc1797 configure.py: remove --python command line option
Remove the `--python` option which was originally added in 780d9a26b2 to support
CentOS's non-standard python3 path (`/usr/bin/python3.4`).

Since we now:
- Build using a Fedora-based container with standard python3 path
- Use properly configured shebangs in build scripts
- Set correct executable permissions on Python scripts

This change:
1. Removes the `--python` command line option
2. Updates build rules to execute Python scripts directly instead of via interpreter

This simplifies the build system and reduces differences between CMake and
configure.py-generated rules.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21607
2024-11-21 06:30:56 +02:00
Botond Dénes
b75c2eb71c readers/mutation_reader: migrate to std::ranges 2024-11-20 11:45:55 -05:00
Botond Dénes
351846bb61 readers/multishard: migrate to std::ranges::{push,pop}_heap()
std::ranges::{push,pop}_heap() will only generate default comparator if
the compared types are fully ordered. So we need to pass std::less<>
explicitely as comparator for the code to compile.
2024-11-20 11:44:48 -05:00
Yaron Kaikov
d9cde7cca5 .github/scripts/auto-backport.py: add user as collaborator to scylladbbot fork
As reported by @Deexie, during the process of opening backport PRs in https://github.com/scylladb/scylladb/pull/21616, No invite emails were sent, causing a lack of permissions for the backport PR branch

The check if `has_in_collaborators(pr.user.login)` was pointing to `scylladb/scylladb` instead of `scylladbbot/scylladb`, fixing it

I also moved the collaborator check to an early stage, before trying to open a backport PR

Closes scylladb/scylladb#21645
2024-11-20 14:34:38 +02:00
Tomasz Grabiec
0d2583600d Merge 'Add tablet repair scheduler support' from Asias He
This adds a new tablet migration kind: repair. It allows tablet repair
scheduler to use this migration kind to schedule repair jobs.

The current repair scheduler implementation does the following:

- A tablet is picked to be repaired when is requested by user

- The tablet repair can be scheduled along with tablet migration and
  rebuild. It runs in the tablet_migration track.

- Repair jobs are scheduled in a smart way so that at any point in time,
  there are no more than configured jobs per shard, which is similar to
  scylla manager's control.

New feature. No backport is needed.

Closes scylladb/scylladb#21088

* github.com:scylladb/scylladb:
  test: Add tests for tablet repair scheduler
  repair: Add restful API for tablet repair
  repair: Add tablet repair scheduler internal API support
  docs: Update system_keyspace.md for tablet repair related info
  docs: Add docs for tablet repair migration
  repair: Add core tablet repair scheduler support
  messaging_service: Introduce TABLET_REPAIR verb
  tablet_allocator: Introduce stream_weight for tablet_migration_streaming_info
  network_topology_strategy: Preserve fields of task_info in reallocate_tablets
2024-11-20 13:28:17 +01:00
Botond Dénes
1096ebd2b2 readers/combined: migrate to std::ranges::subrange<>
From boost::iterator_range<>. One return in maybe_produce_batch() had to
be adjusted because it used a strange initialization of
boost::iterator_range<>, which should not even had compiled.
2024-11-20 04:31:39 -05:00
Botond Dénes
5a66d95e02 readers/combined: migrate to std::ranges::{push,pop}_heap() 2024-11-20 04:31:39 -05:00
Amnon Heiman
1e4fb2442a alternator/test_metrics: Add tests for table consumption units
Adding tests to verify the RCU and WCU metrics.

A new helper function check_increases_metric_exact check that a given
metrics increased by a given number.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-11-20 11:28:53 +02:00
Amnon Heiman
95c45ca269 test_returnconsumedcapacity.py: Add putItem tests
This patch adds testing for putItem consume capacity.

There is an additional test for number support. Numbers are encoded
differently with alternator and dynamoDB, the test adds some flexibility
in the result so it would pass both DynamoDB and Alternator.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-11-20 11:27:43 +02:00
Gleb Natapov
812a90bfe3 gossiper: replace milliseconds with seconds where appropriate 2024-11-20 10:52:19 +02:00
Gleb Natapov
18caa6b22f gossiper: simplify failure_detector_loop loop a bit 2024-11-20 10:52:19 +02:00
Gleb Natapov
9489ad0d2f gossiper: use fmt library to format time 2024-11-20 10:52:19 +02:00
Gleb Natapov
39e44db01f gossiper: drop on_success callback from mutate_live_and_unreachable_endpoints
There is only one user of it and it can just execute its code after
calling mutate_live_and_unreachable_endpoints.
2024-11-20 10:52:18 +02:00
Gleb Natapov
0116704226 gossiper: remove code duplication between shadow round and regular path when state is applied
Differences is about notification so move the notification check into
functions that handle state change.
2024-11-20 10:52:10 +02:00
Botond Dénes
d94591c260 Merge 'treewide: replace boost::find_if with std::ranges::find_if' from Kefu Chai
now that we are allowed to use C++23. we now have the luxury of using `std::ranges::find_if`.

in this change, we:

- replace `boost::find_if` with `std::ranges::find_if`
- remove all `#include <boost/range/algorithm/find_if.hpp>`

to reduce the dependency to boost for better maintainability, and leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#21495

* github.com:scylladb/scylladb:
  treewide: replace boost::find_if with std::ranges::find_if
  counters: replace boost::find_if with std::ranges::find_if
  combine.hh: use std::iter_const_reference_t when appropriate
2024-11-20 09:58:13 +02:00
Botond Dénes
075ca6cc02 Merge 'cql3: respect PER PARTITION LIMIT for aggregate queries' from Paweł Zakrzewski
Currently, PER PARTITION LIMIT is not implemented for aggregates and queries can result in more rows than expected from the same partition.

Instrument the result_set_builder class so that it can enforce PER PARTITION LIMIT for aggregate queries, specifically:
- add per_partition_limit to the result_set_builder
- expose the number of input rows in the selector

result_set_builder gets two new functions handling partition start and end:
- accept_partition_end for notifying that a partition has been finished. This is also called when a page ends, so we cannot simply flush here, as a naive implementation could do.
- accept_new_partition, where we flush_selectors() if it's indeed a new partition (and not a continuation of the previous) and the query has a grouping: we don't want to flush on new partition in a query like SELECT COUNT(*) FROM foo;

Fixes #5363

Closes scylladb/scylladb#21125

* github.com:scylladb/scylladb:
  test: enable PER PARTIION LIMIT + GROUP BY tests
  cql3: respect PER PARTITION LIMIT for aggregates
  cql3: selection: count input rows in the selector
  cql3: selection: pass per partition limit to the result_set_builder
  cql3: show different messages for LIMIT and PER PARTITION LIMIT in get_limit
2024-11-20 09:54:28 +02:00
Botond Dénes
5ccbd500e0 Merge 'repair: fix task_manager_module::abort_all_repairs' from Aleksandra Martyniuk
Currently, task_manager_module::abort_all_repairs marks top-level repairs as aborted (but does not abort them) and aborts all existing shard tasks.

A running repair checks whether its id isn't contained in _aborted_pending_repairs and then proceeds to create shard tasks. If abort_all_repairs is executed after _aborted_pending_repairs is checked but before shard tasks are created, then those new tasks won't be aborted. The issue is the most severe for tablet_repair_task_impl that checks the _aborted_pending_repairs content from different shards, that do not see the top-level task. Hence the repair isn't stopped but it creates shard repair tasks on all shards but the one that initialized repair.

Abort top-level tasks in abort_all_repairs. Fix the shard on which the task abort is checked.

Fixes: #21612.

Needs backport to 6.1 and 6.2 as they contain the bug.

Closes scylladb/scylladb#21616

* github.com:scylladb/scylladb:
  test: add test to check if repair is properly aborted
  repair: add shard param to task_manager_module::is_aborted
  repair: use task abort source to abort repair
  repair: drop _aborted_pending_repairs and utilize tasks abort mechanism
  repair: fix task_manager_module::abort_all_repairs
2024-11-20 06:43:01 +02:00
Asias He
ddfec068d0 test: Add tests for tablet repair scheduler 2024-11-20 09:42:41 +08:00
Asias He
844129227e repair: Add restful API for tablet repair
It allows user to add and del a tablet repair request. The request is
executed by the tablet repair scheduler.
2024-11-20 09:42:41 +08:00
Asias He
ca1fc28605 repair: Add tablet repair scheduler internal API support
Those internal APIs allow to add / del a tablet repair request and
config the tablet repair scheduler.

It can be used by task manager or plain restful api.
2024-11-20 09:42:41 +08:00
Asias He
9d58a911f1 docs: Update system_keyspace.md for tablet repair related info 2024-11-20 09:42:41 +08:00
Asias He
afd356ea9a docs: Add docs for tablet repair migration 2024-11-20 09:42:41 +08:00
Asias He
b71a563030 repair: Add core tablet repair scheduler support
This adds a new tablet migration kind: repair. It allows tablet repair
scheduler to use this migration kind to schedule repair jobs.

The current repair scheduler implementation does the following:

- A tablet is picked to be repaired when the time since last repair is
  bigger than a threshold (auto repair mode) or it is requested by user
  (manual repair mode)

- The tablet repair can be scheduled along with tablet migration and
  rebuild. It runs in the tablet_migration track.

- Repair jobs are scheduled in a smart way so that at any point in time,
  there are no more than configured jobs per shard, which is similar to
  scylla manager's control.

In this patch, both the manual repair and the auto repair are not
enabled yet.
2024-11-20 09:42:41 +08:00
Amnon Heiman
56dce5fe8a Alternator: add WCU support
This patch adds functionality to track Write Capacity Units (WCU).
Currently for the put_item operation.

This enhancement allows for standardized measurement of write
operations, aligning with DynamoDB-like metrics.

Additionally, the WCU value is now optionally included in the response to provide
immediate feedback on the write capacity usage.

The implementation adds a consumed_capacity_counter member to
rmw_operation, this will allow to add WCU functionality to update_item
and delete_item
2024-11-19 18:43:28 +02:00
Amnon Heiman
3c46d78e6a Add test/alternator/test_returnconsumedcapacity.py
This patch adds testing for the consumedCapacity header.
It's currently only test get_item

The test works with both AWS and alternator.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-11-19 18:43:28 +02:00
Amnon Heiman
b8f7b2eb52 alternator/executor: Add consume capacity for get_item
This patch adds functionality to track Read Capacity Units (RCU) for the
get_item operation. This enhancement allows for standardized measurement
of read operations, aligning with DynamoDB-like metrics.

Additionally, the RCU value can now be included in the response to
provide immediate feedback on the read capacity usage.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-11-19 18:43:28 +02:00
Amnon Heiman
2b10296a82 alsternator/stats: Add rcu and wcu metrics to stats
Introduced `rcu` (Read Capacity Units) and `wcu` (Write Capacity Units)
metrics to the `stats` object for enhanced capacity tracking.

`rcu` and `wcu` provide a simplified way of measuring reads and writes,
respectively, by representing capacity usage in standardized units.

This patch adds these metrics to the existing alternator stats, enabling
monitoring of the total consumed units.
2024-11-19 18:43:28 +02:00
Amnon Heiman
b0e699e7ec alternator/executor.hh: white-space cleanup
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-11-19 18:43:28 +02:00
Amnon Heiman
eedf390196 Add the consume_capacity helper class
Alternator API should support returning WCU and RCU when requested.
The consumed capacity helper class serves multiple purposes:
1. Break the logic of calculating the RCU and WCU from the main code.
2. Add a helper class consumed_capacity_counter that can accumulate bytes.
3. Optionally update counters for RCU and WCU that will be used by the
   metric layer.
4. Update the response with the consumed units if needed.

The consumed_capacity_counter is a base class with two implementations:
A read and write implmenentation.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-11-19 18:42:56 +02:00
Nadav Har'El
733a4f94c7 Merge 'test/boost/view_schema_test.cc: Wait for views to build in test_view_update_generating_writetime' from Dawid Mędrek
Before these changes, we didn't wait for the materialized views to
finish building before writing to the base table. That led to generating
an additional view update, which, in turn, led to test failures.

The scenario corresponding to the summary above looked like this:

1. The test creates an empty table and MVs on it.
2. The view builder starts, but it doesn't finish immediately.
3. The test performs mutations to the base table. Since the views
   already exist, view updates are generated.
4. Finally, the view builder finishes. It notices that the base
   table has a row, so it generates a view update for it because
   it doesn't notice that we already have data in the view.

We solve it by explicitly waiting for both views to finish building
and only then start writing to the base table.

Additionally, we also fix a lifetime issue of the row the test revolves
around, further stabilizing CI.

Fixes https://github.com/scylladb/scylladb/issues/20889

Backport: These changes have no semantic effect on the codebase,
but they stabilize CI, so we want to backport them to the maintained
versions of Scylla.

Closes scylladb/scylladb#21632

* github.com:scylladb/scylladb:
  test/boost/view_schema_test.cc: Increase TTL in test_view_update_generating_writetime
  test/boost/view_schema_test.cc: Wait for views to build in test_view_update_generating_writetime
2024-11-19 18:10:52 +02:00
Gleb Natapov
8204bbf547 gossiper: remove remnants of old shadow round
Starting from 108aae09c5 new way of doing
shadow round is mandatory.
2024-11-19 14:43:51 +02:00
Gleb Natapov
4b3d160f34 gossiper: fix indentation after the last patch 2024-11-19 14:43:50 +02:00
Gleb Natapov
edeb8b0e46 gossiper: co-routinize do_shadow_round 2024-11-19 14:43:50 +02:00
Dawid Mędrek
af4afc84ec test/boost/view_schema_test.cc: Increase TTL in test_view_update_generating_writetime
The auxiliary function `eventually()` (defined in `test/lib/eventually.hh`)
tries to execute a passed function. If it throws, `eventually()` sleeps
for `2^#previous_attempts` milliseconds and tries to perform it again.
The default limit of attempts is 17.

In `test_view_update_generating_writetime`, right before the last test
case, we perform:

```cql
UPDATE t USING TTL 10 AND TIMESTAMP 8 SET g=40 WHERE k=1 AND c=1;
```

The test case itself executes:

```cql
SELECT WRITETIME(g) FROM t;
```

and asserts that the result of the query is equal to 8, i.e. it
corresponds to the timestamp of the last write to the table `t`.

However, if the test case keeps failing, then during its 14th attempt
(so affter sleeping for at least `2^14 - 1` milliseconds, which amounts
to about 16 seconds), we'll observe the following error:

```
[Exception] - std::runtime_error: Expected row not found: [0000000000000008] not in {result_message::rows {row:  null}}
```

The reason behind it is the specified TTL is too short. 10 seconds will
have already passed before the 14th attempt, so the value in the column
`g` will be `NULL` again. In particular, the `WRITETIME(g)` will no
longer be equal to `8`.

To solve that issue, we change the TTL in the CQL statement to 300.
The time spent on 17 loops of `eventually()` amounts to about
`2^18 - 1` milliseconds, which is about 263 seconds. That's why
setting the TTL to 300 seconds should be enough to prevent the error
from occurring.
2024-11-19 13:02:34 +01:00
Dawid Mędrek
5ca0cc4e85 test/boost/view_schema_test.cc: Wait for views to build in test_view_update_generating_writetime
Before these changes, we didn't wait for the materialized views to
finish building before writing to the base table. That led to generating
an additional view update, which, in turn, led to test failures.

The scenario corresponding to the summary above looked like this:

1. The test creates an empty table and MVs on it.
2. The view builder starts, but it doesn't finish immediately.
3. The test performs mutations to the base table. Since the views
   already exist, view updates are generated.
4. Finally, the view builder finishes. It notices that the base
   table has a row, so it generates a view update for it because
   it doesn't notice that we already have data in the view.

We solve it by explicitly waiting for both views to finish building
and only then start writing to the base table.

Fixes scylladb/scylladb#20889
2024-11-19 12:51:22 +01:00
Aleksandra Martyniuk
f5795e8aa4 test: add test to check if repair is properly aborted 2024-11-19 11:59:29 +01:00
Paweł Zakrzewski
b893e63b4a test: enable PER PARTIION LIMIT + GROUP BY tests 2024-11-19 09:28:01 +01:00
Nadav Har'El
7607f5e33e alternator: fix "/localnodes" to not return down nodes
Alternator's "/localnodes" HTTP requests is supposed to return the list
of nodes in the local DC to which the user can send requests.

Before commit bac7c33313 we used the
gossiper is_alive() method to determine if a node should be returned.
That commit changed the check to is_normal() - because a node can be
alive but in non-normal (e.g., joining) state and not ready for
requests.

However, it turns out that checking is_normal() is not enough, because
if node is stopped abruptly, other nodes will still consider it "normal",
but down (this is so-called "DN" state). So we need to check **both**
is_alive() and is_normal().

This patch also adds a test reproducing this case, where a node is
shut down abruptly. Before this patch, the test failed ("/localnodes"
continued to return the dead node), and after it it passes.

Fixes #21538

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#21540
2024-11-19 10:04:59 +02:00
Yaron Kaikov
980f6a48ab .github/scripts/auto-backport.py: validate backport candidate with Fixes prefix
Adding `Fixes` validation to a PR when backport labels were added. When the auto backport process triggers (after promotion), we will ensure each PR with backport/x.y label also has in the PR body a `Fixes` reference to an issue

Fixes: https://github.com/scylladb/scylladb/issues/20021

Closes scylladb/scylladb#21563
2024-11-19 09:48:34 +02:00
Benny Halevy
165902b951 conf/scylla.yaml: update documentation for enable_tablets
Change e3e8a94c9a changed
the semantics of the enable_tablets config option,
but updating that in the option documentation in scylla.yaml
was missed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#21614
2024-11-19 09:44:53 +02:00
Botond Dénes
36870feb29 Merge 'test: route S3 Proxy server messages through logger' from Kefu Chai
This change was created in the same spirit of f8221b960f.
The S3ProxyServer (introduced in 8919e0abab) currently prints its
status directly to stdout, which can be distracting when reviewing test
results. For example:

```console
$ ./test.py --mode release object_store/test_backup::test_simple_backup_and_restore
Found 1 tests.
Setting minio proxy random seed to 1731924995
Starting S3 proxy server on ('127.193.179.2', 9002)
================================================================================
[N/TOTAL]   SUITE    MODE   RESULT   TEST
------------------------------------------------------------------------------
[1/1]      object_store release [ PASS ] object_store.test_backup.1
Stopping S3 proxy server
------------------------------------------------------------------------------
CPU utilization: 3.1%
```

Move these messages to use proper logging to give developers more control
over their visibility:

- Make logger parameter mandatory in S3ProxyServer constructor
- Route "Stopping S3 proxy" message through the provided logger
- Add --log-level option to the standalone proxy server launcher

The message is now hidden:

```console
$ ./test.py --mode release object_store/test_backup::test_simple_backup_and_restore
Found 1 tests.
================================================================================
[N/TOTAL]   SUITE    MODE   RESULT   TEST
------------------------------------------------------------------------------
[1/1]      object_store release [ PASS ] object_store.test_backup.1
------------------------------------------------------------------------------
CPU utilization: 4.1%
```

---

this change improves the developer experience, hence no need to backport.

Closes scylladb/scylladb#21610

* github.com:scylladb/scylladb:
  test: route S3 Proxy server messages through logger
  test: s3_proxy: remove unused method
2024-11-19 06:42:28 +02:00
Kefu Chai
33a0e5b892 treewide: replace boost::find_if with std::ranges::find_if
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::find_if`.

in this change, we:

- replace `boost::find_if` with `std::ranges::find_if`
- remove all `#include <boost/range/algorithm/find_if.hpp>`

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-19 10:50:01 +08:00
Kefu Chai
3e75fbd9d3 counters: replace boost::find_if with std::ranges::find_if
std::ranges allows us to create a range from a pair of iterators.
but the iterator has to fulfill the concept of `std::semiregular`.
in order to reduce the header dependency on boost, we need to
make `basic_counter_cell_view::shard_iterator` to support
`std::semiregular`.

in this change:

- define a default constructor for
  `basic_counter_cell_view::shard_iterator`, so that the iterator
  satisfies the constraints of `std::semiregular`, as required by
  C++20's forward_iterator concept. please note, despite that
  the standard requires the iterator to be `std::semiregular`, but
  the iterator created by default constructor is not evaluated in
  production. sometimes, the standard algorithms just need to
  store/create itermediate iterators or to represent a "singular"
  state for iterator. a use case is an empty container.
- change `basic_counter_cell_view::shard_iterator::reference` so
  its dereference returns a rvalue instead of a reference. because
  per C++20 standard, the dereference of a forward_iterator should
  be stable, but we were returning a reference / pointer referencing
  a member variable of the iterator. so once the iterator is destructed,
  the returned reference / pointer would be invalidated. so we have to
  return a value to fulfill the requiremend of forward_iterator. this
  change also fulfills the requirement of `same_as<iter_reference_t<It>,
  iter_reference_t<const It>>`, which a part of the
  `indirectly_readable` requirement.
- let `basic_counter_cell_view::shards()` return a subrange
- let `basic_counter_shard_view::swap_value_and_clock()` accepts
  a plain value instead of a reference. because the dereference of
  the iterator does not return a reference anymore. and the returned
  type is a lightweighted "view", so the performance penality is
  negligible.
- use ranges libraries when appropriate in this header.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-19 10:50:01 +08:00
Kefu Chai
69939ee653 combine.hh: use std::iter_const_reference_t when appropriate
before this change, we assumed that the dereference types of
the given `InputIterator1` and `InputIterator2` are always
references. but this does not hold if the `operator*` returns
a rvalue, as in the C++20 standard, unlike the LegacyForwardIterator
requirement, `std::forward_iterator` does not requires
dereference to return a reference. so we should not assume this,
if we want to use `combine()` with iterators whose dereference
return a, for instance, rvalue.

in this change, we use `std::iter_const_reference_t` instead. this
type is deduced from the behavior of the iterator instead of hardwire
it to a reference type. this allows us to use a C++20 forward_iterator
with this generic function.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-19 10:50:01 +08:00
Asias He
5b17be6494 messaging_service: Introduce TABLET_REPAIR verb
It is used by the tablet repair scheduler.
2024-11-19 10:04:41 +08:00
Asias He
82a10eca55 tablet_allocator: Introduce stream_weight for tablet_migration_streaming_info
The stream_weight for repair migration is set to 2, because it requires
more work than just moving the tablet around. The stream_weight for all
other migrations are set to 1.
2024-11-19 10:04:41 +08:00
Asias He
c975882e03 network_topology_strategy: Preserve fields of task_info in reallocate_tablets
So other fields will not be dropped when the new tablet is created.
2024-11-19 10:04:41 +08:00
Avi Kivity
b14871ad3f Merge 'code cleanup: remove "sstring_view" and replace its usages by std::string_view' from Nadav Har'El
For historic reasons, we have (in bytes.hh) a type sstring_view which is an alias for std::string_view - since the same standard type can hold a pointer into both a seastar::sstring and std::string.

This alias in unnecessary and misleading to new developers, who might be misled to believe it is assume it is somehow different from std::string_view - when it isn't.

This series removes all uses of sstring_view (changing them to use std::string_view), and in the last patch removes the alias itself. A few functions whose name referred to "sstring" but take a std::string_view were renamed.

The patches are fairly mechanical and trivial, with no functional changes intended. To ease the review the series was split to a few smaller patches that modify specific areas of the code.

Fixes #4062.

Closes scylladb/scylladb#21617

* github.com:scylladb/scylladb:
  bytes: remove unused alias sstring_view
  change remaining sstring_view to std::string_view
  test: change sstring_view to std::string_view
  cql3: change sstring_view to std::string_view
  alternator: change sstring_view to std::string_view
  type: change from_sstring() to from_string_view()
  cross-tree: change to_sstring_view() to to_string_view()
2024-11-18 22:43:46 +02:00
Tomasz Grabiec
06d478793d Merge 'mutation: switch from boost ranges to std ranges' from Avi Kivity
Wean the mutation code (at least the headers) from boost ranges to std ranges,
in order to reduce the dependency load.

Cleanup, so no backport.

Closes scylladb/scylladb#21601

* github.com:scylladb/scylladb:
  partition_snapshot_row_cursor.hh: switch from boost ranges to std ranges
  mutation: mutation_partition_v2.hh: switch from boost ranges to std ranges
  mutation: mutation_partition.hh: switch from boost ranges to std ranges
  partition_snapshot_reader.hh: drop unused include boost/range/algorithm/heap_algorithm.hpp
2024-11-18 21:23:29 +01:00
Luis Freitas
34d7a4401d ./github/workflows/conflict_reminder.yaml: fix assignee object
References the login property of object assignee

Closes scylladb/scylladb#21615
2024-11-18 19:42:58 +02:00
Paweł Zakrzewski
08eb853a96 cql3: respect PER PARTITION LIMIT for aggregates
This change adds support for PER PARTITION LIMIT for aggregate queries.
result_set_builder gets two new functions handling partition start and
end:
- accept_partition_end for notifying that a partition has been finished.
  This is also called when a page ends, so we cannot simply flush here,
  as a naive implementation could do.
- accept_new_partition, where we flush_selectors() if it's indeed a new
  partition (and not a continuation of the previous) and the query has a
  grouping: we don't want to flush on new partition in a query like
  SELECT COUNT(*) FROM foo;
2024-11-18 17:56:53 +01:00
Paweł Zakrzewski
8190d76dd6 cql3: selection: count input rows in the selector
This will allow result_set_builder::flush_selectors() to only flush when
there are input rows.
2024-11-18 17:56:53 +01:00
Paweł Zakrzewski
aea3c3851e cql3: selection: pass per partition limit to the result_set_builder
Aggregates require the limit to be applied from within the builder
class, so it needs to be passed to it.
2024-11-18 17:56:53 +01:00
Paweł Zakrzewski
cb1483037c cql3: show different messages for LIMIT and PER PARTITION LIMIT in get_limit
select_statement::get_limit is used to evaluate the LIMIT value for both
LIMIT and PER PARTITION LIMIT. This change fixes the error message for
incorrect values passed by the user.
2024-11-18 17:56:53 +01:00
Aleksandra Martyniuk
ca14167b20 repair: add shard param to task_manager_module::is_aborted
Currently, task_manager_module::is_aborted checks whether a task
with given id was aborted on this shard.

In tablet_repair_task_impl::run, is_aborted method is called on all
shards to check if the parent task was aborted. However, even
for aborted parent, is_aborted will return true only on owner shard
of the parent.

Pass shard param to task_manager_module::is_aborted that indicates
which shard to check.
2024-11-18 16:26:08 +01:00
Aleksandra Martyniuk
de8d59172a repair: use task abort source to abort repair
Aborting of a top-level repair does not need task_mananger_module
anymore. Use task's abort source wherever possible.
2024-11-18 16:26:08 +01:00
Aleksandra Martyniuk
a6d9931705 repair: drop _aborted_pending_repairs and utilize tasks abort mechanism 2024-11-18 16:26:08 +01:00
Aleksandra Martyniuk
db8308fe93 repair: fix task_manager_module::abort_all_repairs
Currently, task_manager_module::abort_all_repairs marks top-level
repairs as aborted (but does not abort them) and aborts all shard
tasks. If after that a top-level repair creates a shard task,
the new shard repair won't be aborted.

Abort top-level repair tasks in abort_all_repairs. They will abort
their children and newly created shard tasks will be immediately
aborted.
2024-11-18 16:25:57 +01:00
Nadav Har'El
5e20cb8c66 bytes: remove unused alias sstring_view
Our "sstring_view" was an historic alias for the standard std::string_view.
All its uses were removed in the previous patches, so we can now finally
remove this unused alias.

Refs #4062.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-11-18 16:51:15 +02:00
Nadav Har'El
e639434a89 change remaining sstring_view to std::string_view
Our "sstring_view" is an historic alias for the standard std::string_view.
The patch changes the last remaining random uses of this old alias across
our source directory to the standard type name.

After this patch, there are no more uses of the "sstring_view" alias.
It will be removed in the following patch.

Refs #4062.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-11-18 16:48:57 +02:00
Nadav Har'El
e72aabae7f test: change sstring_view to std::string_view
Our "sstring_view" is an historic alias for the standard std::string_view.
The test/ directory used this old alias in a few of random places, let's
change them to use the standard type name.

Refs #4062.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-11-18 16:26:20 +02:00
Nadav Har'El
b778ce08a9 cql3: change sstring_view to std::string_view
Our "sstring_view" is an historic alias for the standard std::string_view.
The cql3/ directory used this old alias in a few of random places, let's
change them to use the standard type name.

Refs #4062.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-11-18 15:57:20 +02:00
Nadav Har'El
f2b4a59ec7 alternator: change sstring_view to std::string_view
Our "sstring_view" is an historic alias for the standard std::string_view.
Alternator only used this alias in a couple of random names, let's change
them to the standard type name.

Refs #4062.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-11-18 15:44:49 +02:00
Nadav Har'El
766ee56536 type: change from_sstring() to from_string_view()
All CQL type implementations have a from_sstring(sstring_view) method.
The "sstring_view" type is just an historic alias for std::string_view,
so this patch switches to use the standard type as suggested in #4062,
and also renames these functions from_string_view() to emphesize they can
take any string view, and not necessarily a "sstring" as their old name
suggested.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-11-18 15:33:04 +02:00
Nadav Har'El
da99dc3a7f cross-tree: change to_sstring_view() to to_string_view()
For historic reasons, we have (in bytes.hh) a type sstring_view which
is an alias for std::string_view - since the same standard type can hold
a pointer into both a seastar::sstring and std::string.

This alias in unnecessary and misleading to new developers (who might
assume it is somehow different from std::string_view). This patch doesn't
yet remove all occurances of sstring_view (the request in #4062), but
begins to do it by renaming one commonly-used function, to_sstring_view(bytes)
to to_string_view() and of course changes all its uses to the new name.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-11-18 14:57:49 +02:00
Kefu Chai
cb24022b54 test: route S3 Proxy server messages through logger
This change was created in the same spirit of f8221b960f.
The S3ProxyServer (introduced in 8919e0abab) currently prints its
status directly to stdout, which can be distracting when reviewing test
results. For example:
```console
$ ./test.py --mode release object_store/test_backup::test_simple_backup_and_restore
Found 1 tests.
Setting minio proxy random seed to 1731924995
Starting S3 proxy server on ('127.193.179.2', 9002)
================================================================================
[N/TOTAL]   SUITE    MODE   RESULT   TEST
------------------------------------------------------------------------------
[1/1]      object_store release [ PASS ] object_store.test_backup.1
Stopping S3 proxy server
------------------------------------------------------------------------------
CPU utilization: 3.1%
```

Move these messages to use proper logging to give developers more control
over their visibility:

- Make logger parameter mandatory in S3ProxyServer constructor
- Route "Stopping S3 proxy" message through the provided logger
- Add --log-level option to the standalone proxy server launcher

The message is now hidden:

```console
$ ./test.py --mode release object_store/test_backup::test_simple_backup_and_restore
Found 1 tests.
================================================================================
[N/TOTAL]   SUITE    MODE   RESULT   TEST
------------------------------------------------------------------------------
[1/1]      object_store release [ PASS ] object_store.test_backup.1
------------------------------------------------------------------------------
CPU utilization: 4.1%
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-18 18:41:17 +08:00
Kefu Chai
0dff187b7a test: s3_proxy: remove unused method
neither `InjectingHandler.log_error`, nor `InjectingHandler.log_message`
is used. so let's drop them.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-18 18:39:15 +08:00
Aleksandra Martyniuk
572b005774 repair: implement tablet_repair_task_impl::release_resources
tablet_repair_task_impl keeps a vector of tablet_repair_task_meta,
each of which keeps an effective_replication_map_ptr. So, after
the task completes, the token metadata version will not change for
task_ttl seconds.

Implement tablet_repair_task_impl::release_resources method that clears
tablet_repair_task_meta vector when the task finishes.

Set task_ttl to 1h in test_tablet_repair to check whether the test
won't time out.

Fixes: #21503.

Closes scylladb/scylladb#21504
2024-11-18 12:29:58 +02:00
Avi Kivity
bef015da0d Revert ".github/scripts/auto-backport.py: validate backport candidate with Fixes prefix"
This reverts commit 8414cd743e. It prevents
pulling pull requests that do have "Fixes" references.
2024-11-18 12:29:22 +02:00
Avi Kivity
3a6c0a9b36 Merge 'compaction: Perform integrity checks on compacting SSTables' from Nikos Dragazis
This PR enables compaction tasks to verify the integrity of the input data through checksum and digest checks. The mechanism for integrity checking was introduced in previous PRs (#20207, #20720) as a built-in functionality of the input streams. This PR integrates this mechanism with compaction. The change applies to all compaction types and covers both compressed and uncompressed SSTables adhering to the 3.x format. If a compaction task reads only part of an SSTable, then only the per-chunk checksums are verified, not the digest.

The PR consists of:
* Changes to mx readers to support integrity checking. The kl readers, considered as compatibility-only, were left unchanged. Also, integrity checking on single-partition reversed reads (`data_consume_reversed_partition()`) remains unsupported by mx readers as this is not used in compaction.
* Changes to `sstable` and `sstable_set` APIs to allow toggling integrity checks for mx readers.
* Activation of integrity checking for all compaction types.
* Tests for all compaction types with corrupted SSTables.

Integrity checks come at a cost. For uncompressed SSTables, the cost is the loading of the CRC and Digest components from disk, and the calculation of checksums and digest from the actual data. For compressed SSTables, checksums are stored in-place and they are being checked already on all reads, so the only extra cost is the loading and calculation of the digest. The measurements show a ~5% regression in compaction performance for uncompressed SSTables, and a negligible regression for compressed SSTables.

Command: `perf-sstable --smp=1 --cpuset=1 --poll-mode --mode=compaction --iterations=1000 --partitions 10000 --sstables=1 --key_size=4096 --num_columns=15 --column_size={32, 1024, 3500, 7000, 14500}`

Uncompressed SSTables:
```
+--------------+-----------------------+----------------------+------------+
| SSTable Size | No Integrity (p/sec)  | Integrity (p/sec)    | Regression |
+--------------+-----------------------+----------------------+------------+
| 50  MiB      | 65175.59 +- 80.82     | 61814.63 +- 72.88    | 5.16%      |
| 200 MiB      | 41795.10 +- 60.39     | 39686.28 +- 45.05    | 5.05%      |
| 500 MiB      | 21087.41 +- 30.72     | 20092.93 +- 25.05    | 4.72%      |
| 1   GiB      | 12781.64 +- 21.77     | 12233.94 +- 21.71    | 4.29%      |
| 2   GiB      |  6629.99 +-  9.40     |  6377.13 +-  8.28    | 3.81%      |
+--------------+-----------------------+----------------------+------------+
```
Compressed SSTables:
```
+--------------+-----------------------+----------------------+------------+
| SSTable Size | No Integrity (p/sec)  | Integrity (p/sec)    | Regression |
+--------------+-----------------------+----------------------+------------+
| 50  MiB      | 53975.05 +- 63.18     | 53825.93 +- 62.28    |  0.28%     |
| 200 MiB      | 28687.94 +- 26.58     | 28689.41 +- 26.91    |  0%        |
| 500 MiB      | 13865.35 +- 15.50     | 13790.41 +- 14.88    |  0.54%     |
| 1   GiB      |  7858.10 +-  7.71     |  7829.75 +-  9.66    |  0.36%     |
| 2   GiB      |  4023.11 +-  2.43     |  4010.54 +-  2.55    |  0.31%     |
+--------------+-----------------------+----------------------+------------+
(p/sec = partitions/sec)
```

Refs #19071.

New feature, no backport is needed.

Closes scylladb/scylladb#21153

* github.com:scylladb/scylladb:
  test: Add test for compaction with corrupted SSTables
  compaction: Enable integrity checks for all compaction types
  sstables: Add integrity option to factories for sstable_set readers
  sstables: Add integrity option to sstable::make_reader()
  sstables: Add integrity option to mx::make_reader()
  sstables: Load checksums and digests in mx full-scan reader
  sstables: Add integrity option to data_consume_single_partition()
  sstables: Disengage integrity_check from sstable class
  sstables: Allow data sources to disable digest check
2024-11-17 20:59:31 +02:00
Nadav Har'El
f23800181a Merge 'Align Metric Family Descriptions' from Amnon Heiman
Metrics families (e.g., all metrics with the same name but with different labels) should have the same description.
The metric layer does not enforce that. Instead, it will use the first description provided.
It's a minor issue but the results are different than what you expect.

No need to backport.

Closes scylladb/scylladb#19947

* github.com:scylladb/scylladb:
  service/storage_proxy.cc  All metric groups should have the same description
  raft/server.cc: All metric groups should have the same description
2024-11-17 16:49:57 +02:00
Tomasz Grabiec
8738d9bfa0 system_tables: Compute schema version automatically
This depends on the previous change to the schema_builder
which makes version computation depend on definition only
instead of being new time uuid.

This way we avoid the possibility for a common mistake
when schema of a system table is extended but we forget
to bump up its version passed to .with_version().
2024-11-15 19:16:41 +01:00
Tomasz Grabiec
05a1e0dc40 schema_builder: Introduce with_hash_version()
Currently, if version is missing, we use a unique timeuuid as the
version. It's not useful for creating static schema of system tables
because to achieve the same version on all the nodes, version needs to
be provided externally.

This patch introduces a way to build the schema with version computed
from schema definition, so we can have a stable version which is the
same on all machines.

Will be used for reliable computation of schema version for system
tables. System tables currently set the version statically and we rely
on the developer to bump up the version manually when the definition
changes. We cannot use mutation hash, since system tables are
initialized too rearly (mutation hash needs system schema to be
already there). This is a very error prone process, as it is easy to
forget to do so, and the issue comes up only when testing mixed
clusters.
2024-11-15 19:16:40 +01:00
Tomasz Grabiec
0334c2c24c schema: Store raw_view_info in schema::raw_schema
It will be used for hashing, which will work with raw_schema.

Also, it's more in-line with the current design, where basic information
is kept in raw_schema and other fields are derived from it.
2024-11-15 19:16:40 +01:00
Tomasz Grabiec
70441dc2b3 schema: Remove dead comment 2024-11-15 19:16:40 +01:00
Tomasz Grabiec
5dbbbf6300 hashing: Add hasher for unordered_map 2024-11-15 19:16:40 +01:00
Tomasz Grabiec
8209b301a3 hashing: Add hasher for unique_ptr 2024-11-15 19:16:40 +01:00
Tomasz Grabiec
a2c3b9a038 hashing: Add hasher for double 2024-11-15 19:16:40 +01:00
Avi Kivity
9720bb1e5f partition_snapshot_row_cursor.hh: switch from boost ranges to std ranges
Converge on one range solution.
2024-11-15 14:39:39 +02:00
Avi Kivity
1c26c8deeb mutation: mutation_partition_v2.hh: switch from boost ranges to std ranges
Consolidate on one range solution. Fallout in mutation_partition_v2.cc
and row_cache_test.cc due to interoperability problems is adjusted.
2024-11-15 14:36:28 +02:00
Avi Kivity
de822d3a46 mutation: mutation_partition.hh: switch from boost ranges to std ranges
Consolidate on one range solution. Fallout in mutation_partition.cc
due to interoperability problems is adjusted.
2024-11-15 14:09:31 +02:00
Avi Kivity
6d110b530c partition_snapshot_reader.hh: drop unused include boost/range/algorithm/heap_algorithm.hpp 2024-11-15 14:02:19 +02:00
Kefu Chai
5bc03da0c4 tools/scylla-nodetool: rename estimated_row_count to estimated_partition_count
Rename the helper function from `estimated_row_count()` to `estimated_partition_count()`
to better reflect its actual behavior. While the underlying API endpoint is
"/column_family/metrics/estimated_row_count", it actually returns the estimated
partition count of the given table.

This follows up on 26ac2c23ef which updated server-side variable names but
did not change the API endpoint name. A separate change will update the tool's
documentation to address scylladb/scylladb#21586 specifically.

Refs scylladb/scylladb#21586
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21597
2024-11-15 09:43:00 +02:00
Yaron Kaikov
8414cd743e .github/scripts/auto-backport.py: validate backport candidate with Fixes prefix
Adding `Fixes` validation to a PR when backport labels were added. When the auto backport process triggers (after promotion), we will ensure each PR with backport/x.y label also has in the PR body a `Fixes` reference to an issue

Adding also this validation to `pull_github_pr.sh` per @denesb request,

Fixes: https://github.com/scylladb/scylladb/issues/20021

Closes scylladb/scylladb#21563
2024-11-15 06:51:02 +02:00
Kefu Chai
4cc9d78801 compaction: document compaction::make_interposer_consumer()
for better maintainability

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#14982
2024-11-15 06:44:52 +02:00
Botond Dénes
fed2c6ba83 sstables/mx/reader: release column value buffer after consumed
data_consume_rows_context_m has a _column_value buffer it uses to read
key and column values into, preparing for parsing and consuming them.
This buffer is reset (released) in a few different cases:
* When using it for key - after consuming its content
* When using it for column value - when a colum has no value

However, the buffer is not released when used for a column value and the
column is consumed. This means that if a large column is read from the
sstable, this buffer can potentially linger and keep consuming memory
until either one of the other release scenarios is hit, or the reader is
destroyed.
Add a third release scenario, releasing the buffer after the row end was
consumed. This allows the buffer to be re-used between columns of the
same row, at the same time ensuring that a large buffer will not linger.

This patch can almost halve the memory consumption of reads in certain
circumstances. Point in case: the test
test_reader_concurrency_semaphore_memory_limit_engages starts to fail
after this fix, because the read doesn't trigger the OOM limit anymore
and needs doubling of the concurrency to keep passing.

This issue was found in a dtest
(`test_ics_refresh_with_big_sstable_files`), which writes some large
cells of up to 7MiB. After reading the row containing this large cell,
the reader holds on to the 7MiB buffer causing the semaphore's OOM
protection to kick in down the line.

Fixes: https://github.com/scylladb/scylladb/issues/21160

Closes scylladb/scylladb#21132
2024-11-14 17:24:53 +01:00
Kefu Chai
00810e6a01 treewide: include seastar/core/format.hh instead of seastar/core/print.hh
The later includes the former and in addition to `seastar::format()`,
`print.hh` also provides helpers like `seastar::fprint()` and
`seastar::print()`, which are deprecated and not used by scylladb.

Previously, we include `seastar/core/print.hh` for using
`seastar::format()`. and in seastar 5b04939e, we extracted
`seastar::format()` into `seastar/core/format.hh`. this allows us
to include a much smaller header.

In this change, we just include `seastar/core/format.hh` in place of
`seastar/core/print.hh`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21574
2024-11-14 17:45:07 +02:00
Michael Pedersen
309f1606ae docs: correct the storage size for n2-highmem-32 to 9000GB
updated storage size for n2-highmem-32 to 9000GB as this is default in SC

Closes scylladb/scylladb#21537
2024-11-14 17:16:44 +03:00
Pavel Emelyanov
298602b32d Merge 'message: do not include unused headers' from Kefu Chai
these unused includes are identified by clang-include-cleaner. after auditing the source files, all of the reports have been confirmed.

also, update the workflow to prevent future regressions of including unused headers in this subdirectory.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#21560

* github.com:scylladb/scylladb:
  .github: add "message" to CLEANER_DIR
  message: do not include unused headers
2024-11-14 17:15:16 +03:00
Kefu Chai
6955b8238e docs: fix monospace formatting for rm command
Add missing space before `rm` to ensure proper rendering
in monospace font within documentation.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21576
2024-11-14 17:14:32 +03:00
Kefu Chai
5b8c2ad600 test/object_store: various cleanups
just for better readability:

* chain comparison statement when appropriate
* do not use f-string when there are no place holders
* use list comprehension when initializing a set
* remove unused import statement
* move import statement of the standard library before
  those which import the 3rd-party modules
* put two empty lines in-between top-level functions.
  this is recommended by PEP8.
* remove the extraneous spaces around `=` in parameter
  list.
* remove the extraneous spaces in a list like `[ 1, 2, 3 ]`
  so it looks like `[1, 2, 3]`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21561
2024-11-14 16:57:15 +03:00
Nadav Har'El
99d420daa5 test: move a materialized-view test from boost to cqlpy
This patch moves (after straightforward translation) the test
"test_views_with_future_tombstone", a regression test for #5793,
from the C++ boost framework to the Python cqlpy framework.

The main motivation this move is the ease of debugging failures:
During the work on a patch for #20679 (eliminating read-before-write)
this test began to fail, and understanding where the C++ failed was
near impossible: the Boost test framework reports that the test failed,
but not in which line or why, and adding printouts to this huge source
file require a ridiculous amount of time for recompilation every time.
In contrast, the new pytest-based version shows exactly where the
error is, beautifully:

```
>               assert [] == list(cql.execute(f'select * from {mv}'))
E               assert [] == [Row(b=2, a=1, c=3, d=4, e=5)]
test_materialized_view.py:1614: AssertionError
```

It shows exactly which assertion failed, and exactly what were the
values that were compared. Beautiful and super helpful for debugging.

Beyond the ease of debugging, moving this (and later, other) test to
the cql-pytest framework has additional advantages:

1. The test was misplaced, in the cql_test source file, and it belongs
   with materialized views tests so let's use this opportunity to move
   it to the right place.
2. Can easily run the same test on multiple versions of Scylla, and
   also on Cassandra. It's a good way to confirm the test is correct.
3. No need to recompile the test after every attempt to fix the bug.
   The cql_query_test.cc is huge - over 6,000 lines - and takes over
   a minute to compile after every attempt to fix a bug.

Refs #16134 (the issue asks to move all MV tests to cql-pytest)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#21552
2024-11-14 16:55:58 +03:00
André LFA
703e6f3b1f Update report-scylla-problem.rst removing references to old Health Check Report
Closes scylladb/scylladb#21467
2024-11-14 15:12:26 +02:00
Anna Stuchlik
3bd2ecff63 doc: add the 6.0-to-2024.2 upgrade guide-from-6
This commit adds an upgrade guide from ScyllDB 6.0
to ScyllaDB Enterprise 2024.2.

Fixes https://github.com/scylladb/scylladb/issues/20063
Fixes https://github.com/scylladb/scylladb/issues/20062
Refs https://github.com/scylladb/scylla-enterprise/issues/4544

Closes scylladb/scylladb#20133
2024-11-14 15:07:43 +02:00
Kefu Chai
1cedc45c35 doc: import the new pub keys used to sign the package
before this change, when user follows the instruction, they'd get

```console
$ sudo apt-get update
Hit:1 http://us-east-1.ec2.archive.ubuntu.com/ubuntu noble InRelease
Hit:2 http://us-east-1.ec2.archive.ubuntu.com/ubuntu noble-updates InRelease
Hit:3 http://us-east-1.ec2.archive.ubuntu.com/ubuntu noble-backports InRelease
Hit:4 http://security.ubuntu.com/ubuntu noble-security InRelease
Get:5 https://downloads.scylladb.com/downloads/scylla/deb/debian-ubuntu/scylladb-6.2 stable InRelease [7550 B]
Err:5 https://downloads.scylladb.com/downloads/scylla/deb/debian-ubuntu/scylladb-6.2 stable InRelease
 The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A43E06657BAC99E3
Reading package lists... Done
W: GPG error: https://downloads.scylladb.com/downloads/scylla/deb/debian-ubuntu/scylladb-6.2 stable InRelease: The following signatures couldn't be verified because the public key is not av
ailable: NO_PUBKEY A43E06657BAC99E3
E: The repository 'https://downloads.scylladb.com/downloads/scylla/deb/debian-ubuntu/scylladb-6.2 stable InRelease' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
```

because the packages were signed with a different keyring.

in this change, we import the new pubkey, so that the pacakge manager
can
verify the new packages (2024.2+ and 6.2+) signed with the new key.

see also https://github.com/scylladb/scylla-ansible-roles/issues/399
and https://forum.scylladb.com/t/release-scylla-manager-3-3-1/2516
for the annonucement on using the new key.

Fixes scylladb/scylladb#21557
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21524
2024-11-14 13:33:47 +02:00
Botond Dénes
89c68d4ba7 Update seastar submodule
* seastar 1b0a3087...a5432364 (2):
  > rpc: Emplace buffers into vector, not push
  > core: reactor_config: add reserve_io_control_blocks

Refs: https://github.com/scylladb/scylladb/issues/19185

Closes scylladb/scylladb#21573
2024-11-14 12:44:10 +02:00
Tomasz Grabiec
1d0c6aa26f utils: UUID: Make get_time_UUID() respect the clock offset
schema_change_test currently fails due to failure to start a cql test
env in unit tests after the point where this is called (in one of the
test cases):

   forward_jump_clocks(std::chrono::seconds(60*60*24*31));

The problem manifests with a failure to join the cluster due to
missing_column exception ("missing_column: done") being thrown from
system_keyspace::get_topology_request_state(). It's a symptom of
join request being missing in system.topology_requests. It's missing
because the row is expired.

When request is created, we insert the
mutations with intended TTL of 1 month. The actual TTL value is
computed like this:

  ttl_opt topology_request_tracking_mutation_builder::ttl() const {
      return std::chrono::duration_cast<std::chrono::seconds>(std::chrono::microseconds(_ts)) + std::chrono::months(1)
          - std::chrono::duration_cast<std::chrono::seconds>(gc_clock::now().time_since_epoch());
  }

_ts comes from the request_id, which is supposed to be a timeuuid set
from current time when request starts. It's set using
utils::UUID_gen::get_time_UUID(). It reads the system clock without
adding the clock offset, so after forward_jump_clocks(), _ts and
gc_clock::now() may be far off. In some cases the accumulated offset
is larger than 1month and the ttl becomes negative, causing the
request row to expire immediately and failing the boot sequence.

The fix is to use db_clock, which respects offsets and is consistent
with gc_clock.

The test doesn't fail in CI becuase there each test case runs in a
separate process, so there is no bootstrap attempt (by new cql test
env) after forward_jump_clocks().

Closes scylladb/scylladb#21558
2024-11-14 10:32:07 +02:00
Botond Dénes
c14ace54e3 Merge 'Add testcases for tablet migration involving views' from Lakshmi Narayanan Sreethar
Added test cases to reproduce issues with tablet migration involving views.

Refs #19149
Refs #21564

No backport needed as the PR adds only testcases.

Closes scylladb/scylladb#21566

* github.com:scylladb/scylladb:
  topology_custom/test_tablets.py: add testcase for tablet migration of staged sstables
  topology_custom/test_tablets.py: add testcase for tablet migration with unbuilt views
2024-11-14 08:32:38 +02:00
Lakshmi Narayanan Sreethar
c1d447c932 topology_custom/test_tablets.py: add testcase for tablet migration of staged sstables
Tablet migration mixes staged and non staged sstables causing base view
inconsistencies in the pending replica. Added a testcase to reproduce
this issue.

Refs #19149.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-13 18:17:20 +05:30
Lakshmi Narayanan Sreethar
4cc12e1b7e topology_custom/test_tablets.py: add testcase for tablet migration with unbuilt views
When a tablet gets migrated right after view was created but before the
view builder registered the new view, the pending replica will not
register the sstables in the tablet for view building causing base view
inconsistencies. This commit adds a testcase to reproduce the issue.

Refs #21564

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-13 18:17:14 +05:30
Nadav Har'El
3fda9651cc test/alternator: option to run alternator tests against specific release
We recently added a "--release <version>" option to test/cql-pytest/run
to run a cql-pytest test against a released version of Scylla, downloaded
automatically from ScyllaDB's precompiled binary repository. This patch
adds the same capability also to test/alternator/run - allowing to run
a current test/alternator test on older releases of Scylla. The
implementation in this patch reuses the same implementation from the
cql-pytest patch.

Here is an example use case: the pull request #19941 claimed that
a certain bug fix was backported to release 6.0. Was it? Let's run
the test reproducing that bug on two releases:

test/alternator/run --release 6.0 test_streams.py::test_stream_list_tables
test/alternator/run --release 6.1 test_streams.py::test_stream_list_tables

It shows that the test passes on 6.1 (so the bug is fixed there) but the
test fails 6.0. It turns out that although the fix was backported to
branch-6.0, this happened shortly after 6.0.4 was released and no later
6.0 minor release came afterwards! So the bug wasn't actually fixed
on any official release of 6.0.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#21343
2024-11-13 09:38:09 +02:00
Kefu Chai
6d65e1a73c Update seastar submodule
* seastar fba36a3d...1b0a3087 (9):
  > program-options: add missing include <memory>
  > reactor: Always retry waitpid
  > treewide: include core/format.hh when appropriate
  > print: remove unused fmt/ostream.h
  > print: extract format() into format.hh
  > net: route error messages to logger instead of to stderr
  > net: stop printing when reaching unreachable branch
  > reactor: Mark drain() private
  > rpc: optimize tuple deserialization when the types are default-constructible

Closes scylladb/scylladb#21520
2024-11-13 09:33:00 +02:00
Kefu Chai
e0525bbac0 .github: add "message" to CLEANER_DIR
in order to prevent future inclusion of unused headers, let's include
"message" subdirectory to CLEANER_DIR, so that this workflow can
identify the regressions in future.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-13 14:29:52 +08:00
Kefu Chai
876c4ec78a message: do not include unused headers
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-13 14:29:52 +08:00
Emil Maskovsky
92db2eca0b test/topology_custom: fix the flaky test_raft_recovery_stuck
The test is only sending a subset of the running servers for the rolling
restart. The rolling restart is checking the visibility of the restarted
node agains the other nodes, but if that set is incomplete some of the
running servers might not have seen the restarted node yet.

Improved the manager client rolling restart method to consider all the
running nodes for checking the restarted node visibility.

Fixes: scylladb/scylladb#19959

Closes scylladb/scylladb#21477
2024-11-12 16:38:28 +01:00
Kefu Chai
45e8d6793e test: include fmt/iostream.h and iostream when appropriate
this change was created in the same spirit of aebb5329, which
included the fmt/iostream.h and iostream when appropriate so that
the tree can build with seastar submodule including e96932b0.
in the seastar change, we stopped including unused `fmt/ostream.h`
in a public header in seastar, so the parent projects relying on
the header to indirectly include fmt/ostream.h and iostream would
have to include these headers explicitly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21525
2024-11-12 17:34:08 +02:00
Yaron Kaikov
3bc2b34a18 ./github/scripts/label_promoted_commits.py: fix search for closes prefix on merge PRs
In cc71077e33, i have added check for the
last line in pr body looking for `closes` prefix.

It seems that this is wrong, since in a merge PR, the `closes` prefix is
not the last line

Instead, changing the search for the last line contains `closes` prefix

Closes scylladb/scylladb#21545
2024-11-12 13:56:37 +02:00
Botond Dénes
1c212df62d Merge 'scylla_raid_setup: fix failure on SELinux package installation' from Takuya ASADA
After merged 5a470b2bfb, we found that scylla_raid_setup fails on offline mode
installation.
This is because pkg_install() just print error and exit script on offline mode, instead of installing packages since offline mode not supposed able to connect
internet.
Seems like it occur because of missing "policycoreutils-python-utils"
package, which is the package for "semange" command.
So we need to implement the relabeling patch without using the command.

Fixes https://github.com/scylladb/scylladb/issues/21441

Also, since Amazon Linux 2 has different package name for semange, we need to
adjust package name.

Fixes https://github.com/scylladb/scylladb/issues/21351

Closes scylladb/scylladb#21474

* github.com:scylladb/scylladb:
  scylla_raid_setup: support installing semanage on Amazon Linux 2
  scylla_raid_setup: fix failure on SELinux package installation
2024-11-12 09:20:56 +02:00
Nikos Dragazis
70d6b445a5 test: Add test for compaction with corrupted SSTables
In the previous patch we enabled integrity checking on all compaction
types. This means that compaction jobs should now fail if they encounter
an SSTable with an invalid checksum or digest.

Add a test to verify this behavior. Test every compaction type with:

* compressed/uncompressed SSTables with invalid checksums
* compressed/uncompressed SSTables with invalid digests

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-11 22:25:49 +02:00
Nikos Dragazis
6687eba2db compaction: Enable integrity checks for all compaction types
Compaction tasks create mutation readers to read SSTables from disk.
Each compaction type defines its own reader creation logic by
implementing the pure virtual function `compaction::make_sstable_reader()`.

Modify all implementations of `make_sstable_reader()` to enable
integrity checking on the created readers. This way, all compaction
tasks will be able to detect corruption issues on the compacting
SSTables.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-11 22:25:45 +02:00
Nikos Dragazis
70dd124a95 sstables: Add integrity option to factories for sstable_set readers
Expose the integrity option of the sstable reader factories to the
corresponding sstable_set factories, namely:

* `sstable_set::make_local_shard_sstable_reader()`
* `sstable_set::make_full_scan_reader()`
* `sstable_set::make_range_sstable_reader()`

This is needed to support integrity checking in compaction.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-11 20:42:46 +02:00
Nikos Dragazis
a8f65a421b sstables: Add integrity option to sstable::make_reader()
Expose the integrity option of the mx reader via the public factory
method `sstable::make_reader()`. Same flag is offered for full-scan
readers via `sstable::make_full_scan_reader()`.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-11 20:40:31 +02:00
Nikos Dragazis
64688fdad6 sstables: Add integrity option to mx::make_reader()
In previous patch we added support for integrity checking in the mx
full-scan reader.

Do the same for the mx reader, which is the one used by all compaction
types except for scrub compaction. The mx reader should now support
integrity checking for single-partition and multi-partition reads.
Single-partition reversed reads were excluded from this patch because
they are not used in compaction.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-11 20:40:30 +02:00
Nikos Dragazis
1993aa5261 sstables: Load checksums and digests in mx full-scan reader
In 716fc487fd we introduced integrity checking in the mx crawling reader
(later renamed to full-scan reader in 6250ff18eb).

When integrity checking is enabled, the full-scan reader expects that
the checksum and digest components have been loaded from disk by the
caller. This is true for the validation path, in which
`sstable::validate()` loads the components before creating the full-scan
reader, but it doesn't hold if a full-scan reader is created directly by
a higher-level function through `sstable::make_full_scan_reader()`.

As part of the effort to enable integrity checking for compaction, this
becomes a blocker for scrub compaction, which relies solely on full-scan
readers.

Solve this by allowing the mx full-scan reader to load the checksum and
digest components internally. The loading is an asynchronous operation,
so it has to be deferred until the first buffer fill.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-11 20:26:27 +02:00
Nikos Dragazis
609b16307e sstables: Add integrity option to data_consume_single_partition()
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-11 20:26:27 +02:00
Nikos Dragazis
5b896cdbb7 sstables: Disengage integrity_check from sstable class
The `integrity_check` flag was first introduced as a parameter in
`sstable::data_stream()` to support creating input streams with
integrity checking. As such, it was defined in the sstable class.

However, we also use this flag in the kl/mx full-scan readers, and, in
a later patch, we will use it in `class sstable_set` as well.

Move the definition into `types_fwd.hh` since it is no longer bound to
the sstable class.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-11 20:26:27 +02:00
Nikos Dragazis
2cc82f64e8 sstables: Allow data sources to disable digest check
The compressed and checksummed data sources offer digest checking as an
optional feature. It can be enabled via the boolean template parameter
`check_digest`. If enabled, the data sources calculate the actual digest
chunk-by-chunk whenever `get()` is called, and compare with the expected
digest when all data have been read. If the actual digest cannot be
calculated due to a partial read or skip, the data sources treat this
condition as an internal error.

Relax this constraint by allowing the data sources to handle digest
checks as best effort, i.e., continue to operate with digest checking
disabled if the actual digest cannot be calculated.

We will use this in later patches to enable digest checking for
compaction. Compaction can cause both partial reads and skips (e.g., in
case of cleanup compaction) and we cannot predict skips beforehand.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-11 20:26:27 +02:00
Pavel Emelyanov
b158ca7346 api: Remove param field from req_param
The req_param class is used to help parsing http request parameters from
strings into exact types (typically some simple types like strings,
integrals or boolean). On it there are three fields:

- name -- the parameter name
- param -- the parameter string value
- value -- the parameter value of desired type

The `param` thing is not really needed, it's only used by few places
that print it into logs, but they may as well just print the `value`
thing itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#21502
2024-11-11 17:47:55 +02:00
Pavel Emelyanov
87ec2af6f0 api: Remove dead if-branch that collects all tables from ks
After calling api::parse_tables() the resulting vector of table names
cannot be empty, because in case parameter is missing, the parse_tables
function returns all tables from keyspace anyway.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#21501
2024-11-11 17:46:38 +02:00
Botond Dénes
30cb58b3e4 Merge 'compaction: use better partition estimate for split compaction' from Lakshmi Narayanan Sreethar
Split compaction divides the partitions in an existing sstable into two groups and writes them into two new sstables, which replace the original one. The partition count from the original sstable is used as an estimate when writing the new ones, but this estimate is not accurate as the partitions are split between the two new sstables and each will contain only a portion of the original partition count. This also causes the bloom filters to be rebuilt at the end of compaction, as they were initially built with inaccurate estimates.

Fix this by using a better estimate for the output sstables, which is half the original partition count.

Fixes #20253

Improvement; No need to backport.

Closes scylladb/scylladb#20908

* github.com:scylladb/scylladb:
  compaction: use better partition estimate for split compaction
  compaction::table_state: implement `get_token_range_after_split()` wrapper
  replica/table: implement `get_token_range_after_split()` wrappers
  tablet_map: introduce `get_token_range_after_split()`
  tablet_map: implement existing get_token_range() using the new variant
  tablet_map: introduce `get_token_range()` variant
  tablet_map: introduce `get_last_token()` variant
2024-11-11 16:25:08 +02:00
Kefu Chai
3fb1112c18 readers/multishard: fix a typo in comment
s/fullfill/fulfill/

this misspelling was identified by the codespell workflow.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21521
2024-11-11 16:14:47 +02:00
Kefu Chai
88410b75c9 test/object_store: verify backup fails on missing snapshot
Add test to ensure backup tasks properly handle non-existent snapshots
by:

- Verifying backup task reports failure status
- Ensuring error is propagated through task status API

Previously untested edge case when backing up a snapshot that doesn't
exist in the test_backup.py tests.

Refs scylladb/scylladb#21381
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21385
2024-11-11 13:50:07 +03:00
Yaron Kaikov
cc71077e33 .github/scripts/label_promoted_commits.py: only match the Close tag in the last line in the commit message
When a backport PR is promoted to the release branch, we automatically close the backport PR (since GitHub will only close the one based on the default branch) and update the labels in the original PRs

In a situation when we have multiple `closes` prefixes, the script will use the first one (which is not the correct one), see 3ddb61c90e

Fixing this by always using the last line with the `closes` prefix

Closes scylladb/scylladb#21498
2024-11-11 11:04:33 +02:00
Dani Tweig
381faa2649 Rename .github/ISSUE_TEMPLATE.md to .github/ISSUE_TEMPLATE/bug_report.yml
GitHub issue template process has changed.
The issue template file should be replaced and renamed.

Closes scylladb/scylladb#21518
2024-11-11 11:00:38 +02:00
Takuya ASADA
6fe09a5a16 scylla_raid_setup: support installing semanage on Amazon Linux 2
Since Amazon Linux 2 has different package name for semange, we need to
adjust package name.

Fixes #21351
2024-11-11 17:27:24 +09:00
Takuya ASADA
7ad5e69c54 scylla_raid_setup: fix failure on SELinux package installation
After merged 5a470b2, we found that scylla_raid_setup fails on offline mode
installation.
This is because pkg_install() just print error and exit script on offline mode, instead of installing packages since offline mode not supposed able to connect
internet.
Seems like it occur because of missing "policycoreutils-python-utils"
package, which is the package for "semange" command.
So we need to implement the relabeling patch without using the command.

Fixes #21441
2024-11-11 17:27:24 +09:00
Nikita Kurashkin
3032d8ccbf add check to refuse usage of DESC TABLE on a materialized view
Fixes #21026

Closes scylladb/scylladb#21500
2024-11-11 10:23:30 +02:00
Yaron Kaikov
2596d1577b ./github/workflows/add-label-when-promoted.yaml: Run auto-backport only on default branch
In https://github.com/scylladb/scylladb/pull/21496#event-15221789614
```
scylladbbot force-pushed the backport/21459/to-6.1 branch from 414691c to 59a4ccd Compare 2 days ago
```

Backport automation triggered by `push` but also should either start from `master` branch (or `enterprise` branch from Enterprise), we need to verify it by checking also the default branch.

Fixes: https://github.com/scylladb/scylladb/issues/21514

Closes scylladb/scylladb#21515
2024-11-11 09:16:35 +02:00
Lakshmi Narayanan Sreethar
eb4b407085 compaction: use better partition estimate for split compaction
Split compaction divides the partitions in an existing sstable into two
groups and writes them into two new sstables, which replace the original
one. The partition count from the original sstable is used as an
estimate when writing the new ones, but this estimate is not accurate as
the partitions are split between the two new sstables and each will
contain only a portion of the original partition count. This also causes
the bloom filters to be rebuilt at the end of compaction, as they were
initially built with inaccurate estimates.

Fix this by using a better estimate for the output sstables based on the
token ranges written to them.

Fixes scylladb#20253

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-11 12:26:51 +05:30
Lakshmi Narayanan Sreethar
67dad99ab5 compaction::table_state: implement get_token_range_after_split() wrapper
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-11 12:24:00 +05:30
Lakshmi Narayanan Sreethar
c4db4abcae replica/table: implement get_token_range_after_split() wrappers
Expose the functionality of `tablet_map::get_token_range_after_split()`
via the replica::table class.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-11 12:24:00 +05:30
Lakshmi Narayanan Sreethar
4130187e78 tablet_map: introduce get_token_range_after_split()
Added `get_token_range_after_split()`, which returns the token range the
given token will belong to after a tablet split. This is required to
estimate the token ranges of resultant sstables after a split.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-11 12:23:47 +05:30
Lakshmi Narayanan Sreethar
1e2c1d7f25 tablet_map: implement existing get_token_range() using the new variant
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-11 12:22:14 +05:30
Lakshmi Narayanan Sreethar
f536c7d15b tablet_map: introduce get_token_range() variant
Implement `get_token_range()` to return the token range of the specified
tablet with the given `log2_tablets` size. This will be used to deduce
which range a token will end up in if the tablet is split.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-11 12:21:05 +05:30
Lakshmi Narayanan Sreethar
f655136091 tablet_map: introduce get_last_token() variant
Implement `get_last_token()`, which returns the largest token owned
by the specified tablet with the given `log2_tablets` size. This will be
used to deduce token ranges for a tablet with any arbitrary
`tablet_count`.

Also, update the existing public `get_last_token()` to utilize the new
variant.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-11 12:18:04 +05:30
Pavel Emelyanov
57af69e15f Merge 'Add retries to the S3 client' from Ernest Zaslavsky
1. Add `retry_strategy` interface and default implementation for exponential back-off retry strategy.
2. Add new S3 related errors, also introduce additional errors to describe pure http errors that has no additional information in the body.
3. Add retries to the s3 client, all retries are coordinated by an instance of `retry_strategy`. In a case of error also parse response body in attempt to retrieve additional and more focused error information as suggested by AWS. See https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html. Introduce `aws_exception` to carry the original `aws_error`.
4. Discard whatever exception is thrown in `abort_upload` when aborting multipart upload since we don't care about cleanly aborting it since there are other means to clean up dangling parts, for example `rclone cleanup` or S3 bucket's Lifecycle Management Policy.
5. Add tests to cover retries, and retry exhaustion. Also add tests for jumbo upload.
6. Add the S3 proxy which is used to randomly inject retryable S3 errors to test the "retry" part of the S3 client. Switch the `s3_test` to use the S3 proxy. `s3_tests` set afloat `put_object` problem that was causing segmentation when retrying, fixed.
7. Extend the `s3_test` to use both `minio` and `proxy` configurations.
8. Add parameter to the proxy to seed the error injection randomization to make it replayable.

fixes: #20611
fixes: #20613

Closes scylladb/scylladb#21054

* github.com:scylladb/scylladb:
  aws_errors: Make error messages more verbose.
  test: Make the minio proxy randomization re-playable
  test/boost/s3_test: add error injection scenarios to existing test suite
  test: Switch `s3_test` to use proxy
  test: Add more tests
  client: Stop returning error on `DELETE` in multipart upload abortion
  client: Fix sigsegv when retrying
  client: Add retries
  client: Adjust `map_s3_client_exception` to return exception instance
  aws_errors: Change aws_error::parse to return std::optional<>
  aws_errors: Add http errors mapping into aws_error
  client: Add aws_exception mapping
  aws_error: Add `aws_exeption` to carry original `aws_error`
  aws_errors: Add new error codes
  client: Introduce retry strategy
2024-11-11 08:35:55 +03:00
Takuya ASADA
92af373fab unified: drop scylla-tools from unified package
On b8634fb, we dropped scylla-tools from rpm and deb, we should drop it
from unified package as well.

Closes #20739

Closes scylladb/scylladb#20740
2024-11-10 12:56:43 +02:00
Avi Kivity
b58dbe57aa Merge 'repair: introduce and use buffer size hint for mixed-shard multishard reader' from Botond Dénes
Add a buffer hint to the multishard reader. This is an internal hint, used by the multishard reader to provide a hint to the shard reader, on how much data exactly is needed by the multishard reader from the respective shard. This hint allows eliminating extraneous cross-shard round-trips and possible shard reader evict-recreate cycles. Building on this, repair sets its own row buffer size as the max buffer size on the multishard reader, ensuring that the row buffer is filled with the minimum amount of cross-shard round trips and minimal reader recreation.
To further eliminate unnecessary evictions, this PR also disables the multishard reader's read-ahead which is a mechanism that was designed to reduce latency for user-reads but it can be too aggressive for repair, causing unnecessary extra congestion on the already struggling streaming semaphores.

Refs: https://github.com/scylladb/scylladb/issues/18269
Fixes: https://github.com/scylladb/scylladb/issues/21113

The performance impact was measured with an SCT test, which creates a cluster of 3 nodes with 16 shards, then adds a 4th one with 12 shards.
Currently, it is the bootstrap time which is the worse in the case of mixed shard clusters, see below for the improvement measured during bootstrap:

|              | master        | buffer-hint   | metric                                              |
| ------------ | ------------- | ------------- | --------------------------------------------------- |
| evictions    |          0.9M |         93.0K | scylla_database_paused_reads_permit_based_evictions |
| read (bytes) |          9.0T |          3.9T | scylla_reactor_aio_bytes_read                       |
| read (ops)   |         88.0M |         33.5M | scylla_reactor_aio_reads                            |
| time         |         56min |         20min | N/A                                                 |

This is a performance improvement, no backport required.

Closes scylladb/scylladb#20815

* github.com:scylladb/scylladb:
  test/boost/mutation_reader_test: add test for multishard reader buffer hint
  repair/row_level: disable read-ahead
  db/config: introduce repair_multishard_reader_enable_read_ahead
  readers/multishard: implement the read_ahead flag
  replica/database: make_multishard_streaming_reader(): expose the read_ahead parameter
  readers/multishard: add read_ahead parameter
  repair/row_level: set max buffer size on multishard reader
  replica/database: make_multishard_streaming_reader(): expose buffer_hint parameter
  db/config: introduce enable_repair_multishard_reader_buffer_hint
  readers/multishard: multishard_reader: pass hint to shard_reader
  readers/multishard: shard_reader_v2::fill_reader_buffer(): respect the hint
  readers/multishard: propagate fill_buffer_hint to shard_reader:fill_reader_buffer()
  readers/multishard: shard_reader: extract buffer-fill into its own method
2024-11-10 12:55:19 +02:00
Kefu Chai
961a53f716 dist: systemd: use default KillMode
before this change, we specify the KillMode of the scylla-service
service unit explicitly to "process". according to
according to
https://www.freedesktop.org/software/systemd/man/latest/systemd.kill.html,

> If set to process, only the main process itself is killed (not recommended!).

and the document suggests use "control-group" over "process".
but scylla server is not a multi-process server, it is a multi-threaded
server. so it should not make any difference even if we switch to
the recommended "control-group".

in the light that we've been seeing "defunct" scylla process after
stopping the scylla service using systemd. we are wondering if we should
try to change the `KillMode` to "control-group", which is the default
value of this setting.

in this change, we just drop the setting so that the systemd stops the
service by stopping all processes in the control group of this unit
are stopped.

Refs scylladb/scylladb#21507

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21508
2024-11-09 20:07:11 +02:00
Kefu Chai
1f940d56b2 build: cmake: s/idle_compiler/idl_compiler/
before this change, the header files generated with `idl-compiler.py`
are not regenerated if `idl-compiler.py` is updated. but they should,
as the change to the script could in turn change the generated header
files. because we have a typo in the `DEPENDS` argument,
`${idle_compiler}` is expanded to an empty string.

in this change, the typo is corrected, and the dependency from the
generated headers to the script is correctly reflected in the building
rules.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21475
2024-11-09 20:06:23 +02:00
Piotr Dulikowski
7021efd6b0 Merge 'main,cql_test_env: start group0_service before view_builder' from Michał Jadwiszczak
In scylladb/scylladb#19745, view_builder was migrated to group0 and since then it is dependant on group0_service.
Because of this, group0_service should be initialized/destroyed before/after view_builder.

This patch also adds error injection to `raft_server_with_timeouts::read_barrier`, which does 1s sleep before doing the read barrier. There is a new test which reproduces the use after free bug using the error injection.

Fixes scylladb/scylladb#20772

scylladb/scylladb#19745 is present in 6.2, so this fix should be backported to it.

Closes scylladb/scylladb#21471

* github.com:scylladb/scylladb:
  test/boost/secondary_index_test: add test for use after free
  api/raft: use `get_server_with_timeouts().read_barrier()` in coroutines
  main,cql_test_env: start group0_service before view_builder
2024-11-08 20:27:09 +01:00
Kefu Chai
aebb532906 bytes, utils: include fmt/iostream.h and iostream when appropriate
in seastar e96932b05f394b27cd0101e24f0584736795b50f, we stopped
including unused `fmt/ostream.h`. this helped to reduce the header
dependency. but this also broke the build of scylladb, as we rely
on the `fmt/ostream.h` indirectly included by seastar's header project.

in this change, we include `fmt/iostream.h` and `iostream` explictly
when we are using the declarations in them. this enables us to

- bump up the seastar submodule
- potentially reduce the header dependency as we will be able to
  include seastar/core/format.hh instead of a more bloated
  seastar/core/print.hh after bumping up seastar submodule

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21494
2024-11-08 16:43:25 +03:00
Michał Jadwiszczak
f998f027a2 test/boost/secondary_index_test: add test for use after free
Reproduces scylladb/scylladb#20772.

Add error injection to `raft_server_with_timeouts::read_barrier`,
which does 1s sleep before doing the read barrier.
2024-11-08 14:16:19 +01:00
Michał Jadwiszczak
de7b58e8d4 api/raft: use get_server_with_timeouts().read_barrier() in coroutines
It is unsafe to do `get_server_with_timeouts().read_barrier()` in
continuations because `get_server_with_timeouts()` returns
raft server by value and it may be deallocated when `read_barrier()` yields,
causing use-after-return.

Simple workaround is to use the read barrier in coroutine and co_await
it. Then the raft server is kept on stack until the read barrier is
finished.

I've checked all codebase and it looks like the only place where
`group0_with_timeouts().read_barrier()` is in continuation, is
api/raft.cc.

Co-authored-by: Piotr Dulikowski <piodul@scylladb.com>
2024-11-08 14:15:13 +01:00
Botond Dénes
e3e8a94c9a Merge 'Allow explicitly enabling or disabling tablets when creating a new keyspace' from Benny Halevy
Separate the configuration for enabling the tablets feature from the enablement of tablets when creating new keyspaces.

This change always enables the TABLETS cluster feature and the tablets logic respectively.

The `enable_tablets` config option just controls whether tablets are enabled or disabled by default for new keyspaces.

If `enable_tablets` is set to `true`, tablets can be disabled using `CREATE KEYSPACE WITH tablets = { 'enabled': false }` as it is today.

If `enable_tablets` is set to `false`, tablets can be enabled using `CREATE KEYSPACE WITH tablets = { 'enabled': true }`.

The motivation for this change is to simplify the user experience of using tablets by setting the default for new keyspaces to false amd allowing the user to simply opt-in by using tablets = {enabled: true }.
This is not pissible today.
The user has to enable tablets by default for all new keyspaces (that use the NetworkTopologyStrategy) and then actively opt-out to use vnodes.

* Not required to be backported to OSS versions.  May be backported to specific enterprise versions

* This PR resubmits https://github.com/scylladb/scylladb/pull/20729 that was reverted in 73b1f66b70 due to https://github.com/scylladb/scylladb/issues/21159 which is now fixed

Closes scylladb/scylladb#21451

* github.com:scylladb/scylladb:
  data_dictionary: keyspace_metadata::describe: print tablets enabled also when defaulted
  tablets_test: test enable/disable tablets when creating a new keyspace
  treewide: always allow tablets keyspaces
  feature_service: prevent enabling both tablets and gossip topology changes
  alternator: create_keyspace_metadata: enable tablets using feature_service
2024-11-08 09:15:42 +02:00
Michał Chojnowski
35921eb67e mvcc_test: fix a benign failure of test_apply_to_incomplete_respects_continuity
For performance reasons, mutation_partition_v2::maybe_drop(), and by extension
also mutation_partition_v2::apply_monotonically(mutation_partition_v2&&)
can evict empty row entries, and hence change the continuity of the merged
entry.

For checking that apply_to_incomplete respects continuity,
test_apply_to_incomplete_respects_continuity obtains the continuity of
the partition entry before and after apply_to_incomplete by calling
e.squashed().get_continuity(). But squashed() uses apply_monotonically(),
so in some circumstances the result of squashed() can have smaller
continuity than the argument of squashed(), which messes with the thing
that the test is trying to check, and causes spurious failures.

This patch changes the method of calculating the continuity set,
so that it matches the entry exactly, fixing the test failures.

Fixes scylladb/scylladb#13757

Closes scylladb/scylladb#21459
2024-11-08 06:08:39 +01:00
Ernest Zaslavsky
029837a4a1 aws_errors: Make error messages more verbose.
Add more information to the error messages to make the failure reason clearer. Also add tests to check exceptions propagated from s3 client failure.
2024-11-07 21:01:25 +02:00
Ernest Zaslavsky
14f3832749 test: Make the minio proxy randomization re-playable
Provide a seed to the proxy randomization, the idea that the `test.py` will initialize the seed from `/dev/urandom` and print the seed when starting, in case some tests failed the dev is supposed to re-play it locally with the same seed (if it didnt repro otherwise) using the `start_s3_proxy.py` and providing it with the aforementioned seed using `--rnd-seed` command line argument
2024-11-07 21:01:25 +02:00
Ernest Zaslavsky
0c62635f05 test/boost/s3_test: add error injection scenarios to existing test suite
Add variants of existing S3 tests that route through a proxy instead of connecting directly to MinIO. The proxy allows injecting errors to validate error handling and recovery mechanisms under failure conditions.
2024-11-07 21:01:25 +02:00
Ernest Zaslavsky
8919e0abab test: Switch s3_test to use proxy
Switch `s3_test` to use the S3 proxy which is used to randomly inject retryable S3 errors to test the "retry" part of the S3 client.
Fix `put_object` to make it retryable
2024-11-07 21:01:25 +02:00
Ernest Zaslavsky
b1e36c868c test: Add more tests
Add tests to cover retries, and retry exhaustion. Also add tests for jumbo upload.
2024-11-07 21:01:25 +02:00
Ernest Zaslavsky
7fd1ff8d79 client: Stop returning error on DELETE in multipart upload abortion
Discard whatever exception is thrown in `abort_upload` when aborting multipart upload since we don't care about cleanly aborting it since there are other means to clean up dangling parts, for example `rclone cleanup` or S3 bucket's Lifecycle Management Policy
2024-11-07 21:01:25 +02:00
Ernest Zaslavsky
064a239180 client: Fix sigsegv when retrying
Stop moving the `file` into the `make_file_input_stream` since it will try to use it again on retry
2024-11-07 21:01:25 +02:00
Ernest Zaslavsky
dc6e4c0d97 client: Add retries
Add retries to the s3 client, all retries are coordinated by an instance of `retry_strategy`. In a case of error also parse response body in attempt to retrieve additional and more focused error information as suggested by AWS. See https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html.

Also move the expected http status check to the `make_s3_error_handler` since the http::client::make_request call is done with `nullopt` - we want to manage all the aws errors handling in s3 client to prevent the http client to validate it and fail before we have a chance to analyze the error properly
2024-11-07 21:01:25 +02:00
Ernest Zaslavsky
244635ebd8 client: Adjust map_s3_client_exception to return exception instance
"Unfuturize" the `map_s3_client_exception` since the retryable client is going to be implemented using coroutines and no `future` is needed here, just to save unnecessary `co_await` on it
2024-11-07 21:01:25 +02:00
Ernest Zaslavsky
bd3d4ed417 aws_errors: Change aws_error::parse to return std::optional<>
Change aws_error::parse to return std::optional<> to signify that no error was found in the response body
2024-11-07 21:01:25 +02:00
Ernest Zaslavsky
58decef509 aws_errors: Add http errors mapping into aws_error
Add http errors mapping into aws_error since the retry strategy is going to operate on aws_error and should not be aware of HTTP status codes
2024-11-07 21:01:25 +02:00
Ernest Zaslavsky
fa9e8b7ed0 client: Add aws_exception mapping
Map aws_exceptions in `map_s3_client_exception`, will be needed in retryable client calls to remap newly added  AWS errors to `storage_io_error`
2024-11-07 21:01:25 +02:00
Ernest Zaslavsky
54e250a6f1 aws_error: Add aws_exeption to carry original aws_error
Add `aws_exeption` to carry original `aws_error` for proper error handling in retryable s3 client
2024-11-07 21:01:25 +02:00
Ernest Zaslavsky
e6ff34046f aws_errors: Add new error codes
Add new S3 related errors, also introduce additional errors to describe pure http errors that has no additional information in the body
2024-11-07 21:01:25 +02:00
Ernest Zaslavsky
8dbe351888 client: Introduce retry strategy
Add `retry_strategy` interface and default implementation for exponential back-off retry strategy
2024-11-07 21:01:25 +02:00
Michał Jadwiszczak
7bad8378c7 main,cql_test_env: start group0_service before view_builder
In scylladb/scylladb#19745, view_builder was migrated to group0
and since then it is dependent on group0_service.
Because of this, group0_service should be initialized/destroyed
before/after view_builder.

Fixes scylladb/scylladb#20772

Co-authored-by: Dawid Mędrek <dawid.medrek@scylladb.com>
2024-11-07 14:08:11 +01:00
Kamil Braun
c268cf2e33 Merge 'test: rename "cql-pytest" to "cqlpy"' from Nadav Har'El
Python and Python developers don't like directory names to include a  minus sign, like "cql-pytest". In this patch we rename test/cql-pytest to test/cqlpy, and also change a few references in other code (e.g., code that used test/cql-pytest/run.py) and also references to this test suite in documentation and comments.

Arguably, the word "test" was always redundant in test/cql-pytest, and I want to leave the "py" in test/cqlpy to emphasize that it's Python-based tests, contrasting with test/cql which are CQL-request-only approval tests.

The second patch in the series fixes a small regression in the test/cqlpy/run script.

Fixes #20846

Test organization only, so backports not strictly necessary, but let's do them anyway because otherwise it will make any future backporting of tests in the cqlpy directory more messy than it needs to be.

Closes scylladb/scylladb#21446

* github.com:scylladb/scylladb:
  test/cqlpy: fix "run" script without any parameters
  test: rename "cql-pytest" to "cqlpy"
2024-11-07 13:26:07 +01:00
Benny Halevy
40928bd886 data_dictionary: keyspace_metadata::describe: print tablets enabled also when defaulted
Now that tablets may be explicitly enabled when
creating a new keyspace, describe tablets as enabled
even when the default initial_tablets==0 is used.

Refs https://github.com/scylladb/scylla-enterprise/issues/4860

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-11-07 13:59:59 +02:00
Benny Halevy
8620d9f672 tablets_test: test enable/disable tablets when creating a new keyspace
Test both configuration values for `enable_tablets`
and the possibility to explicitly enable or disable
tablets, respectively, when creating a keyspace using the
`tablets = {'enabled': true|false}` CREATE KEYSPACE option.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-11-07 13:57:40 +02:00
Benny Halevy
4b21cca443 treewide: always allow tablets keyspaces
With the tablets feature always enabled (Unless gossip toopology
changes are forced), the enable_tablets option now controls only
the default for newly created keyspaces.

Even when set to `false`, tablets are still enabled as a
feature and the user may explicitly enable tablets
using `CREATE KEYSPACE <name> WITH tablets = {'enabled': true}`

Note: best viewed with `git show -w`

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-11-07 13:57:39 +02:00
Benny Halevy
974b0f2080 feature_service: prevent enabling both tablets and gossip topology changes
Tablets require raft consistent topology changes.
Therefore, document that they are incompatible in
the config help and prevent their usage in
`feature_config_from_db_config`

Fixes scylladb/scylladb#21075

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-11-07 13:56:59 +02:00
Benny Halevy
4cf3b683bc alternator: create_keyspace_metadata: enable tablets using feature_service
Rather than using the local configuration option
on this node, check the cluster feature instead.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-11-07 13:56:59 +02:00
Botond Dénes
e21346179c test/boost/mutation_reader_test: add test for multishard reader buffer hint 2024-11-07 02:47:54 -05:00
Botond Dénes
5c5c77746e repair/row_level: disable read-ahead
The multishard reader's read-ahead was designed to reduce the latency of
range scans. But in the case of repair, read-ahead is suspected to
contribute significant extra load on the congested streaming semaphore
and thus contribute to the subsequent trashing (excessive reader
eviction).
First off, read-ahead was designed with pages of limited size in mind.
Repair can read much more, even for a single repair buffer. This can
lead to read-ahead concurrency to continue ramping up, creating and
using more and more readers.
Secondly, repair is not latency sensitive, so even when working well and
there is no congestion, the benefits are negligible.

The use of read-ahead is now controllable by the new
repair_multishard_reader_enable_read_ahead config item, defaulting to
false.
2024-11-07 02:47:54 -05:00
Botond Dénes
a248520201 db/config: introduce repair_multishard_reader_enable_read_ahead
Not used yet.
2024-11-07 02:47:54 -05:00
Botond Dénes
36a8756028 readers/multishard: implement the read_ahead flag
Don't do read-aheads when read-ahead was not enabled.
2024-11-07 02:47:54 -05:00
Botond Dénes
8938e06ebe replica/database: make_multishard_streaming_reader(): expose the read_ahead parameter
Continuing the previous patch, expose the just added read_ahead
parameter of make_multishard_combining>_reader_v2().
Set to read_ahead::yes by all callers, keeping the current default.
2024-11-07 02:47:54 -05:00
Botond Dénes
c6c62deaa5 readers/multishard: add read_ahead parameter
And propagate to the reader itself. Not used yet.
2024-11-07 02:47:54 -05:00
Botond Dénes
784f89f585 repair/row_level: set max buffer size on multishard reader
The multishard reader is used in the mixed-shard case, when a repair has
to read from all other shards. It is very important that cross-shard
roundtrips and possible evict-recreate cycles for the shard readers is
avoided. For this end, make use of the recently introduced internal
buffer hint feature in the multishard reader and set it's buffer size to
match that of the row level repair buffer size.

The use of the buffer-hint can be controlled with the recently
introduced repair_multishard_reader_buffer_hint_size config param.
2024-11-07 02:47:54 -05:00
Botond Dénes
e2344e28b6 replica/database: make_multishard_streaming_reader(): expose buffer_hint parameter
Expose the buffer hint functionality added by the previous commits, to
callers of make_multishard_streaming_reader(). All callers disable it
currently, it will be used in the next patch.
2024-11-07 02:47:46 -05:00
Yaron Kaikov
ef104b7b96 .github/scripts/auto-backport.py: update method to get closed prs
`commit.get_pulls()` in PyGithub returns pull requests that are directly associated with the given commit

Since in closed PR. the relevant commit is an event type, the backport
automation didn't get the PR info for backporting

Ref: https://github.com/scylladb/scylladb/issues/18973

Closes scylladb/scylladb#21468
2024-11-07 09:28:46 +02:00
Avi Kivity
9e67649fe5 utils: loading_cache: tighten clock sampling
Sample the clock once to avoid the filter returning different results.

Range algorithms may use multiple passes, so it's better to return
consistent results.

Closes scylladb/scylladb#21400
2024-11-07 10:28:01 +03:00
Kefu Chai
50fbab29ca compaction: remove unused "#include"
we don't use `std::list` in compaction/compaction_manager.hh, neither
is this header responsible for exposing the declarations in `<list>`.
so let's stop `#include` this header.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21436
2024-11-07 10:25:27 +03:00
Avi Kivity
f5489ba4a1 locator: tablet_metadata_guard: forward declare database
No need to bring in a heavy databas.hh dependency.

Closes scylladb/scylladb#21447
2024-11-07 10:24:35 +03:00
Kefu Chai
ba021f72a6 api: s/mulformatted/malformatted
mulformatted was a typo, let's fix it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21442
2024-11-07 10:07:11 +03:00
Pavel Emelyanov
49949092ad Merge 'Make s3 client ops use abort source + use in backup task' from Calle Wilund
Fixes #20716

Adds optional abort_source to all s3 client operations. If provided, will propagate to actual HTTP client and allow for aborting actual net op.

Note: this uses an abort source per call, not a client-local one.
This is for two reasons:

1.) The usage pattern of the client object is to create it outside the eventual owning object (task) that hosts the relevant abort source
2.) It is quite possible to want to have different/no abort source for some operation usage.

Also adds forward usage of task abort_source in backup tasks upload s3 call, making it more readily abort-able.

Closes scylladb/scylladb#21431

* github.com:scylladb/scylladb:
  backup_task: Use task abort source in s3 client call
  s3::client: Make operations (individually) abortable
2024-11-07 10:03:25 +03:00
Yaron Kaikov
9d8562caf3 Add conflict_reminder action for backport PR
In order not to forget to resolve conflicts in backport PRs, we should add some reminders to the PR author so it will not be forgotten

the new action will run twice a week and will send a reminder only for
PR opened with conflicts for 3 days or more

Fixes: https://github.com/scylladb/scylladb/issues/21448

Closes scylladb/scylladb#21449
2024-11-07 06:55:37 +02:00
Calle Wilund
0db4b9fd94 backup_task: Use task abort source in s3 client call
Fixes #20716

Propagates abort source in task object to actual network call,
thus making the upload workload more quickly abortable.

v2: Fix test to handle two versions after each other
2024-11-06 15:20:23 +00:00
Nadav Har'El
1fd7b797c7 test/cqlpy: fix "run" script without any parameters
A recent improvement to test/cqlpy/run to add the "--release" option
broke the ability to run this script it without *any* options (no test
name, etc.). This patch fixes this case.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-11-06 16:48:36 +02:00
Nadav Har'El
8c215141a1 test: rename "cql-pytest" to "cqlpy"
Python and Python developers don't like directory names to include a
minus sign, like "cql-pytest". In this patch we rename test/cql-pytest
to test/cqlpy, and also change a few references in other code (e.g., code
that used test/cql-pytest/run.py) and also references to this test suite
in documentation and comments.

Arguably, the word "test" was always redundant in test/cql-pytest, and
I want to leave the "py" in test/cqlpy to emphasize that it's Python-based
tests, contrasting with test/cql which are CQL-request-only approval
tests.

Fixes #20846

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-11-06 16:48:36 +02:00
Kefu Chai
6efde20939 utils/to_string: do not include fmt/ostream.h
to_string.hh does not use this header, neither is it obliged to
expose the content of this header. so, let's remove this include.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21440
2024-11-06 17:21:29 +03:00
Botond Dénes
3c25e6fcb4 db/config: introduce enable_repair_multishard_reader_buffer_hint
Allows enabling/disabling the multishard reader buffer hint
optimization. Not wired yet.
2024-11-06 08:51:00 -05:00
Botond Dénes
b052c5df62 readers/multishard: multishard_reader: pass hint to shard_reader
Calculate a buffer fill hint and pass it to
shard_reader_v2::fill_buffer(), so the underlying buffer-fill can be
optimized to avoid multiple cross shard round-trips, as well as possible
evict-recreate cycles.
The buffer hint mechanism is opt-in, enabled via the new
multishard_reader_buffer_hint parameter.
2024-11-06 08:51:00 -05:00
Botond Dénes
912b4dfba3 readers/multishard: shard_reader_v2::fill_reader_buffer(): respect the hint
When the hint is provided, respect it: make sure the returned buffer is
of the requested size, stopping early if the stop_token is seen.
To reduce the amount of possible eviction-recreate cycles while the
buffer is filled, disable auto-pause for the duration of the
fill_reader_buffer() call. For this purpose, auto_pause_disable_guard is
added to evictable_reader_v2.
2024-11-06 08:51:00 -05:00
Botond Dénes
8d5283f036 readers/multishard: propagate fill_buffer_hint to shard_reader:fill_reader_buffer()
The hint will tell the shard reader exactly how much data to produce, to
avoid multiple cross-shard round-trips and possible evict-recreate
cycles.

The hint is neither used yet or calculated yet, this is coming in the
next patches.
2024-11-06 08:51:00 -05:00
Botond Dénes
ee7ecb9155 readers/multishard: shard_reader: extract buffer-fill into its own method
It is about to get a bit more complicated, so worth to extract into a
method so it can be shared by the two call-sites.
2024-11-06 08:51:00 -05:00
Tomasz Grabiec
f7d35d535e Merge 'bytes_ostream: replace boost ranges with std ranges' from Avi Kivity
To reduce the dependency load, replace boost ranges with std::ranges.

Cleanup; no backport.

Closes scylladb/scylladb#21450

* github.com:scylladb/scylladb:
  bytes_ostream: replace boost ranges with std ranges
  bytes_ostream: extract fragment_iterator into namespace scope
2024-11-06 14:01:27 +01:00
Yaron Kaikov
77604b4ac7 .github/script/auto-backport.py: push backport PR to scylladbbot fork
Since Scylla is a public repo, when we create a fork, it doesn't fork the team and permissions (unlike private repos where it does).

When we have a backport PR with conflicts, the developers need to be able to update the branch to fix the conflicts. To do so, we modified the logic of the backport automation as follows:

- Every backport PR (with and without conflicts) will be open directly on the `scylladbbot` fork repo
- When there are conflicts, an email will be sent to the original PR author with an invitation to become a contributor in the `scylladbbot` fork with `push` permissions. This will happen only once if Auther is not a contributor.
- Together with sending the invite, all backport labels will be removed and a comment will be added to the original PR with instructions
- The PR author must add the backport labels after the invitation is accepted

Fixes: https://github.com/scylladb/scylladb/issues/18973

Closes scylladb/scylladb#21401
2024-11-06 14:29:37 +02:00
David Garcia
a072478f4f docs: enable tooltips
Updates the theme to the latest version to enable tooltips and modifies the db_options.tmpl to show the new role in action.

Closes scylladb/scylladb#21324
2024-11-06 14:09:28 +02:00
Andrei Chekun
afd1fc8e9f test.py: Add pytest-xdist to the toolchain
Add new dependency pytest-xdist to the toolchain. This will allow executing
boost and unit tests from pytest in parallel, reducing the time
needed for the run.

Closes scylladb/scylladb#21222
2024-11-06 14:09:01 +02:00
Botond Dénes
0ad32c153d Merge 'test_tablets: add rack decommission test cases' from Benny Halevy
test_tablets: add rack decommission test cases

Test scenarios where decommissioing a compelte rack
should succeed, and reproduce scylladb/scylladb#19475
where decommissioning a rack would fail since the
number of remaining racks is insufficient to satisfy
the replication factor, even though the number of nodes
is sufficient, enshrining this behavior.

Refs scylladb/scylladb#19475

* This PR adds unit tests and improves an error message.  No backport required.

Closes scylladb/scylladb#20747

* github.com:scylladb/scylladb:
  tablet_allocator: improve error message when unable to find replicas when draining
  test_tablets: add rack decommission test cases
  topology_experimental_raft/test_tablets: get_tablet_count_per_shard_for_host: move shards_count param to be last
  test/pylib: ServerInfo: add datacenter and rack attributes
  test: everywhere: drop unused imports of ServerInfo
2024-11-06 14:07:47 +02:00
Calle Wilund
3321820c67 s3::client: Make operations (individually) abortable
Refs #20716

Adds optional abort_source to all s3 client operations. If provided, will
propagate to actual HTTP client and allow for aborting actual net op.

Note: this uses an abort source per call, not a client-local one.
This is for two reasons:

1.) The usage pattern of the client object is to create it outside the
    eventual owning object (task) that hosts the relevant abort source
2.) It is quite possible to want to have different/no abort source for
    some operation usage.
2024-11-05 14:23:24 +00:00
Avi Kivity
8fb6d98ba3 Merge "various gossiper code cleanups" from Gleb
* 'gleb/gossip-cleanup-v3' of github.com:scylladb/scylla-dev:
  gossiper: start failure_detector_loop on shard 0 only
  gossiper: use 1 seconds instead of 1000 milliseconds
  gossiper: remove unused code
  gossiper: co-routinize do_send_ack2_msg
  gossiper: do not needlessly call get_endpoint_state_ptr in handle_major_state_change
  gossiper: fix weird logic in get_live_members
  gossiper: drop unneeded this->
  gossiper: fold get_or_create_endpoint_state into my_endpoint_state
  gossiper: co-routinize do_send_ack_msg
2024-11-05 15:31:58 +02:00
Avi Kivity
baaa92c6f5 bytes_ostream: replace boost ranges with std ranges
Have fragment_iterator support iterator_concept for compatibility with
std ranges, and switch from boost iterator_range to std::ranges::subrange.
2024-11-05 14:50:38 +02:00
Avi Kivity
cb026c347e bytes_ostream: extract fragment_iterator into namespace scope
C++ concept evaluation rules clash with nested class definition rules
with the result that evaluating concepts about the nested class within
the enclosing class doesn't work. Extract bytes_ostream::fragment_iterator
to avoid that.
2024-11-05 14:43:49 +02:00
Avi Kivity
4dab2473a2 Merge 'treewide: trade boost's any_of and all_of for std's any_of and all_of' from Kefu Chai
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::all_of` and `std::ranges::any_of`

in this change, we replace `boost::algorithm::all_of` and `boost::algorithm::any_of` with
`std::ranges::all_of` and `std::ranges::any_of` respectively.

to reduce the dependency to boost for better maintainability, and leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#21411

* github.com:scylladb/scylladb:
  treewide: s/boost::algorithm::any_of/std::ranges::any_of/
  treewide: s/boost::algorithm::all_of/std::ranges::all_of/
2024-11-05 12:48:24 +02:00
Piotr Dulikowski
7f17894c88 Merge 'cql3: Allow for describing CDC log tables' from Dawid Mędrek
In the past, DESC SCHEMA would produce create statements for both the base
and the log table. That was incorrect as the log table is automatically
created alongside the base one. That was solved in scylladb/scylladb@9ab57b1
(scylladb/scylladb#18467).

The mentioned changes implemented the following solution:

* DESC SCHEMA/KEYSPACE/TABLE would still print a create statement for the
  CDC base table,
* DESC SCHEMA/KEYSPACE would start printing an alter statement for the
  CDC log table. That statement would ensure that the restored log table
  has the same parameters as the original one,
* DESC TABLE <base table> would behave as DESC SCHEMA/KEYSPACE, i.e.
  it would print a create statement for the base table and an alter
  statement for the log table,
* DESC TABLE <log table> would result in an error.

While that solution was good and behaved correctly in the context of
restoring the schema, it had one flaw: describe statement aren't only
used as a means for producing a backup; they also serve an informative
purpose to learn about the schema, e.g. to learn what parameters a specific
table uses. Because we didn't allow for describing CDC log tables, the user
couldn't look them up directly via a describe statement -- they had to
describe the base table for that.

Attempting to describe a log table ended with an error, e.g.:

```
$ DESC TABLE ks.t_scylla_cdc_log;

ks.t_scylla_cdc_log is a cdc log table and it cannot be described directly. Try `DESC TABLE ks.t` to describe cdc base table and it's log table.
```

In these changes, we allow for describing CDC log tables again. The
semantics of the first three bullets above remains unchanged, but
we impose new behavior for DESC TABLE <log table>:

* When the user executes DESC TABLE <log table>, a create statement
  will be returned, treating the table as if it were a regular one,
* The create statement will be wrapped in CQL comment markers.

The rationale for the second bullet is that although we want to give the
user a means to look into the structure and options of a CDC log table,
the returned statement is not supposed to be ever executed by them. We
want to minimize the risk of that.

An example of the behavior after the change:

```
$ DESC TABLE ks.t_scylla_cdc_log;

/* Do NOT execute this statement! It's only for informational purposes.
   A CDC log table is created automatically when the base is created.

CREATE TABLE ks.t_scylla_cdc_log (
    "cdc$stream_id" blob,
    "cdc$time" timeuuid,
    "cdc$batch_seq_no" int,
    "cdc$end_of_batch" boolean,
    "cdc$operation" tinyint,
    "cdc$ttl" bigint,
    p int,
    PRIMARY KEY ("cdc$stream_id", "cdc$time", "cdc$batch_seq_no")
) WITH CLUSTERING ORDER BY ("cdc$time" ASC, "cdc$batch_seq_no" ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'enabled': 'false', 'keys': 'NONE', 'rows_per_partition': 'NONE'}
    AND comment = 'CDC log for ks.t'
    AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': '60', 'compaction_window_unit': 'MINUTES', 'expired_sstable_check_frequency_seconds': '1800'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 0
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND speculative_retry = '99.0PERCENTILE';

*/
```

We also extend the developer documentation regarding DESCRIBE
statements on CDC tables.

Fixes scylladb/scylladb#21235

Backport: these changes are an enhancement, so not needed.

Closes scylladb/scylladb#21228

* github.com:scylladb/scylladb:
  docs/dev: Document semantics of describing CDC tables
  cql3: Allow for describing CDC log tables
2024-11-05 10:06:13 +01:00
Pavel Emelyanov
440c1e3e3f error_injection: Remove unused inject(sleep, then invoke) overload
The overload was introduced by a8b14b0227 (utils: add timeout error
injection with lambda), but is only used by the test nowadays.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#21377
2024-11-05 09:56:08 +02:00
Yaniv Michael Kaul
4c5e102aee node_exporter: use fewer collectors
Remove unused / less useful collectors by default. While it doesn't seem to reduce memory usage, it may reduce potential performance or security issues in the future.
This is what we are left with (snippet of log when loading node exporter manually with the changed command line):

ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:111 level=info msg="Enabled collectors"
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=arp
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=bonding
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=conntrack
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=cpu
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=cpufreq
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=diskstats
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=dmi
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=edac
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=entropy
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=filefd
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=filesystem
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=interrupts
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=loadavg
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=mdadm
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=meminfo
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=netclass
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=netdev
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=netstat
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=nvme
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=os
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=pressure
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=schedstat
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=selinux
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=sockstat
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=softnet
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=stat
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=textfile
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=time
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=timex
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=uname
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=vmstat
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=watchdog
ts=2024-11-03T15:41:06.855Z caller=node_exporter.go:118 level=info collector=xfs

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Improvement, no need to backport.

Closes scylladb/scylladb#21419
2024-11-05 10:41:09 +03:00
Avi Kivity
b292aeecac replica: query.hh: drop dependency on database.hh
database.hh has large fan-in and therefore can trigger a lot
of recompilations if included. Replace with smaller dependencies.

Closes scylladb/scylladb#21424
2024-11-05 10:40:33 +03:00
Pavel Emelyanov
a98b57212e Merge 'lang, .github: remove unused includes, add more directories to CLEANER_DIR' from Kefu Chai
in this series:

- remove unused `#include` in "lang" subdirectory
- add index and lang to CLEANER_DIR

---

cleanup and improvements in the CI, hence no need to backport.

Closes scylladb/scylladb#21437

* github.com:scylladb/scylladb:
  .github: add index and lang to CLEANER_DIR
  lang: remove unused "#includes"
2024-11-05 10:37:04 +03:00
Kefu Chai
f1d4812ad6 test: lib: rest_client: use isinstance() over type()
in addition to the inheritance support, `isinstance()` is also
the recommended way to check for types by PEP8.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21438
2024-11-05 10:36:31 +03:00
Kefu Chai
59eb2ab119 treewide: s/boost::algorithm::any_of/std::ranges::any_of/
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::any_of`.

in this change, we replace `boost::algorithm::any_of` with
`std::ranges::any_of`

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-05 14:06:09 +08:00
Kefu Chai
f8bb1c64f1 treewide: s/boost::algorithm::all_of/std::ranges::all_of/
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::all_of`.

in this change, we replace `boost::algorithm::all_of` with
`std::ranges::all_of`

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-05 14:05:24 +08:00
Kefu Chai
e651b6dc69 .github: add index and lang to CLEANER_DIR
also explain why we don't run the cleaner against the "idl" subdirectory.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-05 10:01:04 +08:00
Kefu Chai
ee2a9419b3 lang: remove unused "#includes"
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-05 10:01:04 +08:00
Avi Kivity
ee92784098 serialization: replace boost::type with std::type_identity
Recently, seastar rpc started accepting std::type_identity in addition
to boost::type as a type marker (while labeling the latter with an
ominous deprecation warning). Reduce our depedendency on boost
by switching to std::type_identity.
2024-11-05 00:43:27 +01:00
Avi Kivity
075b13597d serializer: drop dependency on boost ranges
The call to boost::range::for_each is easily replaced with ranged for.

Closes scylladb/scylladb#21422
2024-11-04 17:48:17 +02:00
Gleb Natapov
2dbae78542 gossiper: start failure_detector_loop on shard 0 only
failure_detector_loop does nothing on all other shards.
2024-11-04 17:15:06 +02:00
Gleb Natapov
323b04137d gossiper: use 1 seconds instead of 1000 milliseconds 2024-11-04 17:15:06 +02:00
Gleb Natapov
0cb4c71846 gossiper: remove unused code 2024-11-04 17:15:06 +02:00
Gleb Natapov
0e4f149dee gossiper: co-routinize do_send_ack2_msg 2024-11-04 17:15:06 +02:00
Gleb Natapov
1fbac54fb8 gossiper: do not needlessly call get_endpoint_state_ptr in handle_major_state_change
The code calls for get_endpoint_state_ptr several times instead of using
the result of the first call. Change it.
2024-11-04 17:15:06 +02:00
Gleb Natapov
e2cf93abb9 gossiper: fix weird logic in get_live_members
The code adds a node to a set and then removes it if a condition is met.
Add to the set if the condition is not met instead.  Note that the
original set never has local endpoint (it is only added locally), so the
code is equivalent.
2024-11-04 17:14:55 +02:00
Benny Halevy
caedcf20c6 tablet_allocator: improve error message when unable to find replicas when draining
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-11-04 14:54:54 +02:00
Benny Halevy
d8be1cafb5 test_tablets: add rack decommission test cases
Test scenarios where decommissioing a compelte rack
should succeed, and reproduce scylladb/scylladb#19475
where decommissioning a rack would fail since the
number of remaining racks is insufficient to satisfy
the replication factor, even though the number of nodes
is sufficient, enshrining this behavior.

Refs scylladb/scylladb#19475

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-11-04 14:54:13 +02:00
Avi Kivity
b706e3e9e4 Merge 'sstables/index_reader: avoid unnecessary index page reads in single-partition reads' from Michał Chojnowski
Terminology note: in the context of this series, "index page" means an contiguous segment of the index file starting (inclusive) at a key corresponding to a summary entry and ending (exclusive) before the key corresponding to the next summary entry. "Index pages" are not related to filesystem pages.

---

In a single-partition read, if the searched partition key is the first key in its index page, we start scanning the index for that key starting at the previous index page (inclusive), even though we could start directly from the key's page. Similarly, if the searched partition key is absent from the sstable and lies after all other keys in its appropriate page, we additionally scan the next page, even though it's known from the summary that it can't possibly contain the key.

Those cases are wasteful. It's worse than it might seem at first glance. When partitions are small, only a small fraction of search keys fulfills those conditions (i.e. "first key in its page" or "an absent key greater than the last key in its page"), so the waste doesn't matter much. But when partitions are big enough, every index page contains only one partition key (and a promoted index for that partition), which directly means that *all* search keys fulfill the conditions, which means that total index reading work is two times bigger than what it should be.

In addition, there is a secondary performance bug which, when the aforementioned conditions are fulfilled, causes *additional* I/O to happen *past* the index reads which are actually parsed and used. In effect, the index I/O in single-partition reads might be not just doubled, but even tripled (that's for IOPS — throughput might be multiplied even more), all because of a slight inaccuracy in the edge cases.

This series fixes those inefficiencies by tightening the edge cases and ensuring that single-partition reads always read only a single index page.

Here's an example where we query the first row (i.e. `LIMIT 1`) of a certain partition key, in a table with large (1 MB) promoted indexes. Before the patch, the lookup of the lower bound involves 3 serialized disk reads (as described above) to subsequent index pages, and even the lookup of the upper bound involves 2 disk reads:

```
                                                                                                                                                                                     Execute CQL3 query
                                                                                                                                                                          Parsing a statement [shard 0]
                                                                                                                                     Processing a statement for authenticated user: anonymous [shard 0]
                                                                                                                                                        Executing read query (reversed false) [shard 0]
                                                                   Creating read executor for token -1297921881139976049 with all: [127.11.11.1] targets: [127.11.11.1] repair decision: NONE [shard 0]
                                                                    Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0]
                                                                                                                                                                  read_data: querying locally [shard 0]
                                                                                                                         Start querying singular range {{-1297921881139976049, pk{00023130}}} [shard 0]
                                                                                                                                     [reader concurrency semaphore user] admitted immediately [shard 0]
                                                                                                                                           [reader concurrency semaphore user] executing read [shard 0]
                            Reading key {-1297921881139976049, pk{00023130}} from sstable ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 38359040 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 38391808 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 38359040, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 38391808, successfully read 32768 bytes [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39370752 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39403520 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39370752, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39403520, successfully read 32768 bytes [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40378368 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40411136 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40378368, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40411136, successfully read 32768 bytes [shard 0]
                                                                                                                      upper_bound_cache_only({position: clustered, ckp{}, 1}): no upper bound [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40378368 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40411136 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40378368, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40411136, successfully read 32768 bytes [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 41390080 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 41422848 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 41390080, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 41422848, successfully read 32768 bytes [shard 0]
                                 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: scheduling bulk DMA read of size 21926 at offset 819200 [shard 0]
    ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: finished bulk DMA read of size 21926 at offset 819200, successfully read 24576 bytes [shard 0]
                                      Page stats: 1 partition(s), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 0 cell(s) (0 live, 0 dead) [shard 0]
                                                                                                                                                                             Querying is done [shard 0]
                                                                                                                                                         Done processing - preparing a result [shard 0]
                                                                                                                                                                                       Request complete
```

After the patch, the lookup of each bound involves 1 read:
```

                                                                                                                                                                                     Execute CQL3 query
                                                                                                                                                                          Parsing a statement [shard 0]
                                                                                                                                     Processing a statement for authenticated user: anonymous [shard 0]
                                                                                                                                                        Executing read query (reversed false) [shard 0]
                                                                   Creating read executor for token -1297921881139976049 with all: [127.11.11.1] targets: [127.11.11.1] repair decision: NONE [shard 0]
                                                                    Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0]
                                                                                                                                                                  read_data: querying locally [shard 0]
                                                                                                                         Start querying singular range {{-1297921881139976049, pk{00023130}}} [shard 0]
                                                                                                                                     [reader concurrency semaphore user] admitted immediately [shard 0]
                                                                                                                                           [reader concurrency semaphore user] executing read [shard 0]
                            Reading key {-1297921881139976049, pk{00023130}} from sstable ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39370752 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39403520 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39370752, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39403520, successfully read 32768 bytes [shard 0]
                                                                                                                      upper_bound_cache_only({position: clustered, ckp{}, 1}): no upper bound [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40378368 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40411136 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40378368, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40411136, successfully read 32768 bytes [shard 0]
                                 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: scheduling bulk DMA read of size 21926 at offset 819200 [shard 0]
    ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: finished bulk DMA read of size 21926 at offset 819200, successfully read 24576 bytes [shard 0]
                                      Page stats: 1 partition(s), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 0 cell(s) (0 live, 0 dead) [shard 0]
                                                                                                                                                                             Querying is done [shard 0]
                                                                                                                                                         Done processing - preparing a result [shard 0]
                                                                                                                                                                                       Request complete
```

Doesn't have to be backported, since the problem only affects performance, not correctness, and it has been present since forever.

Closes scylladb/scylladb#20897

* github.com:scylladb/scylladb:
  index_reader: remove a piece of misguided code involved in single-partition reads
  index_reader: in single-partition reads, don't read more than one page
  index_reader: fix unnecessary reads of preceding index pages
2024-11-04 14:28:27 +02:00
Avi Kivity
2531dc2d80 schema_registry: stop including replica/database.hh
database.hh is a hotspot that changes often (or its dependencies
do). Avoid including it to reduce recompilations.

Closes scylladb/scylladb#21407
2024-11-04 13:16:27 +01:00
Benny Halevy
9ff614da9f topology_experimental_raft/test_tablets: get_tablet_count_per_shard_for_host: move shards_count param to be last
Prepare for the next comit that will add a version
accepting a list of servers: `get_tablet_count_per_shard_for_hosts`
for which we want `shards_per_node` to be last and have a default value.

Also, fix the type hint for `full_tables`, as it had a syntax error,
using `:` instead of `,`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-11-04 14:11:30 +02:00
Benny Halevy
0c1e85b6e3 test/pylib: ServerInfo: add datacenter and rack attributes
Set to "DEFAULT_DC" and "DEFAULT_RACK" by default.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-11-04 14:11:30 +02:00
Benny Halevy
efa64cb92a test: everywhere: drop unused imports of ServerInfo
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-11-04 14:11:30 +02:00
Avi Kivity
7cb1ad8c87 Merge 'compaction_manager: stop_tasks, stop_ongoing_compactions: ignore errors' from Benny Halevy
stop() methods, like destructors must always succeed,
and returning errors from them is futile as there is
nothing else we can do with them by continue with shutdown.

stop_ongoing_compactions, in particular, currently returns the status
of stopped compaction tasks from `stop_tasks`, but still all tasks
must be stopped after it, even if they failed, so assert that
and ignore the errors.

Fixes scylladb/scylladb#21159

* Needs backport to 6.2 and 6.1, as commit 8cc99973eb causes handles storage that might cause compaction tasks to fail and eventually terminate on shudown when the exceptions are thrown in noexcept context in the deferred stop destructor body

Closes scylladb/scylladb#21299

* github.com:scylladb/scylladb:
  compaction_manager: stop: await _stop_future if engaged
  compaction_manager: really_do_stop:  assert that no tasks are left behind
  compaction_manager: stop_tasks, stop_ongoing_compactions: ignore errors
  compaction/compaction_manager: stop_tasks(): unlink stopped tasks
  compaction/compaction_manager: make _tasks an intrusive list
2024-11-04 13:54:16 +02:00
Gleb Natapov
3b7d9fddbc gossiper: drop unneeded this-> 2024-11-04 12:02:51 +02:00
Gleb Natapov
501b8f6984 gossiper: fold get_or_create_endpoint_state into my_endpoint_state
my_endpoint_state() is the only called of
get_or_create_endpoint_state() and calling it is the only thing the
function does anyway.
2024-11-04 12:02:51 +02:00
Gleb Natapov
300cbcebf6 gossiper: co-routinize do_send_ack_msg 2024-11-04 12:02:51 +02:00
Avi Kivity
d4b0a03d4c locator: token_metadata: switch from boost ranges to std ranges
Since drop_front() does not exist, replace it with advance(1).

Reduce dependency load.
2024-11-03 20:45:29 +02:00
Avi Kivity
247d92fbe5 locator: token_metadata: make iterator support std::input_iterator concept
Add the iterator_concept tag, and make it post-incrementable to
conform to the concept. This prepares the iterator for std::ranges.
2024-11-03 20:39:39 +02:00
Avi Kivity
b93c2c70f9 locator: tokens_metadata: move tokens_iterator to namespace scope
It's difficult to use nested classes with C++ concepts, since the class
might not be fully defined at the point the concept is evaluated,
resulting in spurious errors (e.g. thinking tokens_iterator is not
default constructible). Move it to namespace scope to reduce pain.
2024-11-03 20:39:31 +02:00
Pavel Emelyanov
f3f956841f sstables: Remove unused mp_row_consumer_m::range_tombstone_start
It's only used by its operator<< so remove it as well

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#21380
2024-11-03 16:40:02 +02:00
Avi Kivity
704ea9d3b4 Merge 'api: Remove foreach_column_family() helper' from Pavel Emelyanov
There's a whole lot of helpers and wrappers in api/ that help handlers manipulate keyspaces and tables. One of those is foreach_column_family which calls the provided callable on a table on each shard. There's exactly the same (but a bit more flexible) helper nearby. While at it, this helper gets a better name.

Closes scylladb/scylladb#21398

* github.com:scylladb/scylladb:
  api: Rename set_tables -> for_tables_on_all_shards
  api: Remove foreach_column_family() helper
2024-11-03 15:46:27 +02:00
Avi Kivity
856489ded1 cql3: remove unused request_validations methods
These methods are not used and therefore removed.

Closes scylladb/scylladb#21392
2024-11-03 13:17:32 +02:00
Benny Halevy
6cce67bec8 compaction_manager: stop: await _stop_future if engaged
The current condition that consults the compaction manager
state for awaiting `_stop_future` works since _stop_future
is assigned after the state is set to `stopped`, but it is
incidental.  What matters is that `_stop_future` is engaged.

While at it, exchange _stop_future with a ready future
so that stop() can be safely called multiple times.
And dropped the superfluous co_return.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-11-03 10:53:35 +02:00
Benny Halevy
a7a55298ea compaction_manager: really_do_stop: assert that no tasks are left behind
stop_ongoing_compactions now ignores any errors returned
by tasks, and it should leave no task left behind.
Assert that here, before the compaction_manager is destroyed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-11-03 10:53:34 +02:00
Benny Halevy
c08ba8af68 compaction_manager: stop_tasks, stop_ongoing_compactions: ignore errors
stop() methods, like destructors must always succeed,
and returning errors from them is futile as there is
nothing else we can do with them but continue with shutdown.

Leaked errors on the stop path may cause termination
on shutdown, when called in a deferred action destructor.

Fixes scylladb/scylladb#21298

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-11-03 10:52:58 +02:00
Botond Dénes
d8500472b3 compaction/compaction_manager: stop_tasks(): unlink stopped tasks
Stopped tasks currently linger in _tasks until the fiber that created
the task is scheduled again and unlinks the task. This window between
stop and remove prevents reliable checks for empty _tasks list after all
tasks are stopped.
Unlink the task early so really_do_stop() can safely check for an empty
_tasks list (next patch).
2024-11-03 10:17:11 +02:00
Botond Dénes
e942c074f2 compaction/compaction_manager: make _tasks an intrusive list
_tasks is currently std::list<shared_ptr<compaction_task_executor>>, but
it has no role in keeping the instances alive, this is done by the
fibers which create the task (and pin a shared ptr instance).
This lends itself to an intrusive list, avoiding that extra
allocation upon push_back().
Using an intrusive list also makes it simpler and much cheaper (O(1) vs.
O(N)) to remove tasks from the _tasks list. This will be made use of in
the next patch.

Code using _task has to be updated because the value_type changes from
shared_ptr<compaction_task_executor> to compaction_task_executor&.
2024-11-03 10:17:11 +02:00
Avi Kivity
39b55bd3a0 Update seastar submodule
* seastar f821bda19...fba36a3d1 (13):
  > build: do not include -DBoost_TEST_DYN_LINK in seastar_testing_cflags
  > doc: compatibility: update the notes on supported GCC versions
  > docker: bump up to clang {18,19} and gcc {13,14}
  > rpc: optimize small tuple deserialization
  > rpc: switch rpc::type from boost to std
  > thread: do not use fortify source
  > build: suppress CMake warning about CMP0057
  > core/units: remove space before literal identifier
  > signal.md: describe auto signal handling
  > build: persist Seastar options in SeastarConfig.cmake
  > sharded.hh: seperate invoke_on decls from defs
  > test: Add perf test for http client
  > gate: check: mark as const

Closes scylladb/scylladb#21390
2024-11-02 13:58:45 +02:00
Botond Dénes
19a43b5859 Merge 'repair: Reduce hints and batchlog flush' from Asias He
The hints and batchlog flush requests are issued to all nodes for each repair request when tombstone_gc repair mode is used.

The amount of such flush requests is high when all nodes in the cluster run repair. It is observed it takes a long time, up to 15s, for a repair request to finish such a flush request.

To reduce overhead of the flush, each node caches the flush and only executes the real flush when some time has passed. It is safe to do so before the real flush_time is returned. Repair uses the smallest flush_time from peers as the repair time.

The nice thing about the cache on the receiver side is that all senders can hit the cache. It is better than cache on the sender side.

A slightly smaller flush_time compared to the real flush time will be used with the benefits of significantly dropped hints and batchlog flush. The tradeoff is reasonable.

Fixes #20259

Performance improvement. No backports.

Closes scylladb/scylladb#20260

* github.com:scylladb/scylladb:
  test/test_repair.py: Add test_batchlog_flush_in_repair
  repair: Reduce hints and batchlog flush
  db/batchlog_manager: Add add_delay_to_batch_replay
  db/batchlog_manager: Add get_last_replay
  db/batchlog_manager: wire in batchlog_replay_cleanup_after_replays
  db/config: introduce batchlog_replay_cleanup_after_replays
  db/batchlog_manager: do_batch_log_replay(): add cleanup flag
2024-11-01 14:23:27 +02:00
Pavel Emelyanov
292fd52a60 Merge 'utils: chunked_vector: various constructor improvements' from Avi Kivity
Optimize the various constructors a little, and add an std::from_range_t
constructor.

Minor improvement, so no backports.

Closes scylladb/scylladb#21399

* github.com:scylladb/scylladb:
  utils: chunked_vector: add from_range_t constructor
  utils: chunked_vector: optimize initializer_list constructor
  utils: chunked_vector: iterator constructor: copy spanwise
  utils: chunked_vector: reserve for forward iterators, not just random access iterators, on construction
2024-11-01 15:02:56 +03:00
Botond Dénes
4bafaee523 Merge 'tasks: improve task_manager::lookup_virtual_task' from Aleksandra Martyniuk
Currently, to find the operation with given id, all operations tracked by a virtual task are listed. This isn't necessary, since we only need info regarding one particular operation.

Add a method to check whether a virtual task tracks the operation with the given id.

No backport needed

Closes scylladb/scylladb#20769

* github.com:scylladb/scylladb:
  tasks: delete virtual_task::get_ids method as it is unused
  tasks: improve task_manager::lookup_virtual_task
2024-11-01 13:44:04 +02:00
Kefu Chai
1b8446f92d compaction: fix the indent
in 38ce2c605d, we left a TODO for
reindent the code.

in this change, we reindent the code to address this TODO.

Refs 38ce2c605d
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21383
2024-11-01 12:55:47 +03:00
Avi Kivity
b5e46077df sstables: generation_type: replace boost ranges with std ranges
Reduce dependency load.

Closes scylladb/scylladb#21402
2024-11-01 12:45:24 +03:00
Pavel Emelyanov
d6169630a4 api: Rename set_tables -> for_tables_on_all_shards
The former name is not extremely descriptive, hopefully the latter one
is better in this sense.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-11-01 12:15:01 +03:00
Pavel Emelyanov
822758dffd api: Remove foreach_column_family() helper
There's a whole lot of helpers and wrappers in api/ that help handlers
manipulate keyspaces and tables. One of those is foreach_column_family
which calls the provided callable on a table on each shard. There's
exactly the same (but a bit more flexible) set_table() helper nearby.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-11-01 12:13:35 +03:00
Botond Dénes
0ee0dd3ef4 Merge 'Collect and report backup progress' from Pavel Emelyanov
Task manager GET /status method returns two counters that reflect task progress -- total and completed. To make caller reason about their meaning, additionally there's progress_units field next to those counters.

This patch implements this progress report for backup task. The units are bytes, the total counter is total size of files that are being uploaded, and the completed counter is total amount of bytes successfully sent with PUT requests. To get the counters, the client::upload_file() is extended to calculate those.

fixes #20653

Closes scylladb/scylladb#21144

* github.com:scylladb/scylladb:
  backup_task: Report uploading progress
  s3/client: Account upload progress for real
  s3/client: Introduce upload_progress
  s3: Extract client_fwd.hh
2024-11-01 10:57:12 +02:00
Kefu Chai
64122b3df3 treewide: s/boost::transform/std::ranges::transform/
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::transform`.

in this change, we:

- replace `boost::transform` with `std::ranges::transform`
- update affected code to work with `std::ranges::transform`

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21318
2024-11-01 08:15:14 +02:00
Avi Kivity
8c67f9b42e cql3: util: remove unneeded boost/range includes from header files
The includes are redistributed to the source files that need them.

Closes scylladb/scylladb#21391
2024-10-31 23:49:44 +01:00
Nadav Har'El
ee2d75b088 Merge 'Generalize "breakpoint" type of error injection' from Pavel Emelyanov
This pattern is -- if requested (by test) suspend code execution until requestor (the test) explicitly wakes it up. For that the injected place should inject a lambda that is called with so called "handler" at hand and try to read message from the handler. In many cases the inner lambda additionally prints a message into logs that tests waits upon to make sure injection was stepped on. In the end of the day this "breakpoint" is injected like

```
    co_await inject("foo", [] (auto& handler) {
        log.info("foo waiting");
        co_await handler.wait_for_message(timeout);
    });
```

This PR makes breakpoints shorter and more unified, like this

```
    co_await inject("foo", wait_for_message(timeout));
```

where `wait_for_message` is a wrapper structure used to pick new `inject()` overload.

Closes scylladb/scylladb#21342

* github.com:scylladb/scylladb:
  sstables: Use inject(wait_for_message_overload)
  treewide,error_injection: Use inject(wait_for_message) and fix tests
  treewide,error_injection: Use inject(wait_for_message) overload
  error_injection: Add inject() overload with wait_for_message wrapper
2024-10-31 21:56:27 +02:00
Avi Kivity
6a9852d47b utils: chunked_vector: add from_range_t constructor
std::ranges::to<> has a little protocol with containers. Implement it
to get optimized construction.

Similar to the iterator pair constructor, if the range's size can be
obtained (even with an O(N) algorithm), favor that to avoid reallocations.
Copy elements spanwise to promote optimization to memcpy when possible.
2024-10-31 19:32:16 +02:00
Avi Kivity
b2769403d2 utils: chunked_vector: optimize initializer_list constructor
Delegate to the previously optimized iterator-pair constructor.
2024-10-31 18:10:14 +02:00
Avi Kivity
0a81be4321 utils: chunked_vector: iterator constructor: copy spanwise
Instead of copying element-by-element, copy contiguous spans. This
is much faster if the input is a span and the constructor is trivial,
since the whole thing translates to a memcpy.

Make the two branches constexpr to reduce work for the compiler in
optimizing the other branch away.
2024-10-31 18:10:08 +02:00
Avi Kivity
4653430c8e utils: chunked_vector: reserve for forward iterators, not just random access iterators, on construction
For a forward iterator, prefer a two pass algorithm to first count
the number of elements, reserver, then copy the elements, to a single
pass algorithm that involves reallocation and copying.
2024-10-31 17:55:42 +02:00
Kefu Chai
673b107ffa github: use GithubException when appropriate
`Exception` could be too general, what we really care about is
`GithubException`. so let's catch the latter instead for better
readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21364
2024-10-31 18:21:29 +03:00
Kefu Chai
f8221b960f test: route S3 mock server messages through logger
The S3 mock server (introduced in 5a96549c) currently prints its status
messages directly to stdout, which can be distracting when reviewing test
results. For example:

```console
$ ./test.py --verbose --mode debug object_store/test_backup::test_simple_backup
Found 1 tests.
Starting S3 mock server on ('127.226.51.1', 2012)
================================================================================
[N/TOTAL]   SUITE    MODE   RESULT   TEST
------------------------------------------------------------------------------
[1/1]      object_store  debug  [ PASS ] object_store.test_backup.1 5.99s
Stopping S3 mock server
-------------------------
CPU utilization: 6.5%
```

Move these messages to use proper logging to give developers more control
over their visibility:

- Make logger parameter mandatory in MockS3Server constructor
- Route "Stopping S3 mock server" message through the provided logger
- Add --log-level option to the standalone mock server launcher

The message is now hidden:

```console
$ ./test.py --verbose --mode debug --save-log-on-success object_store/test_backup::test_simple_backup
Found 1 tests.
================================================================================
[N/TOTAL]   SUITE    MODE   RESULT   TEST
------------------------------------------------------------------------------

[1/1]      object_store  debug  [ PASS ] object_store.test_backup.1 6.25s
------------------------------------------------------------------------------
CPU utilization: 5.5%
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21384
2024-10-31 18:21:29 +03:00
Benny Halevy
78ceaeabca compaction_manager: compaction_disabled: return true if not in compaction_state
When a compaction_group is removed via `compaction_manager::remove`,
it is erase from `_compaction_state`, and therefore compaction
is definitely not enabled on it.

This triggers an internal error if tablets are cleaned up
during drop/truncate, which checks that compaction is disabled
in all compaction groups.

Note that the callers of `compaction_disabled` aren't really
interested in compaction being actively disabled on the
compaction_group, but rather if it's enabled or not.
A follow-up patch can be consider to reverse the logic
and expose `compaction_enabled` rather than `compaction_disabled`.

Fixes scylladb/scylladb#20060

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#21378
2024-10-31 18:21:29 +03:00
Dawid Mędrek
495c1188e9 docs/dev: Document semantics of describing CDC tables 2024-10-31 11:25:19 +01:00
Dawid Mędrek
39e0513e1b cql3: Allow for describing CDC log tables
In the past, DESC SCHEMA would produce create statements for both the base
and the log table. That was incorrect as the log table is automatically
created alongside the base one. That was solved in scylladb/scylladb@9ab57b1
(scylladb/scylladb#18467).

The mentioned changes implemented the following solution:

* DESC SCHEMA/KEYSPACE/TABLE would still print a create statement for the
  CDC base table,
* DESC SCHEMA/KEYSPACE would start printing an alter statement for the
  CDC log table. That statement would ensure that the restored log table
  has the same parameters as the original one,
* DESC TABLE <base table> would behave as DESC SCHEMA/KEYSPACE, i.e.
  it would print a create statement for the base table and an alter
  statement for the log table,
* DESC TABLE <log table> would result in an error.

While that solution was good and behaved correctly in the context of
restoring the schema, it had one flaw: describe statement aren't only
used as a means for producing a backup; they also serve an informative
purpose to learn about the schema, e.g. to learn what parameters a specific
table uses. Because we didn't allow for describing CDC log tables, the user
couldn't look them up directly via a describe statement -- they had to
describe the base table for that.

Attempting to describe a log table ended with an error, e.g.:

```
$ DESC TABLE ks.t_scylla_cdc_log;

ks.t_scylla_cdc_log is a cdc log table and it cannot be described directly. Try `DESC TABLE ks.t` to describe cdc base table and it's log table.
```

In these changes, we allow for describing CDC log tables again. The
semantics of the first three bullets above remains unchanged, but
we impose new behavior for DESC TABLE <log table>:

* When the user executes DESC TABLE <log table>, a create statement
  will be returned, treating the table as if it were a regular one,
* The create statement will be wrapped in CQL comment markers.

The rationale for the second bullet is that although we want to give the
user a means to look into the structure and options of a CDC log table,
the returned statement is not supposed to be ever executed by them. We
want to minimize the risk of that.

An example of the behavior after the change:

```
$ DESC TABLE ks.t_scylla_cdc_log;

/* Do NOT execute this statement! It's only for informational purposes.
   A CDC log table is created automatically when the base is created.

CREATE TABLE ks.t_scylla_cdc_log (
    "cdc$stream_id" blob,
    "cdc$time" timeuuid,
    "cdc$batch_seq_no" int,
    "cdc$end_of_batch" boolean,
    "cdc$operation" tinyint,
    "cdc$ttl" bigint,
    p int,
    PRIMARY KEY ("cdc$stream_id", "cdc$time", "cdc$batch_seq_no")
) WITH CLUSTERING ORDER BY ("cdc$time" ASC, "cdc$batch_seq_no" ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'enabled': 'false', 'keys': 'NONE', 'rows_per_partition': 'NONE'}
    AND comment = 'CDC log for ks.t'
    AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': '60', 'compaction_window_unit': 'MINUTES', 'expired_sstable_check_frequency_seconds': '1800'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 0
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND speculative_retry = '99.0PERCENTILE';

*/
```

Fixes scylladb/scylladb#21235
2024-10-31 11:25:19 +01:00
Wojciech Mitros
88ab8db944 mv: run view building in streaming scheduling group
View building is an expensive process that takes a long time to complete.
During the build, it's impact on other work should be minimized, even at
the expense of slightly slowing it down.

Instead, view building is currently performed in the the same scheduling
group (gossip) as other high-priority tasks, in particular raft processing,
which slows it down, making races more likely and increasing the number
of retries that need to be done.

While view building is still initiated in the gossip group (as it's the
result of adding a view, which is a schema change), in this patch the bulk
of the view building work is moved to a low-priority, maintenance scheduling
group (named "streaming" after its main use case).

Additionally, a test is added, where we make sure that the scheduling
group is the one most used when building a view.

Fixes https://github.com/scylladb/scylladb/issues/21232

Closes scylladb/scylladb#21326
2024-10-31 10:13:20 +01:00
Nadav Har'El
7572c483b1 test/topology_experimental_raft: fix flaky test
Today, each test function in test/topology_experimental_raft creates a
cluster in the beginning of the test and drops it at the end of the
function. This is very inefficient if you hope (like I do) to write many
small and pinpointed test functions instead of large test functions that
test 20 unrelated things.

Trying to propose a way to change this sad state of affairs, in
test_alternator.py I created a fixture "alternator3" which I hoped could
be used in multiple tests that need a 3-node Alternator cluster.
Currently only one test uses this fixture.

Unfortunately, it turns out the alternator3 fixture is broken, and
led to flaky test runs (sometimes the test using alternator3 picked
up an existing cluster instead of starting with an empty cluster,
and failed). These problems cannot be *completely* fixed at the current
state of the framework. The framework does not currently allow keeping
a 3-node cluster between test functions, while also allowing other test
functions to create different clusters. The specific flakiness we saw
could be fixed by adding a missing before_test() call, but in the
future we would need to ensure that all the test functions that
use it are contiguous in the test file, and I don't see how we can (or
want to) ensure this. So at this point I am giving up and withdrawing
this proposal until the developers of the topology test framework
make this one of their design goals.

Since there was only one test using this fixture, removing it should
make no performance or correctness difference - it should just fix
the flakiness.

Fixes scylladb/scylladb#21322.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#21370
2024-10-31 10:12:26 +01:00
Calle Wilund
c4361037f7 cql_test_env/gossip: Prevent double shutdown call crash
Fixes scylladb/scylladb#21159

When an exception is thrown in sstable write etc such that
storage_manager::isolate is initiated, we start a shutdown chain
for message service, gossip etc. These are synced (properly) in
storage_manager::stop, but if we somehow call gossiper::shutdown
outside the normal service::stop cycle, we can end up running the
method simultaneously, intertwined (missing the guard because of
the state change between check and set). We then end up co_awaiting
an invalid future (_failure_detector_loop_done) - a second wait.

Fixed by
a.) Remove superfluous gossiper::shutdown in cql_test_env. This was added
    in 20496ed, ages ago. However, it should not be needed nowadays.
b.) Ensure _failure_detector_loop_done is always waitable. Just to be sure.

Closes scylladb/scylladb#21379
2024-10-31 10:11:20 +01:00
Nadav Har'El
d3f09638f0 Merge 'compound_compat: replace use of boost ranges with std ranges' from Avi Kivity
Replace use of boost::ranges::join() with another construct, as it
has no std replacement, and replace other uses with their std
equivalent, in order to reduce dependency load.

Code cleanup - no backport.

Closes scylladb/scylladb#21382

* github.com:scylladb/scylladb:
  compound_compat: replace use of boost ranges with std ranges
  compound_compat: simplify seriakization of ka/la sstables static cell names
2024-10-31 10:16:41 +02:00
Nadav Har'El
65e29f28bd Merge 'gms: remove unused #includes ' from Kefu Chai
these unused includes are identified by clang-include-cleaner. after auditing the source files, all of the reports have been
confirmed.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#21374

* github.com:scylladb/scylladb:
  .github: add gms to iwyu's CLEANER_DIR
  gms: remove unused `#include`s
2024-10-31 09:06:37 +02:00
Kefu Chai
2498e37a2f mutation_writer,streaming: use reader_consumer_v2 type when appropriate
The `reader_consumer_v2` type
(`std::function<future<> (mutation_reader)>`) is defined alongside
`mutation_reader` in `mutation_reader.hh`.

before this change, we sometimes use
`std::function<future<> (mutation_reader)>` directly when defining a
consumer parameter or a consumer variable.

in this change, we improve maintainability by:

- Reducing duplicate function type declarations
- Centralizing the consumer type definition
- Making future signature updates easier to implement

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21369
2024-10-31 07:17:47 +02:00
Avi Kivity
907da210b6 compound_compat: replace use of boost ranges with std ranges
To reduce the dependency load, replace use of boost ranges
with the std equivalent.

Files that lost the indirect boost dependency have it added as a
direct dependency.
2024-10-30 19:58:07 +02:00
Avi Kivity
982cebc1f6 compound_compat: simplify seriakization of ka/la sstables static cell names
compound_compat is used for serializing ka/la sstables static cell names.
Since we can no longer write such sstabkes, the function is used only
in some tests.

Reduce the use of boost::range::join(): it has no direct equivalent
in std (std::views::concat is in C++26), and it is slow due to the
need to type-erase. Instead of using boost::range::join, extend the
vector used to hold the empty clustering key a bit more, and copy
the view representing the static cell name into into it.
2024-10-30 19:19:57 +02:00
Kefu Chai
d3a6931b14 .github: add gms to iwyu's CLEANER_DIR
to avoid future violations of include-what-you-use.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-30 23:01:34 +08:00
Kefu Chai
52ec315ffd gms: remove unused #includes
these unused includes are identified by clang-include-cleaner.
after auditing the source files, all of the reports have been
confirmed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-30 23:01:34 +08:00
Pavel Emelyanov
c16369323b sstables: Use inject(wait_for_message_overload)
This place could be in the pre-previous patch, it just can use the
overload, but it seemengly has a bug. It prints _two_ messages -- that
the injection handler was suspended and that it was woken up. The bug is
in the 2nd message -- it's printed without waiting for the message, so
it likely gets printed before wakeup itself. It seems that no tests care
about it though.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-30 16:53:33 +03:00
Pavel Emelyanov
39cb93be3c treewide,error_injection: Use inject(wait_for_message) and fix tests
This is continuation of previous patch, this time also update tests that
wait for specific message in logs (to make sure injection handler was
called and paused the code execution).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-30 16:53:33 +03:00
Pavel Emelyanov
7d8cc3ccc2 treewide,error_injection: Use inject(wait_for_message) overload
Many places want to inject a handler that waits for external kick. Now
there's convenience inject() method overload for this. It will result in
extra messages in logs, but so far no code/test cares about it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-30 16:53:33 +03:00
Pavel Emelyanov
c1432f3657 error_injection: Add inject() overload with wait_for_message wrapper
The wrapper object denotes that injection should run a handler and
wait_for_message() on it. Wrapper carries the timeout used to call the
mentioned method. It's currently unused, next patches will start enjoing
it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-30 16:53:33 +03:00
Dawid Mędrek
b984488552 cql3: Rename SALTED HASH to HASHED PASSWORD
Cassandra 4.1 announced a new option to create a role with:
`HASHED PASSWORD`. Example:

```
CREATE ROLE bob WITH HASHED PASSWORD = 'hashed_password';
```

We've already introduced another option following the same
semantics: `SALTED HASH`; example:

```
CREATE ROLE bob WITH SALTED HASH = 'salted_hash';
```

The change hasn't made it to any release yet, so in this commit
we rename it to `HASHED PASSWORD` to be compatible with Cassandra.

Additionally, we adjust existing tests to work against Cassandra too.

Fixes scylladb/scylladb#21350

Closes scylladb/scylladb#21352
2024-10-30 14:07:58 +02:00
Aleksandra Martyniuk
bc5b1f9a5d tasks: delete virtual_task::get_ids method as it is unused 2024-10-30 12:25:47 +01:00
Aleksandra Martyniuk
9b5d69ae96 tasks: improve task_manager::lookup_virtual_task
Currently, lookup_virtual_task gets the list of ids of all operations
tracked by a virtual task and checks whether it contains given id.
The list of all ids isn't required and the check whether one particular
operation id is tracked by the virtual task may be quicker than listing
all operations.

Add virtual_task::contains method and use it in lookup_virtual_task.
2024-10-30 12:24:38 +01:00
Kefu Chai
d81ed5adb4 compaction: explain make_interpose_consumer() in compaction strategy
Add documentation to clarify the purpose and behavior of
make_interpose_consumer() in the compaction_strategy_impl class. This
method is crucial for building layered processing pipelines but its
semantics were previously undocumented.

The added documentation explains how:
- It decorates end consumers with additional processing steps
- It enables construction of processing pipelines
- The original consumer's semantics are preserved

This improves code maintainability by making the pipeline construction
pattern more apparent to developers.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21336
2024-10-30 13:22:00 +03:00
Tomasz Grabiec
f3869dadc6 Merge 'compound: replace boost ranges with std ranges' from Avi Kivity
Continue standardization on std::ranges. Since compound contains a custom
iterator, we first have to upgrade it to C++20 iterator concepts.

Cleanup / minor refactoring, so no backport.

Closes scylladb/scylladb#21320

* github.com:scylladb/scylladb:
  compound: replace boost ranges with std ranges
  compound: upgrade iterator to be an std::forward_iterator
2024-10-30 11:02:51 +01:00
Asias He
73806f66a5 test/test_repair.py: Add test_batchlog_flush_in_repair
It checks batchlog flush request cache in repair.
2024-10-30 11:10:39 +08:00
Asias He
b3b3e880d3 repair: Reduce hints and batchlog flush
The hints and batchlog flush requests are issued to all nodes for each
repair request when tombstone_gc repair mode is used.

The amount of such flush requests is high when all nodes in the cluster
run repair. It is observed it takes a long time, up to 15s, for a repair
request to finish such a flush request.

To reduce overhead of the flush, each node caches the flush and only
executes the real flush when the cahce time has passed. It is safe to do
so because the real flush_time is returned. Repair uses the smallest
flush_time returned from peers as the repair time.

The nice thing about the cache on the receiver side is that all senders
can hit the cache. It is better than cache on the sender side.

A slightly smaller flush_time compared to the real flush time will be
used with the benefits of significantly dropped hints and batchlog
flush. The trade-off looks reasonable.

Tests: 2 nodes, with 1s batchlog delay:

Before:
   Repair nr_repairs=20 cache_time_in_ms=0 total_repair_duration=40.04245328903198

After:
   Repair nr_repairs=20 cache_time_in_ms=5000 total_repair_duration=1.252073049545288

Fixes #20259
2024-10-30 11:07:57 +08:00
Asias He
f8ad78ba1e db/batchlog_manager: Add add_delay_to_batch_replay
It is used to simulate slow replay.
2024-10-30 11:07:57 +08:00
Asias He
fed9b54664 db/batchlog_manager: Add get_last_replay
It is used to get the time when the last replay is executed.
2024-10-30 11:07:57 +08:00
Botond Dénes
3361542e84 db/batchlog_manager: wire in batchlog_replay_cleanup_after_replays
After the specified amount of replays, trigger a cleanup: flush batchlog
table memtables. This allows the cleanup to happen on a configurable
interval, instead of on every batchlog replay attempt, which might be
too much.
2024-10-30 11:07:57 +08:00
Botond Dénes
1635525526 db/config: introduce batchlog_replay_cleanup_after_replays
Not used yet.
2024-10-30 11:07:57 +08:00
Botond Dénes
169c74346d db/batchlog_manager: do_batch_log_replay(): add cleanup flag
Add a flag controlling whether cleanup (memtable flush) will be done
after the replay. This is to allow repair to opt out from cleanup --
when many concurrenty repairs are running, there can be storms of calles
to do_batch_log_replay(), which will be mostly no-op, but they will all
attempt to flush the memtable to clean-up after themselves. This is
unnecessary and introduces latency to repairs, best to leave the cleanup
to the periodic batch-log replay.
2024-10-30 11:07:57 +08:00
Avi Kivity
73b1f66b70 Revert "Merge 'Allow explicitly enabling or disabling tablets when creating a new keyspace' from Benny Halevy"
This reverts commit c286434e4c, reversing
changes made to 6712fcc316.

The commit causes memtable_test to be very flaky in debug mode.
Specifically, subtests test_exceptions_in_flush_on_sstable_open
and test_exceptions_in_flush_on_sstable_write).
2024-10-30 00:55:29 +02:00
Avi Kivity
b9df3aec12 gdb: avoid @classmethod/@property combinations
The @classmethod/@property combination was deprecated in Python 3.11
and removed[1] in Python 3.13. It's used in scylla-gdb.py, breaking it
with Python 3.13.

To fix, just make all users (size_t and _vptr_type) top-level
functions. The definitions are all identical and don't need to be
in class scope.

[1] https://docs.python.org/3.13/library/functions.html#classmethod

Closes scylladb/scylladb#21349
2024-10-29 19:37:07 +02:00
Gleb Natapov
cc7f25062a topology coordinator: take a copy of a replication state in raft_topology_cmd_handler
Current code takes a reference and holds it past preemption points. And
while the state itself is not suppose to change the reference may
become stale because the state is re-created on each raft topology
command.

Fix it by taking a copy instead. This is a slow path anyway.

Fixes: scylladb/scylladb#21220

Closes scylladb/scylladb#21316
2024-10-29 15:47:43 +01:00
Avi Kivity
020ccbd76a Merge 'utils: cached_file: Mark permit as awaiting on page miss' from Tomasz Grabiec
Otherwise, the read will be considered as on-cpu during promoted index
search, which will severely underutlize the disk because by default
on-cpu concurrency is 1.

I verified this patch on the worst case scenario, where the workload
reads missing rows from a large partition. So partition index is
cached (no IO) and there is no data file IO (relies on https://github.com/scylladb/scylladb/pull/20522).
But there is IO during promoted index search (via cached_file).

Before the patch this workload was doing 4k req/s, after the patch it does 30k req/s.

The problem is much less pronounced if there is data file or partition index IO involved
because that IO will signal read concurrency semaphore to invite more concurrency.

Fixes #21325

Closes scylladb/scylladb#21323

* github.com:scylladb/scylladb:
  utils: cached_file: Mark permit as awaiting on page miss
  utils: cached_file: Push resource_unit management down to cached_file
2024-10-29 16:15:21 +02:00
Kamil Braun
36cc3bcc90 test: test_crash_coordinator_before_streaming: enable TRACE for raft_topology logger
Issue scylladb/scylladb#21114 reported that sometimes during the test we
timeout when waiting for node to restart after it was killed.
Preliminary investigation showed that the node appears to be hanging
inside `topology_state_load`, while holding `token_metadata` lock, which
prevents `join_topology` from progressing.

Enable TRACE level logging for `raft_topology` so we get more accurate
info where inside `topology_state_load` the hang happens, once the
problem reproduces again in CI.

Closes scylladb/scylladb#21247
2024-10-29 12:46:47 +02:00
Kefu Chai
54d438168a build: cmake: explicitly mark convenience libraries as STATIC
before this change, these
[convenience libraries](https://www.gnu.org/software/automake/manual/html_node/Libtool-Convenience-Libraries.html)
were implicitly built as static libraries by default,
but weren't explicitly marked as STATIC in CMake. While this worked
with default settings, it could cause issues if `BUILD_SHARED_LIBS` is
enabled.

So before we are ready for building these components as shared
libraries, let's mark all convenience libraries as STATIC for
consistency and to prevent potential issues before we properly support
shared library builds.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21274
2024-10-29 10:22:19 +01:00
Yaron Kaikov
94a9efbf1c github: add script for backports automation instead of Mergify
Adding an auto-backport.py script to handle backport automation instead of Mergify.

The rules of backport are as follows:

* Merged or Closed PRs with any backport/x.y label (one or more) and promoted-to-master label
* Backport PR will be automatically assigned to the original PR author
* In case of conflicts the backport PR will be open in the original autoor fork in draft mode. This will give the PR owner the option to resolve conflicts and push those changes to the PR branch (Today in Scylla when we have conflicts, the developers are forced to open another PR and manually close the backport PR opened by Mergify)
* Fixing cherry-pick the wrong commit SHA. With the new script, we always take the SHA from the stable branch
* Support backport for enterprise releases (from Enterprise branch)

Fixes: https://github.com/scylladb/scylladb/issues/18973

Closes scylladb/scylladb#21302
2024-10-29 10:04:30 +02:00
Pavel Emelyanov
25ae3d0aed backup_task: Report uploading progress
Do it by passing reference to s3::upload_progress_monitor object that
sits on task impl itself. Different files' uploads would then update the
monitor with their sizes and uploaded counters. The structure is
reported by get_progress() method. Unit size is set to be bytes. Test is
updated.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-29 08:40:35 +03:00
Pavel Emelyanov
2efcfc13e8 s3/client: Account upload progress for real
Before upload starts file size is checked, so this is the place that
updates progress.total counter.

Uploading a file happens by reading unit_size bytes from file input
stream and writing the buffer into http body writer stream. This is the
place to update progress.uploaded counter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-29 08:38:39 +03:00
Pavel Emelyanov
51e03b1025 s3/client: Introduce upload_progress
This is a structure with "total" and "uploaded" counters that's passed
by user to client::upload_file() method so that client would update it
with the progress.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-29 08:38:39 +03:00
Pavel Emelyanov
f9a5e02b53 s3: Extract client_fwd.hh
This is to export some simple structures to users without the need to
include client.hh itself (rather large already)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-29 08:38:39 +03:00
Avi Kivity
49d3e281d6 Merge 'Sanitize /system/highest_supported_sstable_version API endpoint' from Pavel Emelyanov
Its handler dereferences long chain of objects to get to the value it needs. There's shorter way.
Also, the endpoint in question is not unregistered on stop.

Closes scylladb/scylladb#21279

* github.com:scylladb/scylladb:
  api: Make get_highest_supported_sstable_version use proper service
  api: Move system::get_highest_supported_sstable_version set/unset
  api: Scaffold for sstables-format-selector
2024-10-28 21:42:41 +02:00
Pavel Emelyanov
b09bb6bc19 error_injection: Re-use enter() code in inject() overloads
Most of inject() overloads check if the injection is enabled, then
optionally clear the one-shot one, then do the injection. Everything
but doing the injection is implemented in the enter() method, it's
perfectly worth re-using one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#21285
2024-10-28 21:37:20 +02:00
Kefu Chai
7610b907c6 build: include subdirectory rules in compilation database merge
Previously in e65185ba, when merging Seastar's and ScyllaDB's compilation
databases, the "prefix" parameter in merge-compdb.py was too restrictive.
It only included build rules for files with "CMakeFiles" prefix, excluding
source files in subdirectories like `apps/iotune/CMakeFiles/app_iotune.dir/iotune.cc.o`.

In this change, we change the prefix parameter to an empty string to
include all source files whose object files are located under build directories, regardless of their path structure.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21312
2024-10-28 21:34:21 +02:00
Avi Kivity
c286434e4c Merge 'Allow explicitly enabling or disabling tablets when creating a new keyspace' from Benny Halevy
Separate the configuration for enabling the tablets feature from the enablement of tablets when creating new keyspaces.

This change always enables the TABLETS cluster feature and the tablets logic respectively.

The `enable_tablets` config option just controls whether tablets are enabled or disabled by default for new keyspaces.

If `enable_tablets` is set to `true`, tablets can be disabled using `CREATE KEYSPACE WITH tablets = { 'enabled': false }` as it is today.

If `enable_tablets` is set to `false`, tablets can be enabled using `CREATE KEYSPACE WITH tablets = { 'enabled': true }`.

The motivation for this change is to simplify the user experience of using tablets by setting the default for new keyspaces to false amd allowing the user to simply opt-in by using tablets = {enabled: true }.
This is not pissible today.
The user has to enable tablets by default for all new keyspaces (that use the NetworkTopologyStrategy) and then actively opt-out to use vnodes.

* Not required to be backported to OSS versions.  May be backported to specific enterprise versions

Closes scylladb/scylladb#20729

* github.com:scylladb/scylladb:
  data_dictionary: keyspace_metadata::describe: print tablets enabled also when defaulted
  tablets_test: test enable/disable tablets when creating a new keyspace
  treewide: always allow tablets keyspaces
  feature_service: prevent enabling both tablets and gossip topology changes
  alternator: create_keyspace_metadata: enable tablets using feature_service
2024-10-28 21:33:17 +02:00
Nadav Har'El
6712fcc316 test/cql-pytest: add option to run cql-pytes tests against specific release
This patch adds the option "--release <version>" to test/cql-pytest/run,
which downloads the pre-compiled Scylla release with the given version
number and runs the tests against that version. For example, it can be used
to demonstrate that #15559 was indeed a regression between 2022.1 and 2022.2,
by running a recently-added test against these two old versions:

test/cql-pytest/run --release 2022.1 --runxfail \
        test_prepare.py::test_duplicate_named_bind_marker_prepared

test/cql-pytest/run --release 2022.2 --runxfail \
        test_prepare.py::test_duplicate_named_bind_marker_prepared

The first run passes, the second fails - showing the regression.

The Scylla releases are downloaded from ScyllaDB's S3 bucket
(downloads.scylladb.com). They are saved in the build/ directory
(e.g., build/2022.2.9), and if that directory is not removed, when
"run --release" requests the same version again, the previous download
is reused.

Release numbers can look like:

    * 5.4.7
    * 5.4 (will get the latest in the 5.4 branch, e.g., 5.4.7)
    * 5.4.0~rc2 (a prerelease)
    * 2021.1.9 (Enterprise release)
    * 2023.1 (latest in this branch, Enterprise release)

Fixes #13189

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#19228
2024-10-28 21:29:44 +02:00
Kefu Chai
f3dee5b636 build: enable CMAKE_CXX_EXTENSIONS explicitly
before this change, Seastar enables CXX_EXTENSIONS in its own
build rules. but it does not expose it to the parent project. but
scylladb's CMake building system respect seastar's .pc file and
includes the cflags exposed by it. without this change, scylladb
included "-std=c++23" from seastar, and "-std=gnu++23" from itself.
this is both confusing and inconsistent with the build rules generated
by `configure.py`.

in this change, we explicitly set `CMAKE_CXX_EXTENSIONS` when creating
Seastar's building rules, so that it can populate this setting to its
.pc file. in this way, we don't have two different options for
specifying the C++ standard when building scylladb with CMake.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21311
2024-10-28 21:23:04 +02:00
Kefu Chai
8b80ef3290 build: Remove GCC ARM warning workaround (originally added in 193d1942)
The workaround was initially added to silence warnings on GCC < 6.4 for ARM
platforms due to a compiler bug (gcc.gnu.org/bugzilla/show_bug.cgi?id=77728).
Since our codebase now requires modern GCC versions for coroutine support,
and the bug was fixed in GCC 6.4+, this workaround is no longer needed.

Refs 193d1942f2

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21308
2024-10-28 21:19:56 +02:00
Avi Kivity
94c21e5c05 Merge 'sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions' from Tomasz Grabiec
Single-row reads from large partition issue 64 KiB reads to the data file,
which is equal to the default span of the promoted index block in the data file.
If users would want to increase selectivity of the index to speed up single-row reads,
this won't be effective. The reason is that the reader uses promoted index
to look up the start position in the data file of the read, but end position
will in practice extend to the next partition, and amount of I/O will be
determined by the underlying file input stream implementation and its
read-ahead heuristics. By default, that results in at least 2 IOs 32KB each.

There is already infrastructure to lookup end position based on upper
bound of the read, in anticipation for sharing the promoted index cache,
but it's not effective becasue it's a non-populating lookup and the upper
bound cursor has its own private cached_promoted_index, which is cold
when positions are computed. It's non-populating on purpose, to avoid
extra index file IO to read upper bound. In case upper bound is far-enough
from the lower bound, this will only increase the cost of the read.

The solution employed here is to warm up the lower bound cursor's
cache before positions are computed, and use that cursor for
non-populating lookup of the upper bound.

We use the lower bound cursor and the slice's lower bound so that we
read the same blocks as later lower-bound slicing would, so that we
don't incur extra IO for cases where looking up upper bound is not
worth it, that is when upper bound is far from the lower bound. If
upper bound is near lower bound, then warming up using lower bound
will populate cached_promoted_index with blocks which will allow us to
locate the upper bound block accurately.  This is especially important
for single-row reads, where the bounds are around the same key.  In
this case we want to read the data file range which belongs to a
single promoted index block.  It doesn't matter that the upper bound
is not exactly the same. They both will likely lie in the same block,
and if not, binary search will bring adjacent blocks into cache.  Even
if upper bound is not near, the binary search will populate the cache
with blocks which can be used to narrow down the data file range
somewhat.

Fixes #10030.

The change was tested with perf-fast-forward.

I populated the data set with `column_index_size_in_kb` set to 1

  scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1

Test run:

  build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0

This test issues two reads of subsequent keys from the middle of a large partition (1M rows in total). The first read will miss in the index file page cache, the second read will hit.

Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total.
After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB.
I verified using logging that the data file range matches a single promoted index block.

Also, the first read which misses in cache is still faster after the change.

Before:

```
running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride  rows      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
500000  1         0.009802            1         1        102          0        102        102       21.0     21        196       2       1        0        1        1        0        0        0       568     269 4716050  53.4%
500001  1         0.000321            1         1       3113          0       3113       3113        2.0      2         64       1       0        1        0        0        0        0        0       116      26  555110  45.0%
```

After:

```
running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride  rows      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
500000  1         0.009609            1         1        104          0        104        104       20.0     20        137       2       1        0        1        1        0        0        0       561     268 4633407  43.1%
500001  1         0.000217            1         1       4602          0       4602       4602        1.0      1          2       1       0        1        0        0        0        0        0       110      26  313882  64.1%
```

Backports: none, not a regression

Closes scylladb/scylladb#20522

* github.com:scylladb/scylladb:
  perf: perf_fast_forward: Add test case for querying missing rows
  perf-fast-forward: Allow overriding promoted index block size
  perf-fast-forward: Test subsequent key reads from the middle in test_large_partition_select_few_rows
  perf-fast-forward: Allow adding key offset in test_large_partition_select_few_rows
  perf-fast-forward: Use single-partition reads in test_large_partition_select_few_rows
  sstables: bsearch_clustered_cursor: Add more tracing points
  sstables: reader: Log data file range
  sstables: bsearch_clustered_cursor: Unify skip_info logging
  sstables: bsearch_clustered_cursor: Narrow down range using "end" position of the block
  sstables: bsearch_clustered_cursor: Skip even to the first block
  test: sstables: sstable_3_x_test: Improve failure message
  sstables: mx: writer: Never include partition_end marker in promoted index block width
  sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions
  sstables: clustered_cursor: Track current block
2024-10-28 21:13:23 +02:00
Tomasz Grabiec
0f2101b055 utils: cached_file: Mark permit as awaiting on page miss
Otherwise, the read will be considered as on-cpu during promoted index
search, which will severely underutlize the disk because by default
on-cpu concurrency is 1.

I verified this patch on the worst case scenario, where the workload
reads missing rows from a large partition. So partition index is
cached (no IO) and there is no data file IO. But there is IO during
promoted index search (via cached_file). Before the patch this
workload was doing 4k req/s, after the patch it does 30k req/s.

The problem is much less pronounced if there is data file or index
file IO involved because that IO will signal read concurrency
semaphore to invite more concurrency.
2024-10-28 19:54:58 +01:00
Tomasz Grabiec
868f5b59c4 utils: cached_file: Push resource_unit management down to cached_file
It saves us permit operations on the hot path when we hit in cache.

Also, it will lay the ground for marking the permit as awaiting later.
2024-10-28 19:49:58 +01:00
Avi Kivity
d3dae09316 compound: replace boost ranges with std ranges
Standardize on the standard range library.

The serialize_value(initializer_list) overload is disambiguated
not to call itself. Apparently it wasn't called before.

Since std::ranges::subrange does not provide operator==, replace
it with std::ranges::equals().
2024-10-28 18:35:41 +02:00
Avi Kivity
61d7f1f6a5 compound: upgrade iterator to be an std::forward_iterator
compound::iterator isn't far from a forward_iterator, and if we want
to use it with std::ranges, we have to upgrade it. This is because
std::ranges::subrange() only provides front() for forward ranges, and
we do use this front(). Boost apparently isn't as strict.

To make it a forward_range, we have to drop operator-> and make
operator* return a value (similar to std::views::tranform), since
forward iterators require that pointers and references be stable,
and this iterator returns a pointer to one of its members.

We also add an iterator_concept member to declare the compatibility
to std::ranges.
2024-10-28 17:16:36 +02:00
Kamil Braun
101c1d50f0 Merge 'fix nodetool status to show zero-token nodes' from Abhinav Kumar Jha
In the current scenario, the nodetool status doesn’t display information regarding zero token nodes. For example, if 5 nodes are spun by the administrator, out of which, 2 nodes are zero token nodes, then nodetool status only shows information regarding the 3 non-zero token nodes.

This commit intends to fix this issue by leveraging the “/storage_service/host_id ” API  and adding appropriate logic in scylla-nodetool.cc to support zero token nodes.

A test is also added in nodetool/test_status.py to verify this logic. This test fails without this commit’s zero token node support logic, hence verifying the behavior.

This PR fixes a bug. Hence we need to backport it. Backporting needs to be done only
to 6.2 version, since earlier versions don't support zero token nodes.

Fixes: scylladb/scylladb#19849
Fixes: scylladb/scylladb#17857

Closes scylladb/scylladb#20909

* github.com:scylladb/scylladb:
  fix nodetool status to show zero-token nodes
  test: move `wait_for_first_completed` to pylib/util.py
  token_metadata: rename endpoint_to_host_id_map getter and add support for joining nodes
2024-10-28 12:19:36 +01:00
Kefu Chai
9f8adcd207 backup_task: track the first failure uploading sstables
before this change, we only record the exception returned
by `upload_file()`, and rethrow the exception. but the exception
thrown by `update_file()` not populated to its caller. instead, the
exceptional future is ignored on pupose -- we need to perform
the uploads in parallel.  this is why the task is not marked fail
even if some of the uploads performed by it fail.

in this change, we

- coroutinize `backup_task_impl::do_backup()`. strictly speaking,
  this is not necessary to populate the exception. but, in order
  to ensure that the possible exception is captured before the
  gate is closed, and to reduce the intentation, the teardown
  steps are performed explicitly.
- in addition to note down the exception in the logging message,
  we also store it in a local variable, which it rethrown
  before this function returns.

Fixes scylladb/scylladb#21248
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21254
2024-10-28 12:54:27 +03:00
Tzach Livyatan
1878af9399 Update os-support-info.rst - add CentOS
ScyllaDB support RHEL 9 and derivatives, including CentOS 9.

Fix https://github.com/scylladb/scylladb/issues/21309

Closes scylladb/scylladb#21310
2024-10-28 10:02:31 +02:00
Kefu Chai
8ac471b74b dht: do not include unused headers
in 8d1b3223, we removed some unused "#include"s, but we failed to
address all of them in "dht" subdirectory. and the unaddressed
"#include"s are identified by the iwyu workflow.

in this change, we address the leftovers.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21291
2024-10-28 09:58:42 +02:00
Anna Stuchlik
44a807f5bc doc: improve the README file in the docs folder
This commit improves the README file so that it's more helpful
to documentation contributors. Especially, it:
- Adds the link to the prerequisites.
- Add information on troubleshooting (checking the links, headings, etc.)
- Removes the section on creating a knowledge base article, as we no longer
  promote adding KBs in favor of creating a coherent documentation set.

Fixes https://github.com/scylladb/scylladb/issues/21257

Closes scylladb/scylladb#21262
2024-10-28 09:55:40 +02:00
Anna Stuchlik
212eb204a7 doc: set 6.2 as the latest stable version
This commit updates the configuration for ScyllaDB documentation so that:
- 6.2 is the latest version.
- 6.2 is removed from the list of unstable versions.

It must be merged when ScyllaDB 6.2 is released.

In addition, this commit uncomments the redirections that should be applied
when version 6.2 is the latest stable version (which will happen when this commit
is merged).

No backport is required.

Closes scylladb/scylladb#21133
2024-10-28 09:45:37 +02:00
Pavel Emelyanov
420baf5035 api: Make get_highest_supported_sstable_version use proper service
This endpoint now grabs one via database -> table -> sstables manager
chain, but there's shorter route, namely via sstables format selector.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-28 10:18:57 +03:00
Pavel Emelyanov
61c8b571e5 api: Move system::get_highest_supported_sstable_version set/unset
It's currently registered with all other system endpoints and is not
unregistered. Its correct place is in the sstables-format-selector
set/unset functions.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-28 10:18:23 +03:00
Pavel Emelyanov
f090bdabbb api: Scaffold for sstables-format-selector
This "service" will have its own endpoint soon

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-28 10:17:38 +03:00
Botond Dénes
31342ecb5d Merge 'tasks: fix virtual tasks children' from Aleksandra Martyniuk
Fix how regular tasks that have a virtual parent are created
in task_manager::module::make_task: set sequence number
of a task and subscribe to module's abort source.

Fixes: #21278.

Needs backport to 6.2

Closes scylladb/scylladb#21280

* github.com:scylladb/scylladb:
  tasks: fix sequence number assignment
  tasks: fix abort source subscription of virtual task's child
2024-10-28 08:59:40 +02:00
Aleksandra Martyniuk
85d9565158 test: repair: drop log checks from test_repair_succeeds_with_unitialized_bm
Currently, test_repair_succeeds_with_unitialized_bm checks whether
repair finishes successfully and the error is properly handled
if batchlog_manager isn't initialized. Error handling depends on
logs, making the test fragile to external conditions and flaky.

Drop the error handling check, successful repair is a sufficient
passing condition.

Fixes: #21167.

Closes scylladb/scylladb#21208
2024-10-28 08:39:16 +02:00
Botond Dénes
416159e5d9 Merge 'docs/alternator: explain service discovery HTTP requests' from Nadav Har'El
Add a description of the service discovery HTTP requests - `/` and  `/localnodes` that was previously not documented except in a design document that is unfortunately no longer available publically (https://docs.google.com/document/d/1twgrs6IM1B10BswMBUNqm7bwu5HCm47LOYE-Hdhuu_8/edit).

Fixes https://github.com/scylladb/scylladb/issues/20989

Developer-oriented documentation so no need to backport.

Closes scylladb/scylladb#21000

* github.com:scylladb/scylladb:
  docs/alternator: explain service discovery HTTP requests
  docs/alternator: split Alternator-specific APIs from alternator.md
2024-10-28 08:21:28 +02:00
Benny Halevy
2268912589 docs: add documentation for scylla_identifier
Commit 3a12ad96c7
added an sstable_identifier uuid to the SSTable
scylla_metadata component, however it was
under-documented and this patch adds the missing
documentation for the sstable component format,
and to the scylla sstable tool documentation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#21221
2024-10-28 08:18:08 +02:00
Kefu Chai
0f9d2ab577 build: cmake: disable Seastar exception hack
in cc3953e5, we disabled Seastar exception hack in configure.py.
this change disabled the Seastar exception hack in the following
two builds:

- build generated directly by configure.py
- build configured with multi-config generator using CMake

but we also have non-multi-config build using CMake. to be more
consistent, let's apply the equivalent change to non-multi-config
build of CMake.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21233
2024-10-28 08:11:43 +02:00
Botond Dénes
be70755f47 Merge 'repair: Fix finished ranges metrics for removenode' from Asias He
The skipped ranges should be multiplied by the number of tables

Otherwise the finished ranges ratio will not reach 100%.

Fixes #21174

Closes scylladb/scylladb#21252

* github.com:scylladb/scylladb:
  test: Add test_node_ops_metrics.py
  repair: Make the ranges more consistent in the log
  repair: Fix finished ranges metrics for removenode
2024-10-28 08:09:32 +02:00
Asias He
9868ccbac0 test: Add test_node_ops_metrics.py
It tests the node_ops_metrics_done metric reaches 100% when a node ops
is done.

Refs: #21174
2024-10-28 08:45:37 +08:00
Pavel Emelyanov
2f9f76fddf sstables_loader: Mark to_replica_set() private
It's not called from outside

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#21210
2024-10-27 22:28:54 +02:00
Anna Stuchlik
ef4bcf8b3f doc: remove the Cassandra references from notedool
This PR removes the reference to Cassandra from the nodetool index,
as the native nodetool is no longer a fork.

In addition, it removes the Apache copyright.

Fixes https://github.com/scylladb/scylladb/issues/21238

Closes scylladb/scylladb#21240
2024-10-27 22:26:33 +02:00
Kefu Chai
e65185ba6f build: merge scylla's and seastar's compilation database
Since commit 415c83fa, Seastar is built as an external project. As a result,
the compile_commands.json file generated by ScyllaDB's CMake build system no
longer contains compilation rules for Seastar's object files. This limitation
prevents tools from performing static analysis using the complete dependency
tree of translation units.

This change merges Seastar's compilation database with ScyllaDB's and places
the combined database in the source root directory, maintaining backward
compatibility.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21234
2024-10-27 22:01:29 +02:00
Tomasz Grabiec
850d9cfb59 node-exporter: Disable hwmon collector
This collector reads nvme temperature sensor, which was observed to
cause bad performance on Azure cloud following the reading of the
sensor for ~6 seconds. During the event, we can see elevated system
time (up to 30%) and softirq time. CPU utilization is high, with
nvm_queue_rq taking several orders of magnitude more time than
normally. There are signs of contention, we can see
__pv_queued_spin_lock_slowpath in the perf profile, called. This
manifests as latency spikes and potentially also throughput drop due
to reduced CPU capacity.

By default, the monitoring stack queries it once every 60s.

Closes scylladb/scylladb#21165
2024-10-27 21:59:15 +02:00
Kefu Chai
f5b29331a2 build: populate --enable-dist --disable-dist to CMake
before this change, the "dist" targets are always enabled in the
CMake-based building system. but the build rules generated by
`configure.py` does respect `--enable-dist` and `--disable-dist`
command line options, and enable/distable the dist targets
respectively.

in this change, we

- add an CMake option named "Scylla_DIST". the "dist"
  subdirectory in CMake only if this option is ON.
- pouplate the `--enable-dist` and `--disable-dist` option
  down to cmake by setting the `Scylla_DIST` option,
  when creating the build system using CMake.

this enables the CMake-based build system to be functionality
wise more closer to the legacy building system.

Refs scylladb/scylladb#2717

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21253
2024-10-27 21:57:46 +02:00
Kefu Chai
24d14b601b treewide: s/boost::adaptors::map_values/std::views::values/
now that we are allowed to use C++23. we now have the luxury of using
`std::views::values`.

in this change, we:

- replace `boost::adaptors::map_values` with `std::views::values`
- update affected code to work with `std::views::values`
- the places where we use `boost::join()` are not changed, because
  we cannot use `std::views::concat` yet. this helper is only
  available in C++26.

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21265
2024-10-27 21:32:45 +02:00
Avi Kivity
3124711fc4 Merge 'Report rows_merged in compaction_history rest api and nodetool' from Łukasz Paszkowski
Currently, running the `nodetool compactionhistory` command or using the rest api `curl -X GET --header "Accept: application/json" "http://localhost:10000/compaction_manager/compaction_history"` return compaction history without the `row_merged` field.

The series computes rows merged during compaction and provides this information to users via both the nodetool command and the rest api. The `rows_merged` field contains information on merged clustering keys across multiple sstable files. For instance, compacting two sstables of a table consisting of 7 rows where two rows are part of the both sstables, the output would have the following format: {1: 5, 2: 2}.

No backport is required. It extends the existing compaction history output.

Fixes https://github.com/scylladb/scylladb/issues/666

Closes scylladb/scylladb#20481

* github.com:scylladb/scylladb:
  test/rest_api: Add tests for compactionhistory
  nodetool: Add rows merged stats into compactionhistory output
  compaction: Update compaction history with collected histogram
  compaction: Remove const qualifier from methods creating sstable readers
  sstable_set: Add optional statistics to make_local_shard_sstable_reader
  make_combined_reader: Add optional parameter, combined_reader_statistics
  reader_selector: Extend with maximum reader count
  mutation_fragment_merger: Create histogram while consuming mutation fragment batches
2024-10-27 21:26:11 +02:00
Kefu Chai
158008dd2c mutation_writer: simplify classification using with_deserialized() return value
Since `with_deserialized()` returns the lambda function's result, we can
directly return the bucket from within the lambda instead of relying on
side effects. This makes the code more explicit and functional.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21273
2024-10-27 21:20:55 +02:00
Nadav Har'El
6fdd0ebd3b RBAC: confirm that unprivileged users can't read the roles table
A worry was raised that an unprivileged user might be able to read the
system.roles table - which contains the Alternator secret keys (and also
CQL's hashed passwords). This patch adds tests that show that this worry
is unjustified - and acts as a regression test to ensure it never
becomes justified. The tests show that an unprivileged user cannot read
the system.roles table using either CQL or Alternator APIs.

More specifically, the two tests in this patch demonstrate that:

* The Alternator API does not allow an unprivileged user to read ANY system
  table, unless explicitly granted permissions for that table.

* The CQL API whitelists (see service::client_state::has_access) specific
  system tables - e.g., system_schema.tables - that are made readable to any
  unprivileged user. But the system.auth table is NOT whitelisted in this
  way - and is unreadable to unprivileged users unless explicitly granted
  permissions on that table.

The new tests passes on both Scylla and Casssandra.

Refs #5206 (that issue is about removing the Alternator secret keys from
the roles table - but stealing CQL salted hashes is still pretty bad, so
it's good to know that unprivileged users can't read them).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#21215
2024-10-27 21:09:38 +02:00
Nadav Har'El
1634a64ffd cql-pytest: test a few small materialized views CQL issues
While documenting materialized view in a new document (Refs #16569)
I encountered a few questions on how various CQL operations work on
a table that has views, and this patch contains tests that clarify their
answer - and can later guarantee that the answer doesn't unintentionally
change in the future. The questions that these tests answer are:

1. That TRUNCATE on a base table also TRUNCATEs its views. This is just
   a basic test, with no attempt to reproduce issue #17635 (which is
   about the truncation of the base and views not being atomic).

2. That DROP TABLE is *not allowed* on a base table that has views.

3. That DROP KEYSPACE is allowed, even if there are tables with views.

4. Test that ALTER TABLE tbl DROP is never allowed in Cassandra, but
   allowed in some cases by Scylla

5. Test that ALTER TABLE tbl ADD is allowed, and "SELECT *" expands to
   select the new column into the materialized view as well.

All the new tests pass on both Scylla and Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#21142
2024-10-27 21:08:28 +02:00
Botond Dénes
7c75fc599f streaming: stream-session: switch to tracking permit
The stream-session is the receiving end of streaming, it reads the
mutation fragment stream from an RPC stream and writes it onto the disk.
As such, this part does no disk IO and therefore, using a permit with
count resources is superfluous. Furthermore, after
d98708013c, the count resources on this
permit can cause a deadlock on the receiver end, via the
`db::view::check_view_update_path()`, which wants to read the content of
a system table and therefore has to obtain a permit of its own.

Switch to a tracking-only permit, primarily to resolve the deadlock, but
also because admission is not necessary for a read which does no IO.

Refs: scylladb/scylladb#20885 (partial fix, solves only one of the deadlocks)
Fixes: scylladb/scylladb#21264

Closes scylladb/scylladb#21059
2024-10-27 20:01:25 +02:00
Avi Kivity
7ffbfe8bb3 Merge 'Squash some sstables::test helpers' from Pavel Emelyanov
There's a `missing_summary_first_last_sane` test case that uses some very specific way of modifying an sstable -- it loads one from resources, then tries to "write" the loaded stuff elsewhere. For that it uses a special purpose test::store() helper and a bunch of auxiliary ones from the same class. Those aux helpers are not used anywhere else and are also very special for this test case, so it make sense to keep this whole functionality in a single helper.

Closes scylladb/scylladb#21255

* github.com:scylladb/scylladb:
  test: Squash test::change_generation_number() into test::store()
  test: Squash test::change_dir() into test::store()
  test: Coroutinize sstables::test::store()
2024-10-27 19:59:59 +02:00
Anna Stuchlik
aa0dadea48 doc: extend the ToC for CDC
This commit adds the missing links to the CDC index page.

Fixes https://github.com/scylladb/scylladb/issues/21137

Closes scylladb/scylladb#21286
2024-10-27 19:57:59 +02:00
Anna Stuchlik
b2b9622e32 doc: fix redundant references to version 6.2
This commit removes mentions of version 6.2 that were introduced
with https://github.com/scylladb/scylladb/pull/17969.

Now that the documentation is versioned, there should be no reference to specific versions.

Fixes https://github.com/scylladb/scylladb/issues/21276

Closes scylladb/scylladb#21277
2024-10-27 14:47:40 +02:00
Paweł Zakrzewski
b077685fec test/cql-pytest: GROUP BY with static columns
This commit adds a new test case 'test_group_by_static_column_and_tombstones'
to verify the behavior of GROUP BY queries with static columns. The test is
adapted from Cassandra's test suite and aims to reproduce issue #21267.

Original, larger test:
cassandra_tests/validation/operations/select_group_by_test.py::testGroupByWithPaging()

Closes scylladb/scylladb#21270
2024-10-27 14:45:53 +02:00
Aleksandra Martyniuk
910a6fc032 tasks: fix sequence number assignment
Currently, children of virtual tasks do not have sequence number
assigned. Fix it.
2024-10-25 15:30:13 +02:00
Aleksandra Martyniuk
1eb47b0bbf tasks: fix abort source subscription of virtual task's child
Currently, if a regular task does not have a parent or its parent
is a virtual tasks then it subscribes to module's abort source
in task_manager::task::impl constructor. However, at this point
the kind of the task's parent isn't set. Due to that, children
of virtual tasks aren't aborted on shutdown.

Subscribe to module's abort source in task::impl::set_virtual_parent.
2024-10-25 14:18:00 +02:00
Kefu Chai
e7d6ab576b backup_task: remove unused member variable
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21258
2024-10-25 11:49:06 +03:00
Abhinav
c00d40b239 fix nodetool status to show zero-token nodes
In the current scenario, the nodetool status doesn’t display information
regarding zero token nodes. For example, if 5 nodes are spun by the
administrator, out of which, 2 nodes are zero token nodes, then nodetool
status only shows information regarding the 3 non-zero token nodes.

This commit intends to fix this issue by leveraging the “/storage_service/host_id
” API  and adding appropriate logic in scylla-nodetool.cc to support zero token nodes.

Robust topology tests are added, which spins up scylla nodes and confirm nodetool
status output for various cases, providing good coverage.
A test is also added in nodetool/test_status.py to verify this logic. These tests fail
without this commit’s zero token node support logic, hence verifying the behavior.

The test `test_status_keyspace_joining_node` has been removed. This test is
based on case where host_id=None, which is impossible. Since we now use
host_id_map for node discovery in nodetool, the nodes with "host_id=None"
go undetected. Since this case is anyway impossible, we can get rid of this.

This PR fixes a bug. Hence we need to backport it. Backporting needs to be done only
to 6.2 version, since earlier versions dont support zero token nodes.

Fixes: scylladb/scylladb#19849
2024-10-25 13:28:09 +05:30
Abhinav
39dfd2d7ac test: move wait_for_first_completed to pylib/util.py
This function is needed in a new test added in the next commit and this
refactoring avoids code duplication.
2024-10-25 13:26:42 +05:30
Abhinav
72f3c95a63 token_metadata: rename endpoint_to_host_id_map getter and add support for joining nodes
Rename host_id map getter, 'get_endpoint_to_host_id_map_for_reading' to 'get_endpoint_to_host_id_map_'
Also modify the getter to return information regarding joining nodes as well.

This getter will later be used for retrieving the nodes in nodetool status, hence it needs to show all nodes,
including joining ones.

The function name suffix `_for_reading` suggests that the function was used
in some other places in the past, and indeed if we need endpoints
"for reading" then we cannot show joining endpoints. But it was confirmed
that this function is currently only used by "/storage_service/host_id" endpoint,
hence it can be modified as required.

Fixes: scylladb/scylladb#17857
2024-10-25 13:20:27 +05:30
Pavel Emelyanov
5e713b2b14 Merge 'dht: remove unused #includes ' from Kefu Chai
these unused includes are identified by clang-include-cleaner. after auditing the source files, all of the reports have been
confirmed.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#21237

* github.com:scylladb/scylladb:
  .github: add dht to iwyu's CLEANER_DIR
  dht: remove unused `#include`s
2024-10-24 18:40:49 +03:00
Pavel Emelyanov
7595ef7303 test: Squash test::change_generation_number() into test::store()
No other usages of the former helper other than immediatelly followed by
the latter, no point in keepint it around.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-24 11:29:17 +03:00
Pavel Emelyanov
e885b0e6cd test: Squash test::change_dir() into test::store()
No other usages of the former helper other than immediatelly followed by
the latter, no point in keepint it around.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-24 11:28:39 +03:00
Pavel Emelyanov
874cf2ea6f test: Coroutinize sstables::test::store()
Ahead of future changes

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-24 11:28:07 +03:00
Benny Halevy
5498018cbe data_dictionary: keyspace_metadata::describe: print tablets enabled also when defaulted
Now that tablets may be explicitly enabled when
creating a new keyspace, describe tablets as enabled
even when the default initial_tablets==0 is used.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-10-24 10:18:42 +03:00
Benny Halevy
63cbb6e071 tablets_test: test enable/disable tablets when creating a new keyspace
Test both configuration values for `enable_tablets`
and the possibility to explicitly enable or disable
tablets, respectively, when creating a keyspace using the
`tablets = {'enabled': true|false}` CREATE KEYSPACE option.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-10-24 10:18:42 +03:00
Benny Halevy
b0e12cb40d treewide: always allow tablets keyspaces
With the tablets feature always enabled (Unless gossip toopology
changes are forced), the enable_tablets option now controls only
the default for newly created keyspaces.

Even when set to `false`, tablets are still enabled as a
feature and the user may explicitly enable tablets
using `CREATE KEYSPACE <name> WITH tablets = {'enabled': true}`

Note: best viewed with `git show -w`

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-10-24 10:18:42 +03:00
Benny Halevy
bc62407421 feature_service: prevent enabling both tablets and gossip topology changes
Tablets require raft consistent topology changes.
Therefore, document that they are incompatible in
the config help and prevent their usage in
`feature_config_from_db_config`

Fixes scylladb/scylladb#21075

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-10-24 10:18:42 +03:00
Benny Halevy
9ef2dc2428 alternator: create_keyspace_metadata: enable tablets using feature_service
Rather than using the local configuration option
on this node, check the cluster feature instead.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-10-24 10:18:42 +03:00
Asias He
1392a6068d repair: Make the ranges more consistent in the log
Consider the number of tables for the number of ranges logging. Make it
more consistent with the log when the ops starts.
2024-10-24 10:31:15 +08:00
Asias He
cffe3dc49f repair: Fix finished ranges metrics for removenode
The skipped ranges should be multiplied by the number of tables.

Otherwise the finished ranges ratio will not reach 100%.

Fixes #21174
2024-10-24 10:31:15 +08:00
Kefu Chai
a9e18f70b0 Revert submodule change in 6ead5a4696
in 6ead5a46, we included submodule changes in cqlsh and java by accident.
this was not intended. and this broke the artifacts-rocky8-test.

in this change, both changes in the submodule are reverted.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21236
2024-10-23 19:49:20 +03:00
Pavel Emelyanov
9014da26e1 Merge 'docs: reference object storage config doc from nodetool commands ' from Kefu Chai
this series:

- promote object storage configuration to user-facing documentation
- reference object storage config doc from nodetool commands

---

the nodetool backup/restore commands are not included by any LTS branches yet, hence no need to backport.

Closes scylladb/scylladb#21071

* github.com:scylladb/scylladb:
  docs: move keyspace-storage-option from cql-extensions to admin
  docs: reference admin.rst for object storage config
  docs: reference object storage config doc from nodetool commands
  docs: promote object storage configuration to user-facing documentation
2024-10-23 19:41:46 +03:00
Michał Jadwiszczak
68d0c9a18a test/auth_cluster/test_raft_service_levels: match enterprise SL limit
Despite OSS doesn't limit number of created service levels, match the
enterprise limit to decrease divergence in the test between OSS and
enterprise.

Fixes scylladb/scylladb#21044

Closes scylladb/scylladb#21045
2024-10-23 17:44:19 +02:00
Kefu Chai
bea18f0571 .github: add dht to iwyu's CLEANER_DIR
to avoid future violations of include-what-you-use.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-23 17:45:14 +08:00
Kefu Chai
8d1b3223ab dht: remove unused #includes
these unused includes are identified by clang-include-cleaner.
after auditing the source files, all of the reports have been
confirmed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-23 17:45:14 +08:00
Dawid Mędrek
298cafff35 cql-pytest/test_describe: Introduce auxiliary type for service levels
We introduce an auxiliary type representing a service level for making
it easier to adjust the tests in Enterprise. We move the responsibility
of producing create statements for service levels to the class, so we
only need to modify the code in one place when necessary.

All existing relevant tests have been adjusted to this change.

Closes scylladb/scylladb#21230
2024-10-23 10:15:25 +02:00
Kamil Braun
f5c60e538d Merge 'cql/tablets: fix retrying ALTER tablets KEYSPACE' from Piotr Smaron
ALTER tablets-enabled KEYSPACES (KS) may fail due to
`group0_concurrent_modification`, in which case it's repeated by a `for`
loop surrounding the code. But because raft's `add_entry` consumes the
raft's guard (by `std::move`'ing the guard object), retries of ALTER KS
will use a moved-from guard object, which is UB, potentially a crash.
The fix is to remove the before mentioned `for` loop altogether and rethrow the exception, as the `rf_change` event
will be repeated by the topology state machine if it receives the
concurrent modification exception, because the event will remain present
in the global requests queue, hence it's going to be executed as the
very next event.
Note: refactor is implemented in the follow-up commit.

Fixes: scylladb/scylladb#21102

Should be backported to every 6.x branch, as it may lead to a crash.

Closes scylladb/scylladb#21121

* github.com:scylladb/scylladb:
  test: add UT to test retrying ALTER tablets KEYSPACE
  cql/tablets: fix indentation in `rf_change` event handler
  cql/tablets: fix retrying ALTER tablets KEYSPACE
2024-10-23 10:01:21 +02:00
Botond Dénes
519e167611 Merge 'replica/table: check memtable before discarding tombstone during read' from Lakshmi Narayanan Sreethar
On the read path, the compacting reader is applied only to the sstable
reader. This can cause an expired tombstone from an sstable to be purged
from the request before it has a chance to merge with deleted data in
the memtable leading to data resurrection.

Fix this by checking the memtables before deciding to purge tombstones
from the request on the read path. A tombstone will not be purged if a
key exists in any of the table's memtables with a minimum live timestamp
that is lower than the maximum purgeable timestamp.

Fixes #20916

`perf-simple-query` stats before and after this fix :

`build/Dev/scylla perf-simple-query --smp=1 --flush` :
```
// Before this Fix
// ---------------
94941.79 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59393 insns/op,   24029 cycles/op,        0 errors)
97551.14 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59376 insns/op,   23966 cycles/op,        0 errors)
96599.92 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59367 insns/op,   23998 cycles/op,        0 errors)
97774.91 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59370 insns/op,   23968 cycles/op,        0 errors)
97796.13 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59368 insns/op,   23947 cycles/op,        0 errors)

         throughput: mean=96932.78 standard-deviation=1215.71 median=97551.14 median-absolute-deviation=842.13 maximum=97796.13 minimum=94941.79
instructions_per_op: mean=59374.78 standard-deviation=10.78 median=59369.59 median-absolute-deviation=6.36 maximum=59393.12 minimum=59367.02
  cpu_cycles_per_op: mean=23981.67 standard-deviation=32.29 median=23967.76 median-absolute-deviation=16.33 maximum=24029.38 minimum=23947.19

// After this Fix
// --------------
95313.53 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59392 insns/op,   24058 cycles/op,        0 errors)
97311.48 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59375 insns/op,   24005 cycles/op,        0 errors)
98043.10 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59381 insns/op,   23941 cycles/op,        0 errors)
96750.31 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59396 insns/op,   24025 cycles/op,        0 errors)
93381.21 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59390 insns/op,   24097 cycles/op,        0 errors)

         throughput: mean=96159.93 standard-deviation=1847.88 median=96750.31 median-absolute-deviation=1151.55 maximum=98043.10 minimum=93381.21
instructions_per_op: mean=59386.60 standard-deviation=8.78 median=59389.55 median-absolute-deviation=6.02 maximum=59396.40 minimum=59374.73
  cpu_cycles_per_op: mean=24025.13 standard-deviation=58.39 median=24025.17 median-absolute-deviation=32.67 maximum=24096.66 minimum=23941.22
```

This PR fixes a regression introduced in ce96b472d3 and should be backported to older versions.

Closes scylladb/scylladb#20985

* github.com:scylladb/scylladb:
  topology-custom: add test to verify tombstone gc in read path
  replica/table: check memtable before discarding tombstone during read
  compaction_group: track maximum timestamp across all sstables
2024-10-23 10:28:00 +03:00
Botond Dénes
d6a79fefda Merge 'Do not leak S3 file-uploading parts on exceptions' from Pavel Emelyanov
File uploading code spawns all parts uploading into background. If this "spawning" fails (not the uploading code itself), any fiber that was spawned before is orphaned. It will eventually stop on its own, by while it's alive it may use(-after-free) the do_upload_file object.

Another issue with not handling spawn exception, is that multipart upload object is not aborted in this case. So it's leaked until garbage collector picks it up, which is not critical, but unpleasant.

Closes scylladb/scylladb#21139

* github.com:scylladb/scylladb:
  s3/client: Restore indentation after previous patch
  s3/client: Catch do_upload_file::upload_part() exceptions
2024-10-23 10:12:29 +03:00
Ernest Zaslavsky
59e2ed884d Update seastar submodule
* seastar abd20efd...f821bda1 (17):
  > http: http status classification
  > loopback: add pending capacity param and fix deadlock in httpd_test
  > allow setting buffer sizes on server_socket
  > core: add missing assert header to chunked_fifo
  > cmake: Don't emit message when searching for libarchive
  > stall-analyser: pass args.tmin instead of tmin
  > build: do not check for CMAKE_CXX_STANDARD < 20
  > README.md: specify CMAKE_CXX_STANDARD in the sample
  > cmake: Fix DPDK libarchive dep
  > c-ares: update cooking version to 1.32.3
  > build: support c-ares >= 1.34.1
  > iotune: clarify fsqual error message
  > Make total_steal_time() monotonic.
  > Remove account_idle
  > reactor: add better sleep time accounting
  > reactor: add cpu and awake time reactor metrics
  > Zero-init total sleep time

Closes scylladb/scylladb#21225
2024-10-23 09:30:56 +03:00
Botond Dénes
b9b778054a Merge 'test.py: Add option to fail after number of failures' from Petr Hála
* Add `--max-failures` flag to test.py, which will stop the execution after number of failures
   * Helps with "fails-fast" approach and can be used to improve CI speed, especially the 100times run
   * Adds the number of cancelled tests to both summary and junit xml. I did not include them in boost, since it does not contain any statistics.
* Removes unnecessary list creation in test.py
   * Completely unrelated change, but it is small enough that I feel it can be included as part of this one. If this is an issue I can create separate PR for it

*  Add `Test.started` property
   * Helps with determining the current status of the Test and differentiating cancelled/not started tests.
* Add `Test.failed` and `Test.did_not_run` read-only computed properties
   * Helper methods to determine status, instead of using `Test.success`, which does not tell the entire story
* Fix `ScyllaClusterManager.stop()` method, so it doesn't fail when ran multiple times
   * This happens when tasks are cancelled, not sure yet why, it almost certainly non-wanted behaviour but this behaviour was already there and with this fix it no longer causes errors

I will use backport/None for now as it is a new feature.

Fixes https://github.com/scylladb/qa-tasks/issues/1714

Closes scylladb/scylladb#21098

* github.com:scylladb/scylladb:
  test.py: Add option to fail after number of failures
  test.py: Add started, failed and did_not_run properties to Test
  test.py: Remove unnecessary list creation
  test: lib: Fix ScyllaClusterManager.stop()
2024-10-23 09:11:52 +03:00
Kefu Chai
6a7eaea9f4 mutation_writer/feed_writer: remove redundant check
`mutation_reader::is_end_of_stream()` returns
`_impl->is_end_of_stream() && is_buffer_empty()`, so
`!is_end_of_stream()` equals to `
`!_impl->is_end_of_stream() || !is_buffer_empty()`, which in turn
always equals to
`!_impl->is_end_of_stream() || !is_buffer_empty() || !is_buffer_empty()`.
hence there is no need to check `rd.is_buffer_empty()` again.

in this change, the redundant condition is dropped. simpler this way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21224
2024-10-23 08:48:08 +03:00
Avi Kivity
cc3953e504 build: disable Seastar exception hack
In [1], Seastar started to bypass a lock in libgcc's exception throwing
mechanism to allow scalability on large machines. The problem is documented
in [2] and reported as fixed.

In [3], testing results on a 2s96c192t machine are reported. The problem
appears indeed fixed with gcc 14's runtime (which we use, even though we
build with clang).

Given the new results, we can safely drop the exception scalability hack.
As [1] states that the hack causes the loss of a translation cache, we
may gain some performance this way.

With that, we disable the cache by defining some random macro.

[1] https://github.com/scylladb/seastar/464f5e3ae43b366b05573018fc46321863bf2fae
[2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744
[3] https://github.com/scylladb/seastar/issues/2479#issuecomment-2427098413

Closes scylladb/scylladb#21217
2024-10-22 22:20:07 +03:00
Nadav Har'El
5fd3177057 Merge 'mv: add a dedicated read concurrency semaphore for view update read before writes' from Wojciech Mitros
When writing to some tables with materialized views, we need to read from the base table first to perform a delete of the old view row. When doing so, the memory used for the read is tracked by the user read concurrency semaphore. When we have a large number of such reads, we may use up all of the semaphore units, causing the following reads to be queued. When we have some user reads coming at the same time, these reads can have very high latency due to the write workload on the base table. We want to avoid this, so that the write workload doesn't have a high impact on the latency of the read workload.

This is fixed in this patch by adding a separate read concurrency semaphore just for view update read-before-writes. With the new semaphore, even if there are many view update read-before-writes, they will be queued on a different semaphore than the user reads, and they won't impact their latency.

The second issue fixed by this patch is the concurrency of the view updates that is currently unlimited. Because of that view updates may take up so much memory that they we may run out of memory.

This is fixed by using the read admission on the view update concurrency semaphore.
This limits the number of concurrent view update reads to
max_count_concurrent_view_update_reads, all other incoming view update reads are
queued using just a small chunk of memory. Without this, the reads would also get
queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier, but they would take much more memory while staying in the queue.

The new semaphore has half the capacity of the regular user read concurrency semahpore and is currently used only for user writes - is't used independently of the scheduling group on which we base the read semaphore selection, but we use a different code path for streaming (not database::do_apply) and we shouldn't have view updates in system writes or during compaction.

This patch also adds a test to confirm that the view update workload doesn't impact the read latency, as well as a test which confirms that we do not run out of memory even under heavy view udpate workload.

The issue of view updates causing increased latencies most often occurs in the following scenario:
* we have a medium to high write workload to a table with a materialized view which requires reading from the base table before sending the update to delete the old rows
* we have any read workload
* one replica is slower or is handling more writes due to an imbalance of data distribution
* we write with a cl<ALL, the mentioned replica is replying to write requests slower while new ones keep being sent to it.
* each write performs a read first taking resources from the user read concurrency semaphore, so when enough writes accumulate the reads using the semaphore start getting queued
* the queue is shared by regular reads and view update reads. When there's enough view update reads in the queue, regular reads start getting increased latencies

An sct test (perf-regression-latency-mv-read-concurrency) was prepared to somewhat resemble this scenario:
* the tables were prepared satisfying the conditions above
* we use a medium write workload and a very low read workload
* the imbalance is achieved by writing to just a few (10) partitions - some replicas (and shards) can have twice or more used partitions than others. We also keep writing to a limited (though high) number of rows, to cause overwrites which require reading before sending the view update
* to minimize the test case, we use a cluster of 3 nodes and rf=2, we write with cl=ONE to have background replica writes and read with cl=ALL to wait for the slower replica to respond.

In the test above:
* without the fix, the latency of reads increases over 50s
* with the fix, the latency of reads stays below 20ms

Fixes https://github.com/scylladb/scylladb/issues/8873
Fixes https://github.com/scylladb/scylladb/issues/15805

The patch is not that small and it isn't fixing a regression, so no backports

Closes scylladb/scylladb#20887

* github.com:scylladb/scylladb:
  test: add test for high view update concurrency causing bad_allocs
  test: add test for high view update concurrency degrading read latency
  mv: add a dedicated read concurrency semaphore for view update read before writes
2024-10-22 22:17:23 +03:00
Aleksandra Martyniuk
878a12c922 test: change quotation marks
Before python 3.12 formatted strings couldn't have reused quotes.
Change the type of quotation mark in get_cgroup so it could be
used with earlier python versions.

Closes scylladb/scylladb#21209
2024-10-22 20:42:05 +03:00
Piotr Smaron
522bede8ec test: add UT to test retrying ALTER tablets KEYSPACE
The newly added testcase is based on the already existing
`test_alter_dropped_tablets_keyspace`.
A new error injection is created, which stops the ALTER execution just
before the changes are submitted to RAFT. In the meantime, a new schema
change is performed using the 2nd node in the cluster, thus causing the
1st node to retry the ALTER statement.
2024-10-22 18:22:01 +02:00
Piotr Smaron
3f4c8a30e3 cql/tablets: fix indentation in rf_change event handler
Just moved the code that previously was under a `for` loop by 1 tab, i.e. 4 spaces, to the left.
2024-10-22 18:22:01 +02:00
Piotr Smaron
de511f56ac cql/tablets: fix retrying ALTER tablets KEYSPACE
ALTER tablets-enabled KEYSPACES (KS) may fail due to
`group0_concurrent_modification`, in which case it's repeated by a `for`
loop surrounding the code. But because raft's `add_entry` consumes the
raft's guard (by `std::move`'ing the guard object), retries of ALTER KS
will use a moved-from guard object, which is UB, potentially a crash.
The fix is to remove the before mentioned `for` loop altogether and rethrow the exception, as the `rf_change` event
will be repeated by the topology state machine if it receives the
concurrent modification exception, because the event will remain present
in the global requests queue, hence it's going to be executed as the
very next event.
`topology_coordinator::handle_topology_coordinator_error` handling the
case of `group0_concurrent_modification` has been extended with logging
in order not to write catch-log-throw boilerplate.
Note: refactor is implemented in the follow-up commit.

Fixes: scylladb/scylladb#21102
2024-10-22 18:22:00 +02:00
Avi Kivity
ec543e3902 Merge 'Remove all_datadirs vector of strings from table::config' from Pavel Emelyanov
The all_datadirs keeps paths to directories where local sstables can be. In fact, Scylla doesn't put sstables there, but can try to find them on boot and when checking snapshots. The 0th element of this vector, called datadir, had recently been removed by #20675, now it's time to drop all_datadirs as well. The needed paths can be obtained from table's storage options (see #20542) and db::config::data_file_directories option.

Closes scylladb/scylladb#21212

* github.com:scylladb/scylladb:
  sstables: Open-code format_table_directory_name() moved recently
  replica,sstables: Move format_table_directory_name()
  table: Remove all_datadirs
  sstables: Generate table::all_datadirs from db::config and storage_options
  replica: Prepare vector of fs::path-s with table dirs
  table: Check storage options in get_snapshot_details()
2024-10-22 17:21:31 +03:00
Laszlo Ersek
63417f6a57 utils/small_vector: refactor expansion condition in reserve*()
Rewrite

  _begin + n > _capacity_end

as

  n > _capacity_end - _begin

and then as

  n > capacity()

for two reasons:

- The last form is easier to read than the first form.

- Per N4950 (the final C++23 working draft), [expr.add] paragraph 4, the
  expression

    _begin + n                            (i.e., P + J)

  is defined only if

    0 ≤ 0 + n ≤ _capacity_end - _begin    (i.e., 0 ≤ i + j ≤ n)

  equivalently, only if

    _begin ≤ _begin + n ≤ _capacity_end

  Therefore, the expression

    _begin + n

  invokes undefined behavior exactly when we'd expect our check

    _begin + n > _capacity_end

  to evaluate to true.

gcc and clang have been aggressively equating undefined behavior to "never
happens"; let's prevent that here.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>

Closes scylladb/scylladb#21213
2024-10-22 17:12:11 +03:00
Avi Kivity
847c850034 schema: add accessors for primary key columns and non-primary-key columns
It's somewhat common to ask for the partition key and clustering key
columns, or for the static and regular columsn. Provide accessors for them
rather than requiring the user to glue them.

Some callers are converted.

Closes scylladb/scylladb#21191
2024-10-22 15:01:14 +02:00
pehala
870f3b00fc test.py: Add option to fail after number of failures
Add --max-failures configuration option to specify the amount,
if not set, or not positive, it will never trigger.
Update also the junit reporting to include skipped tests
2024-10-22 13:29:34 +02:00
pehala
c1dd97a049 test.py: Add started, failed and did_not_run properties to Test
This ensures we can determine where in the execution pipeline
the test currently is.
failed and did_not_run are helper properties
2024-10-22 13:29:19 +02:00
pehala
e34dec71e7 test.py: Remove unnecessary list creation
Using generators & set constructor,
we can get rid of unnecessary list creation
2024-10-22 13:29:18 +02:00
pehala
16cd3fccdd test: lib: Fix ScyllaClusterManager.stop()
When cancelling running tasks, stop() could run multiple times and fail.
Removed usage of del and added checks to ensure it won't crash.
2024-10-22 13:29:18 +02:00
Kefu Chai
7a1e067b4e docs: move keyspace-storage-option from cql-extensions to admin
as the admin needs to known the name of the experimental feature option
they need to enable.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-22 18:30:29 +08:00
Kefu Chai
6f97c86a2b docs: reference admin.rst for object storage config
instead of repeating it in cql-extensions.md, let's reference
the object storage related settings in admin.rst

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-22 18:26:19 +08:00
Kefu Chai
fe13b4e10e docs: reference object storage config doc from nodetool commands
Enhance the documentation for nodetool commands that use the `--endpoint`
option by linking to the object storage configuration guide. This change
provides users with essential context and detailed setup instructions
for S3-compatible storage endpoints.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-22 18:26:19 +08:00
Kefu Chai
9bd9ee9f36 docs: promote object storage configuration to user-facing documentation
this commit moves the object storage configuration guide from the developer
documentation to the user-facing admin documentation. the change reflects
the increasing importance of object storage integration in user-facing
features.

in this change:

- move relevant content from `docs/dev/object_storage.md` to
  `docs/operating-scylla/admin.rst`
- reformat the content from Markdown to reStructuredText (RST)
- reword and restructure the content to be more user-friendly
- add explanations and context suitable for a broader audience

this change makes the object storage configuration information more
accessible to Scylla administrators and end-users, supporting the adoption
of new features built on top of object storage integration.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-22 18:26:19 +08:00
Benny Halevy
04d741bcbb storage_service: on_change: update_peer_info only if peer info changed
Return an optional peer_info from get_peer_info_for_update
when the `app_state_map` arg does not change peer_info,
so that we can skip calling update_peer_info, if it didn't
change.

Fixes scylladb/scylladb#20991
Refs scylladb/scylladb#16376

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#21152
2024-10-22 10:26:08 +02:00
Dawid Medrek
4ec0a014e3 docs/hinted-handoff: Add link to API reference
We add a link to the API reference for the convenience
of the user.

Closes scylladb/scylladb#20065
2024-10-22 09:24:14 +03:00
pehala
28aa57f836 test.py: Refactor retry()
Instead of metamethod that looks at all subclasses, use OOP with super() calls

Closes scylladb/scylladb#21155
2024-10-22 09:23:30 +03:00
David Garcia
6b7b4addf9 docs: add dark theme to api
Closes scylladb/scylladb#21161
2024-10-22 09:22:32 +03:00
pehala
59eb4eb528 test.py: Enhance progress report
* Do not leave passed tests in between failed ones.
* Use ANSI Escape sequences for manipulating console
  * Simplifies code and removes need for two object parameters

Closes scylladb/scylladb#21176
2024-10-22 09:22:08 +03:00
Łukasz Paszkowski
34c05cb94f test/rest_api: Add tests for compactionhistory
For a table with NullCompactionStrategy and
TimeWindowCompactionStrategy, the test
- inserts a bunch of data and flushes the table
- deletes/update some data, delete a range of data and flushes
  the table
- Triggers a major compaction and calls for compactionhistory
  to retrieve and validate the histogram
2024-10-22 08:15:02 +02:00
Łukasz Paszkowski
8188a71787 nodetool: Add rows merged stats into compactionhistory output
Incorporate rows merged statistics into the output of the compactionhistory
command. Depending on the requested format type, the output has
different form.

For instance, compacting two sstables of a table consisting of 7
rows where two rows are part of the both sstables, the output
would have the following format:
text: {1: 5, 2: 2}
json: [{"key":1,"value":5},{"key":2,"value":1}]}
yaml: - key: 1
        value: 5
      - key: 2
        value: 1
2024-10-22 08:15:02 +02:00
Łukasz Paszkowski
c01a38f3cf compaction: Update compaction history with collected histogram
A new field has been added to the compaction_stats structure to hold
collected combined reader statistics. The struct is than used to update
the compaction_history table.
2024-10-22 08:15:02 +02:00
Łukasz Paszkowski
7eac89da73 compaction: Remove const qualifier from methods creating sstable readers
Compaction classes start mutate their internal members to be used
in methods setup_sstable_reader and make_sstable_reader creating
sstable reades that are marked as const.

Remove the const qualifier from these methods. Even though it made
sense initially to mark them as const, it is no longer applicable.
2024-10-22 08:15:02 +02:00
Łukasz Paszkowski
484655bf0d sstable_set: Add optional statistics to make_local_shard_sstable_reader
The pointer to combined_reader_statistics is propagated down to
make_combined_reader in order to collect statistics. By default,
a null pointer is propagated.

Note that in case the pointer is valid and the sstable_set consists
of exactly one sstable, statistics are skipped as all rows originate
from exactly a single sstable file.

The existing optimization is crucial f75154afca
2024-10-22 08:15:02 +02:00
Łukasz Paszkowski
a9f776494c make_combined_reader: Add optional parameter, combined_reader_statistics
All the overloaded make_combined_reader functions accept an optional
pointer to combined_reader_statistics, to be propagated down through
merging_reader to mutation_fragment_merger. By default, a null pointer
is propagated.
2024-10-22 08:15:02 +02:00
Łukasz Paszkowski
84912c3155 reader_selector: Extend with maximum reader count
The maximum reader count allows to predict the number of readers
that can be created with create_new_readers(). This helps to
correctly allocate a vector size in the rows_merged statistics
when a combiner reader is created via make_combined_reader.
2024-10-22 08:15:02 +02:00
Łukasz Paszkowski
92f5c56afc mutation_fragment_merger: Create histogram while consuming mutation fragment batches
The mutation_fragment_merger takes one additional parameter in its
constructor, that is a pointer to a combined_reader_statistics
used to collect various statistics.

The histogram is populated with data while the merger consumes
batches from the producer and merges them into seperate mutation
fragments. The size of the batch, that represents the number of streams
the mutation fragment originates from, is used as a key in the
historgam and its corresponding value is increased by one.
2024-10-22 08:15:02 +02:00
Botond Dénes
41de340d93 Merge 'Update get_description.py script' from Amnon Heiman
get_description.py script is a document related script that looks for metrics description in the code.
Its configuration needs to address changes in the code.

This series contains a configuration change and a code fix that allows it to run as a standalone script, and not as a library.

No need to backport, this a documentation related script.

Closes scylladb/scylladb#19950

* github.com:scylladb/scylladb:
  scripts/get_description.py: param_mapping was missing
  scripts/metrics-config.yml: no need to get metrics from the tests
2024-10-22 08:42:15 +03:00
Kefu Chai
27fb893d9b docs: nodetools-commands/restore: update to reflect the latest implementation
in 787ea4b1d4, we added "sstables" argument to the "nodetool restore"
command. but we failed to update the document to reflect the change.

in this change, we update the document for "restore" command to reflect
the latest implementation changes introduced in commit 787ea4b1d4:

* Add information about the new "sstables" argument
* Update command line usage of "--table" argument -- it is now madatory
* Update the example accordingly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21135
2024-10-22 08:30:06 +03:00
Kefu Chai
ce0a86c585 build: cmake: correct some tests' KIND
before this change, we build some tests as if they are Seastar tests.
but after 415c83fa, these tests failed to link. because the
Seastar::seastar_testing does not expose `-DSEASTAR_TESTING_MAIN` in
its cflags. the behavior of the Seastar::seastar_testing  is expected.
because a test linking against this library is not necessarily driven
by the `main()` provided by `testing/seastar_test.hh`.

so, in this change, we correct the `KIND` parameter of these tests,
so that they use `KIND BOOST`, as these tests can be driven by the
`main()` provided by Boost.Test's driver. also there are some tests
driven by Boost.Test's `main()`, but in the meanwhile, they utilize
seastar_testing, so let's add `Seastar::seastar_testing` to their
`LIBRARIES`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21183
2024-10-22 07:10:47 +03:00
Kefu Chai
6ead5a4696 treewide: move log.hh into utils/log.hh
the log.hh under the root of the tree was created keep the backward
compatibility when seastar was extracted into a separate library.
so log.hh should belong to `utils` directory, as it is based solely
on seastar, and can be used all subsystems.

in this change, we move log.hh into utils/log.hh to that it is more
modularized. and this also improves the readability, when one see
`#include "utils/log.hh"`, it is obvious that this source file
needs the logging system, instead of its own log facility -- please
note, we do have two other `log.hh` in the tree.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-22 06:54:46 +03:00
Kefu Chai
6645cdf3b6 build: cmake: improve source generated for check_header
in check_headers.cmake, we verify the self containness of a header file
by replicating it and remove `#pragma once` directive in this header.
but this approach failed to compile headers which include a header file
with the same name in the root source directory, as we add
`-I<directory-of-original-header>` in the cflags when building the
generated source file, so that it can include the headers in the same
directory. but this confuses the compiler, as, assuming we have "log.hh"
in current directory, and under the root source directory, the compiler
would always include the "log.hh" in the current directory even it
should have included "log.hh" under the root source directory.

in this change, instead of adding `-I<directory-of-original-header>` to
cflags, we just include the header under test in a new .cc file
solely generated for testing. this should address this problem.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21216
2024-10-22 06:28:16 +03:00
Kefu Chai
2d6af2791e compaction: simplify time_window_compaction_strategy::get_window_lower_bound()
since chrono allows dividion between durations with different units. let
use it instead for rounding down to the nearest multiple of the window
size, for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20476
2024-10-21 16:01:15 +03:00
Pavel Emelyanov
516a5f06a8 sstables: Open-code format_table_directory_name() moved recently
This helper is small enough and it's easier to understand how table
directory name is formatted without it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-21 15:18:19 +03:00
Pavel Emelyanov
eeb0d637bb replica,sstables: Move format_table_directory_name()
Now this helper is not needed in replica code, as all manipulations of
tables' sstables now sit in the sstables/storage.cc.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-21 15:17:30 +03:00
Pavel Emelyanov
74728d3889 table: Remove all_datadirs
It's write-only now, all the places than wanted to know where table's
storage is (well -- "are", there can be several directories) already use
storage_options.

This finishes the work started by 9fe64b5d70.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-21 15:15:54 +03:00
Pavel Emelyanov
dedb9d349c sstables: Generate table::all_datadirs from db::config and storage_options
As mentioned in the previous patch, there are several places that need
to scan all datafile directories for a given table. This list is
currently stored on table.config.all_datadirs, this patch stops using
one and instead generates it from db::config::data_file_directories and
table's storage options.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-21 15:13:27 +03:00
Pavel Emelyanov
0358515118 replica: Prepare vector of fs::path-s with table dirs
Most of the time table with local storage keeps its sstables in a single
directory referenced by its storage_options::local.dir path. However,
there are two cases when code needs to check all datafile directories
that could be configured -- on boot when distributed loader loads
sstables, and when checking table snapshots.

Both those places check table.cfg.all_datadirs vector of strings and
convert strings to fs::path-s along the way. This patch prepares the
vector of fs::path-s in advance and updates the loop code to work with
path-s.

This is preparation to next patching that will generate vector of paths
for a table.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-21 15:10:53 +03:00
Pavel Emelyanov
4e329ba08f table: Check storage options in get_snapshot_details()
This is continuation of 24589cf00c and a734fd5c9c -- if table is not
based on local storage, getting snapshot details makes no sense.

Another goal this change pursuits is to have storage_options::local
object at hand to be used later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-21 15:08:54 +03:00
Wojciech Mitros
4d719bacca test: add test for high view update concurrency causing bad_allocs
This commit add a test for checking whether a large view update workload
can cause Scylla to run out of memory.
In the test, we keep writing to a table table with a materialized view
with a limited number of rows, causing overwrites which require reading
from the table to perform view updates.
Currently, due to the unlimited concurrency of view update reads, we may
use too much memory which can lead to bad_allocs, causing Scylla to fail.
To reach the failing state more consistently, we use add a sleep after
reading the old value of the base row, to keep the reader concurrency
semaphore units longer. At the same time, we use high concurrency and
large row size to use up all Scylla's memory quickly.
The test fails if Scylla runs out of memory and aborts, and succeeds
otherwise.
2024-10-21 12:35:20 +02:00
Wojciech Mitros
f2c740710c test: add test for high view update concurrency degrading read latency
This commit add a test for checking whether a large view update workload
impacts the latency of other user reads.
In the test, we first create a table for reads and another table with
a materialized view. We then start writing to the table with the view
with a limited number of rows - when overwriting, we need to read the
previous value of the row to prepare a delete of the old row in the view.
This should not impact the latency of the read workload from the other
table that we start at the same time. The test fails if any of the reads
times out.
To reach the failing state more consistantly, we use add a sleep after
reading the old value of the base row, to keep the reader concurrency
semaphore units longer. At the same time, we use a lower threshold for
queueing reads on the semaphore, to see the impact of view update reads
earlier.
Because of the high load, the writes may timeout, but that's expected
- we fail the test only if the user reads time out.
2024-10-21 12:34:55 +02:00
Kefu Chai
5cd619a60c treewide: s/boost::adaptors::map_keys/std::views::keys/
now that we are allowed to use C++23. we now have the luxury of using
`std::views::keys`.

in this change, we:

- replace `boost::adaptors::map_keys` with `std::views::keys`
- update affected code to work with `std::views::keys`

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21198
2024-10-21 12:47:52 +03:00
Wojciech Mitros
242079d70b mv: add a dedicated read concurrency semaphore for view update read before writes
When writing to some tables with materialized views, we need to read from the base
table first to perform a delete of the old view row. When doing so, the memory used
for the read is tracked by the user read concurrency semaphore. When we have a large
number of such reads, we may use up all of the semaphore units, causing the following
reads to be queued. When we have some user reads coming at the same time, these reads
can have very high latency due to the write workload on the base table. We want to avoid
this, so that the write workload doesn't have a high impact on the latency of the
read workload.

This is fixed in this patch by adding a separate read concurrency semaphore just for
view update read-before-writes. With the new semaphore, even if there are many view
update read-before-writes, they will be queued on a different semaphore than the user
reads, and they won't impact their latency.

The second issue fixed by this patch is the concurrency of the view updates that is
currently unlimited. Because of that view updates may take up so much memory that
they we may run out of memory.

This is fixed by using the read admission on the view update concurrency semaphore.
This limits the number of concurrent view update reads to
max_count_concurrent_view_update_reads, all other incoming view update reads are
queued using just a small chunk of memory. Without this, the reads would also get
queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier,
but they would take much more memory while staying in the queue.

The new semaphore has half the capacity of the regular user read concurrency semahpore
and is currently used only for user writes - is't used independently of the scheduling
group on which we base the read semaphore selection, but we use a different code path
for streaming (not database::do_apply) and we shouldn't have view updates in system
writes or during compaction.

Fixes https://github.com/scylladb/scylladb/issues/8873
Fixes https://github.com/scylladb/scylladb/issues/15805
2024-10-21 11:02:06 +02:00
Kefu Chai
5255f18c35 date: do not put space before literal operator
when compiling date.h, clang 20 complains:
```
/home/kefu/.local/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -isystem /home/kefu/dev/scylladb/build/rust -isystem /home/kefu/dev/scylladb/seastar/include -isystem /home/kefu/dev/scylladb/build/Debug/seastar/gen/include -isystem /usr/include/p11-kit-1 -isystem /home/kefu/dev/scylladb/abseil -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -std=c++23 -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_DEBUG -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEBUG_PROMISE -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DFMT_SHARED -DWITH_GZFILEOP -MD -MT lang/CMakeFiles/lang.dir/Debug/lua.cc.o -MF lang/CMakeFiles/lang.dir/Debug/lua.cc.o.d -o lang/CMakeFiles/lang.dir/Debug/lua.cc.o -c /home/kefu/dev/scylladb/lang/lua.cc
In file included from /home/kefu/dev/scylladb/lang/lua.cc:18:
/home/kefu/dev/scylladb/utils/date.h:836:34: error: identifier '_d' preceded by whitespace in a literal operator declaration is deprecated [-Werror,-Wdeprecated-literal-operator]
  836 | CONSTCD11 date::day  operator "" _d(unsigned long long d) NOEXCEPT;
      |                      ~~~~~~~~~~~~^~
      |                      operator""_d
```

because, in
[CWG2521](https://wg21.link/CWG2521), it proposes that compiler should
consider
```c++
  string operator "" _i18n(const char*, std::size_t); // OK, deprecated
```
as "OK, deprecated".

and Clang implemented this proposal, as it was accepted by C++23. since
scylladb uses C++23 standard. let's remove the space between `"` and
`_` to be more compliant to the C++23 standard and to silence the
warning, which is taken as an error.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21194
2024-10-21 11:21:52 +03:00
Kefu Chai
8355056453 build: cmake: expose and use the path to iotune correctly
in 415c83fa, we introduced a regression which broke the build of
target of "package". because

- the IMPORT_LOCATION_<CONFIG> of the imported target of
  "Seastar::iotune" includes a literal `$<CONFIG>`
- we retrieve the property named "IMPORTED_LOCATION" from
  this target. but value of this property is empty.

so, when we copied this file, the "src" parameter passed to
`cmake -E copy` is actually an empty string.

in this change, we

- set the `IMPORTED_LOCATION_${CONFIG}` property with a
  correct path.
- retrieve the property with the right approach -- to use
  `TARGET_FILE` generator expression.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21181
2024-10-21 10:32:51 +03:00
Avi Kivity
b5a1173880 utils: small_vector: support from_range_t
std::ranges::to<>() has a little protocol with containers to
allow them to optimize their construction from ranges. Implement it
for small_vector. It optimizes ranges that can have their size determined
quickly, or that can be traversed twice to determine the size by reserving
up front. Single-pass ranges (std::ranges::input_range) use the less
efficient push_back method.

A unit test (which fails without the new constructor) is added.

Closes scylladb/scylladb#21094
2024-10-21 09:31:38 +03:00
Kefu Chai
d28d64f7fe service: remove extraneous space in #pragma once
to be more consistent with the rest of the tree.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21188
2024-10-20 20:27:38 +03:00
Avi Kivity
c3be2489ce treewide: drop includes of <boost/range/adaptors.hpp>
This includes way too much, including <boost/regex.hpp>, which is huge.
Drop includes of adaptors.hpp and replace by what is needed.

Closes scylladb/scylladb#21187
2024-10-20 17:17:11 +03:00
Aleksandra Martyniuk
29c2d4e7eb tasks: add comments about map_each_task safety
Closes scylladb/scylladb#21172
2024-10-19 21:16:38 +03:00
Avi Kivity
9a521c25b5 Merge 'test/boost: stop using ranges::to()' from Kefu Chai
now that we are able to use ranges library provided by the C++ standard library. there is no need to use the homebrew `ranges::to()`.

in this series, we

- switch to `std::ranges::to()` in favor of `ranges::to()`.
- and drop the unused `utils/ranges.hh` header file.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#21182

* github.com:scylladb/scylladb:
  utils: remove unused ranges.hh
  test/boost: stop using ranges::to()
2024-10-19 16:57:51 +03:00
Kefu Chai
c5e666b7b1 column_computation.hh: include used header
when building the check-header target, we have following failure:

```
FAILED: CMakeFiles/check-headers-scylla-main.dir/Dev/check-headers/column_computation.hh.cc.o
/home/kefu/.local/bin/clang++ -DDEVEL -DSCYLLA_BUILD_MODE=dev -DSCYLLA_ENABLE_ERROR_INJECTION -DSCYLLA_ENABLE_PREEMPTION_SOURCE -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Dev\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -isystem /home/kefu/dev/scylladb/seastar/include -isystem /home/kefu/dev/scylladb/build/Dev/seastar/gen/include -isystem /usr/include/p11-kit-1 -isystem /home/kefu/dev/scylladb/abseil -O2 -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -Wno-unused-const-variable -Wno-unused-function -Wno-unused-variable -std=c++23 -Werror=unused-result -fstack-clash-protection -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SSTRING -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DFMT_SHARED -DWITH_GZFILEOP -MD -MT CMakeFiles/check-headers-scylla-main.dir/Dev/check-headers/column_computation.hh.cc.o -MF CMakeFiles/check-headers-scylla-main.dir/Dev/check-headers/column_computation.hh.cc.o.d -o CMakeFiles/check-headers-scylla-main.dir/Dev/check-headers/column_computation.hh.cc.o -c /home/kefu/dev/scylladb/build/check-headers/column_computation.hh.cc
/home/kefu/dev/scylladb/build/check-headers/column_computation.hh.cc:24:37: error: no template named 'unique_ptr' in namespace 'std'
   24 | using column_computation_ptr = std::unique_ptr<column_computation>;
      |                                ~~~~~^
/home/kefu/dev/scylladb/build/check-headers/column_computation.hh.cc:40:12: error: unknown type name 'column_computation_ptr'; did you mean 'column_computation'?
   40 |     static column_computation_ptr deserialize(bytes_view raw);
      |            ^~~~~~~~~~~~~~~~~~~~~~
      |            column_computation
```

it turns out we failed to include `<memory>`.

in this change, we include `<memory>` so that this header is
self-contained.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21185
2024-10-19 16:56:02 +03:00
Raphael S. Carvalho
dfc217f99a locator: Always preserve balancing_enabled in tablet_metadata::copy()
When there are zero tablets, tablet_metadata::_balancing_enabled
is ignored in the copy.

The property not being preserved can result in balancer not
respecting user's wish to disable balancing when a replica is
created later on.

Fixes #21175.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#21177
2024-10-19 14:51:36 +02:00
Kefu Chai
4d4b0b35b7 utils: remove unused ranges.hh
now that this header is not used, let's drop it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-19 13:21:20 +08:00
Kefu Chai
85518463a9 test/boost: stop using ranges::to()
now that we are able to use ranges library provided by the C++
standard library. there is no need to use the homebrew `ranges::to()`.

in this change, we switch to `std::ranges::to()` in favor of
`ranges::to()`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-19 13:21:20 +08:00
Kefu Chai
5c0db8a49e sstable_directory: remove extraneous semicolon
one semicolon is enough to mark the end of a statement. so let's
remove the extraneous one.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21171
2024-10-18 21:58:04 +03:00
Kefu Chai
e2b18eb7eb data_dictionary: compose the location with "/"
in 787ea4b1, we construct a new `storage_options` for each sstable
to be restored. the `location` of the new `storage_option` instances
is composed of the configured `prefix` and the dirname of each toc
component. but instead of separating them with "/", we just concatenate
them. this breaks the test if the specified key representing toc
components includes "dirname" in them.

in this change

- data_directory: instead of using "{prefix}{dirname}", we use
  "{prefix}/{dirname}".
- test/object_store: update the existing test to add a suffix
  in the keys of the toc objects to mimic the typical use case.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21170
2024-10-18 21:57:56 +03:00
Lakshmi Narayanan Sreethar
afad1b3c85 topology-custom: add test to verify tombstone gc in read path
Co-authored-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-10-18 19:20:03 +05:30
Lakshmi Narayanan Sreethar
5a93277904 replica/table: check memtable before discarding tombstone during read
On the read path, the compacting reader is applied only to the sstable
reader. This can cause an expired tombstone from an sstable to be purged
from the request before it has a chance to merge with deleted data in
the memtable leading to data resurrection.

Fix this by checking the memtables before deciding to purge tombstones
from the request on the read path. A tombstone will not be purged if a
key exists in any of the table's memtables with a minimum live timestamp
that is lower than the maximum purgeable timestamp.

Fixes #20916

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-10-18 19:19:58 +05:30
Lakshmi Narayanan Sreethar
6a357b55e3 compaction_group: track maximum timestamp across all sstables
This will be used in a following patch to decide if the compacting
reader has to check the memtables before purging a tombstone.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-10-18 19:19:11 +05:30
Pavel Emelyanov
b11d50f591 Merge 'multishard reader: make it safe to create with admitted permits' from Botond Dénes
Passing an admitted permit -- i.e. one with count resources on it -- to the multishard reader, will possibly result in a deadlock, because the permit of the multishard reader is destroyed after the permits of its child readers. Therefore its semaphore resources won't be automatically released until children acquire their own resources. This creates a dependency (an edge in the "resource allocation graph"), where the semaphore used by the multishard reader depends on the semaphores used by children. When such dependencies create a cycle, and permits are acquired by different reads in just the right order, a deadlock will  happen.

Users of the multishard reader have to be aware of this gotcha -- and of course they aren't. This is small wonder, considering that not even the documentation on the multishard reader mentions this problem. To work around this, the user has to call `reader_permit::release_base_resources()` on the permit, before passing it to the multishard reader. On multiple occasions, developers (including the very author of the multishard reader), forgot or didn't know about this and this resulted in deadlocks down the line. This is a design-flaw of the multishard reader, which is addressed in this PR, after which, it is safe to pass admitted or not admitted permits to the multishard reader, it will handle the call to `release_base_resources()` if needed.

After fixing the problem in the multishard reader, the existing calls to `release_base_resources()` on permits passed to multishard readers are removed. A test is added which reproduces the problem and ensures we don't regress.

Refs: https://github.com/scylladb/scylladb/issues/20885 (partial fix, there is another deadlock in that issue, which this PR doesn't fix)

This fixes (indirectly) a regression introduced by d98708013c so it has to be backported to 6.2

Closes scylladb/scylladb#21058

* github.com:scylladb/scylladb:
  test/boost/mutation_test: add test for multishard permit safety
  test/lib/reader_lifecycle_policy: add semaphore factory to constructor
  test/lib/reader_lifecycle_policy: rename factory_function
  repair/row_level: drop now unneeded release_base_resource() calls
  readers/multishard: make multishard reader safe to create with admitted permits
2024-10-18 13:30:21 +03:00
Pavel Emelyanov
280cd23c13 Merge 'Allow specifying TLS options with internode_encryption=none + add "transitional" mode' from Calle Wilund
Fixes #18903

Adds a "transitional" internode encryption mode, under which all _outgoing_ RPC connections will use TLS, but we will still accept any incoming non-tls connection.

This allows an operator to perform a move to TLS RPC without cluster downtime:

1. For each server, add certificate etc options to server_encryption_options + internode_encryption=none + set ssl_storage_port + restart (rolling)

2. For each server, set internode_encryption=transitional + RR
3. For each server, set internode_encryption=all + RR

Closes scylladb/scylladb#18939

* github.com:scylladb/scylladb:
  test::topology: Add test for TLS upgrade and downgrade of internode encryption
  docs: Add internode_encryption=transitional documentation
  messaging_service: Add "transitional" internode encryptipn mode
  messaging_service: Create TLS connector even if internode_enc=none when certs set
2024-10-18 11:01:07 +03:00
Avi Kivity
1bbd1436b4 types: move from boost ranges to standard ranges
Reduce depdendency load.

tuple_deserializing_iterator gained a default constructor so it
matches iterator constraints.

Closes scylladb/scylladb#21029
2024-10-18 11:00:49 +03:00
Botond Dénes
b6da82dba3 Merge 'build: build seastar as an external project' from Kefu Chai
before this change, scylla's CMake-based system consumes Seastar
library by including it directly. but this failed to address the needs
of linking against Seastar shared libraries in Debug and Dev builds, while
linking against the static libraries in other builds. because Seastar
uses `BUILD_SHARED_LIBS` CMake variable to determine if it builds
shared libraries. and we cannot assign different values to this
CMake variable based on current configure type -- CMake does not
support. see https://gitlab.kitware.com/cmake/cmake/-/issues/19467

in order to address this problem, we have a couple possible
solutions:

- to enable Seastar to build both shared and static libraries in a
  pass. without sacrificing the performance, we have to build
  all object files twice: once with -fPIC, once without. in order
  to accompolish this goal, we need to develop a machinary to
  populate the same settings to these two builds. this would
  complicate the design of Seastar's building system further.
- to build Seastar libraries twice in scylla, we could use
  the ExternalProject module to implement this. but it'd be
  complicate to extract the compile options, and link options
  previously populated by Seastar's targets with CMake --
  we would have to replicate all of them in scylla. this is
  out of the question.
- to build Seastar libraries twice before building scylla,
  and let scylla to consume them using CMake config files or
  .pc files. this is a compromise. it enables scylla to
  drive the build of Seastar libraries and to consume
  the compile options and link options. the downside is:

  * the generated compilation database (compile_commands.json)
    does not include the commands building Seastar anymore.
  * the building system of scylla does not have finer graind
    control on the building process of seastar. for instance,
    we cannot specify the build dependency to a certain seastar
    library, and just build it instead of building the whole
    seastar project.

turns out the last approach is the best one we can have
at this moment. this is also the approach used by the existing
`configure.py`.

in this change, we

- add FindSeastar.cmake to

  * detect the preconfigured Seastar builds, and
  * extract the build options from .pc files
  * expose library targets to be consumed by parent project
- add Seastar as an external project, so we can build it from
  the parent project.

  this is atypical compared to standard ExternalProject usage:
  - Seastar's build system should already be configured at this point.
  - We maintain separate project variants for each configuration type.

  Benefits of this approach:
  - Allows the parent project to consume the compile options exposed by
    .pc file. as the compile options vary from one config to another.
  - Allows application of config-specific settings
  - Enables building Seastar within the parent project's build system
  - Facilitates linking of artifacts with the external project target,
    establishing proper dependencies between them

we will update `configure.py` to merge the compilation database
of scylla and seastar.

Refs scylladb/scylladb#2717

---

this is a CMake-related change, hence no need to backport.

Closes scylladb/scylladb#21131

* github.com:scylladb/scylladb:
  build: cmake: use GENERATOR_IS_MULTI_CONFIG property to detect mult-config
  build: cmake: consume Seastar using its .pc files
  build: do not use `mode` as the index into `modes`
  build: cmake: detect and link against GnuTLS library
  build: cmake: detect and link against yaml-cpp
  build: cmake: link Seastar with Seastar::<COMPONENT>
  build: cmake: define CMake generate helper funcs in scylla
2024-10-18 09:42:59 +03:00
Amnon Heiman
09fa625672 scripts/get_description.py: param_mapping was missing
get_description.py was moved from a standalone script to a library.
During the transition, param_mapping was not included in the script
option.

This patch makes it possible to use the file as a standalone script
again.
2024-10-18 08:58:04 +03:00
Amnon Heiman
10af854ec4 scripts/metrics-config.yml: no need to get metrics from the tests 2024-10-18 08:57:53 +03:00
Kefu Chai
b5f5a963ca build: do not pass Seastar_CXX_DIALECT=gnu++23 when building Seastar
Seastar now respect CMAKE_CXX_STANDARD in favor of Seastar_CXX_DIALECT,
which has been dropped in Seastar's commit of
60bc8603bd438232614e9b3dcd7537dc83c85206 .

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21130
2024-10-18 08:57:23 +03:00
Botond Dénes
6811411288 Merge 'Sanitize commitlog API endpoints' from Pavel Emelyanov
Endpoints are registered next to the service they use, and the unregistration deferred action is created right after it. When registered, the service in question is passed as argument and then captured by enpoints lambdas. This makes sure that service is not used by endpoints after being stopped.

That's not so for commitlog endpoints. These are registered in several places, and /commitlog "function" is not unregistered on stop. This patch fixes some of this misbehavior, in particular:

 -  adds unregistration of commitlog API function
 -  uses sharded<database>& argument in endpoints instead of ctx.db
 -  moves some endpoints from storage_service.cc to commitlog.cc

Closes scylladb/scylladb#21053

* github.com:scylladb/scylladb:
  api: Use captured database, not the one from ctx
  api: Pass sharded<database> to commitlog endpoints registration
  api: Move commitlog-related from storage_service.cc
  api: Unset commitlog API endpoints
  api: Extract set_server_commitlog() from set_server_done()
2024-10-18 08:56:13 +03:00
Botond Dénes
568b767ec3 Merge 'schema: convert from boost ranges to std ranges' from Avi Kivity
To reduce dependency load, change uses of boost ranges to std::ranges.

The first patch is preparation, replacing a construct that isn't easy to support with std ranges with something simpler.

No backport as this is a code cleanup.

Closes scylladb/scylladb#21122

* github.com:scylladb/scylladb:
  schema: replace boost ranges with std ranges
  schema: precompute all_columns_in_select_order()
2024-10-18 08:42:50 +03:00
Pavel Emelyanov
df6991edd3 test: Do not duplicate sstable twice
The statistics_rewrite test case copies an sstable from resources two
times:

- first time -- explicitly by listing resource components and copying
  files to the test temp dir
- second time -- implicitly, by calling create_links() linking copied
  files by new set in the staging/ subdirectory

The 2nd step is not needed and the history of changes justifies that.

The test itself appeared with 70b793e4d3 and it only contained the 2nd
"copying" -- test linked files from resource directory and then worked
in the newly created set.

Later, commit 59c57861ae added the first step and copied the files
from resource into test temp dir. At this point linking copied files
because pointless, but was preserved. Let's remove it now.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#21097
2024-10-18 08:31:08 +03:00
Kefu Chai
26a5a00b20 interval: include used header
when building the tree with Clang-20 and libstdc++ shippped with
GCC-14.2, we have following build failure:

```
/home/kefu/dev/scylladb/interval.hh:638:14: error: no member named 'sort' in namespace 'std'
  638 |         std::sort(intervals.begin(), intervals.end(), [&](auto&& r1, auto&& r2) {
      |         ~~~~~^
/home/kefu/dev/scylladb/interval.hh:691:21: error: no member named 'upper_bound' in namespace 'std'
  691 |         return std::upper_bound(r.begin(), r.end(), value, std::forward<LessComparator>(cmp));
      |                ~~~~~^
/home/kefu/dev/scylladb/interval.hh:723:18: error: no member named 'minmax' in namespace 'std'; did you mean 'fminmag'?
  723 |         auto p = std::minmax(_interval, other._interval, [&cmp] (auto&& a, auto&& b) {
      |                  ^~~~~~~~~~~
      |                  fminmag
```

it turns out we failed to include the used header.

in this change, we include `<algorithm>` so that this header is
self-contained.

after this change, the build passes.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21168
2024-10-18 08:26:27 +03:00
Kefu Chai
e73b0c942f build: cmake: use GENERATOR_IS_MULTI_CONFIG property to detect mult-config
this is more reliable way to check if we are configured to use a
mult-config generator.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-18 08:36:52 +08:00
Kefu Chai
415c83fa67 build: cmake: consume Seastar using its .pc files
before this change, scylla's CMake-based system consumes Seastar
library by including it directly. but this failed to address the needs
of linking against Seastar shared libraries in Debug and Dev builds, while
linking against the static libraries in other builds. because Seastar
uses `BUILD_SHARED_LIBS` CMake variable to determine if it builds
shared libraries. and we cannot assign different values to this
CMake variable based on current configure type -- CMake does not
support. see https://gitlab.kitware.com/cmake/cmake/-/issues/19467

in order to address this problem, we have a couple possible
solutions:

- to enable Seastar to build both shared and static libraries in a
  pass. without sacrificing the performance, we have to build
  all object files twice: once with -fPIC, once without. in order
  to accompolish this goal, we need to develop a machinary to
  populate the same settings to these two builds. this would
  complicate the design of Seastar's building system further.
- to build Seastar libraries twice in scylla, we could use
  the ExternalProject module to implement this. but it'd be
  complicate to extract the compile options, and link options
  previously populated by Seastar's targets with CMake --
  we would have to replicate all of them in scylla. this is
  out of the question.
- to build Seastar libraries twice before building scylla,
  and let scylla to consume them using CMake config files or
  .pc files. this is a compromise. it enables scylla to
  drive the build of Seastar libraries and to consume
  the compile options and link options. the downside is:

  * the generated compilation database (compile_commands.json)
    does not include the commands building Seastar anymore.
  * the building system of scylla does not have finer graind
    control on the building process of seastar. for instance,
    we cannot specify the build dependency to a certain seastar
    library, and just build it instead of building the whole
    seastar project.

turns out the last approach is the best one we can have
at this moment. this is also the approach used by the existing
`configure.py`.

in this change, we

- add FindSeastar.cmake to

  * detect the preconfigured Seastar builds, and
  * extract the build options from .pc files
  * expose library targets to be consumed by parent project
- add Seastar as an external project, so we can build it from
  the parent project. BUILD_AWAYS is set to ensure that Seastar is
  rebuilt, as scylla developers are expected to modify Seastar
  occasionally. since the change in Seastar's SOURCE_DIR is not
  detectable via the ExternalProject, we have to rebuild it.

  this is atypical compared to standard ExternalProject usage:
  - Seastar's build system should already be configured at this point.
  - We maintain separate project variants for each configuration type.

  Benefits of this approach:
  - Allows the parent project to consume the compile options exposed by
    .pc file. as the compile options vary from one config to another.
  - Allows application of config-specific settings
  - Enables building Seastar within the parent project's build system
  - Facilitates linking of artifacts with the external project target,
    establishing proper dependencies between them
- preserve the existing machinery of including Seastar only when
  building without multi-config generator. this allows users who don't
  use mult-config generator to build Seastar in-the-tree. the typical
  use case is the CI workflows performing the static analysis.

we will update `configure.py` to merge the compilation database
of scylla and seastar.

Refs scylladb/scylladb#2717
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-18 08:36:52 +08:00
Kefu Chai
7cb74df323 build: do not use mode as the index into modes
before this change, in `configure_seastar()`, we use `mode` as
a component in the build directory, and use it as the index into `modes`
dict. but in a succeeding commit, we will reuse `configure_seastar()`
when preparing for the CMake-based building system, in which,
`mode` will be the CMake configure type, like "Debug" instead of
scylla's build mode, like "debug". to be prepared for this change,
let's use `mode_config` directly. it's identical to `modes[mode]`.

this also improves the readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-18 08:36:52 +08:00
Kefu Chai
1bd2ed7826 build: cmake: detect and link against GnuTLS library
before this change, in the CMake-based building system, we rely on
Seastar to provide this linkage, but this is wrong and fragile. as
Seastar is not supposed to expose and provide GnuTLS symbols. that's
why we have following build failure:

```
: && /home/kefu/.local/bin/clang++ -g -Og -g -gz -Xlinker --build-id=sha1 --ld-path=ld.lld -dynamic-linker=/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////lib64/ld-linux-x86-64.so.2 /home/kefu/dev/scylladb/build/Debug/seastar/libseastar.so -fsanitize=address -fsanitize=undefined /usr/lib64/libboost_program_options.so /usr/lib64/libboost_thread.so /usr/lib64/libcares.so /usr/lib64/libfmt.so.11.0.2 -L/usr/lib64 -llz4 CMakeFiles/scylla_version.dir/Debug/release.cc.o CMakeFiles/scylla.dir/Debug/main.cc.o -o Debug/scylla -L/home/kefu/dev/scylladb/idl/absl::headers -Wl,-rpath,/home/kefu/dev/scylladb/idl/absl::headers:/home/kefu/dev/scylladb/build/Debug/seastar  Debug/libscylla-main.a  api/Debug/libapi.a  alternator/Debug/libalternator.a  db/Debug/libdb.a  cdc/Debug/libcdc.a  compaction/Debug/libcompaction.a  cql3/Debug/libcql3.a  data_dictionary/Debug/libdata_dictionary.a  gms/Debug/libgms.a  index/Debug/libindex.a  lang/Debug/liblang.a  message/Debug/libmessage.a  mutation/Debug/libmutation.a  mutation_writer/Debug/libmutation_writer.a  raft/Debug/libraft.a  readers/Debug/libreaders.a  redis/Debug/libredis.a  repair/Debug/librepair.a  replica/Debug/libreplica.a  schema/Debug/libschema.a  service/Debug/libservice.a  sstables/Debug/libsstables.a  streaming/Debug/libstreaming.a  test/perf/Debug/libtest-perf.a  tools/Debug/libtools.a  transport/Debug/libtransport.a  types/Debug/libtypes.a  utils/Debug/libutils.a  Debug/seastar/libseastar.so  /usr/lib64/libyaml-cpp.so  /usr/lib64/libboost_program_options.so.1.83.0  test/lib/Debug/libtest-lib.a  -Xlinker --push-state -Xlinker --whole-archive  auth/Debug/libscylla_auth.a  -Xlinker --pop-state  /usr/lib64/libcrypt.so  cdc/Debug/libcdc.a  compaction/Debug/libcompaction.a  mutation_writer/Debug/libmutation_writer.a  -Xlinker --push-state -Xlinker --whole-archive  dht/Debug/libscylla_dht.a  -Xlinker --pop-state  index/Debug/libindex.a  -Xlinker --push-state -Xlinker --whole-archive  locator/Debug/libscylla_locator.a  -Xlinker --pop-state  message/Debug/libmessage.a  gms/Debug/libgms.a  sstables/Debug/libsstables.a  readers/Debug/libreaders.a  schema/Debug/libschema.a  -Xlinker --push-state -Xlinker --whole-archive  tracing/Debug/libscylla_tracing.a  -Xlinker --pop-state  Debug/libscylla-main.a  -Xlinker --push-state -Xlinker --whole-archive  Debug/libscylla-zstd.a  -Xlinker --pop-state  /usr/lib64/libzstd.so  abseil/absl/strings/Debug/libabsl_cord.a  abseil/absl/strings/Debug/libabsl_cordz_info.a  abseil/absl/strings/Debug/libabsl_cord_internal.a  abseil/absl/strings/Debug/libabsl_cordz_functions.a  abseil/absl/strings/Debug/libabsl_cordz_handle.a  abseil/absl/crc/Debug/libabsl_crc_cord_state.a  abseil/absl/crc/Debug/libabsl_crc32c.a  abseil/absl/crc/Debug/libabsl_crc_internal.a  abseil/absl/crc/Debug/libabsl_crc_cpu_detect.a  abseil/absl/strings/Debug/libabsl_str_format_internal.a  service/Debug/libservice.a  node_ops/Debug/libnode_ops.a  service/Debug/libservice.a  node_ops/Debug/libnode_ops.a  raft/Debug/libraft.a  repair/Debug/librepair.a  streaming/Debug/libstreaming.a  replica/Debug/libreplica.a  abseil/absl/container/Debug/libabsl_raw_hash_set.a  abseil/absl/hash/Debug/libabsl_hash.a  abseil/absl/hash/Debug/libabsl_city.a  abseil/absl/types/Debug/libabsl_bad_variant_access.a  abseil/absl/hash/Debug/libabsl_low_level_hash.a  abseil/absl/types/Debug/libabsl_bad_optional_access.a  abseil/absl/container/Debug/libabsl_hashtablez_sampler.a  abseil/absl/profiling/Debug/libabsl_exponential_biased.a  abseil/absl/synchronization/Debug/libabsl_synchronization.a  abseil/absl/debugging/Debug/libabsl_stacktrace.a  abseil/absl/synchronization/Debug/libabsl_graphcycles_internal.a  abseil/absl/synchronization/Debug/libabsl_kernel_timeout_internal.a  abseil/absl/debugging/Debug/libabsl_symbolize.a  abseil/absl/debugging/Debug/libabsl_debugging_internal.a  abseil/absl/base/Debug/libabsl_malloc_internal.a  abseil/absl/debugging/Debug/libabsl_demangle_internal.a  abseil/absl/time/Debug/libabsl_time.a  abseil/absl/strings/Debug/libabsl_strings.a  abseil/absl/strings/Debug/libabsl_strings_internal.a  abseil/absl/strings/Debug/libabsl_string_view.a  abseil/absl/base/Debug/libabsl_throw_delegate.a  abseil/absl/numeric/Debug/libabsl_int128.a  abseil/absl/base/Debug/libabsl_base.a  abseil/absl/base/Debug/libabsl_raw_logging_internal.a  abseil/absl/base/Debug/libabsl_log_severity.a  abseil/absl/base/Debug/libabsl_spinlock_wait.a  -lrt  abseil/absl/time/Debug/libabsl_civil_time.a  abseil/absl/time/Debug/libabsl_time_zone.a  -lsystemd  /usr/lib64/libz.so  /usr/lib64/libdeflate.so  types/Debug/libtypes.a  utils/Debug/libutils.a  /usr/lib64/libyaml-cpp.so  /usr/lib64/libcryptopp.so  /usr/lib64/libboost_regex.so.1.83.0  /usr/lib64/libicui18n.so  /usr/lib64/libicuuc.so  -ldl  /usr/lib64/libboost_unit_test_framework.so.1.83.0  Debug/seastar/libseastar_perf_testing.so  /usr/lib64/libjsoncpp.so.1.9.5  db/Debug/libdb.a  data_dictionary/Debug/libdata_dictionary.a  cql3/Debug/libcql3.a  transport/Debug/libtransport.a  cql3/Debug/libcql3.a  transport/Debug/libtransport.a  lang/Debug/liblang.a  /usr/lib64/liblua-5.4.so  -lm  rust/Debug/libwasmtime_bindings.a  rust/librust_combined.a  /usr/lib64/libsnappy.so.1.2.1  mutation/Debug/libmutation.a  Debug/seastar/libseastar.so  /usr/lib64/liblz4.so  /usr/lib64/libxxhash.so && :
ld.lld: error: undefined symbol: gnutls_hmac_fast
>>> referenced by aws_sigv4.cc:21 (/home/kefu/dev/scylladb/utils/aws_sigv4.cc:21)
>>>               aws_sigv4.cc.o:(utils::aws::hmac_sha256(std::basic_string_view<char, std::char_traits<char>>, std::basic_string_view<char, std::char_traits<char>>)) in archive utils/Debug/libutils.a

ld.lld: error: undefined symbol: gnutls_strerror
>>> referenced by aws_sigv4.cc:23 (/home/kefu/dev/scylladb/utils/aws_sigv4.cc:23)
>>>               aws_sigv4.cc.o:(utils::aws::hmac_sha256(std::basic_string_view<char, std::char_traits<char>>, std::basic_string_view<char, std::char_traits<char>>)) in archive utils/Debug/libutils.a
```

in this change, we detect this library, and link its caller against it.
this addresses the link failure.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-18 08:36:52 +08:00
Kefu Chai
fc8212483e build: cmake: detect and link against yaml-cpp
in main.cc, we use yaml-cpp library directly. so we are obliged to
detect this library in scylla and link against it instead of relying
on other library to do this. currently, Seastar detects it and pulls
in yaml-cpp for us, but we should not take this for granted and rely
on this.

in this change, we detect and link against yaml-cpp to make this
dependency explicit.

the same applies to the "utils" library.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-18 08:36:52 +08:00
Kefu Chai
2e4be56112 build: cmake: link Seastar with Seastar::<COMPONENT>
before this change, we link against the targets defined in Seastar's
source tree. but these targets are not part of Seastar's public
interface -- they are not exposed by Seastar's CMake config files.

so, let link against the target names qualified by the library module
name. this also prepares for the transition to using Seastar without
including it directly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-18 08:36:52 +08:00
Kefu Chai
b2dc261841 build: cmake: define CMake generate helper funcs in scylla
before this change, we assume that scylla's CMake script includes
Seastar's CMake script.

but we are going to consume Seastar using its .pc files or its CMake
config files instead of including it directly. more over these helper
functions are not part of Seastar's public interface.

actually the same applies to the `check_headers()` helper, which was
adapted from seastar's CheckHeaders.cmake.

so to be prepared for this change, let's define these generate helper
functions in scylla.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-18 08:36:52 +08:00
Avi Kivity
f4acaa5473 cql3: index_target: forward declare boost::regex
No need to burden everyone with the full boost::regex code.

Closes scylladb/scylladb#21148
2024-10-17 19:14:40 +02:00
Botond Dénes
e1d8cddd09 test/boost/mutation_test: add test for multishard permit safety
Add a test checking that the multishard reader will not deadlock, when
created with an admitted permit, on a semaphore with a single count
resource.
2024-10-17 08:47:50 -04:00
Botond Dénes
5a3fd69374 test/lib/reader_lifecycle_policy: add semaphore factory to constructor
Allowing callers to specify how the semaphore is created and stopped,
instead of doing so via boolean flags like it is done currently. This
method doesn't scale, so use a factory instead.
2024-10-17 08:47:50 -04:00
Botond Dénes
c8598e21e8 test/lib/reader_lifecycle_policy: rename factory_function
To reader_factor_function. We are about to add a new factory function
parameters, so the current factory_function has to be renamed to
something more specific.
2024-10-17 08:47:50 -04:00
Botond Dénes
76a5ba2342 repair/row_level: drop now unneeded release_base_resource() calls
The multishard reader now does this itself, no need to do it here.
2024-10-17 08:47:50 -04:00
Botond Dénes
218ea449a5 readers/multishard: make multishard reader safe to create with admitted permits
Passing an admitted permit -- i.e. one with count resources on it -- to
the multishard reader, will possibly result in a deadlock, because the
permit of the multishard reader is destroyed after the permits of its
child readers. Therefore its semaphore resources won't be automatically
released until children acquire their own resources.
This creates a dependency (an edge in the "resource allocation graph"),
where the semaphore used by the multishard reader depends on the
semaphores used by children. When such dependencies create a cycle, and
permits are acquired by different reads in just the right order, a
deadlock will happen.

Users of the multishard reader have to be aware of this gotcha -- and of
course they aren't. This is small wonder, considering that not even the
documentation on the multishard reader mentions this problem.
To work around this, the user has to call
`reader_permit::release_base_resources()` on the permit, before passing
it to the multishard reader.
On multiple occasions, developers (including the very author of the
multishard reader), forgot or didn't know about this and this resulted
in deadlocks down the line.
This is a design-flaw of the multishard reader, which is addressed in
this patch, after which, it is safe to pass admitted or not admitted
permits to the multishard reader, it will handle the call to
`release_base_resources()` if needed.
2024-10-17 08:45:21 -04:00
Raphael S. Carvalho
f3ab5e1f1e tests: Fix perf test for load balancer
Broken after introduction of zero-token nodes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#21156
2024-10-17 14:02:31 +02:00
Kamil Braun
f02afefd34 Merge 'raft: consider the gossiper state then sending the group0 state id' from Emil Maskovsky
Skip the advertisement of the group0 state id in case the gossiper is
not active (ready).

Sending the application state when the gossiper is not active caused
a warning being shown in the log about the local endpoint not being
found in the gossiper endpoint state map on a (graceful) node restart.

The local endpoint is initialized on the gossiper startup, so we skip
the state id advertisement until the startup is finished.

Fixes: scylladb/scylladb#21117

No backport: Fixes an issue that is currently only present in master

Closes scylladb/scylladb#21119

* github.com:scylladb/scylladb:
  raft: consider the gossiper state then sending the group0 state id
  raft: add the test for GROUP0_STATE_ID gossip application state
2024-10-17 13:41:15 +03:00
Kefu Chai
5ef0cbb693 tools/scylla-nodetool: s/vm.count()/vm.contains()/
this change is created in the same spirit of 0104c7d3, which used
`std::map::contains()` in the place of `std::map::count()` when
checking for the existence of a paramter with given name for better
readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21158
2024-10-17 13:41:15 +03:00
Alexey Novikov
b965729f0a replica: implement memtable_flush_period_in_ms schema option
implement cassandra original schema option memtable_flush_period_in_ms:
Milliseconds before memtables associated with the table are flushed.

there are few things concerning this patch:
* milliseconds look strange and scary for this option. Unlike Cassandra
  we use 60000ms (1min) minimum value for this option.
* This is limitation of Cassandra but it is impossible to set this option
  for system tables. However sometimes it could be very useful to use
  automatic flushing for such a tables: some system tables have small
  traffic and as a result prevent tombstone garbage collection.

Fixes #20270

Closes scylladb/scylladb#20999
2024-10-17 13:41:15 +03:00
Anna Stuchlik
b54ce3b0c0 doc: remove the redundant raw:: html directive
This commit removes the raw:: html directive (with the exception of an embedded animation) because:
- It is not supported by the dark theme and looks bad.
- It's a legacy directive, and we no longer need it on index pages.

Fixes https://github.com/scylladb/scylladb/issues/20881

Closes scylladb/scylladb#21062
2024-10-17 13:41:15 +03:00
Kefu Chai
d7f315ef63 tool/scylla-nodetool: check for positional argument passed to "restore"
before this change, if no positional arguments are passed to "restore"
subcommand, the tool fails with following error message:

```
error running operation: boost::wrapexcept<boost::bad_any_cast> (boost::bad_any_cast: failed conversion using boost::any_cast)
```

this is difficult to digest.

after this change, if no sstables are specified:

```
error processing arguments: missing required parameter: sstables
```

this is slightly better from user experience's perspective.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21136
2024-10-17 13:41:15 +03:00
Kefu Chai
9355a32b5c utils/loading_cache: s/typeof/decltype/
`typeof` is a GNU extension, and is part of C23, but it is not included
by C++23.

if we compile the tree with c++23 instead of gnu++23, the compilation
fails like:

```
FAILED: repair/CMakeFiles/repair.dir/RelWithDebInfo/repair.cc.o
/home/kefu/.local/bin/clang++ -DSCYLLA_BUILD_MODE=release -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -isystem /home/kefu/dev/scylladb/abseil -isystem /home/kefu/dev/scylladb/seastar/include -isystem /home/kefu/dev/scylladb/build/RelWithDebInfo/seastar/gen/include -isystem /usr/include/p11-kit-1 -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -mllvm -inline-threshold=2500 -fno-slp-vectorize -std=c++23 -Werror=unused-result -DSEASTAR_API_LEVEL=7 -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_LOGGER_TYPE_STDOUT -DFMT_SHARED -DWITH_GZFILEOP -MD -MT repair/CMakeFiles/repair.dir/RelWithDebInfo/repair.cc.o -MF repair/CMakeFiles/repair.dir/RelWithDebInfo/repair.cc.o.d -o repair/CMakeFiles/repair.dir/RelWithDebInfo/repair.cc.o -c /home/kefu/dev/scylladb/repair/repair.cc
In file included from /home/kefu/dev/scylladb/repair/repair.cc:21:
In file included from /home/kefu/dev/scylladb/service/storage_service.hh:19:
In file included from /home/kefu/dev/scylladb/service/qos/service_level_controller.hh:19:
In file included from /home/kefu/dev/scylladb/auth/service.hh:23:
In file included from /home/kefu/dev/scylladb/auth/permissions_cache.hh:22:
/home/kefu/dev/scylladb/utils/loading_cache.hh:754:66: error: use of undeclared identifier 'typeof'; did you mean 'typeid'?
  754 |         static_assert(SectionHitThreshold <= std::numeric_limits<typeof(_touch_count)>::max() / 2, "SectionHitThreshold value is too big");
      |                                                                  ^
/home/kefu/dev/scylladb/utils/loading_cache.hh:754:66: error: template argument for template type parameter must be a type
  754 |         static_assert(SectionHitThreshold <= std::numeric_limits<typeof(_touch_count)>::max() / 2, "SectionHitThreshold value is too big");
      |                                                                  ^~~~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/limits:311:21: note: template parameter is declared here
  311 |   template<typename _Tp>
      |                     ^
2 errors generated.
```

in this change, we trade `typeof` for a more standard compliant
`decltype`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21116
2024-10-17 13:41:15 +03:00
Pavel Emelyanov
df83fe2dae Merge 'interval: replace boost ranges with std ranges' from Avi Kivity
To reduce dependency load, replace use of boost ranges with std ranges.

Since std ranges are more particular about what iterators they accept, a custom iterator
in size_estimates_virtual_reader has to be fixed first.

No backport; code cleanup.

Closes scylladb/scylladb#21143

* github.com:scylladb/scylladb:
  interval: change boost ranges to std ranges
  size_estimates_virtual_reader: make virtual_row_iterator more conforming
2024-10-17 13:41:15 +03:00
Avi Kivity
6fd219d982 sstables: generation_type: deinline from_string()
This is not performance sensitive and penalizes everyone by including
boost/regex.hpp. Fix by deinlining.

Closes scylladb/scylladb#21147
2024-10-17 13:41:15 +03:00
Emil Maskovsky
e082fef32c raft: remove the group0 state id handler stop check
The stop assertion check in the group0 state id handler was triggering
under some circumstances (stopping server during restart). In that case
it might be that the stop is initiated before the server is fully
initialized, and then the handler destructor is being called without
calling to the `stop()` method first. This is a valid scenario.

The whole `stop()` in the group0 state id handler is not necessary,
as the only operation being done is cancelling the timer which is done
by the timer destructor automatically anyway.

There is the concern of a currently running timer callback, but it
doesn't preempt (not async) so the timer shouldn't be destroyed before
the callback finishes.

Fixes: scylladb/scylladb#21074

Closes scylladb/scylladb#21127
2024-10-17 13:41:15 +03:00
Emil Maskovsky
3f1af268c2 raft: consider the gossiper state then sending the group0 state id
Skip the advertisement of the group0 state id in case the gossiper is
not active (ready).

Sending the application state when the gossiper is not active caused
a warning being shown in the log about the local endpoint not being
found in the gossiper endpoint state map on a (graceful) node restart.

The local endpoint is initialized on the gossiper startup, so we skip
the state id advertisement until the startup is finished.

Fixes: scylladb/scylladb#21117
2024-10-16 19:26:25 +02:00
Emil Maskovsky
65d3d4fd93 raft: add the test for GROUP0_STATE_ID gossip application state
Test that the GROUP0_STATE_ID gossip application state is not causing
the "endpoint_state_map does not contain endpoint" error.

Refs: scylladb/scylladb#21117
2024-10-16 19:21:14 +02:00
Calle Wilund
f2ef75c3da commitlog_test: Up timeout for large entry tests
Fixes #21150

Apparently, on some CI, in debug, these tests can time out (large alloc)
without actually failing what they do. Up the timeout (could consider removing
as well, but...) so they hopefully pass.

Closes scylladb/scylladb#21151
2024-10-16 18:13:04 +03:00
Avi Kivity
f799234c82 Update tools/java submodule (deprecation notice)
* tools/java b2d025fd6b...807e991de7 (1):
  > README.md: add deprecation notice for java tools
2024-10-16 17:09:48 +03:00
Avi Kivity
b73f0197a8 Merge 'micro-updates to documentation development, on python-poetry' from Laszlo Ersek
- `docs/Makefile`: work around python-poetry issue https://github.com/python-poetry/poetry/issues/8761
- `docs/README.md`: fix minimum poetry version

No backporting needed (docs development).

Closes scylladb/scylladb#21118

* github.com:scylladb/scylladb:
  docs/README.md: fix minimum poetry version
  docs/Makefile: work around python-poetry issue #8761
2024-10-16 14:16:29 +03:00
Nadav Har'El
ee0e7a7adf mv: test that operations that should not be allowed on a view, aren't
This patch adds test/cql-pytest tests which verify that all CQL operations
that shouldn't be allowed on a materialized view, actually aren't:

* All operations writing to a table - INSERT, UPDATE, BATCH, DELETE,
  and TRUNCATE - should be rejected when asked to operate on a view.

* All operations with "TABLE" in their name (DROP TABLE, ALTER TABLE,
  DESC TABLE) should be rejected on a view - the ".. MATERIALIZED VIEW"
  operation should be used instead.

* A materialized view cannot get materialized views or indexes of its
  own.

All tests pass on Cassandra (Cassandra 4 or above is needed for the
"DESC" test), and all but one pass on Scylla - Scylla does allow
"DESC TABLE" on a materialized view, unlike Cassandra. I opened an
issue to track that difference: Refs #21026

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#21028
2024-10-16 13:43:36 +03:00
Avi Kivity
d58cd262ca interval: change boost ranges to std ranges
Reduce dependency load.

size_estimates_virtual_reader is adjusted due to poor boost ranges
and std ranges interoperability.
2024-10-16 13:21:43 +03:00
Avi Kivity
3a75efd6d4 size_estimates_virtual_reader: make virtual_row_iterator more conforming
To work with std::ranges, an iterator has to have a default constructor,
and be assignable. Add the default constructor and convert references
to pointers to support this.
2024-10-16 13:21:25 +03:00
Pavel Emelyanov
4a8ab9b3bc s3/client: Restore indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-16 12:27:29 +03:00
Pavel Emelyanov
a15dfe0154 s3/client: Catch do_upload_file::upload_part() exceptions
This method spawns part uploading in the background, but still may
throw, e.g. preparing http request or claiming memory. In this case any
outstanding part upload fibers are not waited on, and the whole
do_upload_file object can be freed from under their feet. Also, the
multipart upload is not aborted, thus losing track of it until g.c.
happens.

To fix it, catch any exception from upload_part() too, and if it
happens, do what the regular upload_sink would do -- close the gate thus
picking up any outstanding activity that may happen there and abort the
multipart upload.

Indentation is deliberately left broken

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-16 12:23:31 +03:00
Nadav Har'El
210d53070e docs/alternator: explain service discovery HTTP requests
In this patch we add to docs/new-apis.md (Alternator-specific API)
a description of the service discovery HTTP requests - `/` and
`/localnodes` that was previously not documented except in a design
document that is unfortunately no longer available publically.

The description also includes the recently added `dc` and `rack`
parameters for the `/localnodes` request.

Fixes #20989

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-10-16 10:15:04 +03:00
Nadav Har'El
367e18ed4a docs/alternator: split Alternator-specific APIs from alternator.md
Before this patch, the documentation of Alternator-specific APIs (APIs
which are unique to Alternator and don't exist in DynamoDB) appear as
a section of the main document alternator.md. In the next patch we
want to describe yet another Alternator feature and make this section
even longer. But there is growing sentiment that the Alternator
documentation should be split into more, shorter, pages (Refs #19822)
so this patch splits the Alternator-specific API documentation into a
new file, new-apis.md.

There is no new content in the patch - just movement of existing content
plus a reference to the new page.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-10-16 10:14:31 +03:00
Kefu Chai
32f508d450 raft: fix typo in logging message
s/miminum/minimum/

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21073
2024-10-16 06:33:43 +03:00
Avi Kivity
d59038fa93 storage_proxy: convert boost range algorithms to std::ranges
Standardize on a single range library.

The changes are mostly mechanical. The only exception is boost::join,
which has no analog in std::ranges (rightly so, since it cannot be
implemented efficiently). A variety of tricks were used to convert it:

 - use std::ranges::join() on an std::array of std::span (when the
   inputs were all contiguous)
 - copy to a utils::small_vector (when it is expected that there will
   be no allocation)
 - use a small_vector of pointers and iterate+dereference that

Closes scylladb/scylladb#21082
2024-10-15 16:52:27 +02:00
Avi Kivity
820509026f schema: replace boost ranges with std ranges
To reduce dependency load, use std ranges instead of boost ranges.

The std::ranges::{lower,upper}_bound don't support heterogeneous lookup,
but a more natural solution is to use a projection to search for the name,
so we use that and the custom comparator is removed.

Many callers are converted as well due to poor interoperability between
boost ranges and std ranges.
2024-10-15 16:42:54 +03:00
Piotr Dulikowski
a380a2efd9 test/test_view_build_status: properly wait for v2 in migration test
The test_view_build_status_migration_to_v2 test case creates a new view
(vt2) after peforming the view_build_status -> view_build_status_v2
migration and waits until it is built by `wait_for_view_v2` function. It
works by waiting until a SELECT from view_build_status_v2 will return
the expected number of rows for a given view.

However, if the host parameter is unspecified, it will query only one
node on each attempt. Because `view_build_status_v2` is managed via
raft, queries always return data from the queried node only. It might
happen that `wait_for_view_v2` fetches expected results from one node
while a different node might be lagging behind the group0 coordinator
and might not have all data yet.

In case of test_view_build_status_migration_to_v2 this is a problem - it
first uses `wait_for_view_v2` to wait for view, later it queries
`view_build_status_v2` on a random node and asserts its state - and
might fail because that node didn't have the newest state yet.

Fix the issue by issuing `wait_for_view_v2` in parallel for all nodes in
the cluster and waiting until all nodes have the most recent state.

Fixes: scylladb/scylladb#21060

Closes scylladb/scylladb#21091
2024-10-15 14:57:47 +03:00
Pavel Emelyanov
63725b10a8 Merge 'cql: create default superuser if it doesn't exist' from Paweł Zakrzewski
This change reorganizes the way standard_role_manager startup is handled: role_manager::ensure_superuser_is_created() is added, which returns a future that resolves once the superuser is available. We wait for this future before starting the CQL server.

There is a change in behavior auth::do_after_system_ready is potentially an infinite loop, and we await its result.

Fixes #10481

Reason for no backports: it's not a regresson and it's an issue that may only affect a tiny time window during the cluster startup.

Closes scylladb/scylladb#20137

* github.com:scylladb/scylladb:
  test: test_restart_cluster: create the test
  auth: standard_role_manager allows awaiting superuser creation
  auth: coroutinize the standard_role_manager start() function
  auth: don't start server until the superuser is created
2024-10-15 14:56:04 +03:00
Avi Kivity
a5c37a110f schema: precompute all_columns_in_select_order()
all_columns_in_select_order() returns a complicated boost range type
that has no analog in std::ranges. To ease the transition to std::ranges,
precompute most of the work done in that function, and only convert
pointers to references in the function itself.

Since boost ranges and std::ranges don't fully interoperate, one of
the user has to be adjusted.
2024-10-15 14:04:12 +03:00
Pavel Emelyanov
6e0899c2b4 data_dictionary: Replace boost ranges with std ranges
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#21105
2024-10-15 13:22:08 +03:00
Laszlo Ersek
a0ffbd5bcf docs/README.md: fix minimum poetry version
Commit 2a3012db7f ("docs/README.md: expand prerequisites list",
2022-08-31) referenced poetry release 1.12, which does not exist even
today (as of this writing, the latest release is 1.8.4). The intent was
probably 1.1.12.

Copy the minimum version from "sphinx-scylladb-theme": 1.8.1 (see
"docs/source/getting-started/installation.rst" and
"docs/source/getting-started/quickstart.rst" at commit f7c26b422572).

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-10-15 12:19:21 +02:00
Laszlo Ersek
e5c2d4bd1d docs/Makefile: work around python-poetry issue #8761
Python-poetry is affected by bug
<https://github.com/python-poetry/poetry/issues/8761>. Namely, if you have
"keyring" <https://pypi.org/project/keyring/> installed, poetry will try
to gain access to the Default collection in the (ex. GNOME) keyring, even
if poetry only needs read-only access to package repositories, and even if
those repos are public.

Consequently, you either unlock your Default collection for poetry
(unjustifiedly), or your GUI session gets effectively locked up, because
any time you hit Cancel on the keyring unlock dialog, poetry immediately
pops up another, and this dialog grabs the keyboard -- you cannot even
switch to a character VT, for killing poetry; you have to log in via ssh
for that.

This issue is not visible to users who don't use "keyring" (GNOME or
otherwise). For those who do, work around the problem by selecting the
"null" keyring back-end, in the environment of every poetry invocation.

Note: I have not regression-tested the workaround in a desktop environment
where "keyring" is unavailable to begin with.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-10-15 12:07:00 +02:00
Botond Dénes
f93abebbb9 Merge 'Sanitize compaction manager API endpoints' from Pavel Emelyanov
Endpoints are registered next to the service they use, and the unregistration deferred action is created right after it. When registered, the service in question is passed as argument and then captured by enpoints lambdas. This makes sure that service is not used by endpoints after being stopped.

That's not quite the case for compaction manager. Its endpoints can be registered in several places, and compaction_manager "function" is not unregistered on stop. This patch fixes some of this misbehavior, in particular:
- adds unregistration of compaction_manager API function
- uses sharded<compaction_manager>& argument in endpoints instead of ctx.db.local().get_compaction_manager() chain
- moves some endpoints from storage_service.cc to compaction_manager.cc

Closes scylladb/scylladb#20962

* github.com:scylladb/scylladb:
  api: Use captured compaction_manager in get_cm_stats() helper
  api: Use captured compaction_manager in endpoints
  api: Add sharded<compaction_manager> argument to compaction_manager API reg/unreg
  api: Move some endpoints from storage_service.cc to compaction_manager.cc
  api: Unset compaction_manager endpoints
  api: Use shorter registration method for compaction_manager function
2024-10-15 10:20:58 +03:00
Daniel Reis
28a265ccd8 docs: fix redirect from cert-based auth to security/enable-auth page
Closes scylladb/scylladb#19943
2024-10-15 09:29:05 +03:00
Kefu Chai
a82706eb8f Update seastar submodule
* seastar 3c9c2696...abd20efd (44):
  > Revert "build: enable Seastar to build shared and static libs in a single build"
  > dns: Support c-ares before 1.22
  > build: improve c-ares version extraction method
  > Minor typos fix in doc: reference_wrapper.hh
  > build: enable Seastar to build shared and static libs in a single build
  > build: include -fno-semantic-interposition in CXXFLAGS
  > loop: add Sentinel iterator support to parallel_for_each()
  > dns: use ARES_LIB_INIT_NONE instead of a magic number
  > dns: use struct typedef for `_channel`
  > doc/testing.md: explain seastar + boost test colocation
  > build: extract c-ares version from header file
  > dns: replace deprecated ares_process() with ares_process_fd()
  > build: do not support c-ares >= 1.33
  > http: fix indentation
  > http: Add non-owning `make_request` to http client
  > treewide: replace boost::irange with std::views::iota where possible
  > Added unit test for "http_content_length_data_sink_impl"
  > sharded.hh: migrate to concepts
  > file, scheduling: remove non-unified I/O and CPU scheduling
  > http: Add more HTTP response codes
  > http: add constness to `response_line`
  > http: refactor response_line to use `seastar::format`
  > httpd/file_handler: Always close stream
  > build: add compiler and C++ standard compatibility checks
  > rpc: rpc_types: replace boost::any with std::any
  > tls: drop dependency on boost::any
  > rpc: drop unnecessaty includes to boost libraries
  > rpc: compressor factory: deinline some boost-using functions
  > sharded: replace boost ranges with <ranges>
  > scheduling_specific: drop dependency on boost range adaptors
  > prefetch: drop dependency on boost::mpl
  > resource: drop unused dependency on boost::any
  > smp: drop dependency on boost ranges
  > reactor: remove unnecessary boost includes
  > execution_stage: remove unnecessary boost includes
  > sharded.hh: add invoke_on variant for a shard range
  > shared_ptr: remove deprecated lw_shared_ptr assignment operator
  > seastar-addr2line: add --debug arg
  > addr2line: add type checking
  > warnings: fix unused result warnings
  > thread_pool: fix includes
  > signal: remove trailing spaces
  > tests/unit: chmod -x signal_test.cc
  > iostream/http: Fix output_stream::write(temporary_buffer) overload

Closes scylladb/scylladb#21109
2024-10-15 09:09:29 +03:00
Tomasz Grabiec
3e438d23e1 Merge 'Check system.tablets update before putting it into the table' from Pavel Emelyanov
Having tablet metadata with more than 1 pending replica will prevent this metadata from being (re)loaded due to sanity check on load. This patch fails the operation which tries to save the wrong metadata with a similar sanity check. For that, changes submitted to raft are validated, and if it's topology_change that affects system.tablets, the new "replicas" and "new_replicas" values are checked similarly to how they will be on (re)load.

fixes #20043

Closes scylladb/scylladb#21020

* github.com:scylladb/scylladb:
  tablets: Validate system.tablets update
  group0_client: Introduce change validation
  group0_client: Add shared_token_metadata dependency
2024-10-15 00:38:59 +02:00
Piotr Smaron
3969ffb39f test: fix flaky test_multidc_alter_tablets_rf
The testcase is flaky due to a known python driver issue:
https://github.com/scylladb/python-driver/issues/317.
This issue causes the `CREATE KEYSPACE` statement to be sometimes
executed twice in a row, and the 2nd CREATE statement causes the test to
fail.
In order to work around it, it's enough to add `if not exists` when
creating a ks.

Fixes: scylladb/scylladb#21034

Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch.

Closes scylladb/scylladb#21056
2024-10-14 16:18:44 +02:00
Avi Kivity
c286ddab38 test: lib: rest_client: use 'http' scheme even when connecting via a unix socket
aiohttp 3.10.5 complains when 'unix+http' is used for a unix-domain
socket. USe 'http', which work with 3.10.5 and the toolchain's 3.9.5.

Closes scylladb/scylladb#21080
2024-10-14 15:32:56 +02:00
Piotr Dulikowski
48d75818fd SCYLLA-VERSION-GEN: correct the logic for skipping SCYLLA-*-FILE
The SCYLLA-VERSION-GEN file skips updating the SCYLLA-*-FILE files if
the commit hash from SCYLLA-RELEASE-FILE is the same. The original
reason for this was to prevent the date in the version string from
changing if multiple modes are built across midnight
(scylladb/scylla-pkg#826). However - intentionally or not - it serves
another purpose: it prevents an infinite loop in the build process.

If the build.ninja file needs to be rebuilt, the configure.py script
unconditionally calls ./SCYLLA-VERSION-GEN. On the other hand, if one
of the SCYLLA-*-FILE files is updated then this triggers rebuild
of build.ninja. Apparently, this is sufficient for ninja to enter an
infinite loop.

However, the check assumes that the RELEASE is in the format

  <build identifier>.<date>.<commit hash>

and assumes that none of the components have a dot inside - otherwise it
breaks and just works incorrectly. Specifically, when building a private
version, it is recommended to set the build identifier to
`count.yourname`.

Previously, before 85219e9, this problem wasn't noticed most likely
because reconfigure process was broken and stopped overwriting
the build.ninja file after the first iteration.

Fix the problem by fixing the logic that extracts the commit hash -
instead of looking at the third dot-separated field counting from the
left side, look at the last field.

Fixes: scylladb/scylladb#21027

Closes scylladb/scylladb#21049
2024-10-14 13:49:15 +03:00
Calle Wilund
8eaf00ff11 test::topology: Add test for TLS upgrade and downgrade of internode encryption
Test a rolling upgrade of cluster while active.
Note: This is a unit test version of dtest test. Has the big drawback of not
being able to use cassandra-stress to work and verify the cluster and results

Test moves from none to all to none encryption while writing and then checking
written data.
2024-10-13 23:54:06 +00:00
Calle Wilund
a557f699a2 docs: Add internode_encryption=transitional documentation
Describing upgrading cluster(s) without downtime.
2024-10-13 23:54:06 +00:00
Calle Wilund
390b9759b6 messaging_service: Add "transitional" internode encryptipn mode
Fixes #18903

Adds a "transitional" internode encryption mode, under which all
_outgoing_ RPC connections will use TLS, but we will still accept
any incoming non-tls connection.

This allows an operator to perform a move to TLS RPC without cluster
downtime:

1. For each server, add certificate etc options to
   server_encryption_options
   + internode_encryption=none
   + set ssl_storage_port
   + restart (rolling)

2. For each server, set internode_encryption=transitional + RR
3. For each server, set internode_encryption=all + RR
2024-10-13 23:54:06 +00:00
Calle Wilund
503a71f9b8 messaging_service: Create TLS connector even if internode_enc=none when certs set
Refs #18903

If ssl_storage_port is non-zero _and_ we have specified actual certificates
are set/exists, create TLS connector for RPC regardless of whether internode
encryption is enables. I.e. potentially unused. For transitioning cluster to
TLS.
2024-10-13 23:54:05 +00:00
Avi Kivity
db14a01901 Merge 'Use table id as system.sstables partition key' from Pavel Emelyanov
The system.sstables (a.k.a. sstables registry) primary key is "string location" as partition key and "uuid generation" as clustering one. The "location" part was taken from table.config.datadir value which, in turn, a string containing path to on-disk files if the table was located locally, e.g. /var/lib/scylla/data/ks/cf-abc123 one. Recently [1] the datadir was moved from table config onto storage options, but this string is still used as registry key.

Other than being owned by a table with ID, sstables are accessed by restore-from-object-storage code [2]. To make it work, both storage driver and sstable_directory helper class maintain two formats of object prefixes for sstables components. For S3-backed sstables having a record in registry, the path used is s3://bucket/generation/component. For restore code there are user-provided prefixes that do not match the aforementioned pattern. The selection between those two is now made by checking sstable state, which is not obvious and may cause troubles for tiered storage driver.

This patch changes  the registry schema so that partition key becomes "uuid owner" and is set to be table.id() value. This is to stop using the local path by S3 backed sstables. Also this change makes it possible for storage driver and sstable directory to rely on the storage options only to tell different bucket prefixes formats from each other.

As a side effect, the make_s3_object_name() helper, that generates the proper object name, becomes explicit for restore-from-S3 usage. Now it relies on the sstable::filename() calling this->prefix() behind the scenes and the latter to return the user-provided prefix, which is pretty fragile construction.

No need to backport (and it's not going to be easy to do it), storage options feature is still experimental

Refs #20675 [1]
Refs #20305 [2]

Closes scylladb/scylladb#20998

* github.com:scylladb/scylladb:
  sstables: Flatten S3 object name making
  sstable_directory: Flatten directory lister creation
  treewide: Rename sstable registry location field to be owner
  system_keyspace: Change sstables registry partition key type
  sstables: Keep location variant on s3 backend too
  storage_options: Use variant on S3 options
  sstables: Split sstable::filename() helper
  sstables: Add s3_storage::owner() helper
2024-10-13 20:08:43 +03:00
Kefu Chai
7d2d44883b install.sh: install seastar/scripts/addr2line.py as well
seastar extracted `addr2line` python module out back in
e078d7877273e4a6698071dc10902945f175e8bc. but `install.sh` was
not updated accordingly. it still installs `seastar-addr2line`
without installing its new dependency. this leaves us with a
broken `seastar-addr2line` in the relocatable tarball.
```console
$ /opt/scylladb/scripts/seastar-addr2line
Traceback (most recent call last):
  File "/opt/scylladb/scripts/libexec/seastar-addr2line", line 26, in <module>
    from addr2line import BacktraceResolver
ModuleNotFoundError: No module named 'addr2line'
```

in this change, we redistribute `addr2line.py` as well. this
should address the issue above.

Fixes scylladb/scylladb#21077

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21078
2024-10-13 19:35:14 +03:00
Kefu Chai
519b4a2934 utils/s3: include used header
when building the tree with clang-19 and libstdc++ shipped along with
GCC 14.2.1, we have

```
clang++ -MD -MT build/release/utils/s3/aws_error.o -MF build/release/utils/s3/aws_error.o.d -std=c++23 -I/home/kefu/dev/scylladb/master/seastar/include -I/home/kefu/dev/scylladb/master/build/release/seastar/gen/include -Werror=unused-result -DSEASTAR_API_LEVEL=7 -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_LOGGER_TYPE_STDOUT -DFMT_SHARED -I/usr/include/p11-kit-1 -DWITH_GZFILEOP -ffile-prefix-map=/home/kefu/dev/scylladb/master=. -march=westmere -ffunction-sections -fdata-sections  -O3 -mllvm -inline-threshold=2500 -fno-slp-vectorize -DSCYLLA_BUILD_MODE=release -g -gz -Xclang -fexperimental-assignment-tracking=disabled -iquote. -iquote build/release/gen -std=gnu++23  -ffile-prefix-map=/home/kefu/dev/scylladb/master=. -march=westmere -DBOOST_ALL_DYN_LINK    -fvisibility=hidden -isystem abseil -Wall -Werror -Wextra -Wimplicit-fallthrough -Wno-mismatched-tags -Wno-c++11-narrowing -Wno-overloaded-virtual -Wno-unused-parameter -Wno-unsupported-friend -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-psabi -Wno-error=deprecated-declarations -DXXH_PRIVATE_API -DSEASTAR_TESTING_MAIN  -c -o build/release/utils/s3/aws_error.o utils/s3/aws_error.cc
utils/s3/aws_error.cc:33:21: error: no member named 'make_unique' in namespace 'std'
   33 |     auto doc = std::make_unique<rapidxml::xml_document<>>();
      |                ~~~~~^
utils/s3/aws_error.cc:33:57: error: expected '(' for function-style cast or type construction
   33 |     auto doc = std::make_unique<rapidxml::xml_document<>>();
      |                                 ~~~~~~~~~~~~~~~~~~~~~~~~^
utils/s3/aws_error.cc:33:59: error: expected expression
   33 |     auto doc = std::make_unique<rapidxml::xml_document<>>();
      |                                                           ^
3 errors generated.
ninja: build stopped: subcommand failed.
```

in order to address the build failure, let's include the used header.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21064
2024-10-13 18:32:34 +03:00
Patryk Jędrzejczak
18d3a6480d test: test_read_required_hosts: run with the raft-based topology
When we made the raft-based topology mandatory, all boost test
tests started using it. Then, `test_read_required_hosts` started
failing. We left investigating it for later and started running it
with `force-gossip-topology-changes` to make it pass.

Currently, the test doesn't fail with the raft-based topology
anymore. Hence, we remove the FIXME and run the test with a normal
config.

We don't know when and why the test stopped failing. Investigating
it wouldn't be easy, since we don't even know why it failed in the
first place. We suspect that there was some bug that is now fixed.

This patch only fixes a test, there is no need to backport it.

Fixes scylladb/scylladb#18463

Closes scylladb/scylladb#20960
2024-10-11 17:01:20 +02:00
Kamil Braun
96070bb5b3 Merge 'storage_proxy: Add conditions checking to avoid UB in speculating read executors.' from Sergey Zolotukhin
During the investigation of scylladb/scylladb#20282, it was discovered that implementations of speculating read executors have undefined behavior when called with an incorrect number of read replicas. This PR introduces two levels of condition checking:

- Condition checking in speculating read executors for the number of replicas.
- Checking the consistency of the Effective Replication Map in  filter_for_query(): the map is considered incorrect if the list  of replicas contains a node from a data center whose replication factor is 0.

 Please note: This PR does not fix the issue found in scylladb/scylladb#20282;   it only adds condition checks to prevent undefined behavior in cases of  inconsistent inputs.

Refs scylladb/scylladb#20625

As this issue applies to the releases versions and can affect clients, we need backports to 6.0, 6.1, 6.2.

Closes scylladb/scylladb#20851

* github.com:scylladb/scylladb:
  Add conditions checking for get_read_executor
  Avoid an extra call to block_for in db::filter_for_query.
  Improve code readability in consistency_level.cc and storage_proxy.cc
  tools: Add build_info header with functions providing build type information
  tests: Add tests for alter table with RF=1 to RF=0
2024-10-11 15:02:02 +02:00
Paweł Zakrzewski
900a6706b8 test: test_restart_cluster: create the test
The purpose of this test that the cluster is able to boot up again after
a full cluster shutdown, thus exhibiting no issues when connecting to
raft group 0 that is larger than one.
2024-10-11 13:25:07 +02:00
Paweł Zakrzewski
7008b71acc auth: standard_role_manager allows awaiting superuser creation
This change implements the ability to await superuser creation in the
function ensure_superuser_is_created(). This means that Scylla will not
be serving CQL connections until the superuser is created.

Fixes #10481
2024-10-11 13:25:07 +02:00
Paweł Zakrzewski
04fc82620b auth: coroutinize the standard_role_manager start() function
This change is a preparation for the next change. Moving to coroutines
makes the code more readable and easier to process.
2024-10-11 13:25:07 +02:00
Paweł Zakrzewski
f525d4b0c1 auth: don't start server until the superuser is created
This change reorganizes the way standard_role_manager startup is
handled: now the future returned by its start() function can be used to
determine when startup has finished. We use this future to ensure the
startup is finished prior to starting the CQL server.

Some clusters are created without auth, and auth is added later. The
first node to recognize that auth is needed must create the superuser.
Currently this is always on restart, but if we were to ever make it
LiveUpdate then it would not be on restart.

This suggests that we don't really need to wait during restart.

This is a preparatory commit, laying ground for implementation of a
start() function that waits for the superuser to be created. The default
implementation returns a ready future, which makes no change in the code
behavior.
2024-10-11 13:25:07 +02:00
Pavel Emelyanov
a7042d66e3 sstables: Flatten S3 object name making
The s3_storage backend driver has a method that generates object path
within the bucket. Depending on options alternative it picks one of two
formats:

- for string prefix, it uses it implicitly via sstable::filename() call
  that calls storage->prefix() which, in turn, returns prefix value

- for registry-backed sstables, the /bucket/generation/component path is
  generated

This patch bruses this place up. Similarly to previous patch, this
change also makes the selection based on the location alternative, not
on the sstable state. As well it's idempotent change, as S3 sstables
with 'upload' state only appear when restoring from object store, and in
this case the string location is in use.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-11 14:11:28 +03:00
Pavel Emelyanov
8d5537a439 sstable_directory: Flatten directory lister creation
After previous patchin, the way components lister is created for S3
storage options became quite hairy. This patch brushes things up to be
easier to read.

The only "functional" change here, is that selection between registry
lister and S3 lister is made based on options' location held
alternative, not on the sstable state value. That's in fact idempotent
change, the only caller that provides string location on options is the
"restore from object store" code that also sets state to be 'upload'.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-11 14:11:28 +03:00
Pavel Emelyanov
031893259a treewide: Rename sstable registry location field to be owner
This is sort of continuation of the previous patch. The partition key in
the registry is now table_id, not string, and is better called "owner",
not "location". This patch is s/location/owner/ over specific places
that include field name in the schema, argument names in registry
maintenance classes and tests accessing the selected row fields by name.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-11 14:11:28 +03:00
Pavel Emelyanov
3315e3a2a9 system_keyspace: Change sstables registry partition key type
Today, the system.sstables schema uses string as partition key. Callers,
in turn, use table's datadir value to reference entries in it. That's
wrong, S3-backed sstables don't have any local paths to work with. The
table's ID is better in this role.

This patch only changes the field type to be table_id and fixes the
callers to provide one. In particular, see init_table_storage() change
-- instead of generating a datadir string, it sets table.id() as the
options' location. Other fixed places are tests. Internally, this id
value is propagated via s3_storage::owner() method, that's fixed as
well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-11 13:48:09 +03:00
pehala
a2f9136e36 test.py: Use "python -m pytest" for pytest invocation for PythonTest
Enables debugging inside pytest subprocesses as well. It seems that pydev automatically attaches itself also to all python subprocesses. Since we used to call "pytest" wrapper it was deemed a different program, and we could not debug individual tests.

Closes scylladb/scylladb#21050
2024-10-11 13:38:47 +03:00
Pavel Emelyanov
bb13b7bf72 sstables: Keep location variant on s3 backend too
Previous patch put variant<string, table_id> as location of S3 options.
This patch makes the S3 sstables backend driver keep variant as sstable
location. As with the previous patch, driver only keeps variant, but
continues using its string alternative internally. This will be changed
later on.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-11 13:09:47 +03:00
Pavel Emelyanov
1181b6b082 storage_options: Use variant on S3 options
Describing S3 storage for an sstables nowadays has two options -- via
sstables registry entry and by using the direct prefix string. The
former is used when putting a keyspace on S3. In this case each sstable
has the corresponding entry in the system.sstables table. The latter is
used by "restore from object storage" code. In that case, sstables don't
have entries in the registry, but are accessed by a specific S3 object
path.

This patch reflects this difference by making s3_options::location be
variant of string prefix and table_id owner. The owner needs more
explanation, here it is.

Today, the system.sstables schema defines partition key to be "string
location" and clustering key to be "UUID generation". The partition key
is table's datadir string, but it's wrong to use it this way. Next
patches will change the partition key to be table's ID (there's table_id
type for it), and before doing it storage options must be prepared to
carry it onboard. This patch does it, but the table_id alternative of
the location is still unused, the rest of the code keeps using the
string location to reference a row in the registry table. Next patches
will eventually make use of the table_id value.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-11 13:04:52 +03:00
Kamil Braun
4d99cd2055 Merge 'raft: fast tombstone GC for group0-managed tables' from Emil Maskovsky
Add the gossip state for broadcasting the nodes state_id.

Implemented the Group0 state broadcaster (based on the gossip) that will broadcast the state id of each node and check the minimal state id for the tombstone GC.

When there is a change in the tombstone GC minimal state id, the state broadcaster will update the tombstone GC time for the group0-managed tables.

The main component of the change is the newly added `group0_state_id_handler` that keeps track, broadcasts and receives the last group0 state_ids across all nodes and sets the tombstone GC deletion time accordingly:
* on each group0 change applied, the state_id handler broadcasts the state_id as a gossip state (only if the value has changed)
* the handler checks for the node state ids every refresh period (configurable, 1h by default)
* on every check, the handler figures out the lowest state_id (timeuuid), which is state_id that all of the nodes already have
* the timestamp of this minimum state_id is then used to set the tombstone GC deletion time
* the tombstone GC calculation then uses that deletion time to provide the GC time back to the callers, e.g. when doing the compaction
* (as the time for tombstone GC calculation has the 1s granularity we actually deduce 1s from the determined timestamp, because it can happen that there were some newer mutations received in the same second that were not distributed across the nodes yet)

This change introduces a new flag to the static schema descriptor (`is_group0_table`) that is being checked for this newly added mode in the tombstone GC. We also add a check (in non-release builds only) on every group0 modification that the table has this flag set.

The group0 tombstone GC handling is similar to the "repair" tombstone GC mode in a sense (that the tombstone GC time is determined according to a reconciliation action), however it is not explicitly visible to (nor editable by) the user. And also the tombstone GC calculation is much simpler than the "repair" mode calculation - for example, we always use the whole range (as opposed to the "repair" mode that can have specific repair times set for specific ranges).

We use the group0 configuration to determine the set of nodes (both current and previous in case of joint configuration) - we need to make sure that we account for all the group0 nodes (if any node didn't provide the state_id yet, the current check round will be skipped, i.e. no GC will be done until all known nodes provide their state_id timestamp value).

Also note that the group0 state_id handling works on all nodes independently, i.e. each node might have its own (possibly different) state depending on the gossip application state propagation. This is however not a problem, as some nodes might be behind, but they will catch up eventually, and this solution has the benefit of being distributed (as opposed to having a central point to handle the state, like for example the topology coordinator that has been considered in the early stages of the design).

Fixes: scylladb/scylla#15607

New feature, should not be backported.

Closes scylladb/scylladb#20394

* github.com:scylladb/scylladb:
  raft: add the check for the group0 tables
  raft: fast tombstone GC for group0-managed tables
  tombstone_gc: refactor the repair map
  raft: flag the group0-managed tables
  gossip: broadcast the group0 state id
  raft/test: add test for the group0 tombstone GC
  treewide: code cleanup and refactoring
2024-10-11 11:52:27 +02:00
Pavel Emelyanov
ba97072709 sstables: Split sstable::filename() helper
To have the filename(type, prefix) one, next patches will provide prefix
on their own, to avoid storage->prefix() call.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-11 12:47:13 +03:00
Pavel Emelyanov
6f9cb51259 sstables: Add s3_storage::owner() helper
This driver uses sstring _location as part of the lookup key in the
sstables registry. Next patches will need to change that and put more
checks on the registry access, so introduce a helper method beforehand.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-11 12:47:12 +03:00
Sergey Zolotukhin
c373edab2d Add conditions checking for get_read_executor
During the investigation of scylladb/scylladb#20282, it was discovered that
implementations of speculating read executors have undefined behavior
when called with an incorrect number of read replicas. This PR
introduces two levels of condition checking:

- Condition checking in speculating read executors for the number of replicas.
- Checking the consistency of the Effective Replication Map in
  get_endpoints_for_reading(): the map is considered incorrect the number of
  read replica nodes is higher than replication factor. The check is
  applied only when built in non release mode.

Please note: This PR does not fix the issue found in scylladb/scylladb#20282;
it only adds condition checks to prevent undefined behavior in cases of
inconsistent inputs.

Refs scylladb/scylladb#20625
2024-10-11 09:38:25 +02:00
Sergey Zolotukhin
8db6d6bd57 Avoid an extra call to block_for in db::filter_for_query. 2024-10-11 09:38:25 +02:00
Sergey Zolotukhin
ad93cf5753 Improve code readability in consistency_level.cc and storage_proxy.cc
Add const correctness and rename some variables to improve code readability.
2024-10-11 09:38:25 +02:00
Sergey Zolotukhin
ae23d42889 tools: Add build_info header with functions providing build type information
A new header provides `constexpr` functions to retrieve build
type information: `get_build_type()`, `is_release_build()`,
and `is_debug_build()`. These functions are useful when adding
changes that should be enabled at compile time only for
specific build types.
2024-10-11 09:38:24 +02:00
Sergey Zolotukhin
132358dc92 tests: Add tests for alter table with RF=1 to RF=0
Adding Vnodes and Tablets tests for alter keyspace operation that decreases replication factor
from 1 to 0 for one of two data centers. Tablet version fails due to issue described in
scylladb/scylladb#20625.

Test for scylladb/scylladb#20625
2024-10-11 09:38:24 +02:00
Pavel Emelyanov
77eb9ddb0f sstable_set: Reserve vector of readers
When generating readers for the set of sstables, the end size of this
vector is known in advance and its storage can be reserved.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#21055
2024-10-11 09:56:17 +03:00
Pavel Emelyanov
551da72492 api: Use captured database, not the one from ctx
Continuation of the previous patch -- not commitlog-related endpoints
can use provided database reference, that was captured from main.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-10 17:53:30 +03:00
Pavel Emelyanov
74f7071db8 api: Pass sharded<database> to commitlog endpoints registration
This is to make registered enpoints with with the database without
grabbing one from ctx.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-10 17:52:56 +03:00
Pavel Emelyanov
14ab6d2615 api: Move commitlog-related from storage_service.cc
It registers itself in /storage_service function, but works with
commitlog, so should be located next to commitlog endpoints.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-10 17:52:25 +03:00
Pavel Emelyanov
ba73704774 api: Unset commitlog API endpoints
Most of other set_...()-s has the unset_...() scheduled right
afterwards, so here's one for set_server_commitlog().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-10 17:50:37 +03:00
Pavel Emelyanov
44ec6d36f3 api: Extract set_server_commitlog() from set_server_done()
The latter collects a bunch of endpoints including commitlog ones.
Extract it as snandalone call in main. It's currently not located next
to "commitlog server" as it should, because there's no standalone
commitlog service in main. It will be addressed as a followup together
with other endpoints that work with sharded<database>.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-10 17:49:10 +03:00
Pavel Emelyanov
1863ccd900 tablets: Validate system.tablets update
Implement change validation for raft topology_change command. For now
the only check is that the "pending replicas" contains at most one
entry. The check mirrors similar one in `process_one_row` function.

If not passed, this prevents system.tablets from being updated with the
mutation(s) that will not be loaded later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-10 12:39:58 +03:00
Pavel Emelyanov
e5bf376cbc group0_client: Introduce change validation
Add validate_change() methods (well, a template and an overload) that
are called by prepare_command() and are supposed to validate the
proposed change before it hits persistent storage

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-10 12:31:52 +03:00
Pavel Emelyanov
f09fe4f351 group0_client: Add shared_token_metadata dependency
It will be needed later to get tablet_metadata from.
The dependency is "OK", shared_token_metadata is low-level sharded
service. Client already references db::system_keyspace, which in turn
references replica::database which, finally, references token_metadata

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-10 12:27:46 +03:00
Botond Dénes
86fd9ce8fd schema/schema: break circular dependency with replica::database
The schema module (everything in schema/) is supposed to be towards the
leafs in the ScyllaDB inter-module dependency graph. In other words, it
should not depend on many other modules. On the other hand, almost the
entire codebase depends on the schema module itself.
Currently there is a circular dependency between schema and
replica::database, as the latter is a required argument for
schema::describe(). This is bad, not just because of the dependency mess
it introduces, but also because now schema::describe() can only be used
by code which has a reference to the database handy.

This patch breaks this circular dependency, by introducing the
schema_describe_helper interface and providing an implementation for it
in database.hh.

There is another circular dependency: schema <-> replica::table. This is
not addressed by this patch.

Closes scylladb/scylladb#20893
2024-10-10 10:07:26 +03:00
Botond Dénes
81423e8e76 Merge 'repair: Fix stall in repair_get_row_diff_with_rpc_stream_process_op_slow_path' from Asias He
Use clear_gently to avoid the following stalls.

```
~frozen_mutation_fragment at ././frozen_mutation.hh:268

std::destroy_at<frozen_mutation_fragment> at
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_construct.h:88

std::allocator_traits<std::allocator<std::_List_node<frozen_mutation_fragment>
> >::destroy<frozen_mutation_fragment> at
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/alloc_traits.h:537

std::__cxx11::_List_base<frozen_mutation_fragment,
std::allocator<frozen_mutation_fragment> >::_M_clear at
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/list.tcc:77

~_List_base at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_list.h:499
~partition_key_and_mutation_fragments at ././repair/repair.hh:298
~repair_row_on_wire_with_cmd at ././repair/repair.hh:335
operator() at ./repair/row_level.cc:1881
```

Fixes #21016

Performance improvement only. No backport.

Closes scylladb/scylladb#21017

* github.com:scylladb/scylladb:
  repair: Fix stall in repair_get_row_diff_with_rpc_stream_process_op_slow_path
  repair: Add clear_gently for partition_key_and_mutation_fragments
2024-10-10 09:27:27 +03:00
Benny Halevy
3a12ad96c7 sstables: scylla_metadata: add sstable identifier
Keep a copy of the sstable uuid generation in a new
scylla_metadata sstable_identifier attribute.

If the SSTable happens to have a numerical generation
just create a new time-uuid and log a message about that.

Dump this new attribute in scylla sstable dump tool.

And add a unit test to verify that the written (and then
loaded) sstable identifier matches the sstable's generation.

The motivatrion for this change stems from backup
deduplication.  In essence, an sstable may already have been
backed up in a previous snapshot, and we don't want to
abck it up again if it's already present on external storage.

Today this is based on rclone that compares files checksums,
but once scylla will backup the sstables using the native
object-storage stack (#19890), we would like to use the sstable
globally-unique identifier for deduplication.  Although the
uuid-generation is encoded in the sstable path, the latter
may change, e.g. due to intra-node migration, so keep a copy
of the original unique identifier in scylla-metadata, and that
attribute would survive file-based or intra-node migrations.

Fixes scylladb/scylladb#20459

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#21002
2024-10-10 08:52:46 +03:00
Avi Kivity
b66479ea98 Merge 'compaction: fix potential data resurrection with file-based migration' from Ferenc Szili
When tablets are migrated with file-based streaming, we can have a situation where a tombstone is garbage collected before the data it shadows lands. For instance, if we have a tablet replica with 3 sstables:

1. sstable containing an expired tombstone
2. sstable with additional data
3. sstable containing data which is shadowed by the expired tombstone in sstable 1

If this tablet is migrated, and the sstables are streamed in the order listed above, the first two sstables can be compacted before the third sstable arrives. In that case, the expired tombstone will be garbage collected, and data in the third sstable will be resurrected after it arrives to the pending replica.

This change fixes this problem by disabling tombstone garbage collection for pending replicas.

This fixes a problem in Enterprise, but the change is in OSS in order to have as few differences between OSS and Enterprise and to have a common infrastructure for disabling tombstone GC on pending replicas.

This change has to be backported to all active versions: 6.0, 6.1 and 6.2, as well as Enterprise 2024.2

Closes scylladb/scylladb#20788

* github.com:scylladb/scylladb:
  test: test tombstone GC disabled on pending replica
  tablet_storage_group_manager: update tombstone_gc_enabled in compaction group
  database::table: add tombstone_gc_enabled(locator::tablet_id)
2024-10-09 21:49:49 +03:00
Avi Kivity
bb1867c7c7 Merge 'sstables: Add digest checking in the validation path of the sstable layer' from Nikos Dragazis
This PR builds upon the PR for checksum validation (#20207) to further enhance scrub's corruption detection capabilities by validating digests as well. The digest (full checksum) is the checksum over the entire data, as opposed to per-chunk checksums which apply to individual chunks. Until now, digests were not examined on any code paths. This PR integrates digest checking into the compressed/checksummed data sources as an optional feature and enables it only through the validation path of the sstable layer (`sstable::validate()`). The validation path is used by the following tools:

* scrub in validate mode
* `sstable validate`

All other reads, including normal user reads, are unaffected by this change.

The PR consists of:
* Extensions to the compressed and checksummed data sources to support digest checking. The data sources receive the expected digest as a parameter and calculate the actual digest incrementally across multiple get() calls. The check happens on the get() call that reaches EOF and results to an exception if the digest is invalid. A digest check requires reading the whole file range. Therefore, a partial read or skip() is treated as an internal error.
* A new shareable digest component loaded on demand by the validation code. No lifecycle management.
* Grouping of old scrub/validate tests for compressed and uncompressed SSTables to reduce code duplication.
* scrub/validate tests for SSTables with valid checksums but invalid digests, and SSTables with no digests at all.
* scrub/validate tests with 3.x Cassandra SSTables to ensure compatibility.

Refs #19058.

New feature, no backport is needed.

Closes scylladb/scylladb#20720

* github.com:scylladb/scylladb:
  test: Test scrub/validate with SSTables from Cassandra
  compaction: Make quarantine optional for perform_sstable_scrub()
  test: Make random schema optional in scrub_test_framework
  test: Add tests for invalid digests
  test: Merge scrub/validate tests for compressed and uncompressed cases
  sstables: Verify digests on validation path
  sstables: Check if digest component exists
  sstables: Add digest in the SSTable components
  sstables: Add digest check in compressed data source
  sstables: Add digest check in checksummed data source
2024-10-09 21:33:08 +03:00
Benny Halevy
d34878e96c view: check_needs_view_update_path: get token_metadata_ptr
check_needs_view_update_path is async and might yield
so the token_metadata reference passed to it must be kept
alive throughout the call.

Fixes scylladb/scylladb#20979

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#20980
2024-10-09 20:56:21 +03:00
Nadav Har'El
a1999cd5d5 cql-pytest: fix run-cassandra on systems with default Java 8
The test/cql-ptest/run-cassandra prefers to use Java 11 if installed on
the system because this is the only version of Java that all modern
versions of Cassandra run on (Cassandra 3 and 4 can run on Java 8 and 11,
Cassandra 5 can run on Java 11 and 17).

However, in our search order we tried the "java" in the user's path
first, before trying Java 11. This means that if the user for some
reason had the ancient Java 8 (which is now a decade old) as his
default "java" got that, instead of Java 11, and couldn't run Cassandra 5.

While at it, update the comments to reflect the new reality that
Cassandra 5 needs Java 17 or 11 - *not* 11 or 8 as the older Cassandra.
We should eventually change the code logic as well (searching for
versions that depend on the Cassandra version - not always Java 8 and
11), but let's do it later. This patch already fixes a real bug for
developers that did install Java 11 but their default "java" pointed to
Java 8.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#21001
2024-10-09 20:51:56 +03:00
David Garcia
2247bdbc8c docs: Fix confgroup links
It was not possible to link to configuration parameters groups in docs/reference/configuration-parameters.rst if they contained a space.

Closes scylladb/scylladb#21018
2024-10-09 20:16:15 +03:00
Pavel Emelyanov
3dcf3d65d7 replica: Use substract_sets() helper
The process_one_row() evaluates pending_replica by subtracting replicas
from new_replicas. There's a convenience helper for that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#21019
2024-10-09 20:02:16 +03:00
Gleb Natapov
f7e7e61fa7 raft: add more information to start_read_barrier error
Add requester into to the error about requester being out of config.
Also fix a typo while we are at it.

Message-ID: <ZwVDtOty2cWy3vqD@scylladb.com>
2024-10-09 16:24:34 +02:00
Pavel Emelyanov
7163fbcef5 Merge 'utils: replace dependency on boost ranges with <ranges>' from Avi Kivity
To avoid depending on two similar libraries (boost ranges and std \<ranges), replace
uses of the former with the latter. This series tackles the utils/ directory.

Code cleanup, no backport.

Closes scylladb/scylladb#20997

* github.com:scylladb/scylladb:
  utils: logalloc: replace boost with std
  utils: lsa: chunked_managed_vector: replace boost with std
  utils: config_file: replace boost with std
  utils: loading_cache: replace boost with std
  utils: fragment_range: replace boost with std
  utils: error_injector: replace boost with std
  utils: crc: replace boost for_each with built-in range for
  utils: class_registrator: replace boost with std
  utils: chunked_vector: replace boost with std
  utils: observable: replace boost with std
2024-10-09 16:04:48 +03:00
Botond Dénes
3e468608e7 Merge 'Collect sstables on boot from all datadirs (and don't collect from S3 twice)' from Pavel Emelyanov
There's a long-pending issue in distributed loader. When it populates sstables on boot it loops over table.config.all_datadirs, but ignores the loop cursor (the datadir itslef), instead loading sstables from table.config.dir, which is 0th element of all_datadirs. There's a test for that, but it's also broken. Effectively collection happens from table.config.dir several times. For local sstables that's just wasted work and potentially lost sstables (but nobody seems to configure more than 1 datadir anyway). For S3 sstables it's also wasted work and incorrectness.

The fix is for both -- populator and test. The former is to use all_datadirs to construct sstable_directory. To make it happen, creation of sstable_directory now depends on the storage options, the loop is moved into the branch that creates sstable_directory for local storage type. The test fix is to make sure that some sstables in non-default datadir before running population code.

Closes scylladb/scylladb#20819

* github.com:scylladb/scylladb:
  test: Fix test_multiple_data_dirs
  distributed_loader: Indentation fix after previous patch
  distributed_loader: Use correct datadir to collect local sstable
  distributed_loader: Move all-datadirs loop to local storage collecting
  distributed_loader: Collect table subdirs based on its storage options
  distributed_loader: Indentation fix after previous patch
  distributed_loader: Squash loop of collect_subdir into one method
  distributed_loader: Convert map of directories into a vector
  distributed_loader: Make start_subdir() method work with directory
  distributed_loader: Drop local reference variable
  distributed_loader: Split start_subdir()
  distributed_loader: Remove allow-offstrategy argument
  distributed_loader: Make populate() method work with directory
  distributed_loader: Remove check for sstable_directory presense
  distributed_loader: Out-line table_populator() methods
  distributed_loader: Print storage options, not datadir
  distributed_loader: Print prepared message
  sstable_directory: Add sstable_state argument ot one of constructors
  sstable_directory: Add state() method
2024-10-09 14:43:34 +03:00
Michał Chojnowski
c2ba300f1c reader_concurrency_semaphore: in stats, fix swapped count_resources and memory_resources
can_admit_read() returns reason::memory_resources when the permit is queued due
to lack of count resources, and it returns reason::count_resources when the
permit is queued due to lack of memory resources. It's supposed to be the other
way around.

This bug is causing the two counts to be swapped in the stat dumps printed to
the logs when semaphores time out.

Closes scylladb/scylladb#20714
2024-10-09 14:12:01 +03:00
Lakshmi Narayanan Sreethar
69c385f540 compaction: make drain wait for compactions to stop during shutdown
During shutdown, the compaction_manager starts stopping ongoing
compaction tasks through `really_do_stop()` method as soon as it
receives a signal from the abort source. Later, when the database object
shuts down, it calls `compaction_manager::drain` to ensure that all
compaction tasks have stopped. However, `compaction_manager::drain` is
currently implemented in such a way that, during shutdown, it
effectively becomes a no-op because the compaction_manager has already
initiated the stopping of tasks. As a result the caller assumes that all
the compaction tasks have stopped and proceeds to close all the tables.
This can lead to race conditions where table closures overlap with
compaction tasks that are still running, resulting in exceptions like :

```
exception during mutation write to 127.0.0.1:
utils::internal::nested_exception<std::runtime_error> (Could not write
mutation system:compaction_history
(pk{0010b70d31705e0411efb2edf6467f094c8b}) to commitlog):
seastar::gate_closed_exception (gate closed)
```

This commit fixes the issue by updating `compaction_manager::drain` to
invoke `stop_ongoing_compactions` even during shutdown to ensure that it
waits for the ongoing compaction tasks to complete. The
`stop_ongoing_compactions` method will also send a stop request to these
tasks before waiting, but the request will be ignored by the tasks as
they would have already received one earlier from `really_do_stop()`.

Fixes #20197

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#20715
2024-10-09 12:08:32 +03:00
Pavel Emelyanov
17ec416178 Merge 'Make sure S3 upload completion parses possible error' from Ernest Zaslavsky
fixes #20517
Adds `aws_error` which possibly can contain errors from the S3 response body. Adds to the multipart upload completion a check for possible error and issues a retry if the error is retryable

Closes scylladb/scylladb#20518

* github.com:scylladb/scylladb:
  test: add complete_multipart_upload completion tests
  code: s3 client error handling
  code: add response parsing and error handling to the complete_multipart_upload
  code: Introduce AWS errors parsing
2024-10-09 12:01:27 +03:00
Piotr Smaron
e0c1a51642 cql/tablets: handle MVs in ALTER tablets KEYSPACE
ALTERing tablets-enabled KEYSPACES (KS) didn't account for materialized
views (MV), and only produced tablets mutations changing tables.
With this patch we're producing tablets mutations for both tables and
MVs, hence when e.g. we change the replication factor (RF) of a KS, both the
tables' RFs and MVs' RFs are updated along with tablets replicas.
The `test_tablet_rf_change` testcase has been extended to also verify
that MVs' tablets replicas are updated when RF changes.

Fixes: #20240

Closes scylladb/scylladb#21007
2024-10-09 10:51:18 +02:00
Pavel Emelyanov
0bc8d0c620 Merge 'utils: unconst: wean away from boost range library' from Avi Kivity
As part of the effort to standardize on a single range library, convert the unconst helper
and its only user to \<ranges>.

The only user, mutation_partitions, happens to use intrusive_btree::iterator as the payload. That
iterator wasn't fully conform to iterator requirements, so it's fixed in a preliminary patch.

Code cleanup; no backport.

Closes scylladb/scylladb#20986

* github.com:scylladb/scylladb:
  utils/unconst, mutation_partition: switch to ranges
  utils: intrusive_btree: improve conformity with iterator requirements
2024-10-09 10:06:52 +03:00
Yuao Ma
1cc7821d12 tools: fix typos in the code
This patch corrects a minor typo without any functional changes.

Signed-off-by: Yuao Ma <c8ef@outlook.com>

Closes scylladb/scylladb#20975
2024-10-09 08:18:36 +03:00
Asias He
2d8442f663 repair: Fix stall in repair_get_row_diff_with_rpc_stream_process_op_slow_path
Use clear_gently to avoid the following stalls.

```
~frozen_mutation_fragment at ././frozen_mutation.hh:268

std::destroy_at<frozen_mutation_fragment> at
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_construct.h:88

std::allocator_traits<std::allocator<std::_List_node<frozen_mutation_fragment>
> >::destroy<frozen_mutation_fragment> at
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/alloc_traits.h:537

std::__cxx11::_List_base<frozen_mutation_fragment,
std::allocator<frozen_mutation_fragment> >::_M_clear at
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/list.tcc:77

~_List_base at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_list.h:499
~partition_key_and_mutation_fragments at ././repair/repair.hh:298
~repair_row_on_wire_with_cmd at ././repair/repair.hh:335
operator() at ./repair/row_level.cc:1881
```

Fixes #21016
2024-10-09 09:37:49 +08:00
Asias He
f5f26e1bba repair: Add clear_gently for partition_key_and_mutation_fragments
It is used to clear mutation_fragments to avoid stalls.
2024-10-09 09:31:14 +08:00
Emil Maskovsky
0c9308cf48 raft: add the check for the group0 tables
Added the runtime check to ensure that all the tables that are used with
the group0 commands are marked as group0 tables.
2024-10-08 21:08:11 +02:00
Emil Maskovsky
a03e98d6e8 raft: fast tombstone GC for group0-managed tables
Set the tombstone GC time for group0-managed tables to the minimal state
id of the group0 nodes.

The check is being done based on a timer, iterating through each node
(according to the group0 topology configuration) and taking the minimum
across all nodes.

This miminum timestamp is then be used to set the tombstone GC time
for the tombstone GC of all the group0-managed tables.

Fixes: scylladb/scylla#15607
2024-10-08 21:07:30 +02:00
Emil Maskovsky
74bd79bbb3 tombstone_gc: refactor the repair map
Move the repair_map definition to the tombstone_gc file where it is
mostly being used.

Refactor and add the accessors and setters for the group0 tombstone GC
time.
2024-10-08 20:53:54 +02:00
Emil Maskovsky
22471410e7 raft: flag the group0-managed tables
Add the schema flag to indicate the group0-managed tables.

This is to be used to identify and list the group0-managed tables.
2024-10-08 20:53:54 +02:00
Emil Maskovsky
baea9cfa67 gossip: broadcast the group0 state id
Implemented the group0 state_id handler (based on the gossip) that will
broadcast the group0 state id of each node.

This will be used to set the tombstone GC time for the group0 tables.
2024-10-08 20:53:54 +02:00
Emil Maskovsky
fa45fdf5f7 raft/test: add test for the group0 tombstone GC
Test that the group0 fast tombstone GC works correctly.
2024-10-08 20:53:54 +02:00
Emil Maskovsky
a840949ea0 treewide: code cleanup and refactoring
Fix the clang-tidy warnings, code cleanup and improvements.

Applied the clang format to the updated places.
2024-10-08 20:53:54 +02:00
Nadav Har'El
b4df07df71 Merge 'cql3: Print arguments and return type without frozen when describing UDF' from Dawid Mędrek
Scylla doesn't allow for the types of arguments or the return type of a UDF
to be frozen. As a result, before these changes, create statements
produced to restore UDFs as part of `DESCRIBE` statements could not
be executed.

Fixes scylladb/scylladb#20256

Backport: necessary as the restore process may not work correctly without these changes. The affected versions span from 5.2 to the current master, but we only want to apply the fix to the live versions, so 6.0, 6.1, and 6.2.

Closes scylladb/scylladb#20816

* github.com:scylladb/scylladb:
  cql3/functions/user_function: Print arguments and return type without frozen
  cql3/functions/user_function: Use fmt to format create statement
2024-10-08 16:05:28 +03:00
Kamil Braun
2d9b8f269f Merge 'cql: improve validating RF's change in ALTER tablets KS' from Piotr Smaron
This patch series fixes a couple of bugs around validating if RF is not changed by too much when performing ALTER tablets KS.
RF cannot change by more than 1 in total, because tablets load balancer cannot handle more work at once.

Fixes: #20039

Should be backported to 6.0 & 6.1 (wherever tablets feature is present), as this bug may break the cluster.

Closes scylladb/scylladb#20208

* github.com:scylladb/scylladb:
  cql: sum of abs RFs diffs cannot exceed 1 in ALTER tablets KS
  cql: join new and old KS options in ALTER tablets KS
  cql: fix validation of ALTERing RFs in tablets KS
  cql: harden `alter_keyspace_statement.cc::validate_rf_difference`
  cql: validate RF change for new DCs in ALTER tablets KS
  cql: extend test_alter_tablet_keyspace_rf
  cql: refactor test_tablets::test_alter_tablet_keyspace
  cql: remove unused helper function from test_tablets
2024-10-08 14:33:45 +02:00
Kamil Braun
1b9337bf99 Merge 'Wait for all users of group0 server to complete before destroying it' from Gleb Natapov
Group0 server is often used in asynchronous context, but we do not wait
for them to complete before destroying the server. We already have
shutdown gate for it, so lets use it in those asynch functions.

Also make sure to signal group0 abort source if initialization fails.

Fixes scylladb/scylladb#20701

Backport to 6.2 since it contains af83c5e53e and it made the race easier to hit, so tests became flaky.

Closes scylladb/scylladb#20891

* github.com:scylladb/scylladb:
  group: hold group0 shutdown gate during async operations
  group0: Stop group0 if node initialization fails
2024-10-08 13:46:54 +02:00
Avi Kivity
48ea51029f Merge 'time_window_compaction_strategy: estimated_pending_compactions: reestimate compactions rather than using cached value' from Benny Halevy
Currently, `estimated_pending_compactions` uses a precalculated value calculated by `update_estimated_compaction_by_tasks`, which, in turn, is called by `get_compaction_candidates`.  That means that, if `estimated_pending_compactions` is called, e.g. right after major compaction, it will return an outdated value that was calculated prior to major compaction, and so, it is no longer relevant.

Instead, just recalculate the value in `estimated_pending_compactions` and drop `update_estimated_compaction_by_tasks`.

* Enhancement, no backport required

Closes scylladb/scylladb#20892

* github.com:scylladb/scylladb:
  test: cql-pytest: test_compaction: add test_compactionstats_after_major_compaction
  test/cql-pytest: rename test_compaction{_tombstone_gc,}
  time_window_compaction_strategy: estimated_pending_compactions: reestimate compactions rather than using cached value
2024-10-08 13:29:51 +03:00
Gleb Natapov
d62fbd795b storage_proxy: make sure there is no end iterator in _live_iterators array
storage_proxy::cancellable_write_handlers_list::update_live_iterators
assumes that iterators in _live_iterators can be dereferenced, but
the code does not make any attempt to make sure this is the case. The
iterator can be the end iterator which cannot be dereferenced.

The patch makes sure that there is no end iterator in _live_iterators.

Fixes scylladb/scylladb#20874

Closes scylladb/scylladb#20977
2024-10-08 13:16:27 +03:00
Avi Kivity
656dc438ab utils: logalloc: replace boost with std 2024-10-08 12:07:14 +03:00
Avi Kivity
84b25a51f5 utils: lsa: chunked_managed_vector: replace boost with std 2024-10-08 12:03:30 +03:00
Avi Kivity
fa772701be utils: config_file: replace boost with std 2024-10-08 12:03:15 +03:00
Avi Kivity
b62fadae5f utils: loading_cache: replace boost with std
Unfortunately, the replacement for boost::range::join(),
std::views::concat(), is in C++26 (and not implemented in libstdc++ 14).
We use array/transform/join to simulate it.
2024-10-08 11:54:34 +03:00
Laszlo Ersek
934b42c6a8 cmake/check_headers: correct typos
Commit efd65aebb2 ("build: cmake: add check-header target", 2023-11-13)
introduced three typos:

- In "cmake/check_headers.cmake", it checked whether the
  "parsed_args_GLOB_RECURSE" argument was defined, but then it referenced
  the same under the wrong name "parsed_args_RECURSIVE".

- The above error masked two further typos; namely the duplicate use of
  "api" and "streaming" each, as targets. With "parsed_args_GLOB_RECURSE"
  above fixed, CMake now reports these conflicting arguments (target
  names). They should have been "node_ops" and "sstables", respectively.

Correct the typos.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>

Closes scylladb/scylladb#20992
2024-10-08 09:38:16 +03:00
Dawid Mędrek
8582ed513b cql3/functions/user_function: Print arguments and return type without frozen
Scylla doesn't allow for the types of arguments or the return type
to be frozen. As a result, before these changes, create statements
produced to restore UDFs as part of `DESCRIBE` statements could not
be executed.

We fix that and add a reproducer test and another one to verify that
the implementation is correct.
2024-10-07 20:53:10 +02:00
Avi Kivity
72a39b84b0 utils: fragment_range: replace boost with std 2024-10-07 21:32:16 +03:00
Avi Kivity
c8a68c4cf7 utils: error_injector: replace boost with std 2024-10-07 21:28:36 +03:00
Avi Kivity
c560686d92 utils: crc: replace boost for_each with built-in range for
Simpler.
2024-10-07 21:19:14 +03:00
Avi Kivity
44419fc5ec utils: class_registrator: replace boost with std 2024-10-07 21:16:03 +03:00
Avi Kivity
adb92a6c16 utils: chunked_vector: replace boost with std 2024-10-07 21:11:23 +03:00
Avi Kivity
b259389a3e utils: observable: replace boost with std 2024-10-07 21:11:07 +03:00
Nadav Har'El
45ccceb137 alternator: add "dc" and "rack" options to "/localnodes" request
Before this patch, the "/localnodes" HTTP request to the Alternator server
lists all the live nodes of the current DC. This patch adds two optional
parameters to this query:

  dc: allows to list the live nodes of a specific named DC instead of the
      current DC of the server.

  rack: allows to restrict the results to just the nodes belonging to a
      specific named rack.

For both options, if no live node exists in the given dc or rack (in
particular, if such a dc or rack doesn't even exist), an empty list is
returned - it's not an error.

The default, if dc or rack is not specified - remains exactly as it is
today - look at the current DC (the one of the node being request), and
do not restrict the list to any specific rack.

We expect the new options that we added here to be useful for two use cases:

1. A client that knows of *some* Scylla node (belonging to an unknown DC),
   but wants to list the nodes in *its* DC, which it knows by name.

2. A client in a multi-rack DC (e.g., multi-AZ region in AWS) that wants
   to send requests to nodes in its own rack (which it knows by name),
   to avoid cross-rack networking costs.

Note that in both cases, this requires clients to know the names of DCs
and AZs via some out-of-band means. The client can also get a list of DCs
and racks using the system.local system table, as the tests included in
this patch demonstrate.

This patch includes two set of tests for these new options: One in the the
single-node test/alternator framework that has a single dc and rack but
can still check the case of an unknown dc or rack (in which case an empty
list is returned). The second test is in the topology framework, and runs
an 8-node cluster with two DCs, two racks, and two nodes in each, and
checks all the combinations of "/localnodes" requests with and without
dc and rack options. This test also resolves a longstanding TODO that
asked for such a multi-DC test for "/localnodes" to be written.

Fixes #12147

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#20915
2024-10-07 20:53:47 +03:00
Pavel Emelyanov
8bfbc563cc test: Remove sstable factory from test_min_max_clustering_key()
The helper makes sstables from env directly. Callers may not create the
factor after that. Less code the better.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20983
2024-10-07 20:08:05 +03:00
Kefu Chai
a6ec6d32ab auth: add "IWYU pragma: keep" to keep boost/regex_fwd.hpp
clang-include-cleaner is not able to tell that the header provides
the template parameter of `std::vector<std::pair<query_source, boost::regex>>`.
and suggest us to remove this include. but it's wrong.

so, in this change we apply the "pragma" to keep it.
see
https://github.com/include-what-you-use/include-what-you-use/blob/master/docs/IWYUPragmas.md
for the explanations on what this pragma is for.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-07 20:08:05 +03:00
Kefu Chai
3d31835949 auth: include boost/regex_fwd.hpp in header
since we only need the full definition of boost::regex in the .cc
file, where we

- define the constructor and destructor
- and actually use the regex.

there is no need to include boost/regex.hpp in the header, in order
to keep the preprocessed header smaller. let's use a header only
contains forward declarations in header, and include the full
definition in the .cc file.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-07 20:08:05 +03:00
Piotr Smaron
ee56bbfe61 cql: sum of abs RFs diffs cannot exceed 1 in ALTER tablets KS
Tablets load balancer is unable to process more than a single pending
replica, thus ALTER tablets KS cannot accept an ALTER statement which
would result in creating 2+ pending replicas, hence it has to validate
if the sum of absoulte differences of RFs specified in the statement is
not greter than 1.
2024-10-07 17:02:50 +02:00
Piotr Smaron
2aabe7f09c cql: join new and old KS options in ALTER tablets KS
A bug has been discovered while trying to ALTER tablets KS and
specifying only 1 out of 2 DCs - the not specified DC's RF has been
zeroed. This is because ALTER tablets KS updated the KS only with the
RF-per-DC mapping specified in the ALTER tablets KS statement, so if a
DC was ommitted, it was assigned a value of RF=0.
This commit fixes that plus additionally passes all the KS options, not
only the replication options, to the topology coordinator, where the KS
update is performed.
`initial_tablets` is a special case, which requires a special handling
in the source code, as we cannot simply update old initial_tablet's
settings with the new ones, because if only ` and TABLETS = {'enabled':
true}` is specified in the ALTER tablets KS statement, we should not zero the `initial_tablets`, but
rather keep the old value - this is tested by the
`test_alter_preserves_tablets_if_initial_tablets_skipped` testcase.
Other than that, the above mentioned testcase started to fail with
these changes, and it appeared to be an issue with the test not waiting
until ALTER is completed, and thus reading the old value, hence the
test's body has been modified to wait for ALTER to complete before
performing validation.
2024-10-07 17:02:45 +02:00
Avi Kivity
d12ba753e0 utils/unconst, mutation_partition: switch to ranges
unconst is a small help that converts a const iterator to a non-const
iterator with the help of the container. Currently it is using the
boost iterator/range libraries.

Convert it to <ranges> as part of an effort to standardize on a single
range library. Its only user in mutation_partition is converted as well.

Due to more iteroperability problems between <range> and boost, some
calls to boost::adaptors::reversed have to be converted as well.
2024-10-07 17:30:12 +03:00
Avi Kivity
75f4ea1b68 utils: intrusive_btree: improve conformity with iterator requirements
The <ranges> library checks that an iterator's operator++() returns
a reference to the same type. intrusive_btree's iterator do not; instead
they return some base type and rely on implicit conversion to the real
iterator type. This causes interoperatibility problems with <range>.

Fix by using the CRTP pattern to inform iterator_base about what
type we really are, and cast to it. Enforce it with static_assert.

Note we can't static_assert in class scope since it is checked too
early and fails. Checking in function scope delays the check.
2024-10-07 17:26:01 +03:00
Piotr Smaron
6676e47371 cql: fix validation of ALTERing RFs in tablets KS
The validation has been corrected with:
1. Checking if a DC specified in ALTER exists.
2. Removing `REPLICATION_STRATEGY_CLASS_KEY` key from a map of RFs that
   needs their RFs to be validated.
2024-10-07 16:02:01 +02:00
Piotr Smaron
93d61d7031 cql: harden alter_keyspace_statement.cc::validate_rf_difference
This function assumed that strings passed as arguments will be of
integer types, but that wasn't the case, and we missed that because this
function didn't have any validation, so this change adds proper
validation and error logging.
Arguments passed to this function were forwarded from a call to
`ks_prop_defs::get_replication_options`, which, among rf-per-dc mapping, returns also
`class:replication_strategy` pair. Second pair's member has been casted
into an `int` type and somehow the code was still running fine, but only
extra testing added later discovered a bug in here.
2024-10-07 16:02:01 +02:00
Piotr Smaron
47acdc1f98 cql: validate RF change for new DCs in ALTER tablets KS
ALTER tablets KS validated if RF is not changed by more than 1 for DCs
that already had replicas, but not for DCs that didn't have them yet, so
specifying an RF jump from 0 to 2 was possible when listing a new DC in
ALTER tablets KS statement, which violated internal invariants of
tablets load balancer.
This PR fixes that bug and adds a multi-dc testcases to check if adding
replicas to a new DC and removing replicas from a DC is honoring the RF
change constraints.

Refs: #20039
2024-10-07 16:02:01 +02:00
Piotr Smaron
9c5950533f cql: extend test_alter_tablet_keyspace_rf
Added cases to also test decreasing RF and setting the same RF.
Also added extra explanatory comments.
2024-10-07 16:02:00 +02:00
Piotr Smaron
adf453af3f cql: refactor test_tablets::test_alter_tablet_keyspace
1. Renamed the testcase to emphasize that it only focuses on testing
   changing RF - there are other tests that test ALTER tablets KS
in general.
2. Fixed whitespaces according to PEP8
2024-10-07 16:02:00 +02:00
Piotr Smaron
042825247f cql: remove unused helper function from test_tablets
`change_default_rf` is not used anywhere, moreover it uses
`replication_factor` tag, which is forbidden in ALTER tablets KS
statement.
2024-10-07 16:02:00 +02:00
Nikos Dragazis
7a1ec3aa41 test: Test scrub/validate with SSTables from Cassandra
All current unit tests for scrub in validate mode generate random
SSTables on the fly.

Add some more tests with frozen Cassandra SSTables from the source tree
to verify compatibility with Cassandra. Use some of the existing 3.x
Cassandra SSTables to test the valid case, and use the same schema to
generate some corrupted SSTables for the invalid case. Overall, the new
tests cover the following scenarios:

* valid compressed/uncompressed
* compressed/uncompressed with invalid checksums
* compressed/uncompressed with invalid digest

For the compressed SSTable with invalid checksums, a small chunk length
was used (4KiB) to have more chunks with less disk space. For
uncompressed SSTables the chunk length is not configurable.

Finally, since the SSTables live in the source tree, the quarantine
mechanism was disabled.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-10-07 15:21:38 +03:00
Nikos Dragazis
7090e2597f compaction: Make quarantine optional for perform_sstable_scrub()
Allow `perform_sstable_scrub()` to disable quarantine for invalid
SSTables detected by scrub in validate mode. This is already supported
by the lower-level function `scrub_sstables_validate_mode()` via the
flag `quarantine_sstables` and is being used by sstable-scrub.

Propagate the flag up to `perform_sstable_scrub()`. This will allow to
test scrub/validate against read-only SSTables from the source tree.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-10-07 15:21:38 +03:00
Nikos Dragazis
5f2be2924e test: Make random schema optional in scrub_test_framework
The scrub_test_framework, which is the foundation for all scrub-related
tests, always generates a random schema upon initialization and makes it
available to the user. This is useful for running tests with ephemeral
SSTables, but is redundant when the creation of the SSTable predates the
test (e.g., it lives in the source tree).

Turn scrub_test_framework into a template with a boolean parameter to
optionally switch off the random schema generation. Also, add an
overload for run() to support passing a ready-to-use SSTable instead of
mutation fragments.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-10-07 15:21:38 +03:00
Nikos Dragazis
07ed0a48aa test: Add tests for invalid digests
In a previous patch we extended the validation path of the SSTable layer
to validate the digests along with the checksums.

Add two tests for compressed and uncompressed SSTables to test the
validation API against SSTables with valid checksums but corrupted
digests.

Add two more tests to ensure that the absence of digest does not affect
checksum validation.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-10-07 15:21:38 +03:00
Nikos Dragazis
39a74fb692 test: Merge scrub/validate tests for compressed and uncompressed cases
Currently, every scrub/validate test is duplicated to cover both
compressed and uncompressed SSTables. However, except for the
compression type, the tests are identical. This leads to some code
bloat.

Introduce common functions parameterized by the compression type to
reduce code duplication. Also, group together the compressed and
uncompressed variants into one compression-agnostic test.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-10-07 15:21:38 +03:00
Nikos Dragazis
3a3783ee23 sstables: Verify digests on validation path
Extend the validation path to perform digest checking on all SSTables.
This is achieved by loading the digest component on demand and passing
it to the underlying data sources only during validation. The data
sources for compressed and uncompressed SSTables were modified in
previous patches to support digest checking.

Consider digest checking as part of the integrity checking mechanism
(i.e., requires `integrity_check::yes`) to ensure it remains disabled
for all reads happening outside of the validation path (i.e.,
`sstable::validate()`). This practically means that digest checking is
enabled only for:

* scrub in validate mode
* sstable validate

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-10-07 15:21:09 +03:00
Avi Kivity
7dad248ac7 Merge 'Fix sstables registry mock' from Pavel Emelyanov
There are two issues in it. First, listing the registry with a consumer callback passes wrong argument to the consumer. Second, the primary key of the registry is wrong. Both issues don't show up, because existing tests that use mock don't read from it, only write. Tests that read from registry are python tests that start scylla and thus use real registry.

Closes scylladb/scylladb#20946

* github.com:scylladb/scylladb:
  test: Use corrcet key in sstables registry mock
  test: Pass entry status to mock registry consumer
2024-10-07 13:56:26 +03:00
Anna Stuchlik
a601845780 doc: remove outdated JMX references
This commit removes references to JMX from the docs.

Context:
The JMX server has been dropped and removed from installation. The user can
install it manually if needed, as documented with https://github.com/scylladb/scylladb/issues/18687.

This commit removes the outdated information about JMX from other pages
in the documentation, including the docs for nodetool, the list of ports,
and the admin section.

Also, the no longer relevant JMX information is removed from
the Docker Hub docs.

Fixes https://github.com/scylladb/scylladb/issues/18687
Fixes https://github.com/scylladb/scylladb/issues/19575

Closes scylladb/scylladb#20917
2024-10-07 13:55:15 +03:00
Nadav Har'El
987042be68 mv, test: reproduce missing validation for view name
This patch adds reproducer tests (still failing) for issue #20755, which
is about missing validation of materialized view names:

1. Unlike table and keyspace names which are limited to 48 characters,
   we forgot to limit view name length, and an excessively long name can
   cause Scylla to shut down :-(

2. Unlike table and keyspace names which only allow alphanumeric
   characters, view names are missing this check and can include any
   characters.

3. Luckily, even though we are missing the alphanumeric check, we at
   least don't allow "/" in view names (if we allowed them, it could
   allow users to write in any directory in the filesystem!). But when
   this happens, we get an internal error instead of the expected
   errors.

The first test also fails on Cassandra (it doesn't crash it, but leaves
the table in a strange state), but the other two pass.

Refs #20755

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#20761
2024-10-07 13:49:58 +03:00
Avi Kivity
73eeb6d274 Merge 'clang-format: adjustments to avoid unwanted refactors and better match the Seastar coding style' from Emil Maskovsky
Some adjustments to the `.clang-format` options to better match the current code:

* don't sort the include headers: causes large diffs especially in files with a lot of includes, and the `#include` ordering is not prescribed by the Seastar coding style
* binpack the arguments in function declarations and calls: allow binpacking (as opposed to forcing each parameter on a separate line if they don't fit into the line length)
* indented parameter continuation (as opposed to aligning to the open parenthesis) - aligning to the open parenthesis causes alignment issues especially with lambdas

Fixes: scylladb/scylladb#20951

No backport: Not a product issue, just applies to master.

Closes scylladb/scylladb#20968

* github.com:scylladb/scylladb:
  clang-format: argument and function packing
  clang-format: don't sort the include headers
2024-10-07 13:21:31 +03:00
Pavel Emelyanov
1870873538 test: Fix test_multiple_data_dirs
The one was broken from the very beginning. It only checked that after
creating a table, its directory is created in all datadirs. But it
didn't check that after restart populating happens from the all. That's
because all directories by 0th were always empty, so not-populating from
them didn't skip any data.

Fix it by moving all sstables from datadirs[0] to datadirs[1] before
restart. With that update not-populating data from datadirs[1] will
be noticed instantly. Fortunately, previous patches fixed that, so the
test still passes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-07 12:04:23 +03:00
Pavel Emelyanov
aa0c20a0e7 distributed_loader: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-07 12:04:23 +03:00
Pavel Emelyanov
792c0060c7 distributed_loader: Use correct datadir to collect local sstable
Current code uses datadir it gets from table itself, which is the 0th
element in the all-datadirs config. So populating local sstables happens
several times from the same directory. Fix it by starting sstable
directory with correct datadir -- the one obtained from the all-datadirs
loop.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-07 12:04:23 +03:00
Pavel Emelyanov
bf654f45bd distributed_loader: Move all-datadirs loop to local storage collecting
It now happens in the outer loop, but it's not correct for S3 storage,
which is thus asked to collect its data twice. Also it's broken for
local storage as well, because the datadir argument is ignored. Next
patch will fix it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-07 12:04:23 +03:00
Pavel Emelyanov
fe779ab1a2 distributed_loader: Collect table subdirs based on its storage options
Collecting sstables for local storage and for S3 storage differs. First,
the populator collects sstables for each datadir configured in
scylla.yaml, but S3 storage doesn't care, so it's effectively asked to
collect the same data twice. Second, S3 collector code uses
sstable_directory simply because that class is used by reshape and
reshard code, but in fact collecting of S3 sstable can be made much
simpler (but that's for later).

Having said that, split preparation of sstables population for local
and S3 storage types.

Indentation is deliberately left broken for local storage collecting
mathod. That's because otherwise next patch will need move it back
anyway.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-07 12:04:23 +03:00
Pavel Emelyanov
ee91cae5b9 distributed_loader: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-07 12:04:23 +03:00
Pavel Emelyanov
6ae486100c distributed_loader: Squash loop of collect_subdir into one method
This prepares the gound for the next patch.
Indentation is left broken.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-07 12:04:23 +03:00
Pavel Emelyanov
4db2929afd distributed_loader: Convert map of directories into a vector
Knowledge of sstable state is no longer needed in the table_populator
start/stop methods, so the map<state, directory> can be converted into
vector<directory>.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-07 12:04:23 +03:00
Pavel Emelyanov
89e9231653 distributed_loader: Make start_subdir() method work with directory
Similarly to populate_subdir() one, it also accepts state and gets
directory out of it. Patch is the same way -- caller now passes it the
reference to directory and doesn't care about the state (in fact, the
start_subdir() doesn't care of the state either).

While at it -- rename the method to reflect what it does.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-07 12:04:23 +03:00
Pavel Emelyanov
999ec88765 distributed_loader: Drop local reference variable
Cleanup after previous patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-07 12:04:23 +03:00
Pavel Emelyanov
50024ef62a distributed_loader: Split start_subdir()
It does two things -- starts sstable_directory and prepares it. Split it
accordingly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-07 12:04:23 +03:00
Pavel Emelyanov
abdc0bb02d distributed_loader: Remove allow-offstrategy argument
This is to make populate_subdir() be self-contained in a way it uses
passed sstable_directory and make caller not care about the state.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-07 12:04:23 +03:00
Pavel Emelyanov
3b583b9d9f distributed_loader: Make populate() method work with directory
The populate_subdir() accepts sstable_state argument and picks the
corresponding sstable_directory object from the map. Patch it so that
caller passes it the sstable_directory reference. For now it makes
things more complicated, but next patches will simplify it back.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-07 12:04:23 +03:00
Pavel Emelyanov
051ac3a737 distributed_loader: Remove check for sstable_directory presense
In the old days the set of sstable_directory-s used by populator could
skip some of them. Now they are all present and the checks is always
false.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-07 12:04:23 +03:00
Pavel Emelyanov
4752a504cd distributed_loader: Out-line table_populator() methods
To make further patching with less indentation level.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-07 12:04:23 +03:00
Pavel Emelyanov
617b0e3ce3 distributed_loader: Print storage options, not datadir
Tables not necessarily have data in a directory, so it's more correct to
show storage options in logs, not some directory path.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-07 12:04:23 +03:00
Pavel Emelyanov
31b2271f07 distributed_loader: Print prepared message
When population throws, the catch block prepares a message to re-throw
another exception and prints the same message into logs. Presumably the
intent was to print the prepared message as well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-07 12:04:23 +03:00
Pavel Emelyanov
87d392d071 sstable_directory: Add sstable_state argument ot one of constructors
There's one constructor that became unused after 787ea4b1. Modify it
with the 'state' argument so that it could be used later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-07 12:03:36 +03:00
Pavel Emelyanov
b56483ab67 sstable_directory: Add state() method
The one will expose sstables state the directory works with. For
convenience.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-07 11:23:50 +03:00
Kefu Chai
abda779a5b compaction: return created sst without using a temporary variable
simpler this way. `sst` does not help with the readability or
performance, but let's drop it. simpler this way. also, remove the
unused parameter.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20961
2024-10-07 10:56:25 +03:00
Pavel Emelyanov
8ccb4a1045 Merge 'db: remove unused includes ' from Kefu Chai
these unused includes are identified by clang-include-cleaner.
after auditing the source files, all of the reports have been
confirmed.

please note, since we have `using seastar::shared_ptr` in
`seastarx.h`, this renders `#include <seastar/core/shared_ptr.hh>`
unnecessary if we don't need the full definition of `seastar::shared_ptr`.

so, in this change, all the unused includes are removed.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#20963

* github.com:scylladb/scylladb:
  .github: add db to iwyu's CLEANER_DIR
  db: remove unused includes
2024-10-07 10:55:48 +03:00
Kefu Chai
cd05f61607 api/storage_service: use ranges when handlging restore API
this change is a follow up of 787ea4b1, to modernize the code base.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20972
2024-10-07 10:54:37 +03:00
Avi Kivity
946bb870f3 utils: hashers: include <memory>
hashers.hh uses std::unique_ptr, so include its header.

Closes scylladb/scylladb#20974
2024-10-07 10:52:36 +03:00
Kefu Chai
c6bc5b2706 sstable_loader: Remove unused _snapshot_name from download_task_impl
in 787ea4b1, we introduced `_prefix` and `_sstables` member  variables
to `sstables_loader::download_task_impl`, replacing the functionality
of `_snapshot_name`. However, we overlooked removing the now-obsolete
`_snapshot_name` variable.

this commit removes the unused `_snapshot_name` member variable to
improve code cleanliness and prevent potential confusion.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20969
2024-10-07 10:43:13 +03:00
Benny Halevy
fa8fe62e90 test: cql-pytest: test_compaction: add test_compactionstats_after_major_compaction
Test that compactionstats are empty, i.e.
there are no required compactions following
major compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-10-07 10:24:06 +03:00
Benny Halevy
630b792bd0 test/cql-pytest: rename test_compaction{_tombstone_gc,}
Prepare to add more tests related to compaction to
this test suite.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-10-07 10:18:30 +03:00
Benny Halevy
284dbc51c3 time_window_compaction_strategy: estimated_pending_compactions: reestimate compactions rather than using cached value
Currently, `estimated_pending_compactions` uses a precalculated value
calculated by `update_estimated_compaction_by_tasks`, which, in turn,
is called by `get_compaction_candidates`.  That means that, if
`estimated_pending_compactions` is called, e.g. right after
major compaction, it will return an outdated value that was
calculated prior to major compaction, and so, it is no longer
relevant.

Instead, just recalculate the value in `estimated_pending_compactions`
and drop `update_estimated_compaction_by_tasks`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-10-07 10:15:19 +03:00
Gleb Natapov
e642f0a86d group: hold group0 shutdown gate during async operations
Wait for all outstanding async work that uses group0 to complete before
destroying group0 server.

Fixes scylladb/scylladb#20701
2024-10-06 17:20:52 +03:00
Gleb Natapov
ba22493a69 group0: Stop group0 if node initialization fails
Commit af83c5e53e moved aborting of group0 into the storage service
drain function. But it is not called if node fails during initialization
(if it failed to join cluster for instance). So lets abort on both
paths (but only once).
2024-10-06 17:20:52 +03:00
Kefu Chai
960aa38cf3 utils/i_filter: include used header
when compiling with clang-19 and the standard library from GCC-14.2,
we have:

```
/usr/bin/cmake -E __run_co_compile --tidy="clang-tidy;--checks=-*,bugprone-use-after-move;--extra-arg-before=--driver-mode=g++" --source=/__w/scylladb/scylladb/utils/bloom_filter.cc -- /usr/bin/clang++ -DBOOST_REGEX_DYN_LINK -DBOOST_REGEX_NO_LIB -DFMT_SHARED -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -I/__w/scylladb/scylladb -I/__w/scylladb/scylladb/seastar/include -I/__w/scylladb/scylladb/build/seastar/gen/include -I/__w/scylladb/scylladb/build/seastar/gen/src -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/__w/scylladb/scylladb/build=. -march=wes
Error: /__w/scylladb/scylladb/utils/bloom_filter.cc:81:1: error: unknown type name 'filter_ptr' [clang-diagnostic-error]
   81 | filter_ptr create_filter(int hash, large_bitset&& bitset, filter_format format) {
      | ^
Error: /__w/scylladb/scylladb/utils/bloom_filter.cc:82:12: error: no viable conversion from returned value of type '__detail::__unique_ptr_t<murmur3_bloom_filter>' (aka 'unique_ptr<utils::filter::murmur3_bloom_filter>') to function return type 'int' [clang-diagnostic-error]
   82 |     return std::make_unique<murmur3_bloom_filter>(hash, std::move(bitset), format);
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Error: /__w/scylladb/scylladb/utils/bloom_filter.cc:85:1: error: unknown type name 'filter_ptr' [clang-diagnostic-error]
   85 | filter_ptr create_filter(int hash, int64_t num_elements, int buckets_per, filter_format format) {
      | ^
Error: /__w/scylladb/scylladb/utils/bloom_filter.cc:86:12: error: no viable conversion from returned value of type '__detail::__unique_ptr_t<murmur3_bloom_filter>' (aka 'unique_ptr<utils::filter::murmur3_bloom_filter>') to function return type 'int' [clang-diagnostic-error]
   86 |     return std::make_unique<murmur3_bloom_filter>(hash, large_bitset(get_bitset_size(num_elements, buckets_per)), format);
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Error: /__w/scylladb/scylladb/utils/bloom_filter.hh:93:1: error: unknown type name 'filter_ptr' [clang-diagnostic-error]
   93 | filter_ptr create_filter(int hash, large_bitset&& bitset, filter_format format);
      | ^
Error: /__w/scylladb/scylladb/utils/bloom_filter.hh:94:1: error: unknown type name 'filter_ptr' [clang-diagnostic-error]
   94 | filter_ptr create_filter(int hash, int64_t num_elements, int buckets_per, filter_format format);
      | ^
Error: /__w/scylladb/scylladb/utils/i_filter.hh:17:25: error: no template named 'unique_ptr' in namespace 'std' [clang-diagnostic-error]
   17 | using filter_ptr = std::unique_ptr<i_filter>;
      |                    ~~~~~^
Error: /__w/scylladb/scylladb/utils/i_filter.hh:54:12: error: unknown type name 'filter_ptr' [clang-diagnostic-error]
   54 |     static filter_ptr get_filter(int64_t num_elements, double max_false_pos_prob, filter_format format);
      |            ^
4 warnings and 8 errors generated.
```

apparently, the definition of `std::unique_ptr` is missing where it is
used. so let's include `<memory>`, so that `i_filter.hh` is more
self-contained.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20971
2024-10-06 14:20:41 +03:00
Michał Chojnowski
5884c9d2fc utils/rjson.cc: correct a comment about assert()
Commit aa1270a00c changed most uses
of `assert` in the codebase to `SCYLLA_ASSERT`.

But the comment fixed in this patch is talking specifically about
`assert`, and shouldn't have been changed. It doesn't make sense
after the change.

Closes scylladb/scylladb#20967
2024-10-06 12:47:51 +03:00
Michał Chojnowski
882a3c60e4 utils/cached_file: reduce latency (and increase overhead) of partially-cached reads
Currently, `cached_file::stream` (currently used only by index_reader,
to read index pages), works as follows.

Assume that the caller requested a read of the range [pos, pos + size).
Then:

- If the first page of the requested range is uncached,
  the entire [pos, pos + size) range is read from disk (even if some
  later pieces of it are cached), the resulting pages are added to the cache,
  and the read completes (most likely) from the cached pages.
- If the first page of the read is cached, then the rest of the read
  is handled page-by-page, in a sequential loop, serving each page
  either from cache (if present) or from disk.

For example, assume that pages 0, 1, 2, 3, 4 are requested.

If exactly pages 1, 2 are cached, then `stream` will read the entire [0, 4] range
from disk and insert the missing 0, 3, 4, and then it will continue serving the
read from cache.

If exactly pages 0 and 3 are cached, then it will serve 0 from cache,
then it will read 1 from disk and insert it into cache,
then it will read 2 from disk and insert it into cache,
then it will serve 3 from cache,
then it will read 4 from disk and insert it into cache.

If exactly the first page is cached, a 128 kiB read turns
into 31 I/O sequential read ops.

This is weird, and doesn't look intended. In one case, we are reading even pages
we already have, just to avoid fragmenting the read, and in the other case
we are reading pages one-by-one (sequentially!) even if they are neighbours.

I'm not sure if cached_file should minimize IOPS or byte throughput,
but the current state is surely suboptimal. Even if its read strategy
is somehow optimal, it should still at least coalesce contiguous reads
and perform the non-contiguous reads in parallel.

This patch leans into minimizing IOPS. After the patch, we serve
as many front pages from the cache as we can, but when we see
an uncached page, we read the entire remainder of the read from disk.

As if we trimmed the read request by the longest cached prefix,
and then performed the rest using the logic from before the patch.

For example, if exactly pages 0 and 3 are cached,
then we serve 0 from cache,
then we read [1, 4] from disk and insert everything into cache.

For partially-cached files, this will result in more bytes read
from disk, but less IOPS. This might be a bad thing. But if so,
then we should lean the other way in a more explicit and efficient
way than we currently do.

Closes scylladb/scylladb#20935
2024-10-04 17:39:38 +02:00
Emil Maskovsky
a11ede758e clang-format: argument and function packing
Changes to better match the Seastar code style and the current codebase.

Allow parameter binpacking and continuation indenting.

Refs: scylladb/scylladb#20951
2024-10-04 14:52:41 +02:00
Emil Maskovsky
b4f28b3e0e clang-format: don't sort the include headers
Sorting the include headers causes reordering of all headers and thus
large diffs, especially in the files that include a lot of headers that
have not been sorted before. This makes it harder to review the changes
and to understand the history of the file.

The Seastar code style doesn't prescribe any include headers ordering.

Refs: scylladb/scylladb#20951
2024-10-04 14:51:54 +02:00
Kefu Chai
d72c8fc047 .github: add db to iwyu's CLEANER_DIR
to avoid future violations of include-what-you-use.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-04 20:48:18 +08:00
Kefu Chai
ee36358a60 db: remove unused includes
these unused includes are identified by clang-include-cleaner.
after auditing the source files, all of the reports have been
confirmed.

please note, since we have `using seastar::shared_ptr` in
`seastarx.h`, this renders `#include <seastar/core/shared_ptr.hh>`
unnecessary if we don't need the full definition of `seastar::shared_ptr`.

so, in this change, all the unused includes are removed. but there are
some headers which are actually used, while still being identified by
this tool. these includes are marked with "IWYU pragma: keep".

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-04 20:48:18 +08:00
Botond Dénes
af124993a4 Merge 'Do not remove objects from backup storage after restore' from Pavel Emelyanov
The restore-from-s3 task uses load-and-stream internally which, in turn, unlinks loaded sstables on success. That's not what user expects when it restores from backup, objects should remain in bucket afterwards.

Closes scylladb/scylladb#20947

* github.com:scylladb/scylladb:
  test: Add check that restored-from objects are not removed
  sstables_loader: Dont unlink sstables when restoring from S3
  sstables_loader: Make primary_replica_only bool_class RAII field
2024-10-04 14:59:40 +03:00
Nikita Kurashkin
874cafefab SStables: replace assertion with malformed_sstable_exception for invalid chunk_size
This will allow to see underlying sstable file
Fixes #20277

Closes scylladb/scylladb#20784
2024-10-04 14:48:35 +03:00
Pavel Emelyanov
6b480589fe Merge 'treewide: accept list of sstables in "restore" API ' from Kefu Chai
before this change, we enumerate the sstables tracked by the
system.sstables table, and restore them when serving
requests to "storage_service/restore" API. this works fine with
"storage_service/backup" API. but this "restore" API cannot be
used as a drop-in replacement of the rclone based API currently
used by scylla-manager.

in order to fill the gap, in this change:

* add the "prefix" parameter for specifying the shared prefix of
  sstables
* add the "sstables" parameter for specifying the list of  TOC
  components of sstables
* remove the "snapshot" parameter, as we don't encode the prefix
  on scylla's end anymore.
* make the "table" parameter mandatory.

Fixes https://github.com/scylladb/scylladb/issues/20461

----

this change is a part of the efforts to bring the native backup/restore to scylla, no need to backprt.

Closes scylladb/scylladb#20685

* github.com:scylladb/scylladb:
  treewide: accept list of sstables in "restore" API
  sstable: pass get_storage_option to sstable_directory::load_sstable()
  test/nodetool: add body parameter to `expected_request`
  tools/scylla-nodetool: enable nodetool to write HTTP body
2024-10-04 12:38:08 +03:00
Pavel Emelyanov
0f6e76f92f api: Use captured compaction_manager in get_cm_stats() helper
This is continuation of the previous patch that also need to touch the
helper function argument.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-04 12:28:15 +03:00
Pavel Emelyanov
f99d8e07ae api: Use captured compaction_manager in endpoints
Instead of getting via ctx -> database chain.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-04 12:28:15 +03:00
Pavel Emelyanov
05b4a8e710 api: Add sharded<compaction_manager> argument to compaction_manager API reg/unreg
To be used by next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-04 12:28:15 +03:00
Pavel Emelyanov
58c4c21581 api: Move some endpoints from storage_service.cc to compaction_manager.cc
Those setting and getting bandiwdth need compaction manager to work with
and thus should sit next to other enpoints working with it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-04 11:36:07 +03:00
Pavel Emelyanov
43fe482204 api: Unset compaction_manager endpoints
Similarly to other .cc files, compaction manager should have its
endpoints unset. For now, no batch unsetting exists, so need to do it
one-by-one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-04 11:35:19 +03:00
Pavel Emelyanov
aa13be15b0 api: Use shorter registration method for compaction_manager function
The register_api() helper does exatly what's needed here -- registers
function and calls a method to set routes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-04 11:34:33 +03:00
Botond Dénes
07094c3e44 Merge 'replica: Fix tombstone GC during tablet split preparation' from Raphael "Raph" Carvalho
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.

Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.

Then split starts:
1) sstable B is split first, and moved from main (unsplit) group to a
split-ready group
2) now compaction runs in split-ready group before sstable A is split

tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.

To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.

Fixes https://github.com/scylladb/scylladb/issues/20044.

Branches 6.0, 6.1 and 6.2 are vulnerable, so backport is needed.

Closes scylladb/scylladb#20939

* github.com:scylladb/scylladb:
  replica: Fix tombstone GC during tablet split preparation
  service: Improve error handling for split
2024-10-04 10:29:42 +03:00
Nikos Dragazis
347f5ee166 sstables: Check if digest component exists
Extend `read_digest()` to first check if the digest component exists
before attempting to load it from disk.

Make `validate_checksums()` throw an error if the component does not
exist to preserve its current behavior.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-10-03 18:09:05 +03:00
Nikos Dragazis
7e738bcd2d sstables: Add digest in the SSTable components
SSTables store their digest in a Digest file. Add this in the list of
SSTable components. In a follow-up patch we will use this component to
enable digest checking in the validation path.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-10-03 18:09:05 +03:00
Nikos Dragazis
c893f06409 sstables: Add digest check in compressed data source
Following the addition of digest check in the checksummed data source,
add the same feature to the compressed data source as well. This ensures
consistent behavior across any type of SSTable.

This is added as an optional feature so that we can preserve the current
behavior, that is verify only the per-chunk checksums during normal user
reads. To ensure zero cost at runtime when disabled, we introduce the
on/off switch as a template parameter.

The digest calculation for compressed SSTables depends on the SSTable
format, hence the new template argument for the checksum mode. This is
consistent with the compressed data sink.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-10-03 18:09:01 +03:00
Nikos Dragazis
0df1c01759 sstables: Add digest check in checksummed data source
The checksummed data source verifies the checksum of each chunk in the
data files of uncompressed SSTables. This is being leveraged by scrub
in validation mode.

Extend the data source to check the digest (full checksum) as well.
Unlike checksums, this is added as an optional feature so that SSTables
without a digest can still be validated in a per-chunk basis. To enable
this, the caller needs to set the template parameter `check_digest` to
true, and provide the expected digest.

The data source calculates the digest incrementally through multiple
get() calls and compares against the expected digest after reading the
whole file range. If there is a mismatch, it throws an exception.

Checking the digest requires reading the whole data file. If this cannot
be satisfied (e.g., due to partial read or skip()), the data source
fails immediately. If the user has successfully read the whole file
range, it can be safely assumed that the digest is valid.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-10-03 18:08:56 +03:00
Tomasz Grabiec
62f3d9e173 perf: perf_fast_forward: Add test case for querying missing rows 2024-10-03 16:26:41 +02:00
Tomasz Grabiec
4602ba90df perf-fast-forward: Allow overriding promoted index block size
For testing dense clustering index.
2024-10-03 16:26:41 +02:00
Tomasz Grabiec
1782456a52 perf-fast-forward: Test subsequent key reads from the middle in test_large_partition_select_few_rows 2024-10-03 16:26:41 +02:00
Tomasz Grabiec
751fa10de8 perf-fast-forward: Allow adding key offset in test_large_partition_select_few_rows 2024-10-03 16:26:41 +02:00
Tomasz Grabiec
10c6990e41 perf-fast-forward: Use single-partition reads in test_large_partition_select_few_rows
It's a more realistic scenario than a full scan.
2024-10-03 16:26:28 +02:00
Tomasz Grabiec
753f6a61fd sstables: bsearch_clustered_cursor: Add more tracing points 2024-10-03 16:24:18 +02:00
Botond Dénes
38088daa1f scylla-gdb.py: drop compatibility code for EOL releases
Any release < 6.0 or < 2023.1 is EOL and need not be supported by
scylla-gdb.py anymore. Remove compatibility code for these releases.

Closes scylladb/scylladb#20918
2024-10-03 15:42:08 +03:00
Avi Kivity
494561c4f3 cql3: expr: drop boost usage
Replace boost usage with <ranges>, modernizing the code a little
and reducing dependencies on a redundant library.

Closes scylladb/scylladb#20919
2024-10-03 15:39:40 +03:00
Kefu Chai
7b82f3a375 test/lib: remove redundant fmt::to_string() in seastar::format()
previously change, implementation was unnecessarily verbose and less
efficient, as it created and immediately discarded temporary strings.
remove unnecessary use of `fmt::to_string()` when arguments are already
being formatted by `seastar::format()`.

in this this change:

- eliminates creation of temporary `std::string` instances
- reduces memory allocations and copies
- improves performance
- simplifies the code

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20923
2024-10-03 15:36:55 +03:00
Tomasz Grabiec
95b864497a sstables: reader: Log data file range 2024-10-03 14:16:05 +02:00
Tomasz Grabiec
41d3ae5e81 sstables: bsearch_clustered_cursor: Unify skip_info logging
Now all exit paths which return skip_info will print it in the same
way which makes for easier log parsing.
2024-10-03 14:16:05 +02:00
Tomasz Grabiec
1b82d5117a sstables: bsearch_clustered_cursor: Narrow down range using "end" position of the block
This is optimization.

Example:

  block0: start=aaa, end=aaA
  block1: start=bbb, end=bbB
  block2: whatever

Before the patch, advance_to("aAA") would skip to block0, and upper
bound probe would skip to block1. This way, the reader would read the
range of block0 from the data file.

After the patch, "end" position is taken into account, so
advance_to("aAA") will notice that block0 doesn't contain the position
and will skip to block1. This is especially important for dense
indexes, as it allows us to skip accessing data file if the search key
is missing.

It also solves the edge case problem related to the fact that single
row reads are using a range which with positions which are not equal
to the key, but are before(key) and after(key) for the lower bound and
upper bound respectively. Before the patch, advance_to(before("bbb"))
would skip to block0, before the position is before the block1's
start. And upper bound probe for after("bbb") would point to
block2. This way the read would scan block0 needlessly. After the
patch, advance_to(before("bbb")) will skip to block1 because we notice
based on "end" that block0 doesn't contain the position.

This change also ensures that the start position of the upper bound
entry of the after_key(pos), where pos is the last advance_to()
position, is warm in cache. This is needed to optimize single-row
reads with a dense index so that they always read exactly one promoted
index block. For this to work, probe_upper_bound() for the
after_key(row) always needs to find the upper bound block in
cache.
2024-10-03 14:16:05 +02:00
Tomasz Grabiec
b03f23a09b sstables: bsearch_clustered_cursor: Skip even to the first block
It was unnecessary to emit a skip info for the first block since it
follows immediately the partition start, but it is relevant to the
optimization of avoiding data reads for missing keys. This
optimization relies on the fact that lower bound position equals upper
bound position. If the reader's key is before the first key in the
partition and we don't arm the skip info for the first block, lower
bound would be equal to the partition start, and upper bound would be
equal to the first row's position, which are not equal.
2024-10-03 14:16:05 +02:00
Tomasz Grabiec
c905554121 test: sstables: sstable_3_x_test: Improve failure message 2024-10-03 14:16:05 +02:00
Tomasz Grabiec
7f077893ed sstables: mx: writer: Never include partition_end marker in promoted index block width
Currently, it may happen that the last promoted index block includes
the partition_end marker. That's because we first write the partition
end marker and then emit the unclosed block. This behavior matches
Cassandra (checked in 3.x and 5.0.1).

This is problematic for ruling out data file reads based on index.
The width field is currently unused, but it will be used later where
the width of the last block is used to compute the skip position past
the last block for lookups which land after all keys in the
partition. If width includes the marker then such a skip would land in
the next partition, which is incorrect, as the reader context expects
a cell element. Even if that was recognized, it's wrong - if this is
not a single partition read (so upper bound is not at the next
partition too), then we would read from the wrong (next) partition.

We want to be able to make such skips in order to avoid unnecessary
data file IO for reads of missing rows. Currently, we would always
read the last block even if the key is past its "end" position.

Another way to solve this would be to propagate the "past the last
block" condition from the index cursor to the reader and let it deal
with it, but the logic for that would be complicated. With this fix,
there is no special logic required.
2024-10-03 14:09:57 +02:00
Pavel Emelyanov
4465bd9e5e test: Add check that restored-from objects are not removed
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-03 13:37:04 +03:00
Botond Dénes
3ebb124eb2 repair/row_level: remove reader timeout
This timeout was added to catch reader related deadlocks. We have not
seen such deadlocks for a long time, but we did see false-timeouts
caused by this, see explanation below. Since the cost now outweight the
benefit, remove the timeout altogether.

The false timeout happens during mixed-shard repair. The
`reader_permit::set_timeout()` call is called on the top-level permit
which repair has a handle on. In the case of the mixed-shard repair,
this belongs to the multishard reader. Calling set_timeout() on the
multishard reader has no effect on the actual shard readers, except in
one case: when the shard reader is created, it inherits the multishard
reader's current timeout. As the shard reader can be alive for a long
time, this timeout is not refreshed and ultimately causes a timeout and
fails the repair.

Refs: #18269

Closes scylladb/scylladb#20703
2024-10-03 11:26:29 +02:00
Kamil Braun
e67016540c Merge 'Node replace and remove operations: Add deprecate IP addresses usage warning.' from Sergey Zolotukhin
- As part of deprecation of IP address usage, warning messages were added when IP addresses specified in the `ignore-dead-nodes` and `--ignore-dead-nodes-for-replace` options for scylla and nodetool.
- Slight optimizations for `utils::split_comma_separated_list`, ` host_id_or_endpoint lists` and `storage_service` remove node operations, replacing `std::list` usage with `std::vector`.

Fixes scylladb/scylladb#19218

Backport: 6.2 as it's not yet released.

Closes scylladb/scylladb#20756

* github.com:scylladb/scylladb:
  config: Add a warning about use of IP address for join topology and replace operations.
  nodetool: Add IP address usage warning for 'ignore-dead-nodes'.
  tests: Fix incorrect UUIDs in test_nodeops
  utils: Optimizations for utils::split_comma_separated_list and usage of host_id_or_endpoint lists
2024-10-03 11:08:28 +02:00
Kamil Braun
d2233d4400 Merge 'test: update cql/ tests to work with tablets enabled by default' from Konstantin Osipov
Explicitly disable tablets for features which still dont' work with tablets: cdc, lwt, coutners.

Closes scylladb/scylladb#20858

* github.com:scylladb/scylladb:
  test: make cdc tests pass with tablets on by default
  test: make cql/counters* pass with and without tablets
  test: make cql/lwt_* pass with and without tablets
  test: rename cql/list_test to cql/lwt_list_test
2024-10-03 10:53:17 +02:00
Kefu Chai
f9091066b7 treewide: replace boost::irange with std::views::iota where possible
when building scylla with the standard library from GCC-14.2, shipped by
fedora 41, we have following build failure:

```
/home/kefu/.local/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -isystem /home/kefu/dev/scylladb/abseil -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=x86-64-v3 -mpclmul -Xclang -fexperimental-assignment-tracking=disabled -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -MD -MT CMakeFiles/scylla-main.dir/Debug/init.cc.o -MF CMakeFiles/scylla-main.dir/Debug/init.cc.o.d -o CMakeFiles/scylla-main.dir/Debug/init.cc.o -c /home/kefu/dev/scylladb/init.cc
In file included from /home/kefu/dev/scylladb/init.cc:12:
In file included from /home/kefu/dev/scylladb/db/config.hh:20:
In file included from /home/kefu/dev/scylladb/locator/abstract_replication_strategy.hh:26:
/home/kefu/dev/scylladb/locator/tablets.hh:410:30: error: unexpected type name 'size_t': expected expression
  410 |         return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
      |                              ^
/home/kefu/dev/scylladb/locator/tablets.hh:410:23: error: no member named 'irange' in namespace 'boost'
  410 |         return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
      |                ~~~~~~~^
/home/kefu/dev/scylladb/locator/tablets.hh:410:38: error: left operand of comma operator has no effect [-Werror,-Wunused-value]
  410 |         return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
      |                                      ^
3 errors generated.
[16/782] Building CXX object CMakeFiles/scylla-main.dir/Debug/keys.cc.o
[17/782] Building CXX object CMakeFiles/scylla-main.dir/Debug/counters.cc.o
[18/782] Building CXX object CMakeFiles/scylla-main.dir/Debug/partition_slice_builder.cc.o
[19/782] Building CXX object CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o
FAILED: CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o
/home/kefu/.local/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -isystem /home/kefu/dev/scylladb/abseil -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=x86-64-v3 -mpclmul -Xclang -fexperimental-assignment-tracking=disabled -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -MD -MT CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o -MF CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o.d -o CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o -c /home/kefu/dev/scylladb/mutation_query.cc
In file included from /home/kefu/dev/scylladb/mutation_query.cc:12:
In file included from /home/kefu/dev/scylladb/schema/schema_registry.hh:17:
In file included from /home/kefu/dev/scylladb/replica/database.hh:11:
In file included from /home/kefu/dev/scylladb/locator/abstract_replication_strategy.hh:26:
/home/kefu/dev/scylladb/locator/tablets.hh:410:30: error: unexpected type name 'size_t': expected expression
  410 |         return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
      |                              ^
/home/kefu/dev/scylladb/locator/tablets.hh:410:23: error: no member named 'irange' in namespace 'boost'
  410 |         return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
      |                ~~~~~~~^
/home/kefu/dev/scylladb/locator/tablets.hh:410:38: error: left operand of comma operator has no effect [-Werror,-Wunused-value]
  410 |         return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
      |                                      ^
In file included from /home/kefu/dev/scylladb/mutation_query.cc:12:
In file included from /home/kefu/dev/scylladb/schema/schema_registry.hh:17:
In file included from /home/kefu/dev/scylladb/replica/database.hh:37:
In file included from /home/kefu/dev/scylladb/db/snapshot-ctl.hh:20:
/home/kefu/dev/scylladb/tasks/task_manager.hh:403:54: error: no member named 'irange' in namespace 'boost'
  403 |         co_await coroutine::parallel_for_each(boost::irange(0u, smp::count), [&tm, id, &res, &func] (unsigned shard) -> future<> {
      |                                               ~~~~~~~^
4 errors generated.
```

so let's take the opportunity to switch from `boost::irange` to
`std::views::iota`.

in this change, we:

- switch from boost::irange to std::views::iota for better standard library compatibility
- retain boost::irange where step parameter is used, as std::views::iota doesn't support it
- this change partially modernizes our range usage while maintaining
- existing functionality

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20924
2024-10-03 10:33:33 +03:00
Pavel Emelyanov
7389f4275d sstables_loader: Dont unlink sstables when restoring from S3
When load_and_stream() completes, all sstables that were loaded (and
streamed) are unlinked. This is wrong for the restore-from-s3 task, as
removing objects from backup storage is not what user expects.

Fix it by adding a boolean to streamer class, and set it to false (well,
bool_class<>::no) for restore task.

fixes: #20938

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-03 10:15:22 +03:00
Pavel Emelyanov
7eb48358e9 sstables_loader: Make primary_replica_only bool_class RAII field
This boolean is currently passed all the way around as pure bool
argument. And it's only needed in a single get_endpoints() method that
calculates the target endpoints.

This patch places this bool on class streamer, so that the call chain
arguments are not polluted, and converts it to bool_class.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-03 10:13:37 +03:00
Pavel Emelyanov
1da1d131b2 test: Use corrcet key in sstables registry mock
The "real" registry defines its primary key as (location, generation)
pair, where location is the partition key and generation is clustering
key. The registry mock uses only location part as primary key, while it
must use both.

The buggy mock works simply because the listing API is in fact not used
by unit tests. Those tests that do need it are python tests that start
scylla and thus implicitly use real registry.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-03 10:07:11 +03:00
Pavel Emelyanov
a503e2ab10 test: Pass entry status to mock registry consumer
When sstables registry is listed, the passed consumer accepts entry
status as its first argument, not its location (location is passed as a
search key)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-03 10:06:28 +03:00
Kefu Chai
c7eafc4dc1 auth: capture boost::regex_error not std::regex_error
in a3db5401, we introduced the TLS certi authenticator, which is
configured using `auth_certificate_role_queries` option . the
value of this option contains a regular expression. so there are
chances the regular expression is malformatted. in that case,
when converting its value presenting the regular expression to an
instance of `boost::regex`, Boost.Regex throws a `boost::regex_error`
exception, not `std::regex_error`.

since we decided to use Boost.Regex, let's catch `boost::regex_error`.

Refs a3db5401
Fixes scylladb/scylladb#20941
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20942
2024-10-03 09:57:15 +03:00
Piotr Dulikowski
6778001313 Merge 'cql3: Make creating MV respect ID option' from Dawid Mędrek
Before these changes, we could create a materialized
view specifying its ID, but the option was ignored.

This commit makes Scylla respect the option. Now specifying
the ID results in the MV being created with that specific ID.
This way, Scylla's behavior is consistent with Cassandra's.

Because Cassandra doesn't mention the option in its
user documentation, we don't update it either in case
the semantics of it changes in the future -- we want
to have an open door for any modifications.

Note that Cassandra returns a server error if the provided
ID is already in use, both in the case of regular tables
and MVs. That's most likely a bug. Instead of following that
behavior, we stay consistent with the current semantics of
creating a regular table in Scylla: if the provided ID is
already used, return an InvalidRequest.

The last thing worth pointing out is Cassandra handles
`WITH ID = null` as a special case; normally, specifying
an invalid ID results in a ConfigurationException, but a null
is treated as a syntax error. As in the previous paragraph,
we stay consistent with the semantics of regular tables and
all invalid IDs, null included, lead to a ConfigurationException.

We also add a few short tests verifying that the implementation
works as intended.

Fixes scylladb/scylladb#20616

Backport not needed: the semantics of the option was never documented in either Cassandra, or Scylla.

Closes scylladb/scylladb#20773

* github.com:scylladb/scylladb:
  test/cql-pytest: Get rid of unnecessary processing describe statements
  cql3: Make creating MV respect ID option
2024-10-03 08:31:07 +02:00
Dawid Mędrek
1f1b201fd8 cql3/functions/user_function: Use fmt to format create statement
We replace `std::ostringstream` with views and formatting
using fmt to improve readability of the code.
2024-10-02 19:17:35 +02:00
Ferenc Szili
cdf775d3cc test: test tombstone GC disabled on pending replica
This tests if tombstone GC is disabled on pending replicas
2024-10-02 16:37:57 +02:00
Ferenc Szili
ba6707506d tablet_storage_group_manager: update tombstone_gc_enabled in compaction group
In order to avoid cases during tablet migrations where we garbage
collect tombstones before the data it shadows arrives, we will
disable tombstone GC on pending replicas.

To achieve this we added a tombston_gc_enabled flag to compaction_group.
This flag is updated from updte_effective_repliction_map method of the
tablet_storage_group_manager class.
2024-10-02 16:31:33 +02:00
Raphael S. Carvalho
93815e0649 replica: Fix tombstone GC during tablet split preparation
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.

Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.

Then split starts:
1) sstable B is split first, and moved from main (unsplit) group to a
split-ready group
2) now compaction runs in split-ready group before sstable A is split

tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.

To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.

Fixes #20044.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-10-02 11:26:13 -03:00
Ferenc Szili
e472844a78 database::table: add tombstone_gc_enabled(locator::tablet_id)
This change adds the flag tombstone_gc_enabled to compaction_group.
The value of this flag will be set in
tablet_storage_group_manager::update_effective_replication_map().
2024-10-02 16:24:45 +02:00
Raphael S. Carvalho
bcd358595f service: Improve error handling for split
Retry wasn't really happening since the loop was broken and sleep
part was skipped on error. Also, we were treating abort of split
during shutdown as if it were an actual error and that confused
longevity tests that parse for logs with error level. The fix is
about demoting the level of logs when we know the exception comes
from shutdown.

Fixes #20890.
2024-10-02 11:23:44 -03:00
Konstantin Osipov
1d1777b13a test: make cdc tests pass with tablets on by default
CDC is not supported with tablets, explicitly disable tablets
in CDC keyspace definition.
2024-10-02 06:37:14 -04:00
Konstantin Osipov
0e3dbec277 test: make cql/counters* pass with and without tablets
Counters are not supported with tablets, make sure
the test works in any ScyllaDB configuration.
2024-10-02 06:37:14 -04:00
Konstantin Osipov
92aca17bc5 test: make cql/lwt_* pass with and without tablets
Lightweight transactions don't support tablets, so let's
explicitly disable tablets in LWT tests.
2024-10-02 06:37:14 -04:00
Konstantin Osipov
7c64fc0c4f test: rename cql/list_test to cql/lwt_list_test
This test is actually testing lists with LWT, so should
have the corresponding name. Going forward we'll patch CQL LWT
tests for tablets, so let's group them together.
2024-10-02 06:37:14 -04:00
Sergey Zolotukhin
6398b7548c config: Add a warning about use of IP address for join topology and replace
operations.

When the '--ignore-dead-nodes-for-replace' config option contains
IP addresses, a warning will be logged, notifying the user that
using IP addresses with this option is deprecated and will no
longer be supported in the next release.

Fixes scylladb/scylladb#19218
2024-10-02 11:56:59 +02:00
Sergey Zolotukhin
9c692438e9 nodetool: Add IP address usage warning for 'ignore-dead-nodes'.
Since we are deprecating the use of IP addresses, a warning message will be printed
if 'nodetool removenode --ignore-dead-nodes' is used with IP addresses.
2024-10-02 11:56:59 +02:00
Sergey Zolotukhin
a871321ecf tests: Fix incorrect UUIDs in test_nodeops
It was found that the UUIDs used in test_nodeops were
invalid. This update replaces those UUIDs with newly generated
random UUIDs.
2024-10-02 11:56:59 +02:00
Sergey Zolotukhin
3b9033423d utils: Optimizations for utils::split_comma_separated_list and usage of host_id_or_endpoint lists
- utils::split_comma_separated_list now accepts a reference to sstring instead
  of a copy to avoid extra memory allocations. Additionally, the results of
  trimming are moved to the resulting vector instead of being copied.
- service/storage_service removenode, raft_removenode, find_raft_nodes_from_hoeps,
  parse_node_list and api/storage_service::set_storage_service were changed to use
  std::vector<host_id_or_endpoint> instead of std::list<host_id_or_endpoint> as
  std::vector is a more cache-friendly structure,  resulting in better performance.
2024-10-02 11:56:59 +02:00
Dawid Mędrek
7a7a1e3558 treewide: Prefer bytes_fwd.hh over bytes.hh
CI started reporting warnings about including `bytes.hh` in
several files. The reason is they actually only use code
introduced in `bytes_fwd.hh` (which is also included by `bytes.hh`).
Clang-include-cleaner suggests that we get rid of that indirection
and only include `bytes_fwd.hh`. That's what happens in this commit.

We include `bytes.hh` in `exceptions/exceptions.cc` because
it relies on the formatting utilities declared and defined
in `bytes.hh`.

Closes scylladb/scylladb#20842
2024-10-02 07:29:30 +02:00
Dawid Mędrek
de88c150f6 test/cql-pytest: Get rid of unnecessary processing describe statements
As part of scylladb/scylladb@d42f160, we added a test verifying that
restoring the schema works as intended. Unfortunately, because of
scylladb/scylladb#20616, we had to manually process the results
of `DESCRIBE SCHEMA` to exclude the ID parameter and be able to
compare restore statements corresponding to the same view.

Now that materialized views respect the ID parameter, we can get rid
of that logic.
2024-10-01 22:04:05 +02:00
Dawid Mędrek
552c752005 cql3: Make creating MV respect ID option
Before these changes, we could create a materialized
view specifying its ID, but the option was ignored.

This commit makes Scylla respect the option. Now specifying
the ID results in the MV being created with that specific ID.
This way, Scylla's behavior is consistent with Cassandra's.

Because Cassandra doesn't mention the option in its
user documentation, we don't update it either in case
the semantics of it changes in the future -- we want
to have an open door for any modifications.

Note that Cassandra returns a server error if the provided
ID is already in use, both in the case of regular tables
and MVs. That's most likely a bug. Instead of following that
behavior, we stay consistent with the current semantics of
creating a regular table in Scylla: if the provided ID is
already used, return an InvalidRequest.

The last thing worth pointing out is Cassandra handles
`WITH ID = null` as a special case; normally, specifying
an invalid ID results in a ConfigurationException, but a null
is treated as a syntax error. As in the previous paragraph,
we stay consistent with the semantics of regular tables and
all invalid IDs, null included, lead to a ConfigurationException.

We also add a few short tests verifying that the implementation
works as intended.
2024-10-01 22:03:58 +02:00
Kefu Chai
9b5eab0dde test/lib: include <fmt/std.h> for formatting std::optional
before this change, when compiling with fmtlib v11.0.2 and clang
v19.1.0, the compiler fails like:

```
/usr/bin/clang++ -DBOOST_REGEX_DYN_LINK -DBOOST_REGEX_NO_LIB -DBOOST_UNIT_TEST_FRAMEWORK_DYN_LINK -DBOOST_UNIT_TEST_FRAMEWORK_NO_LIB -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -I/home/kefu/dev/scylladb/build -isystem /home/kefu/dev/scylladb/abseil -isystem /home/kefu/dev/scylladb/build/rust -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=x86-64-v3 -mpclmul -Xclang -fexperimental-assignment-tracking=disabled -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -MD -MT test/lib/CMakeFiles/test-lib.dir/Debug/cql_assertions.cc.o -MF test/lib/CMakeFiles/test-lib.dir/Debug/cql_assertions.cc.o.d -o test/lib/CMakeFiles/test-lib.dir/Debug/cql_assertions.cc.o -c /home/kefu/dev/scylladb/test/lib/cql_assertions.cc
In file included from /home/kefu/dev/scylladb/test/lib/cql_assertions.cc:12:
In file included from /usr/include/fmt/ranges.h:20:
In file included from /usr/include/fmt/format.h:41:
/usr/include/fmt/base.h:2673:45: error: implicit instantiation of undefined template 'fmt::detail::type_is_unformattable_for<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, char>'
 2673 |     type_is_unformattable_for<T, char_type> _;
      |                                             ^
/usr/include/fmt/base.h:2735:23: note: in instantiation of function template specialization 'fmt::detail::parse_format_specs<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, fmt::detail::compile_parse_context<char>>' requested here
 2735 |         parse_funcs_{&parse_format_specs<Args, parse_context_type>...} {}
      |                       ^
/usr/include/fmt/base.h:2884:47: note: in instantiation of member function 'fmt::detail::format_string_checker<char, int, std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, std::vector<std::optional<managed_bytes>>>::format_string_checker' requested here
 2884 |       detail::parse_format_string<true>(str_, checker(s));
      |                                               ^
/home/kefu/dev/scylladb/test/lib/cql_assertions.cc:132:34: note: in instantiation of function template specialization 'fmt::basic_format_string<char, int &, std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<managed_bytes>> &>::basic_format_string<char[35], 0>' requested here
  132 |             fail(seastar::format("row {} differs, expected {} got {}", row_nr, row, actual));
      |                                  ^
/usr/include/fmt/base.h:1616:8: note: template is declared here
 1616 | struct type_is_unformattable_for;
      |        ^
/home/kefu/dev/scylladb/test/lib/cql_assertions.cc:132:34: error: call to consteval function 'fmt::basic_format_string<char, int &, std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<managed_bytes>> &>::basic_format_string<char[35], 0>' is not a constant expression
  132 |             fail(seastar::format("row {} differs, expected {} got {}", row_nr, row, actual));
      |                                  ^
```

because the formatter for `std::optional<>` is defined in fmt/std.h.

so, in this change, we include the used header.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20922
2024-10-01 22:32:16 +03:00
Tomasz Grabiec
a29501ed67 sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions
Single-row reads from large partition issue 64 KiB reads to the data file,
which is equal to the default span of the promoted index block in the data file.
If users would want to reduce selectivity of the index to speed up single-row reads,
this won't be effective. The reason is that the reader uses promoted index
to look up the start position in the data file of the read, but end position
will in practice extend to the next partition, and amount of I/O will be
determined by the underlying file input stream implementation and its
read-ahead heuristics. By default, that results in at least 2 IOs 32KB each.

There is already infrastructure to lookup end position based on upper
bound of the read, but it's not effective becasue it's a
non-populating lookup and the upper bound cursor has its own private
cached_promoted_index, which is cold when positions are computed. It's
non-populating on purpose, to avoid extra index file IO to read upper
bound. In case upper bound is far-enough from the lower bound, this
will only increase the cost of the read.

The solution employed here is to warm up the lower bound cursor's
cache before positions are computed, and use that cursor for
non-populating lookup of the upper bound.

We use the lower bound cursor and the slice's lower bound so that we
read the same blocks as later lower-bound slicing would, so that we
don't incur extra IO for cases where looking up upper bound is not
worth it, that is when upper bound is far from the lower bound. If
upper bound is near lower bound, then warming up using lower bound
will populate cached_promoted_index with blocks which will allow us to
locate the upper bound block accurately.  This is especially important
for single-row reads, where the bounds are around the same key.  In
this case we want to read the data file range which belongs to a
single promoted index block.  It doesn't matter that the upper bound
is not exactly the same. They both will likely lie in the same block,
and if not, binary search will bring adjacent blocks into cache.  Even
if upper bound is not near, the binary search will populate the cache
with blocks which can be used to narrow down the data file range
somewhat.

Fixes #10030.

The change was tested with perf-fast-forward.

I populated the data set with `column_index_size_in_kb` set to 1

  scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1

Test run:

  build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0

This test reads two rows from the middle of a large partition (1M
rows), of subsequent keys. The first read will miss in the index file
page cache, the second read will hit.

Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total.
After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB.
I verified using logging that the data file range matches a single promoted index block.

Also, the first read which misses in cache is still faster after the change.

Before:

running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride  rows      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
500000  1         0.009802            1         1        102          0        102        102       21.0     21        196       2       1        0        1        1        0        0        0       568     269 4716050  53.4%
500001  1         0.000321            1         1       3113          0       3113       3113        2.0      2         64       1       0        1        0        0        0        0        0       116      26  555110  45.0%

After:

running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride  rows      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
500000  1         0.009609            1         1        104          0        104        104       20.0     20        137       2       1        0        1        1        0        0        0       561     268 4633407  43.1%
500001  1         0.000217            1         1       4602          0       4602       4602        1.0      1          2       1       0        1        0        0        0        0        0       110      26  313882  64.1%

(cherry picked from commit dfb339376aff1ed961b26c4759b1604f7df35e54)
2024-10-01 18:40:34 +02:00
Tomasz Grabiec
41be5d1daf sstables: clustered_cursor: Track current block
Will be needed by the reader to jump to the current block even if we
already advanced to it before, when setting up the reader context.

We want to advance to lower bound earlier, before the praser skips to
the lower bound. We want that in order to set input stream data file
range based on index. If we didn't have access to the current block
and used the result from advance_to(), the parser will think we're
already in the block which has lower_bound when it attempts to skip,
and will not skip, falling back to scanning.
2024-10-01 18:40:34 +02:00
Kefu Chai
787ea4b1d4 treewide: accept list of sstables in "restore" API
before this change, we enumerate the sstables tracked by the
system.sstables table, and restore them when serving
requests to "storage_service/restore" API. this works fine with
"storage_service/backup" API. but this "restore" API cannot be
used as a drop-in replacement of the rclone based API currently
used by scylla-manager.

in order to fill the gap, in this change:

* add the "prefix" parameter for specifying the shared prefix of
  sstables
* add the "sstables" parameter for specifying the list of  TOC
  components of sstables
* remove the "snapshot" parameter, as we don't encode the prefix
  on scylla's end anymore.
* make the "table" parameter mandatory.

Fixes scylladb/scylladb#20461
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-01 23:24:56 +08:00
Kefu Chai
17181c2eca sstable: pass get_storage_option to sstable_directory::load_sstable()
before this change, we always pass
`sstable_directory::_storage_opts` to `_manager.make_sstable()` in
`sstable_directory::load_sstable()`. but when loading from object
storage, we need to customize the storage_options on a per-sstable
basis. the way to address this is to allow the caller of
`sstable_directory::process_descriptor()` to pass a functor which
return the `storage_options` to be used when creating the sstable.

so, in this change, we update

- sstable_directory::load_sstable()
- sstable_directory::process_descriptor()

so that they accept another parameter to create the storage_options.
in the next commit we will pass a different functor for customizing
the storage_options on a per-sstable basis when loading sstables.

Refs scylladb/scylladb#20461

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-01 23:24:56 +08:00
Kefu Chai
283697e316 test/nodetool: add body parameter to expected_request
before this change, `expected_request` only includes query strings
for the parameters of requests. but we will add an API
("storage_service/restore") which accepts its parameters in HTTP body
as well.

in this change, we add an optional `body` member to `expected_request`,
so that we can mock the APIs which pass the parameters with the HTTP
body.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-01 23:24:56 +08:00
Kefu Chai
3c19cc9aec tools/scylla-nodetool: enable nodetool to write HTTP body
before this change, we always send the parameters with query strings,
but we will add an API ("storage_service/restore") which accepts its
parameters in HTTP body as well.

in this change, we add an optional parameter to `do_request()` and
`post()`, so that we can send HTTP body when using "POST" method
in nodetool implementation.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-01 23:24:56 +08:00
Pavel Emelyanov
7f71371de1 distributed_loader: Get token metadata from e.r.m., not database
Though database can be used to get relevant token metadata, it's better
not to use one service (database) as a proxy to get another one (token
metadata). In case of tokens, there's effective replication map at hand,
which is a more correct source of such topology information.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20894
2024-10-01 14:59:35 +03:00
Anna Stuchlik
7eb1dc2ae5 doc: document the option to run ScyllaDB in Docker on macOS
This commit adds a description of a workaround to create a multi-node ScyllaDB cluster
with Docker on macOS.

Refs https://github.com/scylladb/scylladb/issues/16806
See https://forum.scylladb.com/t/running-3-node-scylladb-in-docker/1057/4

Closes scylladb/scylladb#20857
2024-10-01 14:58:58 +03:00
Botond Dénes
6535283881 .github/CODEOWNERS: add code owners for tools/*
Closes scylladb/scylladb#20702
2024-10-01 14:52:26 +03:00
Yaron Kaikov
ab964bcd5a [script/pull_github_pr.sh] Check Gating status before merging
Maintainers use scripts/pull_github_pr.sh from scylladb.git when merging PRs and before pushing to the next. We want to prevent merges from piling up on top of unstable builds. This change will check Gating's current status and notify the maintainers

Related to scylladb/scylla-pkg#3644

Closes scylladb/scylladb#20742
2024-10-01 14:46:29 +03:00
Anna Stuchlik
a97db03448 doc: add metric updates from 6.1 to 6.2
This commit specifies metrics that are new in version 6.2 compared to 6.1,
as specified in https://github.com/scylladb/scylladb/issues/20176.

Fixes https://github.com/scylladb/scylladb/issues/20176

Closes scylladb/scylladb#20896
2024-10-01 14:41:37 +03:00
muthu90tech
1204d54c5c transport: Dont bypass seastar API when making syscalls
The transport/controller.cc bypasses seastar API when making a few syscalls,
this PR will use the right seastar API to make the syscall and libc calls
this PR relies on few new APIs introduced in
seastar commit : cd7f3b8e8850cd80a4f6899cedc726e576c51abe

Closes scylladb/scylladb#17443

Closes scylladb/scylladb#19565
2024-10-01 14:29:24 +03:00
Benny Halevy
5a0f3889e0 treewide: use std::ranges sort functions rather than boost
Using the standard library is preffered over boost.

In cql3/expr/expression.cc to_sorted_vector got more of a
face-list and was modernized to use also std::unique
and while at it, to move its input range in the uniquely sorted
result vector.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-10-01 14:19:05 +03:00
Avi Kivity
e99426df60 treewide: de-static namespace scope functions in headers
'static inline' is always wrong in headers - if the same header is
included multiple times, and the function happens not to be inlined,
then multiple copies of it will be generated.

Fix by mechanically changing '^static inline' to 'inline'.
2024-10-01 14:02:50 +03:00
Avi Kivity
e9425e15b2 treewide: remove dependency on boost asio address_v4
It's not used. There's a comment mentioning it prevents some type
conflict, but apparently that was fixed some time ago.

Closes scylladb/scylladb#20883
2024-10-01 14:00:50 +03:00
Pavel Emelyanov
24598848a9 Merge 'virtual_tables: snapshots: include all snapshots' from Benny Halevy
Use database::get_snapshot_details to get the details
of all snapshots on disk, in particular those of
deleted tables.

Add test_snapshots_dropped_table to test listing
of snapshots of a deleted table.
And harden the existing test cases to use a unique
snapshot tag and to delete it when the test ends.

Fixes #18313

* No backport required at this time since this is rather minor UX issue that weren't hit in the field AFAIK

Closes scylladb/scylladb#20869

* github.com:scylladb/scylladb:
  cql-pytest: test_virtual_tables: add test_snapshots_multiple_keyspaces
  virtual_tables: snapshots: include all snapshots
2024-10-01 13:56:13 +03:00
Gleb Natapov' via ScyllaDB development
22368b13f2 api: introduce raft stepdown REST API
Also provide test.py util function to trigger it. Can be useful for
testing.
2024-10-01 12:18:49 +02:00
Avi Kivity
f5628be597 Update tools/java submodule
* tools/java 5b0e274f12...b2d025fd6b (1):
  > build.xml: update scylla-tools license
2024-10-01 12:48:45 +03:00
Pavel Emelyanov
1dfe780457 cql: Check that CREATEing tablets/vnodes is consistent with the CLI
There are two bits that control whenter replication strategy for a
keyspace will use tablets or not -- the configuration option and CQL
parameter. This patch tunes its parsing to implement the logic shown
below:

    if (strategy.supports_tablets) {
         if (cql.with_tablets) {
             if (cfg.enable_tablets) {
                 return create_keyspace_with_tablets();
             } else {
                 throw "tablets are not enabled";
             }
         } else if (cql.with_tablets = off) {
              return create_keyspace_without_tablets();
         } else { // cql.with_tablets is not specified
              if (cfg.enable_tablets) {
                  return create_keyspace_with_tablets();
              } else {
                  return create_keyspace_without_tablets();
              }
         }
     } else { // strategy doesn't support tablets
         if (cql.with_tablets == on) {
             throw "invalid cql parameter";
         } else if (cql.with_tablets == off) {
             return create_keyspace_without_tablets();
         } else { // cql.with_tablets is not specified
             return create_keyspace_without_tablets();
         }
     }

closes: #20088

In order to enable tablets "by default" for NetworkTopologyStrategy
there's explicit check near ks_prop_defs::get_initial_tablets(), that's
not very nice. It needs more care to fix it, e.g. provide feature
service reference to abstract_replication_strategy constructor. But
since ks_prop_defs code already highjacks options specifically for that
strategy type (see prepare_options() helper), it's OK for now.

There's also #20768 misbehavior that's preserved in this patch, but
should be fixed eventually as well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20779
2024-10-01 10:54:29 +02:00
Botond Dénes
e780a3f168 Merge 'fix regressions of building tests with cmake' from Laszlo Ersek
Fix two recent regressions of the cmake build -- found this time in the test suite.

We (presumably) don't build stable releases (and their tests) with CMake, so backporting these fixes appears unnecessary, even if the regressions have been ported to stable branches.

@xemul @dawmd @tchaikov @tgrabiec @scylladb/scylla-maint

Closes scylladb/scylladb#20854

* github.com:scylladb/scylladb:
  test/boost/bptree_test: fix the CMake build
  test/boost/auth_test: fix the CMake build
2024-10-01 11:14:19 +03:00
Kefu Chai
d484121cc8 github: add a trigger to retrigger clang-tidy with comment
before this change, clang-tidy is triggered by a pull request. but
there are chances that user wants to retrigger it. for jenkins
jobs, user can rebuild a job manually. but for workflow, only the
developers with write permission can retrigger a workflow. this
is not convenient to regular contributors.

so, in this change, another trigger is added, so that user can
trigger the clang-tidy workflow with "/clang-tidy" command.
the syntax is inspired by IRC commands.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20841
2024-10-01 11:10:54 +03:00
Ernest Zaslavsky
5a96549c86 test: add complete_multipart_upload completion tests
A primitive python http server is processing s3 client requests and issues either success or error. A multipart uploader should fail or succeed (with or without retries) depending on aforementioned server response
2024-10-01 09:06:24 +03:00
Ernest Zaslavsky
3be6052786 code: s3 client error handling
Handle the `finalize_upload` possible exception to abort the upload (which also can throw) and show the right error originated from the `finalize_upload`
2024-10-01 09:06:24 +03:00
Ernest Zaslavsky
6be2433b5a code: add response parsing and error handling to the complete_multipart_upload
Instead of ignoring the response for multipart upload completion start parsing it and look for a possible errors in the response body. If the error is found throw an exception
2024-10-01 09:06:24 +03:00
Ernest Zaslavsky
826cf5cd4a code: Introduce AWS errors parsing
Add a simple utility class to parse (possible) error response from AWS S3. Stay as close as possible to aws-sdk-cpp ErrorMarshaler https://github.com/aws/aws-sdk-cpp/blob/main/src/aws-cpp-sdk-core/source/client/AWSErrorMarshaller.cpp logic
Also, add a tester for this new class
2024-10-01 09:06:24 +03:00
Michał Chojnowski
c77d00fd8d index_reader: remove a piece of misguided code involved in single-partition reads
This patch removes a piece of code which, according to the comment,
allows for forwarding the index reader even if it was created as a
single-partition reader.

For single-partition reads, the input_stream
used by the reader is limited to the single index page containing the partition,
since reading the index file past that point would be a waste.
Because of this limit, such an index reader can't be forwarded/advanced.

The dubious piece of code gets around that by unsetting the stream and ensuring
it will be re-created, this time without the limit, if the index is advanced.

But there is no use for this. The idea of a "single-partition reader"
exist as an optimization. It's illegal to forward single-partition readers,
and it doesn't make sense to attempt that. (If there's a need for forwarding,
just don't create a single-partition reader).

I suspect this piece of code was written due to a misunderstanding.
Before the previous patch in this series, when the searched partition key
was the first key in its page, the index reader would scan the preceding
page first, realize it made a mistake, and advance to the next, correct page.
I suspect this piece of code was written to make this work.

But this is, in fact, undesirable. The fact that the index reader was working
like this was a performance bug. In the single-partition case there's
never an inherent reason to start with the wrong page. The index logic can be
corrected to always start with the right page, and that's what the previous
patch in this series does. And with that, there is no need to support advancing
anymore, and the dubious piece of code can be erased.

We also add an assert to emphasize that advancing a single-partition reader is
illegal.
2024-09-30 23:43:36 +02:00
Michał Chojnowski
bc30509523 index_reader: in single-partition reads, don't read more than one page
When looking for a partition key in the index, we scan the index from the
first index page which can possibly contain the key.

In a single-partition read, there is never a reason to read beyond that
page. After the previous patch in this series, it's guaranteed that
the first key in the next page is strictly greater than the searched key.
So if the searched key is greater than the last key in the first page,
then it is neither in the first nor the second page -- it must be absent
from the sstable.

But with the current logic, we read the second index page anyway, and
the realization that the key is absent happens higher in the call chain.

This patch optimizes that inefficiency by immediately returning EOF
if a single-partition read doesn't find the key in the first page.

Returning "end of file" even though we didn't actually go beyond the
end of file is hacky, but I don't see any other non-invasive way
of communicating to the caller that the partition is absent.

Some caller of the index could possibly assume that returning EOF
proves that the searched key is greater than all keys in the sstable.
I don't think any such caller exists today, but it's a possible
place for confusion.

Together with the previous patch in this series, this patch guarantees
that a single-partition read only accesses a single index page.
This fixes a weird secondary performance bug.
Due to some misunderstanding in the logic, when during a single-partition read
we scan two index pages, the second index page is scanned via an input_stream
created without an upper I/O limit, which means that we additionally read
a full read-ahead (currently: 64 kiB) past the second index page for no reason
whatsoever.
After this and the previous patch, a single-partition read always reads exactly one
index page, so the above problem cannot occur.
2024-09-30 23:43:36 +02:00
Michał Chojnowski
6b8b7d962c index_reader: fix unnecessary reads of preceding index pages
When setting the index to position X, we first look for the first summary
entry N such that N >= X. Then we load the index page preceding N and
scan it for the first partition key P such that P >= X.
If there is no such key in this page, then we scan the next page
(starting with N) for such key. (In this case it's always the first key).

For example, assume we have:
summary:  A   C   E
index:    A B C D E F

If we look up "B" in the index, then we first locate summary entry "C",
then we scan the index for B, starting from "A". This is all fine.

But when we look for "C" in the index, then we do the exactly the same
-- we scan the index for "C" starting from "A". This is wasteful, because
we can start scanning from "C".

To avoid this inefficiency, we should be looking for N > X, not N >= X.
This patch fixes that.

In addition, this fixes a second, weirder performance bug.
Due to some misunderstanding in the logic, when during a single-partition read
we scan two index pages, the second index page is scanned via an input_stream
created without an upper I/O limit, which means that we additionally read
a full read-ahead (currently: 64 kiB) past the second index page for no reason
whatsoever. After this patch, a single-partition read always reads exactly one
index page, so the above problem cannot occur.
2024-09-30 23:43:36 +02:00
Avi Kivity
fb8743b2d6 Merge 'sstables: Fix use-after-free on page cache buffer when parsing promoted index entries across pages' from Tomasz Grabiec
This fixes a use-after-free bug when parsing clustering key across
pages.

Also includes a fix for allocating section retry, which is potentially not safe (not in practice yet).

Details of the first problem:

Clustering key index lookup is based on the index file page cache. We
do a binary search within the index, which involves parsing index
blocks touched by the algorithm. Index file pages are 4 KB chunks
which are stored in LSA.

To parse the first key of the block, we reuse clustering_parser, which
is also used when parsing the data file. The parser is stateful and
accepts consecutive chunks as temporary_buffers. The parser is
supposed to keep its state across chunks.

In 93482439, the promoted index cursor was optimized to avoid
fully page copy when parsing index blocks. Instead, parser is
given a temporary_buffer which is a view on the page.

A bit earlier, in b1b5bda, the parser was changed to keep shared
fragments of the buffer passed to the parser in its internal state (across pages)
rather than copy the fragments into a new buffer. This is problematic
when buffers come from page cache because LSA buffers may be moved
around or evicted. So the temporary_buffer which is a view on the LSA
buffer is valid only around the duration of a single consume() call to
the parser.

If the blob which is parsed (e.g. variable-length clustering key
component) spans pages, the fragments stored in the parser may be
invalidated before the component is fully parsed. As a result, the
parsed clustering key may have incorrect component values. This never
causes parsing errors because the "length" field is always parsed from
the current buffer, which is valid, and component parsing will end at
the right place in the next (valid) buffer.

The problematic path for clustering_key parsing is the one which calls
primitive_consumer::read_bytes(), which is called for example for text
components. Fixed-size components are not parsed like this, they store
the intermediate state by copying data.

This may cause incorrect clustering keys to be parsed when doing
binary search in the index, diverting the search to an incorrect
block.

Details of the solution:

We adapt page_view to a temporary_buffer-like API. For this, a new concept
is introduced called ContiguousSharedBuffer. We also change parsers so that
they can be templated on the type of the buffer they work with (page_view vs
temporary_buffer). This way we don't introduce indirection to existing algorithms.

We use page_view instead of temporary_buffer in the promoted
index parser which works with page cache buffers. page_view can be safely
shared via share() and stored across allocating sections. It keeps hold to the
LSA buffer even across allocating sections by the means of cached_file::page_ptr.

Fixes #20766

Closes scylladb/scylladb#20837

* github.com:scylladb/scylladb:
  sstables: bsearch_clustered_cursor: Add trace-level logging
  sstables: bsearch_clustered_cursor: Move definitions out of line
  test, sstables: Verify parsing stability when allocating section is retried
  test, sstables: Verify parsing stability when buffers cross page boundary
  sstables: bsearch_clustered_cursor: Switch parsers to work with page_view
  cached_file: Adapt page_view to ContiguousSharedBuffer
  cached_file: Change meaning of page_view::_size to be relative to _offset rather than page start
  sstables, utils: Allow parsers to work with different buffer types
  sstables: promoted_index_block_parser: Make reset() always bring parser to initial state
  sstables: bsearch_clustered_cursor: Switch read_block_offset() to use the read() method
  sstables: bsearch_clustered_cursor: Fix parsing when allocating section is retried
2024-10-01 00:02:55 +03:00
Calle Wilund
b5d167699c commitlog: Fix buffer_list_bytes not updated correctly
Fixes #20862

With the change in 60af2f3cb2 the bookkeep
for buffer memory was changed subtly, the problem here that we would
shrink buffer size before we after flush use said buffer's size to
decrement the buffer_list_bytes value, previously inc:ed by the full,
allocated size. I.e. we would slowly grow this value instead of adjusting
properly to actual used bytes.

Test included.

Closes scylladb/scylladb#20886
2024-09-30 18:04:00 +03:00
Raphael S. Carvalho
cf58674029 replica: Fix schema change during migration cleanup
During migration cleanup, there's a small window in which the storage
group was stopped but not yet removed from the list. So concurrent
operations traversing the list could work with stopped groups.

During a test which emitted schema changes during migrations,
a failure happened when updating the compaction strategy of a table,
but since the group was stopped, the compaction manager was unable
to find the state for that group.

In order to fix it, we'll skip stopped groups when traversing the
list since they're unused at this stage of migration and going away
soon.

Fixes #20699.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#20798
2024-09-30 17:30:38 +03:00
David Garcia
b94fbbf30c docs: update command
Removes the update command from the setup command.

This is required because versions now are not strictly pinned in the poetry.lock file since Sphinx ScyllaDB Theme 1.8.

Closes scylladb/scylladb#20876
2024-09-30 17:06:07 +03:00
Andrei Chekun
cdd0c0b7fc test.py: Do not attach logs for passed tests
To reduce the amount of space needed for reports, this PR will modify logs
attachment in allure, so it will attach logs only for the tests that have
status other than PASSED. To simplify the solution, with the current way it's
not possible to switch off these logs completely.

Closes scylladb/scylladb#20786
2024-09-30 14:55:55 +02:00
Kefu Chai
1c8100d3f1 test/unit: remove unused #include
following headers are no longer used by this compilation unit:

- "utils/managed_ref.hh"
- "test/perf/perf.hh"

this was identified by clang-include-cleaner. As the code is audited,
we can safely remove the #include directive.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20850
2024-09-30 14:46:39 +03:00
Kefu Chai
c3be4a36af test.py: pass "count" to re.sub() with kwarg
since Python 3.13, passing count to `re.sub()` as positional argument
has been deprecated. and when runnint `test.py` with Python 3.13, we
have following warning:

```
/home/kefu/dev/scylladb/./test.py:1477: DeprecationWarning: 'count' is passed as positional argument
  args.tests = set(re.sub(r'.* List configured unit tests\n(.*)\n', r'\1', out, 1, re.DOTALL).split("\n"))
```

see also https://github.com/python/cpython/issues/56166

in order to silence this distracting warning, let's pass
`count` using kwarg.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20859
2024-09-30 13:57:02 +03:00
Kefu Chai
947d9d5a97 scylla_coredump_setup: fix typos in comment
these typos were identified by the codespell workflow.

and fixed a syntax error along the way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20877
2024-09-30 13:29:34 +03:00
Aleksandra Martyniuk
efc7ad8547 node_ops: fix task_manager_module::get_nodes()
Currently, node ops virtual task gathers its children from all nodes contained
in a sum of service::topology::normal_nodes and service::topology::transition_nodes.
The maps may contain nodes that are down but weren't removed yet. So, if a user
requests the status of a node ops virtual task, the task's attempt to retrieve
its children list may fail with seastar::rpc::closed_error.

Filter out the tasks that are down in node_ops::task_manager_module::get_nodes.

Fixes: #20843.

Closes scylladb/scylladb#20856
2024-09-30 12:32:23 +03:00
Pavel Emelyanov
423b5a3ba7 Merge 'directories: cleanups to silence clang-tidy false alarms' from Kefu Chai
clang-tidy warns:

```
Warning: /__w/scylladb/scylladb/utils/directories.cc:132:52: warning: 'path' used after it was moved [bugprone-use-after-move]
  132 |         bool can_access = co_await file_accessible(path.string(), access_flags::read | access_flags::write | access_flags::execute);
      |                                                    ^
/__w/scylladb/scylladb/utils/directories.cc:121:28: note: move occurred here
  121 |         verification_error(std::move(path), "File not owned by current euid: {}. Owner is: {}", geteuid(), sd.uid);
      |                            ^
```

because we pass `std::move(path)` to `verification_error()`, and "then" use this variable again in this same function.
this is a false alarm, but we could make it very clear to convince this tool that it's safe to do so.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#20875

* github.com:scylladb/scylladb:
  directories: mark verification_error() with [[noreturn]]
  directories: pass const ref of path to verification_error()
2024-09-30 12:02:39 +03:00
Kamil Braun
322efb54c2 Merge 'raft_group0_client: place on a #include diet' from Avi Kivity
Reduce compile time and unnecessary compilations by reducing #include load.

Minor refactoring, no backport.

Closes scylladb/scylladb#20864

* github.com:scylladb/scylladb:
  raft_group0_client: uninclude "raft_group0_registry.hh"
  raft_group_registry: extract raft_timeout
  raft_group0_client: uninclude "mutation/mutation.hh"
  raft_group0_client: uninclude "db/system_keyspace.hh"
  db: system_keyspace: extract auth_version_t into its own header
2024-09-30 10:43:44 +02:00
Kefu Chai
faec71e666 directories: mark verification_error() with [[noreturn]]
this helps the compiler or static analyzers do make the right decision.
for instance, clang-tidy thinks a parameter like `std::move(path)`
could be reused after being moved away. with this attribute, this tool
should be able to tell that this never happens.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-30 12:07:15 +08:00
Kefu Chai
0ef72475fc directories: pass const ref of path to verification_error()
before this change, we pass a `path` to `verification_error()` by
moving away from the original `path`. this works fine in the sense
that it is correct and does not incur potential performance issues.

but clang-tidy considers it a used-after-move, because it cannot tell
`verification_error()` does not return at all, and believes that `path`
could be accessed again after being moved away. so it warns like:

```
Warning: /__w/scylladb/scylladb/utils/directories.cc:132:52: warning: 'path' used after it was moved [bugprone-use-after-move]
  132 |         bool can_access = co_await file_accessible(path.string(), access_flags::read | access_flags::write | access_flags::execute);
      |                                                    ^
/__w/scylladb/scylladb/utils/directories.cc:121:28: note: move occurred here
  121 |         verification_error(std::move(path), "File not owned by current euid: {}. Owner is: {}", geteuid(), sd.uid);
      |                            ^
```

in this change, instead of passing `fs::path` to `verification_error()`,
we pass a `const fs::path&` to this function. because
`verification_error()` is not coroutine, neither does it not pass `path` to
another continuation to be scheduled. so it's perfectly fine to pass
`path` to it.

this change address the false alarms from clang-tidy.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-30 12:07:15 +08:00
Nadav Har'El
64c0540d02 cql-pytest: test a few small materialized views syntax issue
While documenting materialized view in a new document (Refs #16569)
I encountered a few questions and this patch contains tests that
clarify their answer - and can later guarantee that the answer doesn't
unintentionally change in the future. The questions that these tests
answer are:

1. It is not allowed to filter a view on a static column (a comment
   on the test explains why).

2. We already tested that it's not allowed to SELECT a static column into
   a view. Here we add the check that "SELECT *" is also not allowed if
   a static column exists in the base table.

3. We check that CREATE MATERIALIZED VIEW ... WITH COMMENT='..' works.

4. We check that CREATE MATERIALIZED VIEW ... WITH COMPACT STORAGE is
   forbidden.

5. We check that CREATE MATERIALIZED VIEW ... WITH garbage=.. fails
   with a clean InvalidRequest.

All these tests pass on both Scylla and Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#20873
2024-09-29 21:34:24 +03:00
Nadav Har'El
b008dabee5 test/cql-pytest: fix support for Cassandra 3
One of the design goals of the test/cql-pytest frameworks was to be able
to run these tests against Cassandra. Preferably, we should be able to
run most of the tests against any popular version of Cassandra, including
Cassandra 3. This is admittingly a very old version, but was still maintained
until just a year ago, it's the version that Scylla is most compatible with,
and we can still be curious about how it worked.

Until recently cql-pytest indeed worked on Cassandra 3, but it broke on some
change related to tablet detection that cause our most basic fixture -
"text_keyspace" - to use the Cassandra 4 feature of "auto expand".
This is trivial to fix - we should just use the this_dc fixture that we already
had exactly for this purpose.

Fixes #20781

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#20782
2024-09-29 19:36:33 +03:00
Benny Halevy
946f21bbd3 cql-pytest: test_virtual_tables: add test_snapshots_multiple_keyspaces
Test snapshots listing in system.snapshots
using multiple keyspaces and multiple snpashots.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-29 14:36:18 +03:00
Benny Halevy
906de3444b virtual_tables: snapshots: include all snapshots
Use database::get_snapshot_details to get the details
of all snapshots on disk, in particular those of
deleted tables.

Add test_snapshots_dropped_table to test listing
of snapshots of a deleted table.
And harden the existing test cases to use a unique
snapshot tag and to delete it when the test ends.

Fixes scylladb/scylladb#18313

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-29 14:16:11 +03:00
Botond Dénes
a4c41755de Update seastar submodule
* ./seastar 69f88e2f...3c9c2696 (14):
  > core/reactor: don't check AIO block count when they are not needed
  > build: do not print the default value of --c++-standard in help output
  > json_formatter: Add tests for formatter::write
  > Add APIs to get group details and to change ownership of file.
  > scripts/perftune.py: improve a dry-run printout
  > build: drop the workaround for a GCC bug
  > cmake: Depend on libbsd if DPDK depends on it
  > http: clarify the ownership in the router's doxygen comment
  > build: check for P2582R1 support
  > python: introduce a python formatting CI check
  > addr2line: reformat with black
  > scripts: add pyproject.toml
  > json_formatter: Make formatter::write work for std::pair
  > README.md: use the github homepage of Ceph for Crimson

Closes scylladb/scylladb#20836
2024-09-29 13:47:40 +03:00
Avi Kivity
5a470b2bfb Merge 'scylla_raid_setup: configure SELinux file context' from Takuya ASADA
On RHEL9, systemd-coredump fails to coredump on /var/lib/scylla/coredump because the service only have write acess with systemd_coredump_var_lib_t. To make it writable, we need to add file context rule for /var/lib/scylla/coredump, and run restorecon on /var/lib/scylla.

Fixes #19325

Closes scylladb/scylladb#20528

* github.com:scylladb/scylladb:
  scylla_raid_setup: configure SELinux file context
  scylla_coredump_setup: fix SELinux configuration for RHEL9
2024-09-29 12:53:00 +03:00
Avi Kivity
884297ae2e raft_group0_client: uninclude "raft_group0_registry.hh"
Reduce unnecessary recompilations.
2024-09-28 17:25:11 +03:00
Avi Kivity
67cdd0d389 raft_group_registry: extract raft_timeout
It is a vocabulary term that shouldn't need the registry to be visible.
Extract it to a new header.
2024-09-28 17:25:03 +03:00
Avi Kivity
93afc77307 raft_group0_client: uninclude "mutation/mutation.hh"
Lighten the dependency load. Some constructors and destructors
are uninlined to avoid the header depending on the mutation class.
2024-09-28 16:31:53 +03:00
Avi Kivity
5d68efe0bd raft_group0_client: uninclude "db/system_keyspace.hh"
It doesn't need it apart from a forward declaration.

Files that lost necessary includes are adjusted, and some users
of auth_version_t are redirected to the definition outside system_keyspace.
2024-09-28 16:31:53 +03:00
Avi Kivity
df3ee94467 db: system_keyspace: extract auth_version_t into its own header
Users of auth_version_t shouldn't need to include the heavyweight
system_keyspace.hh.
2024-09-28 16:31:50 +03:00
Pavel Emelyanov
c17d353718 Revert "[script/pull_github_pr.sh] Check Gating status before merging"
This reverts commit fac682df7e.

Again, this patch broke maintainer workflows, it needs even more care.
2024-09-27 19:12:18 +03:00
Benny Halevy
23d6b996b8 test/pylib: scylla_cluster: set endpoint_snitch in scylla conf
When `property_file` is provided, we generate a
`cassandra-rackdc.properties` file, but to actually use it,
`endpoint_snitch` must be set to `GossipingPropertyFileSnitch`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#20730
2024-09-27 16:46:54 +03:00
David Garcia
4900e4b1ac docs: update theme 1.8.1
chore: update README

Closes scylladb/scylladb#20832
2024-09-27 14:35:39 +02:00
Laszlo Ersek
153279dbfa test/boost/bptree_test: fix the CMake build
Commit 4cf4b7d4ef ("test: Move B+tree compactiont test from unit to
boost", 2024-09-24) introduced the first SEASTAR_THREAD_TEST_CASE to
"test/boost/bptree_test.cc" (alongside the prior BOOST_AUTO_TEST_CASEs),
but missed changing the KIND of the test from BOOST to SEASTAR. Therefore
we get a linker failure:

> : && /usr/bin/clang++ -O2 -Xlinker --build-id=sha1 --ld-path=ld.lld
> -dynamic-linker=/.../lib64/ld-linux-x86-64.so.2
> test/boost/CMakeFiles/bptree_test.dir/Dev/bptree_test.cc.o -o
> test/boost/Dev/bptree_test -L$srcdir/idl/absl::headers
> -Wl,-rpath,$srcdir/idl/absl::headers test/lib/Dev/libtest-lib.a
> seastar/Dev/libseastar.a /usr/lib64/libxxhash.so
> /usr/lib64/libboost_unit_test_framework.so.1.83.0  utils/Dev/libutils.a
> -Xlinker --push-state -Xlinker --whole-archive auth/Dev/libscylla_auth.a
> -Xlinker --pop-state  /usr/lib64/libcrypt.so cdc/Dev/libcdc.a
> compaction/Dev/libcompaction.a mutation_writer/Dev/libmutation_writer.a
> -Xlinker --push-state -Xlinker --whole-archive  dht/Dev/libscylla_dht.a
> -Xlinker --pop-state types/Dev/libtypes.a  index/Dev/libindex.a -Xlinker
> --push-state -Xlinker --whole-archive locator/Dev/libscylla_locator.a
> -Xlinker --pop-state message/Dev/libmessage.a  gms/Dev/libgms.a
> sstables/Dev/libsstables.a readers/Dev/libreaders.a
> schema/Dev/libschema.a  -Xlinker --push-state -Xlinker --whole-archive
> tracing/Dev/libscylla_tracing.a  -Xlinker --pop-state
> Dev/libscylla-main.a  -Xlinker --push-state -Xlinker --whole-archive
> Dev/libscylla-zstd.a  -Xlinker --pop-state /usr/lib64/libzstd.so
> abseil/absl/strings/Dev/libabsl_cord.a
> abseil/absl/strings/Dev/libabsl_cordz_info.a
> abseil/absl/strings/Dev/libabsl_cord_internal.a
> abseil/absl/strings/Dev/libabsl_cordz_functions.a
> abseil/absl/strings/Dev/libabsl_cordz_handle.a
> abseil/absl/crc/Dev/libabsl_crc_cord_state.a
> abseil/absl/crc/Dev/libabsl_crc32c.a
> abseil/absl/crc/Dev/libabsl_crc_internal.a
> abseil/absl/crc/Dev/libabsl_crc_cpu_detect.a
> abseil/absl/strings/Dev/libabsl_str_format_internal.a /usr/lib64/libz.so
> service/Dev/libservice.a  node_ops/Dev/libnode_ops.a
> service/Dev/libservice.a  node_ops/Dev/libnode_ops.a  -lsystemd
> raft/Dev/libraft.a  repair/Dev/librepair.a  streaming/Dev/libstreaming.a
> replica/Dev/libreplica.a  db/Dev/libdb.a  mutation/Dev/libmutation.a
> data_dictionary/Dev/libdata_dictionary.a  cql3/Dev/libcql3.a
> transport/Dev/libtransport.a  cql3/Dev/libcql3.a
> transport/Dev/libtransport.a  lang/Dev/liblang.a
> /usr/lib64/liblua-5.4.so  -lm  /usr/lib64/libsnappy.so.1.1.10
> abseil/absl/container/Dev/libabsl_raw_hash_set.a
> abseil/absl/hash/Dev/libabsl_hash.a  abseil/absl/hash/Dev/libabsl_city.a
> abseil/absl/types/Dev/libabsl_bad_variant_access.a
> abseil/absl/hash/Dev/libabsl_low_level_hash.a
> abseil/absl/types/Dev/libabsl_bad_optional_access.a
> abseil/absl/container/Dev/libabsl_hashtablez_sampler.a
> abseil/absl/profiling/Dev/libabsl_exponential_biased.a
> abseil/absl/synchronization/Dev/libabsl_synchronization.a
> abseil/absl/debugging/Dev/libabsl_stacktrace.a
> abseil/absl/synchronization/Dev/libabsl_graphcycles_internal.a
> abseil/absl/synchronization/Dev/libabsl_kernel_timeout_internal.a
> abseil/absl/debugging/Dev/libabsl_symbolize.a
> abseil/absl/debugging/Dev/libabsl_debugging_internal.a
> abseil/absl/base/Dev/libabsl_malloc_internal.a
> abseil/absl/debugging/Dev/libabsl_demangle_internal.a
> abseil/absl/time/Dev/libabsl_time.a
> abseil/absl/strings/Dev/libabsl_strings.a
> abseil/absl/strings/Dev/libabsl_strings_internal.a
> abseil/absl/strings/Dev/libabsl_string_view.a
> abseil/absl/base/Dev/libabsl_throw_delegate.a
> abseil/absl/numeric/Dev/libabsl_int128.a
> abseil/absl/base/Dev/libabsl_base.a
> abseil/absl/base/Dev/libabsl_raw_logging_internal.a
> abseil/absl/base/Dev/libabsl_log_severity.a
> abseil/absl/base/Dev/libabsl_spinlock_wait.a  -lrt
> abseil/absl/time/Dev/libabsl_civil_time.a
> abseil/absl/time/Dev/libabsl_time_zone.a rust/Dev/libwasmtime_bindings.a
> rust/librust_combined.a utils/Dev/libutils.a  seastar/Dev/libseastar.a
> /usr/lib64/libboost_program_options.so  /usr/lib64/libboost_thread.so
> /usr/lib64/libboost_chrono.so  /usr/lib64/libboost_atomic.so
> /usr/lib64/libcares.so  /usr/lib64/libfmt.so.10.2.1 /usr/lib64/liblz4.so
> /usr/lib64/libgnutls.so  -latomic /usr/lib64/libsctp.so
> /usr/lib64/libprotobuf.so /usr/lib64/libyaml-cpp.so
> /usr/lib64/libhwloc.so  /usr/lib64/libnuma.so /usr/lib64/libxxhash.so
> /usr/lib64/libcryptopp.so /usr/lib64/libdeflate.so
> /usr/lib64/libboost_regex.so.1.83.0 /usr/lib64/libicui18n.so
> /usr/lib64/libicuuc.so  -ldl && :
> ld.lld: error: undefined symbol: main
> >>> referenced by
> /usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../lib64/crt1.o:(_start)
>
> ld.lld: error: undefined symbol:
> seastar::testing::seastar_test::seastar_test(char const*, char const*,
> int, boost::unit_test::decorator::collector_t&)
> ooo referenced by bptree_test.cc
> >>>
> test/boost/CMakeFiles/bptree_test.dir/Dev/bptree_test.cc.o:(_GLOBAL__sub_I_bptree_test.cc)
> clang++: error: linker command failed with exit code 1 (use -v to see invocation)

Fix the KIND now.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-09-27 12:21:17 +02:00
Laszlo Ersek
5fa87cb1c6 test/boost/auth_test: fix the CMake build
Commit 78ab1ee8b7 ("test: Add tests for `CREATE ROLE WITH SALTED HASH`",
2024-09-20) made test/boost/auth_test dependent on cql3, but didn't encode
the dependency in "CMakeLists.txt":

> FAILED:
> test/boost/CMakeFiles/auth_test.dir/RelWithDebInfo/auth_test.cc.o
> /usr/bin/clang++ -DBOOST_ALL_DYN_LINK -DFMT_SHARED
> -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7
> -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT
> -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING
> -DSEASTAR_TESTING_MAIN -DXXH_PRIVATE_API
> -DCMAKE_INTDIR=\"RelWithDebInfo\" -I$srcdir -I$srcdir/build/gen
> -I$srcdir/seastar/include -I$srcdir/build/seastar/gen/include
> -I$srcdir/build/seastar/gen/src -isystem $srcdir/abseil -isystem
> $srcdir/build/rust -ffunction-sections -fdata-sections -O3 -g -gz
> -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra
> -Wno-error=deprecated-declarations -Wimplicit-fallthrough
> -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags
> -Wno-missing-field-initializers -Wno-overloaded-virtual
> -Wno-unsupported-friend -Wno-enum-constexpr-conversion
> -Wno-unused-parameter -ffile-prefix-map=$srcdir/build=. -march=westmere
> -Xclang -fexperimental-assignment-tracking=disabled -mllvm
> -inline-threshold=2500 -fno-slp-vectorize -Werror=unused-result -MD -MT
> test/boost/CMakeFiles/auth_test.dir/RelWithDebInfo/auth_test.cc.o -MF
> test/boost/CMakeFiles/auth_test.dir/RelWithDebInfo/auth_test.cc.o.d -o
> test/boost/CMakeFiles/auth_test.dir/RelWithDebInfo/auth_test.cc.o -c
> $srcdir/test/boost/auth_test.cc
> $srcdir/test/boost/auth_test.cc:22:10: fatal error: 'cql3/CqlParser.hpp'
> file not found
>    22 | #include "cql3/CqlParser.hpp"
>       |          ^~~~~~~~~~~~~~~~~~~~
> 1 error generated.

State the dependency now.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-09-27 11:38:03 +02:00
Tomasz Grabiec
b5ae7da9d2 sstables: bsearch_clustered_cursor: Add trace-level logging 2024-09-27 01:25:15 +02:00
Tomasz Grabiec
8e54ecd38e sstables: bsearch_clustered_cursor: Move definitions out of line
In order to later use the formatter for the inner class
promoted_index_block, which is defined out of line after
cached_promoted_index class definition.
2024-09-27 01:25:15 +02:00
Tomasz Grabiec
0279ac5faa test, sstables: Verify parsing stability when allocating section is retried 2024-09-27 01:25:15 +02:00
Tomasz Grabiec
c09fa0cb98 test, sstables: Verify parsing stability when buffers cross page boundary 2024-09-27 01:25:15 +02:00
Tomasz Grabiec
7670ee701a sstables: bsearch_clustered_cursor: Switch parsers to work with page_view
This fixes a use-after-free bug when parsing clustering key across
pages.

Clustering key index lookup is based on the index file page cache. We
do a binary search within the index, which involves parsing index
blocks touched by the algorithm. Index file pages are 4 KB chunks
which are stored in LSA.

To parse the first key of the block, we reuse clustering_parser, which
is also used when parsing the data file. The parser is stateful and
accepts consecutive chunks as temporary_buffers. The parser is
supposed to keep its state across chunks.

In b1b5bda, the parser was changed to keep shared fragments of the
buffer passed to the parser in its internal state (across pages)
rather than copy the fragments into a new buffer. This is problematic
when buffers come from page cache because LSA buffers may be moved
around or evicted. So the temporary_buffer which is a view on the LSA
buffer is valid only around the duration of a single consume() call to
the parser.

If the blob which is parsed (e.g. variable-length clustering key
component) spans pages, the fragments stored in the parser may be
invalidated before the component is fully parsed. As a result, the
parsed clustering key may have incorrect component values. This never
causes parsing errors because the "length" field is always parsed from
the current buffer, which is valid, and component parsing will end at
the right place in the next (valid) buffer.

The problematic path for clustering_key parsing is the one which calls
primitive_consumer::read_bytes(), which is called for example for text
components. Fixed-size components are not parsed like this, they store
the intermediate state by copying data.

This may cause incorrect clustering keys to be parsed when doing
binary search in the index, diverting the search to an incorrect
block.

The solution is to use page_view instead of temporary_buffer, which
can be safely shared via share() and stored across allocating
section. The page_view maintains its hold to the LSA buffer even
across allocating sections.

Fixes #20766
2024-09-27 01:25:15 +02:00
Tomasz Grabiec
c15145b71d cached_file: Adapt page_view to ContiguousSharedBuffer 2024-09-27 01:25:15 +02:00
Tomasz Grabiec
29498a97ae cached_file: Change meaning of page_view::_size to be relative to _offset rather than page start
Will be easier to implement ContiguousSharedBuffer API as the buffer
size will be equal to _size.
2024-09-27 01:25:15 +02:00
Tomasz Grabiec
c0fa49bab5 sstables, utils: Allow parsers to work with different buffer types
Currently, parsers work with temporary_buffer<char>. This is unsafe
when invoked by bsearch_clustered_cursor, which reuses some of the
parsers, and passes temporary_buffer<char> which is a view onto LSA
buffer which comes from the index file page cache. This view is stable
only around consume(). If parsing requires more than one page, it will
continue with a different input buffer. The old buffer will be
invalid, and it's unsafe for the parser to store and access
it. Unfortunetly, the temporary_buffer API allows sharing the buffer
via the share() method, which shares the underlying memory area. This
is not correct when the underlying is managed by LSA, because storage
may move. Parser uses this sharing when parsing blobs, e.g. clustering
key components. When parsing resumes in the next page, parser will try
to access the stored shared buffers pointing to the previous page,
which may result in use-after-free on the memory area.

In prearation for fixing the problem, parametrize parsers to work with
different kinds of buffers. This will allow us to instantiate them
with a buffer kind which supports sharing of LSA buffers properly in a
safe way.

It's not purely mechanical work. Some parts of the parsing state
machine still works with temporary_buffer<char>, and allocate buffers
internally, when reading into linearized destination buffer. They used
to store this destination in _read_bytes vector, same field which is
used to store the shared buffers. Now it's not possible, since shared
buffer type may be different than temporary_buffer<char>. So those
paths were changed to use a new field: _read_bytes_buf.
2024-09-27 01:24:54 +02:00
Tomasz Grabiec
93bfaf4282 sstables: promoted_index_block_parser: Make reset() always bring parser to initial state
When reset() is done due to allocating section retry, it can be
theoretically in an arbitrary point. So we should not assume that it
finished parsing and state was reset by previous parsing. We should
reset all the fields.
2024-09-27 01:23:43 +02:00
Tomasz Grabiec
ac823b1050 sstables: bsearch_clustered_cursor: Switch read_block_offset() to use the read() method
To unify logic which handles allocating section retry, and thus
improve safety.
2024-09-27 01:22:35 +02:00
Nadav Har'El
9af43dcd06 Merge 'Move collections stress tests from unit/ to boost/' from Pavel Emelyanov
Collection stress tests include testing of B- B+- and radix trees, and those tests live in unit/ suite. There are also small corner-case tests for those collections in boost/ suite. There's an attempt to get rid of unit suite in favor of boost one, and this PR moves the collections stress testing from unit suite into their boost counterparts.

refs: scylladb/qa-tasks#1655

Closes scylladb/scylladb#20475

* github.com:scylladb/scylladb:
  test: Move other collection-testing headers from unit to boost
  test: Move stress-collecton header from unit to boost
  test: Move B+tree compactiont test from unit to boost
  test: Move radix tree compactiont test from unit to boost
  test: Move B-tree compactiont test from unit to boost
  test: Move radix tree stress test from unit to boost
  test: Move B-tree stress test from unit to boost
  test: Move b+tree stress test from unit to boost
  test: Add bool in_thread argument to stress_collection function
2024-09-26 18:11:23 +03:00
Botond Dénes
9fe64b5d70 Merge 'Remove datadir string from table::config' from Pavel Emelyanov
The datadir keeps path to directory where local sstables can be. The very same information is now kept in table's storage options (#20542). This set fixes the remaining places that still use table::config::datadir and table::dir() and removes the datadir field.

Closes scylladb/scylladb#20675

* github.com:scylladb/scylladb:
  treewide: Remove table::config::datadir
  distributed_loader: Print storage options, not datadir
  data_dictionary: Add formatter for storage_options
  test: Construct table_for_tests with table storage options
  test: Generalize pair of make_table_for_tests helpers
  tests: Add helper to get snapshot directory from storage options
  table: snapshot_exists: Get directory from storage options
  table: snapshot_on_all_shards: Get directory from storage options
2024-09-26 15:26:45 +03:00
Kamil Braun
9224e48d6b Merge 'Populate raft address map from gossiper on raft configuration change' from Gleb Natapov
For each new node added to the raft config populate its ID to IP mapping
in raft address map from the gossiper. The mapping may have expired if a
node is added to the raft configuration long after it first appears in
the gossiper.

Fixes scylladb/scylladb#20600

Backport to all supported versions since the bug may cause bootstrapping failure.

Closes scylladb/scylladb#20601

* github.com:scylladb/scylladb:
  test: extend existing test to check that a joining node can map addresses of all pre-existing nodes during join
  group0: make sure that address map has an entry for each new node in the raft configuration
2024-09-26 12:41:25 +02:00
Tomasz Grabiec
8aca93b3ec sstables: bsearch_clustered_cursor: Fix parsing when allocating section is retried
Parser's state was not reset when allocating section was retried.

This doesn't cause problems in practice, because reserves are enough
to cover allocation demands of parsing clustering keys, which are at
most 64K in size. But it's still potentially unsafe and needs fixing.
2024-09-26 12:34:41 +02:00
Laszlo Ersek
ed91d35171 sstables: coroutinize sstable::load()
Best viewed with "git show -b -W".

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>

Closes scylladb/scylladb#20822
2024-09-26 13:26:22 +03:00
Lakshmi Narayanan Sreethar
7beea03196 build: cmake: link cql3 library to the service library
After commit d16ea0af, compiling the server using cmake fails with the
following error :

```
FAILED: service/CMakeFiles/service.dir/Dev/qos/service_level_controller.cc.o
...
/home/Scylla/scylladb/cql3/util.hh:21:10: fatal error: 'cql3/CqlParser.hpp' file not found
   21 | #include "cql3/CqlParser.hpp"
      |          ^~~~~~~~~~~~~~~~~~~~
1 error generated.
```

Fix it by linking the cql3 to the service library.

Closes scylladb/scylladb#20805
2024-09-26 09:17:30 +03:00
Yaron Kaikov
fac682df7e [script/pull_github_pr.sh] Check Gating status before merging
Maintainers use scripts/pull_github_pr.sh from scylladb.git when merging PRs and before pushing to the next. We want to prevent merges from piling up on top of unstable builds. This change will check Gating's current status and notify the maintainers

Related to scylladb/scylla-pkg#3644

Closes scylladb/scylladb#20742
2024-09-26 08:44:06 +03:00
Nadav Har'El
7715abfc56 Merge 'Alternator store ProvisionedThroughput' from Amnon Heiman
When users create a table using the Alternator API, they can decide if the billing is PROVISIONED of PAY_PER_REQUEST.
If the billing is set to PROVISIONED, they need to set the ProvisionedThroughput ReadCapacityUnits (RCU) and WriteCapacityUnits (WCU).

This series adds support for getting and setting the ProvisionedThroughput. The values will be stored as table extension tags.
Following how TTL is stored within the Alternator, we will use ```system:rcu_attribute``` and ```system:wcu_attribute``` for the labels.

The series adds a test that sets ProvisionedThroughput and validates that it gets the value back. It was tested with both Alternator and AWS.

This series is part of the effort to monitor, limit, and bill Alternator operations.

New code, no need to backport.

Closes scylladb/scylladb#20056

* github.com:scylladb/scylladb:
  docs/alternator/compatibility.md: explain the consumed capacity provisioned
  Add test/alternator/test_provisioned_throughput.py
  test/alternator/util.py: Allow override BillingMode
  alternator/executor.cc: Store ProvisionedThroughput
2024-09-26 01:23:17 +03:00
Avi Kivity
357168114b cql3: statement_restrictions: use the evaluator to calculate token for constrained global index query
A global index has a primary key of the form

   (indexed_column, token, partition_key_column..., clustering_key_column...)

The primary key columns are used to point at the base table row, and
the token (computed as token(partition_key_column...) is used to maintain
sort order.

The query planner has an optimization: if the partition key is fully
constrained to a unique value, then we compute the token from the partition
key and use that to seek directly into the clustering row range for
that base table partition. If the clustering key is also partially
constrained, it is used to refine the index clustering key.

Currently, this optimization is implemented as a hack: the partition key
is extracted from the prepared statement + query options in
get_global_index_token_clustering_ranges(), then used to calculate
the token, which is then substituted in the expression passed to
get_single_column_clustering_bounds() (the expression is shared across
all running queries, so this is quite dangerous).

We simplify the whole thing:

 - Let prepare_index_global() recognize that if the partition key is not
   fully constrained, then there is no way that we'll be able to compute
   the token (as it needs all partition key columns). Since the token
   is the first clustering key column of the index table, we can truncate
   it to length zero and bail out.

 - Otherwise, the partition key is fully constrained. We refactor the
   predicate (pk1 = :a AND pk2 = :b) to (pk1, pk2) := (:a, :b). We then
   pass expressions representing the partition key to the token function,
   ending up with token(:a, :b). We then substitute this expression into
   (*_idx_tbl_ck_prefix)[0], which computes the first clustering key
   column for the index table.

 - Remove the runtime component in get_global_index_clustering_ranges().
   Note this include the early return if the partition key wasn't fully
   constrained (though the comment only mentions over-constraining), and
   the token computation, which is now done by evaluate().

Closes scylladb/scylladb#20733
2024-09-25 22:48:16 +03:00
Yaron Kaikov
d164fd45bc install-dependencies.sh: update node_exporter to 1.8.2
Update node_exporter to 1.8.2

Fixes: #18493

Closes scylladb/scylladb#20254

[avi: regenerate frozen toolchain, with new clang in

  https://devpkg.scylladb.com/clang/clang-18.1.8-Fedora-40-aarch64.tar.gz
  https://devpkg.scylladb.com/clang/clang-18.1.8-Fedora-40-x86_64.tar.gz

new clang regenerated due to new packaging format (f6fe4d9e73) and
some other minor changes.]
2024-09-25 18:42:25 +03:00
Gleb Natapov
9e4cd32096 test: extend existing test to check that a joining node can map addresses of all pre-existing nodes during join 2024-09-25 17:10:09 +03:00
Kamil Braun
7d8f1d251a Merge 'Mark node as being replaced earlier' from Gleb Natapov
Before 17f4a151ce the node was marked as
been replaced in join_group0 state, before it actually joins the group0,
so by the time it actually joins and starts transferring snapshot/log no
traffic is sent to it. The commit changed this to mark the node as
being replaced after the snapshot/log is already transferred so we can
get the traffic to the node while it sill did not caught up with a
leader and this may causes problems since the state is not complete.
Mark the node as being replaced earlier, but still add the new node to
the topology later as the commit above intended.

Fixes: scylladb/scylladb#20629

Need to be backported since this is a regression

Closes scylladb/scylladb#20743

* github.com:scylladb/scylladb:
  test: amend test_replace_reuse_ip test to check that there is no stale writes after snapshot transfer starts
  topology coordinator:: mark node as being replaced earlier
  topology coordinator: do metadata barrier before calling finish_accepting_node() during replace
2024-09-25 15:46:12 +02:00
Kamil Braun
09c68c0731 service: raft: fix rpc error message
What it called "leader" is actually the destination of the RPC.

Trivial fix, should be backported to all affected versions.

Closes scylladb/scylladb#20789
2024-09-25 15:46:37 +03:00
Kefu Chai
d5b348460f config: do not provide default value for set_value() and friends
before this change, `config_file::set_value()` and
`config_file::set_value_on_all_shards()` provide default value for
`config_source`. but the default value is never used -- we alway
specify the `source_source` when calling `set_value_on_all_shards()`.

so in hope to improve the readability, the default value is removed.
so, for example, one can figure out when `config_source::Internal` is
used with less efforts. despite that `config_file::set_value()` is not
used in the tree. for the sake of completeness, its default value is
also dropped.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20728
2024-09-25 15:45:42 +03:00
Anna Stuchlik
8145109120 doc: add OS support for version 6.2
This commit adds the OS support for version 6.2.
In addition, it removes support for 6.0, as the policy is only to include
information for the supported versions, i.e., the two latest versions.

Fixes https://github.com/scylladb/scylladb/issues/20804

Closes scylladb/scylladb#20806
2024-09-25 15:39:23 +03:00
Pavel Emelyanov
ae76481444 Merge 'treewide: add "table" parameter to "backup" API ' from Kefu Chai
with this parameter, "backup" API can backup the given table, this
enables it to be a drop-in replacement of existing rclone API used by
scylla manager.

Fixes https://github.com/scylladb/scylladb/issues/20636

---

this change is a part of the efforts to bring the native backup/restore to scylla, no need to backprt.

Closes scylladb/scylladb#20661

* github.com:scylladb/scylladb:
  backup_task: fix the indent
  treewide: add "table" parameter to "backup" API
2024-09-25 10:53:38 +03:00
Takuya ASADA
f6fe4d9e73 toolchain: fix broken INSTALL_FROM mode
We found that --clang-build-mode INSTALL_FROM tries to rebuild clang
even we use an archive of prebuilt image.
Seems like it is because ninja detected changes on standard library
headers, which updated when we build new frozen toolchain container
image.

To avoid such unnecessary rebuild, we should stop archive whole clang
build directory, we should archive install image instead.
To do so, we can use
"DESTDIR=<sysroot dir> ninja install-distribution-stripped", and archive
sysroot dir as clang archive.

Fixes #20421

Closes scylladb/scylladb#20422
2024-09-25 10:48:56 +03:00
Anna Stuchlik
da8047a834 doc: add an intro to the Features page
This commit modifies the Features page in the following way:

- It adds a short introduction and descriptions to each listed feature.
- It hides the ToC (required to control and modify the information on the page,
  e.g., to add descriptions, have full control over what is displayed, etc.)
- Removes the info about Enterprise features (following the request not to include
  Enterprise info in the OSS docs)

Fixes https://github.com/scylladb/scylladb/issues/20617
Blocks https://github.com/scylladb/scylla-enterprise/pull/4711

Closes scylladb/scylladb#20635
2024-09-25 08:50:21 +03:00
Aleksandra Martyniuk
3195ebd04e node_ops: make node_ops tasks type more human-friendly
Currently, node ops tasks type is retrieved from topology_request
without any change. Use respective node operation name instead.

Closes scylladb/scylladb#20671
2024-09-25 08:49:34 +03:00
Kamil Braun
69b4769418 test: fix topology_custom/test_raft_recovery_stuck flakiness
The test performs consecutive schema changes in RECOVERY mode. The
second change relies on the first. However the driver might route the
changes to different servers and we don't have group 0 to guarantee
linearizability. We must rely on the first change coordinator to push
the schema mutations to other servers before returning, but that only
happens when it sees other servers as alive when doing the schema
change. It wasn't guaranteed in the test. Fix this.

Fixes scylladb/scylladb#20791

Should be backported to all branches containing this test to reduce
flakiness.

Closes scylladb/scylladb#20792
2024-09-25 08:45:37 +03:00
Kefu Chai
54858b8242 backup_task: fix the indent
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-25 09:11:26 +08:00
Kefu Chai
d663b6c13b treewide: add "table" parameter to "backup" API
with this parameter, "backup" API can backup the given table, this
enables it to be a drop-in replacement of existing rclone API used by
scylla manager.

in this change:

* api/storage_service: add "table" parameter to "backup" API.
* snapshot_ctl: compose the full path of the snapshot directory in
  `snapshot_ctl::start_backup`. since we have all the information
  for composing the snapshot directory, and what the `backup_task_impl`
  class is interested is but the snapshot directory, we just pass
  the path to it instead the individual components of the directory.
* backup_task_impl: instead of scan the whole keyspace recursively,
  only scan the specified snapshot directory.

Fixes scylladb/scylladb#20636
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-25 09:11:26 +08:00
Avi Kivity
d16ea0afd6 Merge 'cql3: Extend DESC SCHEMA by auth and service levels' from Dawid Mędrek
Auth has been managed via Raft since Scylla 6.0. Restoring data
following the usual procedure (1) is error-prone and so a safer
method must have been designed and implemented. That's what
happens in this PR.

We want to extend `DESC SCHEMA` by auth and service levels
to provide a safe way to backup and restore those two components.
To realize that, we change the meaning of `DESC SCHEMA WITH INTERNALS`
and add a new "tier": `DESC SCHEMA WITH INTERNALS AND PASSWORDS`.

* `DESC SCHEMA` -- no change, i.e. the statement describes the current
  schema items such as keyspaces, tables, views, UDTs, etc.
* `DESC SCHEMA WITH INTERNALS` -- does the same as the previous tier
  and also describes auth and service levels. No information about
  passwords is returned.
* `DESC SCHEMA WITH INTERNALS AND PASSWORDS` -- does the same
  as the previous tier and also includes information about the salted
  hashes corresponding to the passwords of roles.

To restore existing roles, we extend the `CREATE ROLE` statement
by allowing to use the option `WITH SALTED HASH = '[...]'`.

---

Implementation strategy:

* Add missing things/adjust existing ones that will be used later.
* Implement creating a role with salted hash.
* Add tests for creating a role with salted hash.
* Prepare for implementing describe functionality of auth and service levels.
* Implement describe functionality for elements of auth and service levels.
* Extend the grammar.
* Add tests for describe auth and service levels.
* Add/update documentation.

---

(1): https://opensource.docs.scylladb.com/stable/operating-scylla/procedures/backup-restore/restore.html
In case the link stops working, restoring a schema was realised
by managing raw files on disk.

Fixes scylladb/scylladb#18750
Fixes scylladb/scylladb#18751
Fixes scylladb/scylladb#20711

Closes scylladb/scylladb#20168

* github.com:scylladb/scylladb:
  docs: Update user documentation for backup and restore
  docs/dev: Add documentation for DESC SCHEMA
  test: Add tests for describing auth and service levels
  cql3/functions/user_function: Remove newline character before and after UDF body
  cql3: Implement DESCRIBE SCHEMA WITH INTERNALS AND PASSWORDS
  auth: Implement describing auth
  auth/authenticator: Add member functions for querying password hash
  service/qos/service_level_controller: Describe service levels
  data_dictionary: Remove keyspace_element.hh
  treewide: Start using new overloads of describe
  treewide: Fix indentation in describe functions
  treewide: Return create statement optionally in describe functions
  treewide: Add new describe overloads to implementations of data_dictionary::keyspace_element
  treewide: Start using schema::ks_name() instead of schema::keyspace_name()
  cql3: Refactor `description`
  cql3: Move description to dedicated files
  test: Add tests for `CREATE ROLE WITH SALTED HASH`
  cql3/statements: Restrict CREATE ROLE WITH SALTED HASH
  auth: Allow for creating roles with SALTED HASH
  types: Introduce a function `cql3_type_name_without_frozen()`
  cql3/util: Accept std::string_view rather than const sstring&
2024-09-24 21:44:32 +03:00
Tomasz Grabiec
bca8258150 Merge 'tablet: Fix single-sstable split when attaching new unsplit sstables' from Raphael "Raph" Carvalho
To fix a race between split and repair here c1de4859d8, a new sstable
  generated during streaming can be split before being attached to the sstable
  set. That's to prevent an unsplit sstable from reaching the set after the
  tablet map is resized.

  So we can think this split is an extension of the sstable writer. A failure
  during split means the new sstable won't be added. Also, the duration of split
  is also adding to the time erm is held. For example, repair writer will only
  release its erm once the split sstable is added into the set.

  This single-sstable split is going through run_custom_job(), which serializes
  with other maintenance tasks. That was a terrible decision, since the split may
  have to wait for ongoing maintenance task to finish, which means holding erm
  for longer. Additionally, if split monitor decides to run split on the entire
  compaction group, it can cause single-sstable split to be aborted since the
  former wants to select all sstables, propagating a failure to the streaming
  writer.
  That results in new sstable being leaked and may cause problems on restart,
  since the underlying tablet may have moved elsewhere or multiple splits may
  have happened. We have some fragility today in cleaning up leaked sstables on
  streaming failure, but this single-sstable split made it worse since the
  failure can happen during normal operation, when there's e.g. no I/O error.

  It makes sense to kill run_custom_job() usage, since the single-sstable split
  is offline and an extension of sstable writing, therefore it makes no sense to
  serialize with maintenance tasks. It must also inherit the sched group of the
  process writing the new sstable. The inheritance happens today, but is fragile.

  Fixes #20626.

Closes scylladb/scylladb#20737

* github.com:scylladb/scylladb:
  tablet: Fix single-sstable split when attaching new unsplit sstables
  replica: Fix tablet split execute after restart
2024-09-24 19:46:11 +02:00
Abhinav
36d68ec955 raft topology: add error for removal of non-normal nodes
In the current scenario, We check if a node being removed is normal
on the node initiating the removenode request. However, we don't have a
similar check on the topology coordinator. The node being removed could be
normal when we initiate the request, but it doesn't have to be normal when
the topology coordinator starts handling the request.
For example, the topology coordinator could have removed this node while handling
another removenode request that was added to the request queue earlier.

This commit intends to fix this issue by adding more checks in the enqueuing phase
and return errors for duplicate requests for node removal.

This PR fixes a bug. Hence we need to backport it.

Fixes: scylladb/scylladb#20271

Closes scylladb/scylladb#20500
2024-09-24 16:11:19 +02:00
Botond Dénes
24ac408a08 Revert "[script/pull_github_pr.sh] Check Gating status before merging"
This reverts commit ec0bb42b45.

This patch broke maintainer workflows, it needs more work before it can
land.
2024-09-24 16:53:02 +03:00
Artsiom Mishuta
c07306582b test.py: deselect remove_data_dir_of_dead_node event
Deselect remove_data_dir_of_dead_node event from test_random_failures
due to issue scylladb/scylladb#20751

Closes scylladb/scylladb#20790
2024-09-24 14:49:00 +02:00
Dawid Mędrek
1ef51be1d7 docs: Update user documentation for backup and restore
We update the relevant articles addressing backing-up
and restoring the schema by specifying that the user
performing it must be a superuser. We also update
the required version of cqlsh.

Additionally, we add an article covering the fundamental
information on `DESCRIBE SCHEMA`.
2024-09-24 14:21:15 +02:00
Dawid Mędrek
5e1d7f109a docs/dev: Add documentation for DESC SCHEMA
We add documentation for developers addressing
`DESCRIBE SCHEMA`. It covers the following aspects
of it:

* motivation,
* synopsis of the solution,
* implementation of the solution,

as well as a few subsections explaining the details:

* restoring process and its side effects,
* restoring roles with passwords,
* list of statements generated by `DESC SCHEMA`
  with examples,
* implementation details.
2024-09-24 14:18:01 +02:00
Dawid Mędrek
d42f1604ad test: Add tests for describing auth and service levels
We add tests verifying the following features work correctly:

* describing auth: roles, role grants, granting permissions on
  resources,
* describing service levels: creating them and attaching to roles.
2024-09-24 14:18:01 +02:00
Dawid Mędrek
10d13f541b cql3/functions/user_function: Remove newline character before and after UDF body
We remove newline characters that are printed before and after
a UDF's body. This way, we want to keep the create statement
as close to what was actually provided as possible. Although
there should be no semantic differences with or without the
newline characters, it's a lot more convenient in testing when
they're not present.

Fixes scylladb/scylladb#20711
2024-09-24 14:18:01 +02:00
Dawid Mędrek
be851cef10 cql3: Implement DESCRIBE SCHEMA WITH INTERNALS AND PASSWORDS
When executing `DESC SCHEMA WITH INTERNALS`, Scylla now also returns
statements that can be used to recreate service levels and restore
the state of auth. That encompasses granting roles and permissions
as well as attaching service levels to roles.

If the additional parameter `WITH PASSWORDS` is provided,
the statements corresponding to recreating roles in the system
will also contain the stored salted hashes.
2024-09-24 14:18:01 +02:00
Dawid Mędrek
2a27d4b4d6 auth: Implement describing auth
We introduce a function `describe_auth()` in `auth::service`
responsible for producing a sequence of descriptions whose
corresponding CQL statement can be used to restore the state
of auth.
2024-09-24 14:17:58 +02:00
Nadav Har'El
b70ab7bd64 test/boost: add README.md
Add a README.md in test/boost, giving a short introduction to what this
directory is and what kind of tests it contains, and how to run individual
tests.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#20550
2024-09-24 15:16:55 +03:00
Pavel Emelyanov
39dc340424 test: Move other collection-testing headers from unit to boost
Simple and straightforward.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-24 13:42:13 +03:00
Pavel Emelyanov
f0d60c2b4d test: Move stress-collecton header from unit to boost
Now all its users are in boost suite. Once moved, the
stress_collection() function no longer runs in seastar thread, and the
in_thread argument is removed while the function is moved.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-24 13:42:13 +03:00
Pavel Emelyanov
4cf4b7d4ef test: Move B+tree compactiont test from unit to boost
This time the boost test needs to stop being pure-boost test, since
bptree compaction test case needs to run in seastar thread. Other
collection tests are already such, not bptree_test joins the party.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-24 13:42:13 +03:00
Pavel Emelyanov
d1f727669c test: Move radix tree compactiont test from unit to boost
No surprises here, just move the code and hard-code default args.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-24 13:42:13 +03:00
Pavel Emelyanov
bdcf965318 test: Move B-tree compactiont test from unit to boost
This test must run in seastar thread, so put it in seastar-thread test
case, fortunately btree test allows that. Just like its stress peer,
this test also has two invocations from suite, so make it two distinct
test cases as well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-24 13:42:13 +03:00
Pavel Emelyanov
328b5b71d7 test: Move radix tree stress test from unit to boost
Just move the code. Test "scale" is also taken from default unit test
arguments.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-24 13:42:13 +03:00
Pavel Emelyanov
023cc99514 test: Move B-tree stress test from unit to boost
This also moves the code, but takes into account the stress test had two
invovations with suite options -- small and large. Inherit both with two
distinct test cases.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-24 13:42:12 +03:00
Pavel Emelyanov
72cb835c1e test: Move b+tree stress test from unit to boost
Just move the code. And hard-code the "scale" (i.e. -- number of keys
and iterations) from default arguments of the unit test.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-24 13:31:33 +03:00
Pavel Emelyanov
f0526bf6a4 test: Add bool in_thread argument to stress_collection function
This code is going to be shared between seastar thread and boost tests,
temporarily. So not to yield in pure boost test, add the switch. It will
be removed really soon.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-24 13:31:33 +03:00
Tomasz Grabiec
bd6eeb4730 Merge 'Separate schema merging logic' from Marcin Maliszkiewicz
This patch doesn't yet change how schema merging works but it prepares the ground for it by simplifying the code and separating merging logic into its own unit.

It consists of:
- minor cleanups of unused code
- moving code into separate file
- simplifying merge_keyspaces code

More detailed explanation in per commit messages.

Relates scylladb/scylladb#19153

Closes scylladb/scylladb#19687

* github.com:scylladb/scylladb:
  db: schema_applier: simplify merge_keyspaces function
  db: schema_applier: remove unnecessary read in merge_keyspaces
  db: schema_tables: move scylla specific code into create keyspace function
  db: move schema merging code into a separate unit
  db: schema_tables: export some schema management functions
  replica: remove unused table_selector forward declaration
  db: remove unused flush arg from do_merge_schema func
  db: remove unused read_arg_values function
2024-09-24 11:43:06 +02:00
Michał Jadwiszczak
d7945eea2a docs/dev/service_levels: replace unspecified workload type with NULL
`unspecified` workload type is an internal value and it's not exposed to
user via CQL.
Default value for workload type from user's perspective is `NULL`.

Fixes scylladb/scylladb#20780
2024-09-24 11:43:29 +03:00
Yaron Kaikov
ec0bb42b45 [script/pull_github_pr.sh] Check Gating status before merging
Maintainers use scripts/pull_github_pr.sh from scylladb.git when merging PRs and before pushing to the next. We want to prevent merges from piling up on top of unstable builds. This change will check Gating's current status and notify the maintainers

Related to https://github.com/scylladb/scylla-pkg/issues/3644

Closes scylladb/scylladb#20742
2024-09-24 08:39:47 +03:00
Pavel Emelyanov
9fd8eba3ec proxy: Don't keep truncate timeout as optional argument
Because it is never such -- the only caller of truncate_blocking()
always knows the timeout it want this method to use.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20620
2024-09-24 08:25:54 +03:00
Pavel Emelyanov
d64529f370 Merge 'sstables/sstables.hh: Remove unused forward declarations' from Nikos Dragazis
Code cleanup, no backport needed.

Closes scylladb/scylladb#20767

* github.com:scylladb/scylladb:
  sstables: Remove forward declaration for random_access_reader
  sstables: Remove forward declaration for metadata_collector
  sstables: Remove forward declaration for sstables_manager
  sstables: Remove forward declaration for sstable_writer_v2
  sstables: Remove forward declaration for key
2024-09-24 07:44:56 +03:00
Andrei Chekun
da2397005b test.py: Remount cgroup before changing files ownership
Change order of functions: firstly remount, then change ownership for
cgroup. It was not failing before because with privileged mode, it will
mount cgroups as RW, but it's better to have this check if behavior will
change.

Closes scylladb/scylladb#20676
2024-09-24 07:27:24 +03:00
Kefu Chai
657ea95f4c main: coroutinize read_config()
for better readability.

read_config() is not on the critical path, so the performance
degradation caused by C++20 couroutine is neglectable.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20694
2024-09-24 06:30:34 +03:00
Gleb Natapov
1213f02a5a test: skip test_lwt_semaphore::test_cas_semaphore in aarch64 debug mode
The test configures write timeout to much smaller value to make the test
run faster since for some writes sleep is inserted to hit the timeout,
but it makes aarch64 debug flaky since timeout happens when it should
not because of a natural slowness.

Fixes scylladb/scylladb#20515

Closes scylladb/scylladb#20744
2024-09-23 20:46:55 +02:00
Avi Kivity
5c329e3db0 Merge 'Put sstables::test class on a diet' from Pavel Emelyanov
This one is aimed at giving tests the ability to call private methods of class sstable. Some of the wrappers in the test class wrap public methods and can be removed.

Closes scylladb/scylladb#20614

* github.com:scylladb/scylladb:
  test: Remove sstables::test::binary_search()
  test: Remove sstables::test::move_summary()
  test: Remove sstables::test::read_toc()
  test: Remove sstables::test::get_summary()
  test: Remove sstables::test::get_statistics()
  test: Remove sstables::test::data_read()
2024-09-23 21:40:40 +03:00
Yaniv Michael Kaul
26f2cbdfe2 optimized_clang.sh: compile with -march
Add for both x86_64 compilation flags for clang, to get it compile with newer arch x86_64-v3 for x86 and ARM 8.2 level for aarch64.

Tested to compile fine with both clang 18.1.6 and 18.1.8.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#20682
2024-09-23 17:40:20 +03:00
Paweł Zakrzewski
16dd58fb0d cql3: respect the user-defined page size in aggregate queries
This change allows the user to fully set the page size for the query.
There's still an internal hard-limit of 1MB anyway, so there's no need
to limit it to our default value (because using a larger page size might
be a query optimization sometimes)

Fixes #20612

Closes scylladb/scylladb#20692
2024-09-23 16:31:21 +03:00
Botond Dénes
64ed3f80c7 Merge 'Coroutinize sstable_directory::remove_unshared_sstables()' from Pavel Emelyanov
This one is pretty simple

```
    return do_with(std::move(data), [] {
        toss_data(data);
        return remove(std::move(data));
    });
```

it doesn't really need to do_with() since "toss_data" is non-preemptive. Still, convert it into

```
    toss_data(data);
    co_await remove(std::move(data));
```

Closes scylladb/scylladb#20479

* github.com:scylladb/scylladb:
  sstables: Restore indentation after previous patch
  sstables: Coroutinize remove_unshared_sstables()
2024-09-23 16:15:46 +03:00
Kefu Chai
1aa030a8cd docs: explain precedence of configure options
to explain for instance which setting takes effect if both
command line options and `scylla.yaml` configures the same parameter.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20696
2024-09-23 16:12:44 +03:00
Yaniv Michael Kaul
85c0bb7ff4 optimized_clang.sh: add missing symbolic links to clang (for ccache)
The removal of clang removes the symblic links ccache uses to mask itself as clang/clang++
Manually add them back, so ccache can work.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Fixes: https://github.com/scylladb/scylladb/issues/20490

Closes scylladb/scylladb#20491
2024-09-23 15:55:22 +03:00
Nikos Dragazis
1e4b67dd8a sstables: Remove forward declaration for random_access_reader
The sstables header contains a forward declaration for
`random_access_reader`. This was introduced in 75dc7b799e for no
obvious reason.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-23 15:28:43 +03:00
Nikos Dragazis
83ccd5bcca sstables: Remove forward declaration for metadata_collector
The sstables header contains a forward declaration for
`metadata_collector`. This was introduced in 2d6608bb88 for the return
value of the `sstable_writer::get_metadata_collector()`. This function
was later removed in 9e7144f719 but the forward declaration was left
behind.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-23 15:28:43 +03:00
Nikos Dragazis
90aff33cb0 sstables: Remove forward declaration for sstables_manager
This is a duplicate.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-23 15:28:43 +03:00
Nikos Dragazis
4efca437c8 sstables: Remove forward declaration for sstable_writer_v2
The sstables header contains a forward declaration for
`sstable_writer_v2`. This was introduced in fed5b73147 but never used.
It is probably a leftover from a previous revision of the patchset.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-23 15:28:43 +03:00
Nikos Dragazis
fda98ba9f6 sstables: Remove forward declaration for key
The sstables header contains a forward declaration for `key`.
This was introduced in 198f55dc5c for a reference parameter in
`binary_search()`.

The function was eventually moved to a different header in 4ed7e529db
but the forward declaration was left behind.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-23 15:28:26 +03:00
Piotr Dulikowski
d1c7e2effa configure.py: deduplicate --out-final-name arg added in build.ninja
Every time the ninja buildfile decides it needs to be updates, it calls
the configure.py script with roughly the same set of flags. However, the
--out-final-name flag is improperly handled and, on each reconfigure,
one more --out-final-name flag is appended to the rebuild command. This
is harmless because each instance of the flag will specify the same
parameter, but slightly annoying because it bloats the generated file
and the duplicated flags show up in ninja's output when reconfigure
runs.

Fix the problem by stripping the --out-final-name flags from the set of
the flags passed to the configure.py before forwarding them to the
reconfigure rule.

Closes scylladb/scylladb#20731
2024-09-23 15:05:13 +03:00
Dawid Mędrek
90ce86930a auth/authenticator: Add member functions for querying password hash
We add new member functions to the interface of `auth::authenticator`
responsible for querying the password hash corresponding to a given
role. One method indicates whether a given authenticator uses
password hashes, while the other queries them or throws an exception
password hashes are not used.

The rationale for extending the interface of authenticator is
to be able to access salted hashes from other parts of auth.
We will need them in an upcoming commit responsible for describing
auth.
2024-09-23 13:55:52 +02:00
Dawid Mędrek
6517ca8920 service/qos/service_level_controller: Describe service levels
We implement a member function responsible for producing
instances of `cql3::description` that can be used to
restore service levels.
2024-09-23 13:55:49 +02:00
Kefu Chai
40f2d4c988 build: cmake: drop scylla-jmx from the build
in 3cd2a61736, we dropped scylla-jmx
from the build. but didn't update the CMake building system accordingly,
this broke the CMake build, as the dependencies pointing to jmx cannot
be found or fulfilled.

in this change, we remove all references to jmx in the CMake build.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>

Closes scylladb/scylladb#20736
2024-09-23 14:20:42 +03:00
Marcin Maliszkiewicz
2df8eefd67 db: schema_applier: simplify merge_keyspaces function
- removes uneccesary temporary sets/vectors
- removes auto&&
- moves return value instead of copying
- instead adds diff references to keep readability
- create and alter logic is almost the same, now it's visible better
2024-09-23 12:01:36 +02:00
Marcin Maliszkiewicz
7225538845 db: schema_applier: remove unnecessary read in merge_keyspaces
read_schema_partition_for_keyspace() is already called for every
changing keyspace by get_schema_complete_view() and stored in _after
field so we can reuse this data.
2024-09-23 12:01:36 +02:00
Marcin Maliszkiewicz
f49822f78d db: schema_tables: move scylla specific code into create keyspace function
Since extract_scylla_specific_keyspace_info() was always coupled
with create_keyspace_from_schema_partition() there is no value
in separating them. By moving first into the latter we:
- reduce number of exported functions
- simplify arguments of create_keyspace_from_schema_partition
- simplify caller's code
2024-09-23 12:01:36 +02:00
Marcin Maliszkiewicz
9792d720c9 db: move schema merging code into a separate unit
It's mostly self containted and it's easier to
maintain reasonably sized files. Also splitting
better shows boundaries between schema and
schema merging code.
2024-09-23 12:01:36 +02:00
Marcin Maliszkiewicz
208050f190 db: schema_tables: export some schema management functions
In subseqent commits schema merging code will be separated from
db/schema_tables.cc but code which manages schema will remain intact.
So those two translation units will share some amount of code.

It's similar case as with replica/database.cc which creates schema
on startup, it calls functions from db/schema_tables.cc.

Struct qualified_name got moved to header as it's used as
read_table_mutations() argument.
2024-09-23 12:01:36 +02:00
Marcin Maliszkiewicz
258ffbd126 replica: remove unused table_selector forward declaration 2024-09-23 12:01:36 +02:00
Marcin Maliszkiewicz
4630864b58 db: remove unused flush arg from do_merge_schema func 2024-09-23 12:01:36 +02:00
Marcin Maliszkiewicz
4cce9c8b5a db: remove unused read_arg_values function 2024-09-23 12:01:36 +02:00
Nadav Har'El
6496eab5ee Merge 'Rename Alternator batch item count metrics' from Amnon Heiman
This PR addresses multiple issues with alternator batch metrics:

1. Rename the metrics to scylla_alternator_batch_item_count with op=BatchGetItem/BatchWriteItem
2. The batch size calculation was wrong and didn't count all items in the batch.
3. Add a test to validate that the metrics values increase by the correct value (not just increase). This also requires an addition to the testing to validate ops of different metrics and an exact value change.

Needs backporting to allow the monitoring to use the correct metrics names.

Fixes #20571

Closes scylladb/scylladb#20646

* github.com:scylladb/scylladb:
  alternator:test_metrics test metrics for batch item count
  alternator:test_metrics Add validating the increased value
  alternator: Fix item counting in batch operations
  Alterntor rename batch item count metrics
2024-09-23 10:13:07 +03:00
Kefu Chai
2014d1c0cb cql3: drop workaround for castas_fctn_simple()
now that e13a584ab7
has been merged, and our toolchain is based on the fedora 40 on 20240710, which should include this
change.

so let's drop the workaround from 51d09e6a

Refs #18508
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20750
2024-09-22 19:59:10 +03:00
Kefu Chai
fdc8773278 test/scylla_gdb: get table::_schema raw pointer with lw_shared_ptr
This commit addresses an issue where accessing the raw pointer of the
schema instance within `table::_schema` using `table.schema._p` was
unreliable.

before this change, `_p` was of type `lw_shared_ptr_counter_base*`, a
type-erased smart pointer, preventing direct casting to the underlying
schema pointer. but we still cast it to `schema*` anyway. this led to
a gdb.MemoryError when dereferencing the deduced pointer:
but the type of `_p` is `lw_shared_ptr_counter_base*`, which is a
type erased smart pointer, and it cannot be casted directly to the
under pointer pointing to a `schema` instance. this results in:

```
Traceback (most recent call last):
  File "/home/avi/scylla/test/scylla_gdb/../../scylla-gdb.py", line 5554, in invoke
    self.print_key_type(seastar_lw_shared_ptr(schema['_clustering_key_type']).get().dereference(), 'clustering')
  File "/home/avi/scylla/test/scylla_gdb/../../scylla-gdb.py", line 5533, in print_key_type
    key_type = seastar_shared_ptr(key_type).get().dereference()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
gdb.MemoryError: Cannot access memory at address 0x4000079656b0078
```

when we are dereferencing the raw pointer deduced this way.

in this change,

* we use the wrapper of `seastar_lw_shared_ptr` to safely obtain the
  raw pointer.
* reenable this test previously disabled by 3d781c4f

tested using
```console
$ SCYLLA=/home/kefu/dev/scylladb/master/build/release/scylla \
  test/scylla_gdb/run -o junit_suite_name=scylla_gdb test_misc.py::test_schema
```

on an up-to-date fedora 40 installation.

Refs 3d781c4f
Fixes scylladb/scylladb#20741
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20746
2024-09-22 18:30:16 +03:00
Avi Kivity
657848dcbb cql3: statement_restrictions, expr: move restrictions-related expression utilities out of expression.cc
Move all of the blatantly restriction-related expression utilities
to statement_restrictions.cc.

Some are so blatant as to include the word "restriction" in their name.
Others are just so specialized that they cannot be used for anything else.

The motivation is that further refactoring will be simplified if it can
happen within the same module, as there will not be a need to prove
it has no effect elsewhere.

Most of the declarations are made non-public (in .cc file) to limit
proliferation. A few are needed for tests or in select_statement.cc
and so are kept public.

Other than that, the only changes are namespace qualifications and
removal of a now-duplicate definition ("inclusive").

Closes scylladb/scylladb#20732
2024-09-22 11:00:51 +03:00
Avi Kivity
3d781c4fc8 Update frozen toolchain
* tools/java e505a6d3bb...5b0e274f12 (1):
  > Merge 'build.xml: install and use java-11 when building' from Kefu Chai

Updates to clang 18.1.8 + LLVM patch to match Fedora 40.

New optimized clang build generated and stored in

    https://devpkg.scylladb.com/clang/clang-18.1.8-x86_64.tar.gz
    https://devpkg.scylladb.com/clang/clang-18.1.8-aarch64.tar.gz

Due to the loss of the jmx submodule, we no longer install java-11-openjdk.
We add it in install-dependencies.sh here to compensate, pending a better
solution.

tools/java submodule updated to remove build failure where Java 8
was selected instead of Java 11.

The scylla_gdb test suite was disabled due to a regression in gdb 15,
which is brought in by the toolchain update [1].

[1] https://github.com/scylladb/scylladb/issues/20741.
2024-09-21 20:07:28 +03:00
Raphael S. Carvalho
38ce2c605d tablet: Fix single-sstable split when attaching new unsplit sstables
To fix a race between split and repair here c1de4859d8, a new sstable
generated during streaming can be split before being attached to the sstable
set. That's to prevent an unsplit sstable from reaching the set after the
tablet map is resized.

So we can think this split is an extension of the sstable writer. A failure
during split means the new sstable won't be added. Also, the duration of split
is also adding to the time erm is held. For example, repair writer will only
release its erm once the split sstable is added into the set.

This single-sstable split is going through run_custom_job(), which serializes
with other maintenance tasks. That was a terrible decision, since the split may
have to wait for ongoing maintenance task to finish, which means holding erm
for longer. Additionally, if split monitor decides to run split on the entire
compaction group, it can cause single-sstable split to be aborted since the
former wants to select all sstables, propagating a failure to the streaming
writer.
That results in new sstable being leaked and may cause problems on restart,
since the underlying tablet may have moved elsewhere or multiple splits may
have happened. We have some fragility today in cleaning up leaked sstables on
streaming failure, but this single-sstable split made it worse since the
failure can happen during normal operation, when there's e.g. no I/O error.

It makes sense to kill run_custom_job() usage, since the single-sstable split
is offline and an extension of sstable writing, therefore it makes no sense to
serialize with maintenance tasks. It must also inherit the sched group of the
process writing the new sstable. The inheritance happens today, but is fragile.

Fixes #20626.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-09-20 23:03:01 -03:00
Raphael S. Carvalho
999f1f1318 replica: Fix tablet split execute after restart
let's assume there are 2 nodes, n1, n2. n1 is the coordinator.

1) n1 emits split
2) n1 and n2 complete split work
3) n1 becomes aware all replicas are ready for split
4) n2 restarts, but places split sstable into main group[1]
5) n1 executes split
6) n2 handles split completion, but see the main group is not empty

[1]: During split, main group should only contain unsplit sstables.
If all sstables are split, main must be empty.

This is a result of replica not setting storage group to split mode on restart
(using tablet map) and therefore sstables are incorrectly placed on main group.

The fix is about looking at tablet map and setting group to split mode before
sstables are populated into it.

Refs #20626.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-09-20 22:28:09 -03:00
Avi Kivity
cd861bc788 row_cache: coroutinize do_update()
do_with() makes the change a no-brainer, and besides, it's called once
per huge update.

Closes scylladb/scylladb#20735
2024-09-21 00:07:02 +02:00
Botond Dénes
488a372fdc tool/scylla-nodetool: status: reorder endpoint calls to match old nodetool
Old nodetool requested `/storage_service/tokens_endpoing` first, then
`/storage_service/host_id`, while the native nodetool did it in reverse
order. Most of the time this is inconsequential but there is an edge
case when a node's IP address is changed. This reversing of the order
results in unexpected behavior for tests, causing noise via flaky
tests.
Match the order of the old nodetool so that the native nodetool exhibits
the behavior expected by tests (and users too probably).

Fixes: scylladb/scylladb#18693

Closes scylladb/scylladb#20615
2024-09-20 15:07:16 +02:00
Dawid Mędrek
b357307406 data_dictionary: Remove keyspace_element.hh
The interface is not used anywhere anymore, so we can
remove it safely. It has been replaced by custom
functions for each keyspace element and `cql3::description`.
2024-09-20 14:24:54 +02:00
Dawid Mędrek
7b4f9c806c treewide: Start using new overloads of describe
We continue removing `data_dictionary::keyspace_element`.
In this commit, we start using the overloads returning
`cql3::description` in places where the methods specified
by `data_dictionary::keyspace_element` were used.
2024-09-20 14:24:54 +02:00
Dawid Mędrek
df94e92b06 treewide: Fix indentation in describe functions
After modifying new functions for generating `cql3::description`,
we fix indentation in them in this commit.
2024-09-20 14:24:54 +02:00
Dawid Mędrek
86722e4cea treewide: Return create statement optionally in describe functions
We add a new parameter in functions used to generate instances
of `cql3::description` for types related to situations where we
might not need a create statement. An example of such a scenario
could be `DESCRIBE TYPES`.
2024-09-20 14:24:54 +02:00
Dawid Mędrek
0702e93e32 treewide: Add new describe overloads to implementations of data_dictionary::keyspace_element
We're removing `data_dictionary::keyspace_element`.
Before we can do that, we need to substitute the existing
methods used for describing keyspace elements with their
new versions returning `cql3::description`.
That's what happens in this commit.
2024-09-20 14:24:53 +02:00
Dawid Mędrek
39cf106151 treewide: Start using schema::ks_name() instead of schema::keyspace_name()
We're going to remove the interface `data_dictionary::keyspace_element`.
As `schema::keyspace_name()` is an implementation of one of the methods
specified by that interface, we replace its uses by `schema::ks_name()`.
`schema::keyspace_name()` was an alias for it, so no semantic change
has occured.
2024-09-20 14:24:53 +02:00
Dawid Mędrek
1844c71f9a cql3: Refactor description
In these changes, we describe the purpose of the type
and make it reusable for other parts of the code.
That includes ditching the existing constructors,
leaving the formatting of its fields to the user
of the interface.

The removed constructors have been replaced by
free functions so that existing code can still
use them the way it did before.
2024-09-20 14:24:53 +02:00
Dawid Mędrek
05d6794e65 cql3: Move description to dedicated files
We move the declaration of `description` to dedicated files
to be able to create instances of it from other parts of
the code.

`describe_statement.cc` has been functioning as an
intermediary between objects that can be described
and the end user. It will still perform that duty,
but we want to let other modules be able to generate
descriptions on their own, without having to share
an additional layer of abstraction in form of types
inheriting from `data_dictionary::keyspace_element`.
Those types may not perform any other function than
that and thus may be redundant.

Adjusting `description` to its new purpose will happen
in an upcoming commit.
2024-09-20 14:24:53 +02:00
Dawid Mędrek
78ab1ee8b7 test: Add tests for CREATE ROLE WITH SALTED HASH 2024-09-20 14:24:53 +02:00
Dawid Mędrek
47a5469280 cql3/statements: Restrict CREATE ROLE WITH SALTED HASH
We start requiring that the user issuing `CREATE ROLE
WITH SALTED HASH` be a superuser. The rationale for
that is the statement directly modifies a system
tables, circumventing the hashing algorithm.

Additionally, we correct a possible existing problem.
`_options.is_superuser` in `create_role_statement`
may be an empty optional, so dereferencing it
without a prior check could lead to undefined
behavior in the future.
2024-09-20 14:24:53 +02:00
Dawid Mędrek
206fdf2848 auth: Allow for creating roles with SALTED HASH
We introduce a way to create a role with explictly
provided salted hash.

The algorithm for creating a role with a password works
like this:

1. The user issues a statement `CREATE ROLE <role> WITH
   PASSWORD = '<password>' <...>`.
2. Scylla produces a hash based on the value of
   `<password>`.
3. Scylla puts the produced hash in `system.roles`,
   in the column `salted_hash`.

The newly introduced way to create a role is based
on a new form of the create statement:
`CREATE ROLE <role> WITH SALTED HASH = '<salted_hash>`

The difference in the algorithm used for processing
this statement is that we insert `<salted_hash>`
into `system.roles` directly, without hashing it.

The rationale for introducing this new statement is that
we want to be able to restore roles. The original password
isn't stored anywhere in the database (as intended),
so we need to rely on the column `salted_hash`.
2024-09-20 14:24:53 +02:00
Dawid Mędrek
35a92d189e types: Introduce a function cql3_type_name_without_frozen()
The introduced function returns the actual name
of the type represented by `abstract_type`.
It circumvents name processing like wrapping a type
within `frozen<>` or using Cassandra's syntax.

We add the function to be able to describe UDFs
in the upcoming commits that require that their
arguments not be `frozen<>`.

We also test the implementation.
2024-09-20 14:24:53 +02:00
Dawid Mędrek
202d866892 cql3/util: Accept std::string_view rather than const sstring& 2024-09-20 14:24:53 +02:00
Avi Kivity
61d19e4464 Update tools/java submodule
* tools/java 0b4accdd5e...e505a6d3bb (1):
  > [C-S] Make it use DCAwareRoundRobinPolicy unless rack is provided
2024-09-20 14:49:21 +03:00
Pavel Emelyanov
b45891acd7 sstables: storage: Don't keep base directory in base class
This reverts commit 44bd183187 and
moves the base directory back on filesystem_storage. The mentioned
commit says

>     so we can use the base (table) directory for
>    e.g. pending_delete logs, in the next patch.

but "next patch" doesn't use it outside of the filesystem-storage
anyway.

This field doesn't make sense for S3 backend. Its "location" is not
location, but a key in the system.sstables, which should rather be
schema ID, not /var/lib/.../keyspace/table-uuid string.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20642
2024-09-20 11:51:04 +03:00
Andrei Chekun
bd9a73c39b Add .idea folder to .gitignore
.idea directory used by JetBrains IDE's to store data about project
config

Closes scylladb/scylladb#20718
2024-09-20 11:49:41 +03:00
Tomasz Grabiec
8e047e8fff gdb: Add std::set wrapper
Allows accessing std::set fields from gdb, e.g.:

(gdb) python for e in std_set(_promoted_index._blocks): print(e)

Closes scylladb/scylladb#20650
2024-09-20 08:24:15 +03:00
Anna Stuchlik
5da7894f70 doc: move the install-jmx instructions to a common folder
This commit moves the install-jmx.rst file from the install-scylla folder
to the installation-common folder.

All the references to the moved document are updated.

This is a follow-up to https://github.com/scylladb/scylladb/pull/17969/

Closes scylladb/scylladb#20712
2024-09-20 00:36:32 +03:00
Nadav Har'El
3499c407f7 test: avoid silly "no_mode.1" labels when running tests outside test.py
For the benefit of running test.py inside CI, we recently added to
test/cql-pytest and test/alternator the knowledge of which "Scylla mode"
(--mode) and "run number" is running (--run_id), although these concepts
are alien to these two test frameworks (remember that those test frameworks
can also run tests against unknown versions of Scylla or even our competitors'
implementations).

One unfortunate result of this change is that now if you run a test by
using pytest directly (or test/*/run) instead of test.py, for example:

    $ cd test/alternator
    $ pytest --aws test_item.py::test_basic_string_put_and_get

The test's success or failure reports the ugly name

    test_item.py::test_basic_string_put_and_get.no_mode.1

This unnecessary "no_mode.1" come from the the default values for --mode
and --run_id, respectively. But there is no reason for these silly
defaults. In this patch we change these defaults to None, and when they
are None, they aren't tacked onto the test's name.

This patch shouldn't affect running tests through test.py, because
test.py always sets the --mode and --run_id options, and doesn't leave
them as the default.

Fixes #20512

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#20513
2024-09-20 00:36:32 +03:00
Avi Kivity
b015c85d31 Merge 'gms: inet_address: drop unused raw_addr method and modernize comperators' from Benny Halevy
Drop the unused `gms::inet_address::raw_addr` method
and modernize operator== and operator< as class methods

* Cleanup only, no backport needed

Closes scylladb/scylladb#20681

* github.com:scylladb/scylladb:
  gms: inet_address: modernize comparison operators
  gms: inet_address: drop unused raw_addr method
2024-09-20 00:36:32 +03:00
Piotr Dulikowski
7e7701d436 Merge 'cql3/statements/select_statement: SELECT ... USING SERVICE LEVEL' from Michał Jadwiszczak
Allow to specify service level used in select statement `SELECT ... USING SERVICE LEVEL sl_name`.
In OSS, this only affects statement's timeout.

In case both service level and timeout are specified `SELECT ... USING SERVICE LEVEL sl_name AND TIMEOUT 1h`, the timeout has higher priority as statement's timeout.

Fixes scylladb/scylladb#18471

Closes scylladb/scylladb#20523

* github.com:scylladb/scylladb:
  test/cql-pytest: add test for `SELECT ... USING SERVICE LEVEL`
  cql3/Cql.g: extend grammar to allow `SELECT ... USING SERVICE LEVEL`
  cql3/statements/select_statement: use service level timeout
  cql3/attributes: add service level name field
  qos/service_level_controller: add method to check if service level exists in cache
2024-09-19 18:19:23 +02:00
Pavel Emelyanov
bd720dd2da Merge 'cql3: statement_restrictions: adapt to functional style' from Avi Kivity
The statement_restrictions class started life in the object-oriented style - an
object that interacts with its environment via mutators and is observed via
observers.

This is however not suitable for its objective: to analyze the WHERE clause,
select a query plan, and partition the WHERE clause atoms to the various
parts demanded by the query plan (read_command and filters). Furthermore,
the object oriented style makes it hard to work with as you can only call some
observers after the related mutators were called.

Fix this by transforming the code info a more functional style: we call
a function that returns an immutable statement_restrictions object that
can only be observed. This makes it easier to further change in the future,
as changes will not have to consider interaction with the environment.

No backport as this is a refactoring

Closes scylladb/scylladb#20672

* github.com:scylladb/scylladb:
  cql3: statement_restrictions: use functional style
  cql3: statement_restrictions: calculate the index only once
  cql3: statement_restrictions: make it a const object
2024-09-19 18:18:28 +03:00
Kefu Chai
8cc9d783a0 sstables/sstable_directory: document components_lister::process()
for better maintainability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20693
2024-09-19 18:11:31 +03:00
Kefu Chai
7985aa97b1 main, test: use seastar::handle_signal() instead
use `seastar::handle_signal()` instead of `reactor::handle_signal()`.

in a recent change in seastar (c3e826ad1197f2610138f3bcfaeb0b458f8fb799),
the later was marked as deprecated in favor of the former, so let's
use the recommended API.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20695
2024-09-19 18:10:07 +03:00
Kefu Chai
1fd1698a90 test: btree: use BOOST_DATA_TEST_CASE() when appropriate
instead grouping tests with different parameters, let's parameterize
them using `BOOST_DATA_TEST_CASE()`, simpler this way. and the tests
can be more structured.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20697
2024-09-19 18:09:05 +03:00
Avi Kivity
6f7c2ce0aa Merge 'cql_server::connection: Process rebounce message in case of multiple shard migrations' from Sergey Zolotukhin
During a query execution, the query can be re-bounced to another shard if the requested data is located there. Previous implementation assumed that the shard cannot be changed after first re-bounce, however with the introduction of Tablets, data could be migrated to another shard after the query was already re-bounced, causing a failure of the query execution. To avoid this issue, the query is re-bounced as needed until it is executed on the correct shard.

Fixes #15465

Closes scylladb/scylladb#20493

* github.com:scylladb/scylladb:
  cql_server: Add a test for multiple query msg rebounces.
  cql_server::connection: process: rebounce msg if needed
  cql_server::connection: process: co-routinize connection::process_on_shard
  cql_server: connection: process: fixup indentation
  cql_server: connection: process_on_shard: drop permit parameter
  transport: server: pass bounce_to_shard as foreign shared ptr
  cql_server: connection: process: add template concept for process_fn
  cql_server: move process_fn_return_type to class definition
2024-09-19 17:27:55 +03:00
Gleb Natapov
1b4c255ffd test: amend test_replace_reuse_ip test to check that there is no stale writes after snapshot transfer starts 2024-09-19 15:24:59 +03:00
Gleb Natapov
c0939d86f9 topology coordinator:: mark node as being replaced earlier
Before 17f4a151ce the node was marked as
been replaced in join_group0 state, before it actually joins the group0,
so by the time it actually joins and starts transferring snapshot/log no
traffic is sent to it. The commit changed this to mark the node as
being replaced after the snapshot/log is already transferred so we can
get the traffic to the node while it sill did not caught up with a
leader and this may causes problems since the state is not complete.
Mark the node as being replaced earlier, but still add the new node to
the topology later as the commit above intended.
2024-09-19 15:23:48 +03:00
Gleb Natapov
644e7a2012 topology coordinator: do metadata barrier before calling finish_accepting_node() during replace
During replace with the same IP a node may get queries that were intended
for the node it was replacing since the new node declares itself UP
before it advertises that it is a replacement. But after the node
starts replacing procedure the old node is marked as "being replaced"
and queries no longer sent there. It is important to do so before the
new node start to get raft snapshot since the snapshot application is
not atomic and queries that run parallel with it may see partial state
and fail in weird ways. Queries that are sent before that will fail
because schema is empty, so they will not find any tables in the first
place. The is pre-existing and not addressed by this patch.
2024-09-19 15:00:27 +03:00
Benny Halevy
574a08ed96 storage_service: rebuild: warn about tablets-enabled keyspaces
Until we automatically support rebuild for tablets-enabled
keyspaces, warn the user about them.

The reason this is not an error, is that after
increasing RF in a new datacenter, the current procedure
is to run `nodetool rebuild` on all nodes in that dc
to rebuild the new vnode replicas.
This is not required for tablets, since the additional
replicas are rebuilt automatically as part of ALTER KS.

However, `nodetool rebuild` is also run after local
data loss (e.g. due to corruption and removal of sstables).
In this case, rebuild is not supported for tablets-enabled
keyspaces, as tablet replicas that had lost data may have
already been migrated to other nodes, and rebuilding the
requested node will not know about it.
It is advised to repair all nodes in the datacenter instead.

Refs scylladb/scylladb#17575

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#20375
2024-09-19 14:25:46 +03:00
Pavel Emelyanov
8487f2fd93 treewide: Remove table::config::datadir
It's write-only now, all the places than wanted to know where table's
storage is, already use storage_options.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-19 13:06:39 +03:00
Pavel Emelyanov
350f64c38b distributed_loader: Print storage options, not datadir
When populating keyspace on boot the dist. loader prints a debugging
message with ks:cf names, state and the directory from where it picks
sstables. The last one is not extremely correct, as loading sstables
from S3 happens from a bucket, not directory. So it's better to print
the storage options, not the datadir string.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-19 13:06:39 +03:00
Pavel Emelyanov
b2fcfdcaa9 data_dictionary: Add formatter for storage_options
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-19 13:06:39 +03:00
Pavel Emelyanov
5046cfab4b test: Construct table_for_tests with table storage options
The only place that constructs table_for_tests is make_table_for_tests
helper. It can and should prepare the correct storage options, because
that's the last place where the target directory is still known.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-19 13:06:39 +03:00
Pavel Emelyanov
eaad4f348b test: Generalize pair of make_table_for_tests helpers
They only differ in a way they get target directory from -- one via
argument, andother from test_env. Respectively, the latter can call the
former.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-19 13:06:39 +03:00
Pavel Emelyanov
d9ef9bdd3b tests: Add helper to get snapshot directory from storage options
There's a bunch of tests that check the contents of snapshot directory
after creating one. Add a helper for those that gets this directory via
storage options, not table config.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-19 13:06:39 +03:00
Pavel Emelyanov
a734fd5c9c table: snapshot_exists: Get directory from storage options
Similarly to snapshot_on_all_shards, the way snapshot directory is
evaluated is changed to rely on storage options. Two ... assumptions are
that when asking for non-local snapshot existance or for a snapshot of a
virtual table, it's correct to return false instead of throwing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-19 13:06:09 +03:00
Pavel Emelyanov
24589cf00c table: snapshot_on_all_shards: Get directory from storage options
There are several things that are changed here

- The target directory for snapshot is evaluated using table directory
  taken from its storage options, not from config

- If the storage options are not "local", the snapshot_on_all_shards is
  failed early, it's impossible to snapshot sstables anyway

- If the storage is not configured for the obtained local options,
  snapshotting is skilled, because it's a virtual table that's probably
  not supposed to have snapshots

- The late failure to snapshot non-local sstables is converted into
  internal error, as this functionality cannot be executed as per
  previous change

- The target path is created using fs::path operator/ overload, not by
  concatenating strings (it's minor change)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-19 13:05:16 +03:00
Anna Stuchlik
cdc69b4e06 doc: enable publishing docs for branch-6.2
This commit enables publishing documentation from branch-6.2. The docs will be published as UNSTABLE (the warning about version 6.1 being unstable will be displayed).

Fixes https://github.com/scylladb/scylladb/issues/20643

No backport is required.

Closes scylladb/scylladb#20647
2024-09-19 09:39:58 +03:00
Anna Stuchlik
400a14eefa doc: update the unified installer instructions
This commit updates the unified installer instructions to avoid specifying a given version.
At the moment, we're technically unable to use variables in URLs, so we need to update
the page each release.

Fixes https://github.com/scylladb/scylladb/issues/20677

Closes scylladb/scylladb#20680
2024-09-19 09:28:44 +03:00
Anna Stuchlik
aa0c95c95c doc: fix a broken link
This commit fixes a link to the Manager by adding a missing underscore
to the external link.

Closes scylladb/scylladb#20656
2024-09-19 09:20:20 +03:00
Calle Wilund
60f8a9f39d database: Also forced new schema commitlog segment on user initiated memtable flush
Refs #20686
Refs #15607

In #15060 we added forced new commitlog segment on user initated flush,
mainly so that tests can verify tombstone gc and other compaction related
things, without having to wait for "organic" segment deletion.
Schema commitlog was not included, mainly because we did not have tests
featuring compaction checks of schema related tables, but also because
it was assumed to be lower general througput.
There is however no real reason to not include it, and it will make some
testing much quicker and more predictable.

Closes scylladb/scylladb#20691
2024-09-19 09:00:33 +03:00
Benny Halevy
5ccdf1cf1c gms: inet_address: modernize comparison operators
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-18 17:07:51 +03:00
Benny Halevy
38540d89a1 gms: inet_address: drop unused raw_addr method
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-18 14:21:18 +03:00
Kefu Chai
b0696bd842 test: btree: use BOOST_DATA_TEST_CASE to structure parameterized tests
for better readability. and for more structured tests.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20516
2024-09-18 14:16:28 +03:00
Pavel Emelyanov
eb22c2a8c8 Merge 'reader_concurrency_semaphore: improve the diagnostics dump' from Botond Dénes
* Also dump diagnostics when a read times out while active (not queued).
* Add the "Trigger permit" line, containing the details of the permit which caused the diagnostics dump (by e.g. timing out).
* Add the "Identified bottleneck(s)" line, containing the identified bottlenecks which lead to permits being queued. This line is missing if no such bottleneck can be identified.
* Document the new features, as well as the stat dump, which was added some time ago.

Example of the new dump format:
```
INFO  2024-09-12 08:09:48,046 [shard  0:main] reader_concurrency_semaphore - Semaphore reader_concurrency_semaphore_dump_reader_diganostics with 8/10 count and 106192275/32768 memory resources: timed out, dumping permit diagnostics:
Trigger permit: count=0, memory=0, table=ks.tbl0, operation=mutation-query, state=waiting_for_admission
Identified bottleneck(s): memory

permits count   memory  table/operation/state
3       2       26M     *.*/push-view-updates-2/active
3       2       16M     ks.tbl1/push-view-updates-1/active
1       1       15M     ks.tbl2/push-view-updates-1/active
1       0       13M     ks.tbl1/multishard-mutation-query/active
1       0       12M     ks.tbl0/push-view-updates-1/active
1       1       10M     ks.tbl3/push-view-updates-2/active
1       1       6060K   ks.tbl3/multishard-mutation-query/active
2       1       1930K   ks.tbl0/push-view-updates-2/active
1       0       1216K   ks.tbl0/multishard-mutation-query/active
6       0       0B      ks.tbl1/shard-reader/waiting_for_admission
3       0       0B      *.*/data-query/waiting_for_admission
9       0       0B      ks.tbl0/mutation-query/waiting_for_admission
2       0       0B      ks.tbl2/shard-reader/waiting_for_admission
4       0       0B      ks.tbl0/shard-reader/waiting_for_admission
9       0       0B      ks.tbl0/data-query/waiting_for_admission
7       0       0B      ks.tbl3/mutation-query/waiting_for_admission
5       0       0B      ks.tbl1/mutation-query/waiting_for_admission
2       0       0B      ks.tbl2/mutation-query/waiting_for_admission
8       0       0B      ks.tbl1/data-query/waiting_for_admission
1       0       0B      *.*/mutation-query/waiting_for_admission
26      0       0B      permits omitted for brevity

96      8       101M    total

Stats:
permit_based_evictions: 0
time_based_evictions: 0
inactive_reads: 0
total_successful_reads: 0
total_failed_reads: 0
total_reads_shed_due_to_overload: 0
total_reads_killed_due_to_kill_limit: 0
reads_admitted: 1
reads_enqueued_for_admission: 82
reads_enqueued_for_memory: 0
reads_admitted_immediately: 1
reads_queued_because_ready_list: 0
reads_queued_because_need_cpu_permits: 82
reads_queued_because_memory_resources: 0
reads_queued_because_count_resources: 0
reads_queued_with_eviction: 0
total_permits: 97
current_permits: 96
need_cpu_permits: 0
awaits_permits: 0
disk_reads: 0
sstables_read: 0
```

Fixes: https://github.com/scylladb/scylladb/issues/19535

Improvement, no backport needed.

Closes scylladb/scylladb#20545

* github.com:scylladb/scylladb:
  docs/dev/reader-concurrency-semaphore.md: update the documentation on diagnostics dumps
  test/boost/reader_concurrency_semaphore_test: test the new diagnostics functionality
  reader_concurrency_semaphore: add bottleneck self-diagnosis to diagnosis dump
  reader_concurrency_semaphore: include trigger permit in diagnostic dump
  reader_concurrency_semaphore: propagate permit to do_dump_reader_permit_diagnostics()
  reader_concurrency_semaphore: use consistent exception type for timeout
  reader_concurrency_semaphore: dump diagnostics when non-waiting reader times out
2024-09-18 14:06:05 +03:00
Botond Dénes
1efda557b1 replica/table: query_mutations(): enter the table's async gate
So the table is not dropped while the query is ongoing.
query() already does this but using old-fashioned enter()+leave(),
convert it to use the new RAII helper.

Closes scylladb/scylladb#20583
2024-09-18 14:03:22 +03:00
Pavel Emelyanov
2f4f0eb060 Merge 'Alternator: a few RBAC fixes' from Nadav Har'El
The main goal of this PR is to fix a bug (#20619) in the alternator_enforce_authorization=false setting - which didn't do its job (i.e, _don't_ check permissions) when authorization is configured in CQL but not wanted in Alternator.

The series also a few smaller bugs in the code that were discovered while debugging the main issue:
1. A potential use-after-free (that didn't seem to hit us in practice) is fixed.
2. A confusing error message (that was also reported in #20619) is improved.
3. Make the alternator_enforce_authorization live-updatable. There was no reason why it shouldn't be, and as this series needs to make this flag available to more code, let's just do it properly and assume the flag is live-updatable.

Because the RBAC feature has not been backported to any open-source branches, neither should these fixes. But if some private branch received a backport of the RBAC feature, it should get these fixes too.

Fixes #20619.

Closes scylladb/scylladb#20640

* github.com:scylladb/scylladb:
  alternator: make alternator_enforce_authorization live-updateable
  alternator: fix alternator_enforce_authorization=false
  alternator: improve error message when unauthenticated
  alternator: avoid use-after-free in RBAC
2024-09-18 14:02:09 +03:00
Kefu Chai
cb1670b79b Update seastar submodule
* seastar ec5da7a6...69f88e2f (38):
  > build: s/Sanitizers_COMPILER_OPTIONS/Sanitizers_COMPILE_OPTIONS
  > test: Update httpd test with request/reply body writing sugar
  > http: Add sugar to request and response body writers
  > utils: Add util::write_to_stream() helper
  > seastar-addr2line: adjust llvm termination regex
  > README.md: add Crimson project
  > rpc: conditionally use fmt::runtime() based on SEASTAR_LOGGER_COMPILE_TIME_FMT
  > build: check the combination of Sanitizers
  > tls: clear session ticket before releasing
  > print: remove dead code
  > doc/lambda-coroutine-fiasco: reword for better readability
  > rpc: fix compilation error caused by fmt::runtime()
  > tutorial: explain the use case of rethrow_exception and coroutine::exception
  > reactor: print more informative error when io_submit fails
  > README.md: note GitHub discussions
  > prometheus: `fmt::print` to stringstream directly
  > doc: add document for testing with seastar
  > seastar/testing: only include used headers
  > test: Add abortable http client test cases
  > http/client: Add abortable make_request() API method
  > http/client: Abort established connections
  > http/client: Handle abort source in pool wait
  > http/client: Add abort source to factory::make() method
  > http/client: Pass abort_source here and there
  > http/client: Idnentation fix after previous patch
  > http/client: Merge some continuations explicitly
  > signal: add seastar signal api
  > httpd: remove unused prometheus structs
  > print: use fmtlib's fmt::format_string in format()
  > rpc: do not use seastar::format() in rpc logger
  > treewide: s/format/seastar::format/
  > prometheus: sanitize label value for text protocol
  > tests: unit test prometheus wire format
  > io-tester: Introduce batches to rate-based submission
  > io-tester: Generalize issueing request and collecting its result
  > io-tester: Cancel intent once
  > io-tester: Dont carry rps/parallelism variables over lambdas
  > io-tester: Simplify in-flight management

The breaking changes in the seastar submodule necessitate corresponding
modifications in our code. These changes must be implemented together in
a single commit to maintain consistency. So that each commit is buildable.

following changes are included in addition to seastar submodule update:
* instead of passing a `const char*` for the format string, pass a
  templated `fmt::format_string<...>`, this depends on the
  `seastar::format()` change in seastar.
* explicitly call `fmt::runtime()` if the format string is not a
  consteval expression. this depends on the `seastar::format()` change
  in seastar. as `seastar::format()` does not accept a plain
  `const char*` which is not constexpr anymore.
* pass abort_source to `dns_connection_factory::make()`. this depends on
  the change in seastar, which added a `abort_source*` argument to
  the pure virtual member function of `connection_factory::make()`.
* call call {fmt,seastar}::format() explicitly. this is a follow up of
  3e84d43f, which takes care of all places where we should call
  `fmt::format()` and `seastar::format()` explicitly to disambiguate the
  `format()` call. but more `format()` call made their way into the source
   tree after 3e84d43f. so we need fix them as well.
* include used header in tests

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Update seastar submodule

 Please enter the commit message for your changes. Lines starting

Closes scylladb/scylladb#20649
2024-09-18 13:59:22 +03:00
Gleb Natapov
bddaf498df group0: make sure that address map has an entry for each new node in the raft configuration
ID->IP mapping is added to the raft address map when the mapping first
appears in the gossiper, but it is added as expiring entry. It becomes
non expiring when a node is added to raft configuration. But when a node
joins those two events may be distant in time (since the node's request
may sit in the topology coordinator queue for a while) and mappings may
expire already from the map. This patch makes sure to transfer the
mapping from the gossiper for a node that is added to the raft
configuration instead of assuming that the mapping is already there.
2024-09-18 13:42:38 +03:00
Amnon Heiman
8dec292698 alternator:test_metrics test metrics for batch item count
This patch adds tests for the batch operations item count.

The tests validate that the metrics tracking the number of items
processed in a batch increase by the correct amount.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-09-18 11:31:06 +03:00
Amnon Heiman
4d57a43815 alternator:test_metrics Add validating the increased value
The `check_increases_operation` now allows override the checked metric.

Additionally, a custom validation value can now be passed, which make it
possible to validate the amount by which a value has changed, rather
than just validating that the value increased.

The default behavior of validating that values have increased remains
unchanged, ensuring backward compatibility.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-09-18 11:31:06 +03:00
Amnon Heiman
905408f764 alternator: Fix item counting in batch operations
This patch fixes the logic for counting items in batch operations.
Previously, the item count in requests was inaccurate, it count the
number of tabels in get_item and the request_items in write_items.

The new logic correctly counts each individual item in `BatchGetItem`
and `BatchWriteItem` requests.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-09-18 11:30:59 +03:00
Amnon Heiman
515857a4a9 Alterntor rename batch item count metrics
This patch renames metrics tracking the total number of items in a batch
to `scylla_alternator_batch_item_count`.  It uses the existing `op` label to
differentiate between `BatchGetItem` and `BatchWriteItem` operations.

Ensures better clarity and distinction for batch operations in monitoring.

This an example of how it looks like:
 # HELP scylla_alternator_batch_item_count The total number of items processed across all batches
 # TYPE scylla_alternator_batch_item_count counter
 scylla_alternator_batch_item_count{op="BatchGetItem",shard="0"} 4
 scylla_alternator_batch_item_count{op="BatchWriteItem",shard="0"} 4
2024-09-18 11:20:07 +03:00
Anna Mikhlin
0c7ca284ad mergify: add support for branch-6.2
branch-6.2 is already available, adding support for it in mergify to
allow backport to this new branch.
in addition, since branch 5.4 reached EOL - removing it

Closes scylladb/scylladb#20669
2024-09-18 08:30:41 +03:00
Ernest Zaslavsky
924325fd25 treewide: add "prefix" parameter to backup API
Allow the caller to pass the prefix when performing backup and restore

Fixes scylladb/scylladb#20335

Closes scylladb/scylladb#20413
2024-09-18 08:25:00 +03:00
Calle Wilund
b789361091 commitlog: Fix assertion in oversized_alloc
Fixes #20633

Cannot assert on actual request_controller when releasing permit, as the
release, if we have waiters in queue, will subtract some units to hand to them.
Instead assert on permit size + waiter status (and if zero, also controller value)

* v2 - use SCYLLA_ASSERT

Closes scylladb/scylladb#20654
2024-09-18 08:22:28 +03:00
Avi Kivity
57ab5ce313 repair: row_level: simplify repair_put_row_diff_with_rpc_stream_process_op()
repair_put_row_diff_with_rpc_stream_process_op() always returns
stop_iteration::no (or throws). Moreover, the return value is ignored
by its only caller. Simplify by returning a plain future<>.

Closes scylladb/scylladb#20610
2024-09-18 08:17:09 +03:00
Botond Dénes
d72fcb11f5 Merge 'Add new GDB commands to dump sstable index file from memory and print promoted index ' from Tomasz Grabiec
Closes scylladb/scylladb#20648

* github.com:scylladb/scylladb:
  gdb: Introduce "scylla sstable-dump-cached-index" command
  gdb: Introduce "scylla sstable-promoted-index" command
  gdb: Fix range printer for singular ranges
2024-09-18 08:13:04 +03:00
Nadav Har'El
24fb92c8ba Merge 'cql3: simplify runtime component of selection filtering' from Avi Kivity
Most of the analysis of the WHERE clause is done in statement_restrictions. It determines
what parts to use for the primary or secondary index, and what parts to use for filtering.

The difficult part is that it has a very wide interface. After construction, the user must pick
the correct bits from many public functions. There are subtle interactions between them
that are hard to untangle.

This series simplifies the interface as it is used for selection filtering. In the end, only
two public functions are used, both returning expressions: one for the partition-level
filtering, one for the clustering row level filtering.

In the end, the WHERE clause is factored into three parts:
 - one part goes into the read_command of the primary or secondary index
 - another part (that references only partition key columns and static key columns) is used to filter entire partitions
 - another part (that currently references only clustering key columns and regular columns, but one day may reference other columns) is used to filter clustering rows

Refactoring, no backport.

Closes scylladb/scylladb#20487

* github.com:scylladb/scylladb:
  cql3: statement_restrictions: drop accessors for single-column key restrictions
  cql3: selection: adjust indentation
  cql3: selection: delete empty loop
  cql3: statement_restrictions, selection: fold multi-column restrictions into row-level filter
  cql3: statement_restrictions, selection: merge clustering key filter and regular columns filter
  cql3: statement_restrictions, selection: merge partition key filter and static columns filter
  cql3: selection: filter regular and static rows as a single expression each
  cql3: statement_restrictions: collect regular column and static column filters into single expressions
  cql3: selection: filter clustering key as a single expression
  cql3: statement_restrictions: expose filter for clustering key
  cql3: selection: filter partition key as a single expression
  cql3: statement_restrictions: expose filter for partition key
  cql3: statement_restrictions: remove relations used for indexing from filtering
  cql3: statement_restrictions: bail out of find_idx if !_uses_secondary_index
  cql3: statement_restrictions, modification_statement: pass correct value of check_indexes
  cql3: statement_restrictions: correct mismatched clustering/partition restrictions references
  cql3: statement_restrictions: precalculate get_column_defs_for_filtering()
  cql3: selection: do_filter(): push static/regular row glue to higher level
2024-09-17 22:58:24 +03:00
Piotr Dulikowski
cc5c3aaae7 Merge 'message/messaging_service: guard adding maintenance tenant under cluster feature' from Michał Jadwiszczak
In https://github.com/scylladb/scylladb/pull/18729, we introduced a new statement tenant `$maintenance`, but the change wasn't protected by any cluster feature.
This wasn't a problem for OSS, since unknown isolation cookie just uses default scheduling group. However, in enterprise that leads to creating a service level on not-upgraded nodes, which may end up in an error if user create maximum number of service levels.

This patch adds a cluster feature to guard adding the new tenant. It's done in the way to handle two upgrade scenarios:
- version without `$maintenance` tenant -> version with `$maintenance` tenant guarded by a feature
- version with `$maintenance` tenant but not guarded by a feature -> version with `$maintenance` tenant guarded by a feature

The PR adds `enabled` flag to statement tenants.
This way, when the tenant is disabled, it cannot be used to create a connection, but it can be used to accept an incoming connection.
The `$maintenance` tenant is added to the config as disabled and it gets enabled once the corresponding feature is enabled.

Fixes scylladb/scylladb#20070
Refs scylladb/scylla-enterprise#4403

Closes scylladb/scylladb#19802

* github.com:scylladb/scylladb:
  message/messaging_service: guard adding maintenance tenant under cluster feature
  message/messaging_service: add feature_service dependency
  message/messaging_service: add `enabled` flag to statement tenants
2024-09-17 18:24:34 +02:00
Avi Kivity
1663fbe717 cql3: statement_restrictions: use functional style
Instead of a constructor, use a new function
analyze_statement_restrictions() as the entry point. It returns an
immutable statement_restrictions object.

This opens the door to returning a variant, with each arm of the variant
corresponding to a different query plan.
2024-09-17 17:13:27 +03:00
Avi Kivity
3169b8e0ec cql3: statement_restrictions: calculate the index only once
find_idx() is called several times. Rename it do_find_idx(), call it
just once, store the results, and make find_idx() return the stored
results.

This simplifies control flow and reduces the risk that successive
calls of find_idx return different results.
2024-09-17 17:03:31 +03:00
Avi Kivity
d5c8083b76 cql3: statement_restrictions: make it a const object
Make validate_secondary_index_selections() const (it trivially is),
and call prepare_indexed_local() / prepared_indexed_global() at the
end of the constructor.

By making statement_restrictions a const object, reasoning about it
can be local (looking at the source file) rather than global (looking
at all the interactions of the class with its environment. In fact,
we might make it a function one day.

Since prepare_indexed_global()/prepare_indexed_local() only mutate
_idx_tbl_ck_prefix, which isn't mutated by the rest of the code, the
transformation is safe.

The corresponding code is removed from select_statement. The removal
isn't complete since it still uses some computation, but later
deduplication is left for another day.
2024-09-17 17:03:27 +03:00
Sergey Zolotukhin
68740f57c2 cql_server: Add a test for multiple query msg rebounces.
The test emulates several LWT(Lightweight Transaction) query rebounces. Currently, the code
that processes queries does not expect that a query may be rebounced more than once.
It was impossible with the VNodes, but with intruduction of the Tablets, data can be moved
between shards by the balancer thus a query can be rebounced to different shards multiple times.
2024-09-17 15:19:56 +02:00
Benny Halevy
65430b9e1b cql_server::connection: process: rebounce msg if needed
Rebounce the msg to another shard if needed,
e.g. in the case of tablet migration.

An example for that, as given by Tomasz Grabiec:
> Bouncing happens when executing LWT statement in
> modification_statement::execute_with_condition by returning a
> special result message kind. The code assumes that after
> jumping to the shard from the bounce request, the result
> message is the regular one and not yet another bounce.
> There is no problem with vnodes, because shards don't change.
> With tablets, they can change at run time on migration.

Fixes scylladb/scylladb#15465

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-17 15:09:43 +02:00
Sergey Zolotukhin
f674f522aa cql_server::connection: process: co-routinize connection::process_on_shard
`cql_server::connection::process_on_shard` is made a co-routine to
make sure captured objects' lifetime is managed by the source shard,
avoiding error prone inter-shard objects transfers.
2024-09-17 14:54:42 +02:00
Nadav Har'El
17deaae463 alternator: make alternator_enforce_authorization live-updateable
For no good reason, the "alternator_enforce_authorization" flag (which
chooses whether to enable authentication and authorization checks in
Alternator) was not live-updatable, so make it so.

Both "server" and "executor" objects use this configuration flag, the
former is fixed in this patch (to hold a live-updatable reference
instead of a copy of a boolean), the latter was already prepared for
this change and already held a live-updatable reference.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-09-17 15:51:16 +03:00
Nadav Har'El
00793059e1 alternator: fix alternator_enforce_authorization=false
When the configuration has alternator_enforce_authorization=false,
Alternator should not do authentication (check which user signed each
request) nor authorization (check if that user has permissions to do
each operation).

Our implementation forgot to disable the authorization checks when
it's configured to false. The (incorrect) assumption was that when
alternator_enforce_authorization is configured to false, the CQL
'authenticator' and 'authorizer' configuration is also disabled -
so the authorization checks will be no-ops. But we can't assume
that: Users are free to configure 'authenticator' and 'authorizer'
for use in CQL, and then set alternator_enforce_authorization=false
just for Alternator.

So this patch adds a new test for this case - when we have
authenticator=PasswordAuthenticator, authorizer=CassandraAuthorizer
but alternator_enforce_authorization=false, and fixes it to work
correctly.

The heart of the fix is trivial: the `verify_*_permission()` functions
just need to check the alternator_enforce_authorization and return
immediately when false. The bigger part of this change is to get the
alternator_enforce_authorization into the "executor" object and then
to pass it into the verify calls.

Although alternator_enforce_authorization is not YET live updatable,
this code is prepared for the future that it may become live
updatable, so the executor object saves not the boolean value of
this flag, but a live-updatable reference to it.

Fixes #20619

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-09-17 15:50:00 +03:00
Nadav Har'El
76af7c0389 alternator: improve error message when unauthenticated
When access-control checks report permission denied, we want to report
the name of the authenticated role (the role signing the request) which
didn't have the permission. When authentication was disabled, and there
is no authenticated role, we printed the fake name "anonymous", but this
can confuse users (it confused me!) to think there's an actual role
named "anonymous". So let's change that string to "<anonymous>" with
angle brackets - it makes it more obvious that this isn't a real role,
but actually an anonymous request.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-09-17 15:44:29 +03:00
Tomasz Grabiec
e70ce4d6ed gdb: Introduce "scylla sstable-dump-cached-index" command 2024-09-17 14:41:18 +02:00
Tomasz Grabiec
9f0eed263d gdb: Introduce "scylla sstable-promoted-index" command 2024-09-17 14:41:13 +02:00
Nadav Har'El
3543bf14e9 alternator: avoid use-after-free in RBAC
While auditing the code, I noticed that the current Alternator access
control checks have code like:

```
    return client_state.check_has_permission(auth::command_desc(
            permission_to_check,
            auth::make_data_resource(schema->ks_name(), schema->cf_name()))).then(
```

There's a problem here - it turns out that, unfortunately, command_desc
holds a reference to the "resource" object - not a copy. So the temporary
object returned by make_data_resource may be freed and then used...
Curiously, we've not seen a bug caused by this in practice (not even in
debug build mode), but better safe than sorry, so this patch changes the
code in one of two ways:

1. Code using coroutines can keep the "resource" as a variable on the
   stack.
2. Code using continuations needs to hold the "resource" with do_with(),
   but since this already incurs the cost of an extra allocation
   (even in the successful case), might as well just switch to using
   coroutines and have less ugly code.

This patch does not change any functionality, and all the tests seem to
work before and after it the same.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

hello
2024-09-17 15:41:09 +03:00
Tomasz Grabiec
2c463ead59 gdb: Fix range printer for singular ranges
Before, it printed [x, +inf) instead of {x}
2024-09-17 14:30:28 +02:00
Andrei Chekun
bbb6c3c2ff test.py: Add resource consumption metrics
This PR adds the possibility to gather resource consumption metrics. The collected metrics can be used to compare performance before and after specific changes aimed at increasing performance. Currently, this functionality works only in manual mode, and this is just raw data. Later on, these metrics can be used in Jupyter notebook to analyze and visualize how the resources are used and can provide the insight on how to improve it. This PR is a first insight after gathering these metrics.

Add the possibility to gather resource consumption for the test.py execution. SQLite DB will be created with different performance metrics that will allow comparing the resource consumption between changes.
The DB will be in the tmp directory that by default set to testlog. Across the runs, the DB will not be deleted, so each new run will just add information to the existing DB.
Parameter --get-metrics was added to switch on or off the metrics gathering. By default, it's switched on.

Closes: scylladb/qa-tasks#1666

Closes: scylladb/qa-tasks#1707

Closes scylladb/scylladb#19881
2024-09-17 15:22:34 +03:00
Benny Halevy
39ce358d82 time_window_compaction_strategy: get_reshaping_job: restrict sort of multi_window vector to its size
Currently the function calls boost::partial_sort with a middle
iterator that might be out of bound and cause undefined behavior.

Check the vector size, and do a partial sort only if its longer
than `max_sstables`, otherwise sort the whole vector.

Fixes scylladb/scylladb#20608

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#20609
2024-09-17 15:05:37 +03:00
Tomasz Grabiec
adf99402c5 Merge 'readers/flat_mutation_reader_v2: call set_close_required() from consume*()' from Botond Dénes
The `consume*()` variants just forward the call to the `_impl` method with the same name. The latter, being a member of `::impl`, will bypass the top level `fill_buffer()`, etc. methods and thus will never call `set_close_required()`. Do this in the top-level `consume*()` methods instead, to ensure a reader, on which only `consume*()` is called, and then is destroyed, will complain as it should (and abort).
Only one place was found in core code, which didn't close the reader: `split_mutation() in `mutation/mutation.cc` and this reader is the "from-mutation" one which has no real close routine. All other places were in tests. All this is to say, there were no real bugs uncovered by this PR.

Fixes #16520

Improvement, no backport required.

Closes scylladb/scylladb#16522

* github.com:scylladb/scylladb:
  readers/flat_mutation_reader_v2: call set_close_required() from consume*()
  test/boost/sstable_compaction_test: close reader after use
  test/boost/repair_test: close reader after use
  mutation/mutation: split_mutation(): close reader after use
2024-09-17 13:21:34 +02:00
Anna Mikhlin
66c0814c33 Update ScyllaDB version to: 6.3.0-dev 2024-09-17 13:43:04 +03:00
Botond Dénes
6250ff18eb Merge 'sstable: s/crawling_sstable_mutation_reader/sstable_full_scan_reader' from Kefu Chai
"crawling" is a little bit obscure in this context. so let's rename this class to reflect the fact that this reader only reads the entire content of the sstable.

both crawling reader for kl and mx formats are renamed. also, in order to be consistent, all "crawling reader" in variable names are updated as well.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#20599

* github.com:scylladb/scylladb:
  sstable: s/crawling_sstable_mutation_reader/sstable_full_scan_reader
  sstable/mx/reader: add comment for mx_crawling_sstable_mutation_reader
2024-09-17 11:55:08 +03:00
Pavel Emelyanov
ebfa73e004 s3/client: Don't move file from write_body's lambda
Requests sent by S3 are retriable, so when request.write_body() is
called, it should keep everything intact in case http client will call
it again.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20579
2024-09-17 09:48:09 +03:00
Tzach Livyatan
cb864b11d8 Update client-node-encryption: OpsnSSL is FIPS *enabled*
Closes scylladb/scylladb#19705
2024-09-17 09:47:07 +03:00
Botond Dénes
f32e67cb9e Merge 'Make sstables without on-disk path' from Pavel Emelyanov
New sstables for a table are created by the table::make_sstable() method. The method then calls sstables_manager::make_sstable() and passes there a path to component files which, in turn, sits on table::config. Since some time ago having an on-disk path for an sstable had become optional, as sstables could be put on S3 storage without local paths involved. In that case the aforementioned "path" is ~~ab~~used as a key in the system.sstables registry, that references a record with information used to retrieve URLs of sstables' objects.

This PR removes the "path" argument from sstables_manager::make_sstable() and its sstable_sdirectory peer. The details of sstables' location are moved onto storage_options and depend on storage type. For now in both storage types this location is still the good-old $datadir/$keyspace/$table-$uuid string. S3 storage needs to be patched more to use more elegant "location" value.

Eventually the `table::config::{datadir|all_datadirs}` will be removed, this PR is the step towards it.

closes: #12707

Closes scylladb/scylladb#20542

* github.com:scylladb/scylladb:
  table: Use storage options to clean the storage
  sstables/storage: Re-use ocally generated vector of paths
  sstables/storage: Visit options once to initialize storage
  sstables_manager: Return table storage options when initalizing storage
  sstables/storage: Fix indentation after previous patch
  table: Move datadirs initialization parallelism to storage level
  sstables/storage: Split the visitor's overloaded functor
  restore: Don't use table_dir to construct sstable_directory
  sstable_directory: Remove table_dir field
  sstable_directory: Use options details in lister
  sstables_manager: Remove table_dir from make_sstable()
  sstables: Remove table_dir from sstable constructor
  sstables/storage: Remove sstring dir from make_storage()
  sstables/storage: Use options to construct
  tests: Properly initialize storage options with "dir"
  distributed_loader: Create S3 options with prefix for restore
  storage_options: Add special-purpose local options maker
  storage_options: Keep local path / s3 prefix onboard
  table: Get another options when initializing storage
2024-09-17 09:41:21 +03:00
Kefu Chai
df7f332a58 sstable: s/crawling_sstable_mutation_reader/sstable_full_scan_reader
"crawling" is a little bit obscure in this context. so let's rename this
class to reflect the fact that this reader only reads the entire content
of the sstable.

both crawling reader for kl and mx formats are renamed. also, in order
to be consistent, all "crawling reader" in variable names are updated
as well.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-17 10:39:37 +08:00
Kefu Chai
c1ed2f0ea4 sstable/mx/reader: add comment for mx_crawling_sstable_mutation_reader
to explain its typical usage.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-17 10:39:25 +08:00
Michał Jadwiszczak
b4b91ca364 message/messaging_service: guard adding maintenance tenant under cluster feature
Set `enabled` flag for `$maintenance` tenant to false and
enable it when `MAINTENANCE_TENANT` feature is enabled.
2024-09-16 15:34:36 +02:00
Michał Jadwiszczak
71a03ef6b0 message/messaging_service: add feature_service dependency 2024-09-16 15:33:40 +02:00
Michał Jadwiszczak
d44844241d message/messaging_service: add enabled flag to statement tenants
Adding a new tenant needs to be done under cluster feature protection.
However it wasn't the case for adding `$maintenance` statement tenant
and to fix it we need to support an upgrade from node which doesn't
know about maintenance tenant at all and from one which uses it without
any cluster feature protection.

This commit adds `enabled` flag to statement tenants.
This way, when the tenant is disabled, it cannot be used to create
a connection, but it can be used to accept an incoming connection.
2024-09-16 15:31:04 +02:00
Michał Jadwiszczak
de7acbad8b test/cql-pytest: add test for SELECT ... USING SERVICE LEVEL 2024-09-16 14:31:43 +02:00
Michał Jadwiszczak
8255c61f5f cql3/Cql.g: extend grammar to allow SELECT ... USING SERVICE LEVEL 2024-09-16 14:31:32 +02:00
Michał Jadwiszczak
af6dc78025 cql3/statements/select_statement: use service level timeout
Use service level timeout in selecte statement when specified.
`USING TIMEOUT` have higher priority in timeout definition.
2024-09-16 13:48:48 +02:00
Michał Jadwiszczak
2e545c915b cql3/attributes: add service level name field
In next patches, we will allow to do `SELECT ... USING SERVICE LEVEL
sl_name`. To do it, we need to extend `cql3::attributes` with service
level name.
2024-09-16 13:48:43 +02:00
Michał Jadwiszczak
b9b326c2bb qos/service_level_controller: add method to check if service level exists in cache
There is `service_level_controller::get_service_level()` method,
which searches for service level in the controller cache and returns
default service level if SL with given name doesn't exist.

Added method allows to check whether a service level exists in the
controller cache.
2024-09-16 12:41:15 +02:00
Pavel Emelyanov
bf5021e735 test: Remove sstables::test::binary_search()
That's the most mysterious wrapper in this set as it doesn't need
sstable itself at all, it just duplicates the existing non-class
function out there.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-16 12:51:35 +03:00
Pavel Emelyanov
309d315af7 test: Remove sstables::test::move_summary()
This one is a bit tricky, as it needs to modify the sstables's summary.
However, the sstables::test::_summary() one returns mutable reference
and the only caller can use it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-16 12:50:48 +03:00
Pavel Emelyanov
deec952111 test: Remove sstables::test::read_toc()
The sstable::read_toc() is public method, use it directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-16 12:50:19 +03:00
Pavel Emelyanov
25cd8ccdd8 test: Remove sstables::test::get_summary()
Same as previous patch -- callers can come with const reference to
summary, so they can live with existing public sstable::get_summary().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-16 12:49:39 +03:00
Pavel Emelyanov
f714ac9b48 test: Remove sstables::test::get_statistics()
Just call the public sstable::get_statistics(). The callers would get
const reference on it, but they don't need more than that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-16 12:48:43 +03:00
Pavel Emelyanov
53afa583e8 test: Remove sstables::test::data_read()
The wrapper just changes the order of arguments for a public method.
Drop it, and call the wrapee directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-16 12:47:59 +03:00
Avi Kivity
e4cab3a5e9 cql3: statement_restrictions: drop accessors for single-column key restrictions
No longer used.
2024-09-16 12:15:14 +03:00
Avi Kivity
626acf416e cql3: selection: adjust indentation 2024-09-16 12:15:14 +03:00
Avi Kivity
c443d922ea cql3: selection: delete empty loop
Our refactoring left a loop with no body, delete it.
2024-09-16 12:15:14 +03:00
Avi Kivity
56e8a4c931 cql3: statement_restrictions, selection: fold multi-column restrictions into row-level filter
When filtering, we apply single-column and multi-column filters separately.

This is completely unnecessary. Find the multi-column filters during prepare
time and append them to the row-level filter.

This slightly changes the original: in the original, if we had a multi-column
filter, we applied all of the restrictions. But hopefully if we check
for multi-column filters, that's what we need.
2024-09-16 12:15:14 +03:00
Avi Kivity
a6d81806c0 cql3: statement_restrictions, selection: merge clustering key filter and regular columns filter
The two filters are used in the same way: check the filter, return false if
it matches.

Unify the two filters into a clustering_row_level_filter.

Since one of the two filters wasn't std::optional, we take the liberty
of making the combined filter non-optional.
2024-09-16 12:15:03 +03:00
Avi Kivity
2933a2f118 cql3: statement_restrictions, selection: merge partition key filter and static columns filter
The two filters are used in the same way: check the filter, set a boolean
flag if it matches, return false. The two boolean flags are in turn checked
in the same way.

Unify the two filters into a partition_level_filter.

Since one of the two filters wasn't std::optional, we take the liberty
of making the combined filter non-optional.
2024-09-16 12:10:49 +03:00
Avi Kivity
807153a9ed cql3: selection: filter regular and static rows as a single expression each
Instead of filtering regular and static columns column by column, call
is_satisfied_by() for an expression containing all the static columns
predicates, and one for all the regular column.

We cannot have one expression, since the code sets
_current_static_row_does_not_match only for static columns.

Note the fix for #20485 is now implicit, since the evaluation machinery
will treat missing regular columns as NULL.
2024-09-15 14:33:57 +03:00
Avi Kivity
3c71096479 cql3: statement_restrictions: collect regular column and static column filters into single expressions
Similar to previous work with clustering and partition key, expose
static and reglar column filters as single expressions.

Since we don't currently expose a boolean for whether those filters
exist, we expose them now as non-optionals. In any case evaluating
an empty conjunction is plenty fast.
2024-09-15 14:33:57 +03:00
Avi Kivity
ec2898afe9 cql3: selection: filter clustering key as a single expression
Instead of filtering the clustering key column by column, call
is_satisfied_by() for an expression containing all the clustering key
predicates.

The check for clustering_key.empty() is removed; the evaluation machinery
is able to handle partial clustering keys. In fact if we add IS NULL,
we have to evaluate as an empty clustering key should match.
2024-09-15 14:33:57 +03:00
Avi Kivity
318d653d80 cql3: statement_restrictions: expose filter for clustering key
cql3::selection performs filtering by consulting
ck_restrictions_need_filtering() and
get_single_column_clustering_key_restrictions() (which is a map of column
definition to expressions). Make them available in one
nice package as an optional<expression>. When the optional is engaged,
filtering is needed, and the expression in the equivalent of all of the
map.
2024-09-15 14:33:57 +03:00
Avi Kivity
0bd2f12922 cql3: selection: filter partition key as a single expression
Instead of filtering the partition key column by column, call
is_satisfied_by() for an expression containing all the partition key
predicates.
2024-09-15 14:33:56 +03:00
Avi Kivity
21cb91077f cql3: statement_restrictions: expose filter for partition key
cql3::selection performs filtering by consulting
pk_restrictions_need_filtering() and
get_single_column_partition_key_restrictions() (which is a map of column
definition to expressions). Make them available in one
nice package as an optional<expression>. When the optional is engaged,
filtering is needed, and the expression in the equivalent of all of the
map.
2024-09-15 14:33:56 +03:00
Avi Kivity
a453221314 cql3: statement_restrictions: remove relations used for indexing from filtering
statement_restrictions does not name columns that were used for a secondary
index for selection for filtering, since accessing the index "pre-filters"
these columns.

However, it keeps the relations that contain these columns. This makes
it impossible (besides unnecessary) to evaluate the relations, as the
columns they reference aren't selected.

The reason this works now is that
result_set_builder::restrictions_filter::do_filter() iterates on selected
columns, matching them to relations, then execute the matched relation.
A relation that references an unselected column is invisible to do_filter().

We wish to filter using complete expressions, rather than fragments, so as
a first step remove these unnecessary and unusable relations while we
choose which columns are necessary for filtering.

calculate_column_defs_for_filtering is renamed to remind us of the extra
work done.
2024-09-15 14:33:56 +03:00
Avi Kivity
ba8c2014bf cql3: statement_restrictions: bail out of find_idx if !_uses_secondary_index
The condition seems trivial, but wasn't implemented, without ill effects
so far.

With the following patches, calculate_column_defs_for_filtering() becomes
confused as it selects an indexing code path even when !_uses_secondary_index,
triggered by the reproducer of #10300.
2024-09-15 14:33:56 +03:00
Avi Kivity
65ba19323c cql3: statement_restrictions, modification_statement: pass correct value of check_indexes
Our UPDATE/INSERT/DELETE statements require a full primary/partition key
and therefore never use indexes; fix the check_index parameter passed from
modification_statement.

So far the bug is benign as we did not take any action on the value.

Make the parameter non-default to avoid such confusion in the future.
2024-09-15 14:33:56 +03:00
Avi Kivity
71ea3200ba cql3: statement_restrictions: correct mismatched clustering/partition restrictions references
The second loop of calculate_column_defs_for_filtering() finds clustering
keys that are used for filtering, minus and clustering keys that happen
to be used for secondary indexing.

However, to check whether the clustering key is used for secondary indexing,
it looks up in _single_column_partition_key_restrictions, which contains
partition key restrictions.

The end result is that we select a column which ends the partition key
for the secondary index, and so is unnecessary. We do a little more work,
but the bug is benign. Nevertheless, fix it, as it interferes with following
work.
2024-09-15 14:33:56 +03:00
Avi Kivity
33db14e7d5 cql3: statement_restrictions: precalculate get_column_defs_for_filtering()
get_column_defs_for_filtering() names all the columns that are required
for filtering. While doing that, it skips over columns that are participate
in indexing (primary or secondary), since the index "pre-filters" the
query.

We wish to make use of this skipping. As a first step, call the calculation
from the constructor, so we have control over when it is executed.
2024-09-15 14:33:56 +03:00
Avi Kivity
251ad4fcd0 cql3: selection: do_filter(): push static/regular row glue to higher level
Currently, for each column we call get_non_pk_values() to transform
the way we get the information (query::result_row_view) to the way
the expression evaluation machinery wants it (vector<managed_bytes_opt>).

Call it just once outside the loop.
2024-09-15 14:33:56 +03:00
Pavel Emelyanov
f850681b14 table: Use storage options to clean the storage
Like it was done for table::init_storage(), patch the
table::destroy_storage() not to mess with datadir path and rely on
storage options only.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:49:50 +03:00
Pavel Emelyanov
3aea7bebb7 sstables/storage: Re-use ocally generated vector of paths
A cleanup after prefious patch -- in order to create storage options for
table the local initialization code can re-use the vector of paths that
it hag generated in the same call to create table directory layout.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:49:50 +03:00
Pavel Emelyanov
7c34724509 sstables/storage: Visit options once to initialize storage
The init_table_storage() method now does it twice -- one time to
initialize the storage, another one to create new options for table.
Both can be merged, thus making table storage options initialization
better encapsulated for local/s3 cases.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:49:50 +03:00
Pavel Emelyanov
311fb906be sstables_manager: Return table storage options when initalizing storage
Now the table::init_storage() calls sstables manager two times -- first,
to get storage options, second, to initialize the storage with obtained
options. Merge two calls into one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:49:50 +03:00
Pavel Emelyanov
918ec00c1d sstables/storage: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:49:50 +03:00
Pavel Emelyanov
f1e4367439 table: Move datadirs initialization parallelism to storage level
The table::init_table_storage() calls sstables_manager's storage
initialization for each of the datadirs found on config. That's not
great, it's sstables manager (and its storage) that know if table needs
to mess with datadirs or not. This patch moves the loop to storage.cc.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:49:50 +03:00
Pavel Emelyanov
b6b3a477c5 sstables/storage: Split the visitor's overloaded functor
The main goal is to have init_table_storage() overload for local options
as standalone function. This makes next patching simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:49:50 +03:00
Pavel Emelyanov
30c8d89f97 restore: Don't use table_dir to construct sstable_directory
Continuation of the previous patch patching the special-purpose sstable
directory constructor that's used by restore-from-s3-backup code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:49:50 +03:00
Pavel Emelyanov
af14408052 sstable_directory: Remove table_dir field
It's no longer needed -- both, lister and making sstable, work with
having storage options at hand.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:49:50 +03:00
Pavel Emelyanov
f403728aa4 sstable_directory: Use options details in lister
This class is very similar to sstables::storage one -- it also needs
path or s3 prefix to construct. Now when this information is stored on
storage_options, it's better to stick to it, not to the argument.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:49:50 +03:00
Pavel Emelyanov
36863d4ad0 sstables_manager: Remove table_dir from make_sstable()
It used to be passed to sstable constructor, but now it doesn't need
this argument.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:49:50 +03:00
Pavel Emelyanov
0764eca553 sstables: Remove table_dir from sstable constructor
It used to be passed to storage constructor, now storage works with
options only and this argument is no longer needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:49:50 +03:00
Pavel Emelyanov
d79ae1f02b sstables/storage: Remove sstring dir from make_storage()
Now the directory/s3 prefix is propagated via storage options.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:49:50 +03:00
Pavel Emelyanov
65a19df8ef sstables/storage: Use options to construct
All callers of make_sstable are now patched to provide correct storage
options with path/prefix set. The make_storage() helper can switch to
using it. Respectively, it's good to make sure that the storage is
created with table options that have path/prefix.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:49:50 +03:00
Pavel Emelyanov
4425cf54c6 tests: Properly initialize storage options with "dir"
Most of the tests work with local storage options. Some support S3
options as well. Whatever it is, when creating an sstable, tests need to
put proper "dir" on the options, this patch does so.

In fact, storage options for tests are created together with the
test-env, and ideally this is the place where dir should be assigned on
it. However, there are still places that explicitly specify path they
want to see sstables at, for those the new temporary options should be
constructed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:49:50 +03:00
Pavel Emelyanov
33bc9e7112 distributed_loader: Create S3 options with prefix for restore
Restore-from-backup code wants to collect sstables from remote S3. For
that it constructs S3 options, and now it needs to put prefix on it as
well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:32:39 +03:00
Pavel Emelyanov
56111a50cd storage_options: Add special-purpose local options maker
Lost of code (in tools and tests) explicitly deal with local sstables
and need to create options for it. Currently default-constructing
options generates local ones, but without the directory path. Add a
helper that creates local options with path and patch callers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:32:39 +03:00
Pavel Emelyanov
95e60cde9f storage_options: Keep local path / s3 prefix onboard
Now when tables keep their own copy of storage options, it's possible
for each table to add table-specific information on it. Namely -- path
for local storage and prefix for S3 one (in fact, it's not a "prefix",
but a key in sstables registry, but fixing it is beyond the scope of
this set).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:32:32 +03:00
Pavel Emelyanov
14976fda73 table: Get another options when initializing storage
Right now the table's storage_options life starts in cql, and shortly
after the lw-shared-pointer to options is put on keyspace metadata.
Later, when the table is created the pointer from keyspace is copied on
the table via its contructor.

Next patches will extend the options pointed to by a table, and the
extension is going to be different for different tables. For that, each
table needs to have its private options and this patch prepares for
that.

For now table directly calls sstables/storage code to get the options
from, but it's temporary, soon the options will be created via sstables
manager together with initialising the storage itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-13 16:32:32 +03:00
Botond Dénes
cb30271d29 readers/flat_mutation_reader_v2: call set_close_required() from consume*()
The `consume*()` variants just forward the call to the `_impl` method
with the same name. The latter, being a member of `::impl`, will bypass
the top level `fill_buffer()`, etc. methods and thus will never call
`set_close_required()`. Do this in the top-level `consume*()` methods
instead, to ensure a reader, on which only `consume*()` is called, and
then is destroyed, will complain as it should (and abort).

operator()() was also missing `set_close_required()`, fix that too.
2024-09-13 06:52:26 -04:00
Botond Dénes
fbed280cd5 test/boost/sstable_compaction_test: close reader after use 2024-09-13 06:52:26 -04:00
Botond Dénes
116b044fec test/boost/repair_test: close reader after use 2024-09-13 06:52:26 -04:00
Botond Dénes
1a11f9cf95 mutation/mutation: split_mutation(): close reader after use 2024-09-13 06:52:26 -04:00
Takuya ASADA
0ac450de05 scylla_raid_setup: configure SELinux file context
On RHEL9, systemd-coredump fails to coredump on /var/lib/scylla/coredump
because the service only have write acess with systemd_coredump_var_lib_t.
To make it writable, we need to add file context rule for
/var/lib/scylla/coredump, and run restorecon on /var/lib/scylla.

Fixes #20573
2024-09-13 04:31:52 +09:00
Takuya ASADA
56c971373c scylla_coredump_setup: fix SELinux configuration for RHEL9
Seems like specific version of systemd pacakge on RHEL9 has a bug on
SELinux configuration, it introduced "systemd-container-coredump" module
to provide rule for systemd-coredump, but not enabled by default.
We have to manually load it, otherwise it causes permission error.

Fixes #19325
2024-09-13 04:31:16 +09:00
Botond Dénes
f834ad81e0 docs/dev/reader-concurrency-semaphore.md: update the documentation on diagnostics dumps
The part of the document which explains diagnostics dumps was due for an
update. It was missing an explanation on the dumped stats and it
also needs to explain the "Problematic permit" and "Identified
bottleneck(s)".
2024-09-12 08:31:25 -04:00
Botond Dénes
fdff4beb1f test/boost/reader_concurrency_semaphore_test: test the new diagnostics functionality
Adjust the test reader_concurrency_semaphore_dump_reader_diganostics to
also cover the new diagnostics functionality. The test is not a
correctness test -- the output has to be inspected by a human. But it is
good enough to make sure the code paths do not have any memory errors.
2024-09-12 08:31:25 -04:00
Botond Dénes
40b6616d3d reader_concurrency_semaphore: add bottleneck self-diagnosis to diagnosis dump
There are a few typical cases of bottlenecks, which can be easily
identified when dumping the semaphore diagnostics. Identify and print
these to fast-track investigations.
2024-09-12 08:31:25 -04:00
Botond Dénes
7d2b931619 reader_concurrency_semaphore: include trigger permit in diagnostic dump
In the previous patch, we provided an opportunity for callers to provide
a trigger permit, when calling `maybe_dump_reader_permit_diagnostics()`.
If the caller provided the trigger permit, include its details in the
dump, allowing the identification of the table and code-path of the
permit which triggered the dump.
2024-09-12 08:30:50 -04:00
Benny Halevy
0b93409b44 cql_server: connection: process: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-12 11:32:17 +02:00
Benny Halevy
71052dca6a cql_server: connection: process_on_shard: drop permit parameter
It is currently unused in `process_on_shard`, which
generates an empty service_permit.

The next patch may call process_on_shard in a loop,
so it can't simply move the permit to the callee
and better hold on to it until processing completes.

`cql_server::connection::process` was turned into
a coroutine in this patch to hold on to the permit parameter
in a simple way. This is a preliminary step to changing
`if (bounce_msg)` to `while (bounce_msg)` that will allow
rebouncing the message in case it moved yet again when
yielding in `process_on_shard`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-12 11:32:17 +02:00
Benny Halevy
eb7fbdbed2 transport: server: pass bounce_to_shard as foreign shared ptr
So it can safely passed between shards, as will be needed
in the following patch that handles a (re)bounce_to_shard result
from process_fn that's called by `process_on_shard` on the
`move_to_shard`.

With that in mind, pass the `bounce_to_shard` payload
to `process_on_shard` rather than the foreign shared ptr
since the latter grabs what it needs from it on entry
and the shared_ptr can be released on the calling shard.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-12 11:32:15 +02:00
Benny Halevy
0df6f55379 cql_server: connection: process: add template concept for process_fn
Quoting Avi Kivity:
> Out of scope: we should consider detemplating this.

As a follow-up we should consider that and pass
a function object as process_fn, just make sure
there are no drawbacks.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-12 10:26:13 +02:00
Benny Halevy
150dce5de0 cql_server: move process_fn_return_type to class definition
So it can be used for a template concept in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-12 10:26:13 +02:00
Botond Dénes
c044904f07 reader_concurrency_semaphore: propagate permit to do_dump_reader_permit_diagnostics()
Will be used in the next patch.
2024-09-12 00:51:56 -04:00
Botond Dénes
67565a5eee reader_concurrency_semaphore: use consistent exception type for timeout
When a read times out, we use different exception types for the permit's
future (if the permit is waiting), or the permit's abort exception _ex
(which is used to abort ongoing reads). This patch changes both to use
named_semaphore_timed_out, which is the more verbose of the two.
2024-09-12 00:51:03 -04:00
Botond Dénes
036d27dc1b reader_concurrency_semaphore: dump diagnostics when non-waiting reader times out
Currently the semaphore only dumps diagnostics when a waiting reader
times out. The diagnostics are also useful when a non-waiting reader
(which is in the process of reading) times out, so also dump diagnostics
in this case.
Change the code to use a switch statement, so future addition of states
don't miss updating this logic.
2024-09-12 00:51:03 -04:00
Amnon Heiman
46792bd04f docs/alternator/compatibility.md: explain the consumed capacity provisioned
This patch change the alternator documentation to express that the
provisoned units are stored and return but Alternator ignores them.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-09-10 17:28:31 -04:00
Amnon Heiman
3726c20564 Add test/alternator/test_provisioned_throughput.py
The test_provisioned_throughput.py test ProvisionedThroughput support.
The first test, check that ProvisionedThroughput can be set and get when
using describe table.

The second test check that missing read or write will throw an
exception.

The third test check that when using billing PAY_PER_REQUEST it returns
zero for the read and write units.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-09-10 17:27:19 -04:00
Amnon Heiman
9b5f29b6bc test/alternator/util.py: Allow override BillingMode
This patch adds the ability to override the BillingMode. If a
BillingMode is provided to the create_test_table function, it will
override the default BillingMode.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-09-10 17:06:47 -04:00
Amnon Heiman
c76347032d alternator/executor.cc: Store ProvisionedThroughput
This patch adds the ability to store and retrieve the
ProvisionedThroughput in a table.

The information is stored in the table tags. We use the TTL convention
used in alternator, and the tags will be: system:provisioned_rcu and
system:provisioned_wcu.

verify_billing_mode function now return a struct with the billing mode
information.

The code of describe_table now check if the provision tags exists and
return the RCU and WCU accordingly.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-09-10 17:06:40 -04:00
Pavel Emelyanov
103c68b419 sstables: Restore indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-06 18:24:56 +03:00
Pavel Emelyanov
c47c0f1cd6 sstables: Coroutinize remove_unshared_sstables()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-06 18:24:40 +03:00
Amnon Heiman
8bf8feb5ff service/storage_proxy.cc All metric groups should have the same description 2024-07-31 10:42:35 +03:00
Amnon Heiman
23b62540dd raft/server.cc: All metric groups should have the same description 2024-07-31 10:20:39 +03:00
2786 changed files with 102359 additions and 34287 deletions

View File

@@ -1,7 +1,7 @@
---
Language: Cpp
AccessModifierOffset: -4
AlignAfterOpenBracket: Align
AlignAfterOpenBracket: DontAlign
AlignArrayOfStructures: None
AlignConsecutiveAssignments:
Enabled: false
@@ -42,9 +42,9 @@ AllowAllParametersOfDeclarationOnNextLine: true
AllowShortBlocksOnASingleLine: Never
AllowShortCaseLabelsOnASingleLine: false
AllowShortEnumsOnASingleLine: true
AllowShortFunctionsOnASingleLine: InlineOnly
AllowShortFunctionsOnASingleLine: None
AllowShortIfStatementsOnASingleLine: Never
AllowShortLambdasOnASingleLine: All
AllowShortLambdasOnASingleLine: Empty
AllowShortLoopsOnASingleLine: false
AlwaysBreakAfterDefinitionReturnType: None
AlwaysBreakAfterReturnType: None
@@ -52,8 +52,8 @@ AlwaysBreakBeforeMultilineStrings: false
AlwaysBreakTemplateDeclarations: Yes
AttributeMacros:
- __capability
BinPackArguments: false
BinPackParameters: false
BinPackArguments: true
BinPackParameters: true
BitFieldColonSpacing: Both
BraceWrapping:
AfterCaseLabel: false
@@ -89,7 +89,7 @@ ColumnLimit: 160
CommentPragmas: '^ IWYU pragma:'
CompactNamespaces: false
ConstructorInitializerIndentWidth: 4
ContinuationIndentWidth: 4
ContinuationIndentWidth: 8
Cpp11BracedListStyle: true
DerivePointerAlignment: false
DisableFormat: false
@@ -103,22 +103,6 @@ ForEachMacros:
- BOOST_FOREACH
IfMacros:
- KJ_IF_MAYBE
IncludeBlocks: Preserve
IncludeCategories:
- Regex: '^"(llvm|llvm-c|clang|clang-c)/'
Priority: 2
SortPriority: 0
CaseSensitive: false
- Regex: '^(<|"(gtest|gmock|isl|json)/)'
Priority: 3
SortPriority: 0
CaseSensitive: false
- Regex: '.*'
Priority: 1
SortPriority: 0
CaseSensitive: false
IncludeIsMainRegex: '(Test)?$'
IncludeIsMainSourceRegex: ''
IndentAccessModifiers: false
IndentCaseBlocks: false
IndentCaseLabels: false
@@ -148,7 +132,7 @@ MacroBlockBegin: ''
MacroBlockEnd: ''
MaxEmptyLinesToKeep: 2
NamespaceIndentation: None
PackConstructorInitializers: NextLine
PackConstructorInitializers: BinPack
PenaltyBreakAssignment: 2
PenaltyBreakBeforeFirstCallParameter: 19
PenaltyBreakComment: 300
@@ -171,9 +155,9 @@ RequiresClausePosition: OwnLine
RequiresExpressionIndentation: OuterScope
SeparateDefinitionBlocks: Leave
ShortNamespaceLines: 1
SortIncludes: CaseSensitive
SortIncludes: Never
SortJavaStaticImport: Before
SortUsingDeclarations: LexicographicNumeric
SortUsingDeclarations: Never
SpaceAfterCStyleCast: false
SpaceAfterLogicalNot: false
SpaceAfterTemplateKeyword: true

1
.gitattributes vendored
View File

@@ -2,3 +2,4 @@
*.hh diff=cpp
*.svg binary
docs/_static/api/js/* binary
pgo/profiles/** filter=lfs diff=lfs merge=lfs -text

5
.github/CODEOWNERS vendored
View File

@@ -91,7 +91,7 @@ test/boost/mutation_reader_test.cc @denesb
test/boost/querier_cache_test.cc @denesb
# PYTEST-BASED CQL TESTS
test/cql-pytest/* @nyh
test/cqlpy/* @nyh
# RAFT
raft/* @kbr-scylla @gleb-cloudius @kostja
@@ -99,3 +99,6 @@ test/raft/* @kbr-scylla @gleb-cloudius @kostja
# HEAT-WEIGHTED LOAD BALANCING
db/heat_load_balance.* @nyh @gleb-cloudius
# Tools
tools/* @denesb

9
.github/dependabot.yml vendored Normal file
View File

@@ -0,0 +1,9 @@
version: 2
updates:
- package-ecosystem: "pip"
directory: "/docs"
schedule:
interval: "daily"
allow:
- dependency-name: "sphinx-scylladb-theme"
- dependency-name: "sphinx-multiversion-scylla"

50
.github/mergify.yml vendored
View File

@@ -15,6 +15,31 @@ pull_request_rules:
- closed
actions:
delete_head_branch:
- name: Automate backport pull request 6.2
conditions:
- or:
- closed
- merged
- or:
- base=master
- base=next
- label=backport/6.2 # The PR must have this label to trigger the backport
- label=promoted-to-master
actions:
copy:
title: "[Backport 6.2] {{ title }}"
body: |
{{ body }}
{% for c in commits %}
(cherry picked from commit {{ c.sha }})
{% endfor %}
Refs #{{number}}
branches:
- branch-6.2
assignees:
- "{{ author }}"
- name: Automate backport pull request 6.1
conditions:
- or:
@@ -40,31 +65,6 @@ pull_request_rules:
- branch-6.1
assignees:
- "{{ author }}"
- name: Automate backport pull request 5.4
conditions:
- or:
- closed
- merged
- or:
- base=master
- base=next
- label=backport/5.4 # The PR must have this label to trigger the backport
- label=promoted-to-master
actions:
copy:
title: "[Backport 5.4] {{ title }}"
body: |
{{ body }}
{% for c in commits %}
(cherry picked from commit {{ c.sha }})
{% endfor %}
Refs #{{number}}
branches:
- branch-5.4
assignees:
- "{{ author }}"
- name: Automate backport pull request 6.0
conditions:
- or:

226
.github/scripts/auto-backport.py vendored Executable file
View File

@@ -0,0 +1,226 @@
#!/usr/bin/env python3
import argparse
import os
import re
import sys
import tempfile
import logging
from github import Github, GithubException
from git import Repo, GitCommandError
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
try:
github_token = os.environ["GITHUB_TOKEN"]
except KeyError:
print("Please set the 'GITHUB_TOKEN' environment variable")
sys.exit(1)
def is_pull_request():
return '--pull-request' in sys.argv[1:]
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('--repo', type=str, required=True, help='Github repository name')
parser.add_argument('--base-branch', type=str, default='refs/heads/master', help='Base branch')
parser.add_argument('--commits', default=None, type=str, help='Range of promoted commits.')
parser.add_argument('--pull-request', type=int, help='Pull request number to be backported')
parser.add_argument('--head-commit', type=str, required=is_pull_request(), help='The HEAD of target branch after the pull request specified by --pull-request is merged')
return parser.parse_args()
def create_pull_request(repo, new_branch_name, base_branch_name, pr, backport_pr_title, commits, is_draft=False):
pr_body = f'{pr.body}\n\n'
for commit in commits:
pr_body += f'- (cherry picked from commit {commit})\n\n'
pr_body += f'Parent PR: #{pr.number}'
try:
backport_pr = repo.create_pull(
title=backport_pr_title,
body=pr_body,
head=f'scylladbbot:{new_branch_name}',
base=base_branch_name,
draft=is_draft
)
logging.info(f"Pull request created: {backport_pr.html_url}")
backport_pr.add_to_assignees(pr.user)
if is_draft:
backport_pr.add_to_labels("conflicts")
pr_comment = f"@{pr.user.login} - This PR was marked as draft because it has conflicts\n"
pr_comment += "Please resolve them and remove the 'conflicts' label. The PR will be made ready for review automatically."
backport_pr.create_issue_comment(pr_comment)
logging.info(f"Assigned PR to original author: {pr.user}")
return backport_pr
except GithubException as e:
if 'A pull request already exists' in str(e):
logging.warning(f'A pull request already exists for {pr.user}:{new_branch_name}')
else:
logging.error(f'Failed to create PR: {e}')
def get_pr_commits(repo, pr, stable_branch, start_commit=None):
commits = []
if pr.merged:
merge_commit = repo.get_commit(pr.merge_commit_sha)
if len(merge_commit.parents) > 1: # Check if this merge commit includes multiple commits
for commit in pr.get_commits():
commits.append(commit.sha)
else:
if start_commit:
promoted_commits = repo.compare(start_commit, stable_branch).commits
else:
promoted_commits = repo.get_commits(sha=stable_branch)
for commit in pr.get_commits():
for promoted_commit in promoted_commits:
commit_title = commit.commit.message.splitlines()[0]
# In Scylla-pkg and scylla-dtest, for example,
# we don't create a merge commit for a PR with multiple commits,
# according to the GitHub API, the last commit will be the merge commit,
# which is not what we need when backporting (we need all the commits).
# So here, we are validating the correct SHA for each commit so we can cherry-pick
if promoted_commit.commit.message.startswith(commit_title):
commits.append(promoted_commit.sha)
elif pr.state == 'closed':
events = pr.get_issue_events()
for event in events:
if event.event == 'closed':
commits.append(event.commit_id)
return commits
def create_pr_comment_and_remove_label(pr, comment_body):
labels = pr.get_labels()
pattern = re.compile(r"backport/\d+\.\d+$")
for label in labels:
if pattern.match(label.name):
print(f"Removing label: {label.name}")
comment_body += f'- {label.name}\n'
pr.remove_from_labels(label)
pr.create_issue_comment(comment_body)
def backport(repo, pr, version, commits, backport_base_branch):
new_branch_name = f'backport/{pr.number}/to-{version}'
backport_pr_title = f'[Backport {version}] {pr.title}'
repo_url = f'https://scylladbbot:{github_token}@github.com/{repo.full_name}.git'
fork_repo = f'https://scylladbbot:{github_token}@github.com/scylladbbot/{repo.name}.git'
with (tempfile.TemporaryDirectory() as local_repo_path):
try:
repo_local = Repo.clone_from(repo_url, local_repo_path, branch=backport_base_branch)
repo_local.git.checkout(b=new_branch_name)
is_draft = False
for commit in commits:
try:
repo_local.git.cherry_pick(commit, '-x')
except GitCommandError as e:
logging.warning(f'Cherry-pick conflict on commit {commit}: {e}')
is_draft = True
repo_local.git.add(A=True)
repo_local.git.cherry_pick('--continue')
# Check if the branch already exists in the remote fork
remote_refs = repo_local.git.ls_remote('--heads', fork_repo, new_branch_name)
if not remote_refs:
# Branch does not exist, create it with a regular push
repo_local.git.push(fork_repo, new_branch_name)
create_pull_request(repo, new_branch_name, backport_base_branch, pr, backport_pr_title, commits,
is_draft)
else:
logging.info(f"Remote branch {new_branch_name} already exists in fork. Skipping push.")
except GitCommandError as e:
logging.warning(f"GitCommandError: {e}")
def with_github_keyword_prefix(repo, pr):
# GitHub issue pattern: #123, scylladb/scylladb#123, or full GitHub URLs
github_pattern = rf"(?:fix(?:|es|ed))\s*:?\s*(?:(?:(?:{repo.full_name})?#)|https://github\.com/{repo.full_name}/issues/)(\d+)"
# JIRA issue pattern: PKG-92 or https://scylladb.atlassian.net/browse/PKG-92
jira_pattern = r"(?:fix(?:|es|ed))\s*:?\s*(?:(?:https://scylladb\.atlassian\.net/browse/)?([A-Z]+-\d+))"
# Check PR body for GitHub issues
github_match = re.findall(github_pattern, pr.body, re.IGNORECASE)
# Check PR body for JIRA issues
jira_match = re.findall(jira_pattern, pr.body, re.IGNORECASE)
match = github_match or jira_match
if match:
return True
for commit in pr.get_commits():
github_match = re.findall(github_pattern, commit.commit.message, re.IGNORECASE)
jira_match = re.findall(jira_pattern, commit.commit.message, re.IGNORECASE)
if github_match or jira_match:
print(f'{pr.number} has a valid close reference in commit message {commit.sha}')
return True
print(f'No valid close reference for {pr.number}')
return False
def main():
args = parse_args()
base_branch = args.base_branch.split('/')[2]
promoted_label = 'promoted-to-master'
repo_name = args.repo
fork_repo_name = 'scylladbbot/scylladb'
if 'scylla-enterprise' in args.repo:
promoted_label = 'promoted-to-enterprise'
fork_repo_name = 'scylladbbot/scylla-enterprise'
stable_branch = base_branch
backport_branch = 'branch-'
backport_label_pattern = re.compile(r'backport/\d+\.\d+$')
g = Github(github_token)
repo = g.get_repo(repo_name)
scylladbbot_repo = g.get_repo(fork_repo_name)
closed_prs = []
start_commit = None
if args.commits:
start_commit, end_commit = args.commits.split('..')
commits = repo.compare(start_commit, end_commit).commits
for commit in commits:
match = re.search(rf"Closes .*#([0-9]+)", commit.commit.message, re.IGNORECASE)
if match:
pr_number = int(match.group(1))
pr = repo.get_pull(pr_number)
closed_prs.append(pr)
if args.pull_request:
start_commit = args.head_commit
pr = repo.get_pull(args.pull_request)
closed_prs = [pr]
for pr in closed_prs:
labels = [label.name for label in pr.labels]
backport_labels = [label for label in labels if backport_label_pattern.match(label)]
if promoted_label not in labels:
print(f'no {promoted_label} label: {pr.number}')
continue
if not backport_labels:
print(f'no backport label: {pr.number}')
continue
if args.commits and not with_github_keyword_prefix(repo, pr):
continue
if not repo.private and not scylladbbot_repo.has_in_collaborators(pr.user.login):
logging.info(f"Sending an invite to {pr.user.login} to become a collaborator to {scylladbbot_repo.full_name} ")
scylladbbot_repo.add_to_collaborators(pr.user.login)
comment = f':warning: @{pr.user.login} you have been added as collaborator to scylladbbot fork '
comment += f'Please check your inbox and approve the invitation, once it is done, please add the backport labels again\n'
create_pr_comment_and_remove_label(pr, comment)
continue
commits = get_pr_commits(repo, pr, stable_branch, start_commit)
logging.info(f"Found PR #{pr.number} with commit {commits} and the following labels: {backport_labels}")
for backport_label in backport_labels:
version = backport_label.replace('backport/', '')
backport_base_branch = backport_label.replace('backport/', backport_branch)
backport(repo, pr, version, commits, backport_base_branch)
if __name__ == "__main__":
main()

View File

@@ -16,13 +16,8 @@ def parser():
parser = argparse.ArgumentParser()
parser.add_argument('--repository', type=str, required=True,
help='Github repository name (e.g., scylladb/scylladb)')
parser.add_argument('--commit_before_merge', type=str, required=True, help='Git commit ID to start labeling from ('
'newest commit).')
parser.add_argument('--commit_after_merge', type=str, required=True,
help='Git commit ID to end labeling at (oldest '
'commit, exclusive).')
parser.add_argument('--update_issue', type=bool, default=False, help='Set True to update issues when backport was '
'done')
parser.add_argument('--commits', type=str, required=True, help='Range of promoted commits.')
parser.add_argument('--label', type=str, default='promoted-to-master', help='Label to use')
parser.add_argument('--ref', type=str, required=True, help='PR target branch')
return parser.parse_args()
@@ -53,38 +48,41 @@ def main():
target_branch = re.search(r'branch-(\d+\.\d+)', args.ref)
g = Github(github_token)
repo = g.get_repo(args.repository, lazy=False)
commits = repo.compare(head=args.commit_after_merge, base=args.commit_before_merge)
start_commit, end_commit = args.commits.split('..')
commits = repo.compare(start_commit, end_commit).commits
processed_prs = set()
# Print commit information
for commit in commits.commits:
for commit in commits:
print(f'Commit sha is: {commit.sha}')
match = pr_pattern.search(commit.commit.message)
if match:
pr_number = int(match.group(1))
if pr_number in processed_prs:
continue
if target_branch:
pr = repo.get_pull(pr_number)
branch_name = target_branch[1]
refs_pr = re.findall(r'Refs (?:#|https.*?)(\d+)', pr.body)
if refs_pr:
print(f'branch-{target_branch.group(1)}, pr number is: {pr_number}')
# 1. change the backport label of the parent PR to note that
# we've merge the corresponding backport PR
# 2. close the backport PR and leave a comment on it to note
# that it has been merged with a certain git commit,
ref_pr_number = refs_pr[0]
mark_backport_done(repo, ref_pr_number, branch_name)
comment = f'Closed via {commit.sha}'
add_comment_and_close_pr(pr, comment)
else:
try:
pr_last_line = commit.commit.message.splitlines()
for line in reversed(pr_last_line):
match = pr_pattern.search(line)
if match:
pr_number = int(match.group(1))
if pr_number in processed_prs:
continue
if target_branch:
pr = repo.get_pull(pr_number)
pr.add_to_labels('promoted-to-master')
print(f'master branch, pr number is: {pr_number}')
except UnknownObjectException:
print(f'{pr_number} is not a PR but an issue, no need to add label')
processed_prs.add(pr_number)
branch_name = target_branch[1]
refs_pr = re.findall(r'Parent PR: (?:#|https.*?)(\d+)', pr.body)
if refs_pr:
print(f'branch-{target_branch.group(1)}, pr number is: {pr_number}')
# 1. change the backport label of the parent PR to note that
# we've merged the corresponding backport PR
# 2. close the backport PR and leave a comment on it to note
# that it has been merged with a certain git commit.
ref_pr_number = refs_pr[0]
mark_backport_done(repo, ref_pr_number, branch_name)
comment = f'Closed via {commit.sha}'
add_comment_and_close_pr(pr, comment)
else:
try:
pr = repo.get_pull(pr_number)
pr.add_to_labels('promoted-to-master')
print(f'master branch, pr number is: {pr_number}')
except UnknownObjectException:
print(f'{pr_number} is not a PR but an issue, no need to add label')
processed_prs.add(pr_number)
if __name__ == "__main__":

View File

@@ -5,9 +5,10 @@ on:
branches:
- master
- branch-*.*
env:
DEFAULT_BRANCH: 'master'
- enterprise
pull_request_target:
types: [labeled]
branches: [master, next, enterprise]
jobs:
check-commit:
@@ -20,17 +21,51 @@ jobs:
env:
GITHUB_CONTEXT: ${{ toJson(github) }}
run: echo "$GITHUB_CONTEXT"
- name: Set Default Branch
id: set_branch
run: |
if [[ "${{ github.repository }}" == *enterprise* ]]; then
echo "DEFAULT_BRANCH=enterprise" >> $GITHUB_ENV
else
echo "DEFAULT_BRANCH=master" >> $GITHUB_ENV
fi
- name: Checkout repository
uses: actions/checkout@v4
with:
repository: ${{ github.repository }}
ref: ${{ env.DEFAULT_BRANCH }}
token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
fetch-depth: 0 # Fetch all history for all tags and branches
- name: Set up Git identity
run: |
git config --global user.name "GitHub Action"
git config --global user.email "action@github.com"
git config --global merge.conflictstyle diff3
- name: Install dependencies
run: sudo apt-get install -y python3-github
run: sudo apt-get install -y python3-github python3-git
- name: Run python script
if: github.event_name == 'push'
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: python .github/scripts/label_promoted_commits.py --commit_before_merge ${{ github.event.before }} --commit_after_merge ${{ github.event.after }} --repository ${{ github.repository }} --ref ${{ github.ref }}
GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}
run: python .github/scripts/label_promoted_commits.py --commits ${{ github.event.before }}..${{ github.sha }} --repository ${{ github.repository }} --ref ${{ github.ref }}
- name: Run auto-backport.py when promotion completed
if: ${{ github.event_name == 'push' && github.ref == format('refs/heads/{0}', env.DEFAULT_BRANCH) }}
env:
GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}
run: python .github/scripts/auto-backport.py --repo ${{ github.repository }} --base-branch ${{ github.ref }} --commits ${{ github.event.before }}..${{ github.sha }}
- name: Check if label starts with 'backport/' and contains digits
id: check_label
run: |
label_name="${{ github.event.label.name }}"
if [[ "$label_name" =~ ^backport/[0-9]+\.[0-9]+$ ]]; then
echo "Label matches backport/X.X pattern."
echo "backport_label=true" >> $GITHUB_OUTPUT
else
echo "Label does not match the required pattern."
echo "backport_label=false" >> $GITHUB_OUTPUT
fi
- name: Run auto-backport.py when label was added
if: ${{ github.event_name == 'pull_request_target' && steps.check_label.outputs.backport_label == 'true' && github.event.pull_request.state == 'closed' }}
env:
GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}
run: python .github/scripts/auto-backport.py --repo ${{ github.repository }} --base-branch ${{ github.ref }} --pull-request ${{ github.event.pull_request.number }} --head-commit ${{ github.event.pull_request.base.sha }}

View File

@@ -18,7 +18,7 @@ jobs:
// Regular expression pattern to check for "Fixes" prefix
// Adjusted to dynamically insert the repository full name
const pattern = `Fixes:? (?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)`;
const pattern = `Fixes:? ((?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)|(?:https://scylladb\\.atlassian\\.net/browse/)?([A-Z]+-\\d+))`;
const regex = new RegExp(pattern);
if (!regex.test(body)) {

View File

@@ -0,0 +1,53 @@
name: Backport with Jira Integration
on:
push:
branches:
- master
- next-*.*
- branch-*.*
pull_request_target:
types: [labeled, closed]
branches:
- master
- next
- next-*.*
- branch-*.*
jobs:
backport-on-push:
if: github.event_name == 'push'
uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main
with:
event_type: 'push'
base_branch: ${{ github.ref }}
commits: ${{ github.event.before }}..${{ github.sha }}
secrets:
gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
backport-on-label:
if: github.event_name == 'pull_request_target' && github.event.action == 'labeled'
uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main
with:
event_type: 'labeled'
base_branch: refs/heads/${{ github.event.pull_request.base.ref }}
pull_request_number: ${{ github.event.pull_request.number }}
head_commit: ${{ github.event.pull_request.base.sha }}
label_name: ${{ github.event.label.name }}
pr_state: ${{ github.event.pull_request.state }}
secrets:
gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
backport-chain:
if: github.event_name == 'pull_request_target' && github.event.action == 'closed' && github.event.pull_request.merged == true
uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main
with:
event_type: 'chain'
base_branch: refs/heads/${{ github.event.pull_request.base.ref }}
pull_request_number: ${{ github.event.pull_request.number }}
pr_body: ${{ github.event.pull_request.body }}
secrets:
gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -10,6 +10,9 @@ on:
- 'docs/**'
- '.github/**'
workflow_dispatch:
issue_comment:
types:
- created
env:
BUILD_TYPE: RelWithDebInfo
@@ -25,6 +28,7 @@ concurrency:
jobs:
read-toolchain:
if: github.event_name == 'pull_request' || (github.event.issue.pull_request && startsWith(github.event.comment.body, '/clang-tidy'))
uses: ./.github/workflows/read-toolchain.yaml
clang-tidy:
name: Run clang-tidy

View File

@@ -0,0 +1,45 @@
name: Notify PR Authors of Conflicts
on:
schedule:
- cron: '0 10 * * 1,4' # Runs every Monday and Thursday at 10:00am
workflow_dispatch: # Manual trigger for testing
jobs:
notify_conflict_prs:
runs-on: ubuntu-latest
steps:
- name: Notify PR Authors of Conflicts
uses: actions/github-script@v7
with:
script: |
const prs = await github.paginate(github.rest.pulls.list, {
owner: context.repo.owner,
repo: context.repo.repo,
state: 'open',
per_page: 100
});
const branchPrefix = 'branch-';
const threeDaysAgo = new Date();
const conflictLabel = 'conflicts';
threeDaysAgo.setDate(threeDaysAgo.getDate() - 3);
for (const pr of prs) {
if (!pr.base.ref.startsWith(branchPrefix)) continue;
const hasConflictLabel = pr.labels.some(label => label.name === conflictLabel);
if (!hasConflictLabel) continue;
const updatedDate = new Date(pr.updated_at);
if (updatedDate >= threeDaysAgo) continue;
if (pr.assignee === null) continue;
const assignee = pr.assignee.login;
if (assignee) {
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: pr.number,
body: `@${assignee}, this PR has been open with conflicts. Please resolve the conflicts so we can merge it.`,
});
console.log(`Notified @${assignee} for PR #${pr.number}`);
}
}
console.log(`Total PRs checked: ${prs.length}`);

View File

@@ -0,0 +1,32 @@
---
# https://github.com/redhat-plumbers-in-action/differential-shellcheck#readme
name: Differential ShellCheck
on:
push:
branches:
- master
pull_request:
branches:
- master
permissions:
contents: read
jobs:
lint:
runs-on: ubuntu-latest
permissions:
security-events: write
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Differential ShellCheck
uses: redhat-plumbers-in-action/differential-shellcheck@v5
with:
severity: warning
token: ${{ secrets.GITHUB_TOKEN }}

View File

@@ -9,7 +9,9 @@ env:
BUILD_TYPE: RelWithDebInfo
BUILD_DIR: build
CLEANER_OUTPUT_PATH: build/clang-include-cleaner.log
CLEANER_DIRS: test/unit exceptions alternator api auth cdc compaction
# the "idl" subdirectory does not contain C++ source code. the .hh files in it are
# supposed to be processed by idl-compiler.py, so we don't check them using the cleaner
CLEANER_DIRS: test/unit exceptions alternator api auth cdc compaction db dht gms index lang message mutation mutation_writer node_ops redis replica
permissions: {}
@@ -43,6 +45,10 @@ jobs:
-G Ninja \
-B $BUILD_DIR \
-S .
- run: |
cmake \
--build $BUILD_DIR \
--target wasmtime_bindings
- name: Build headers
run: |
swagger_targets=''

View File

@@ -0,0 +1,27 @@
name: Mark PR as Ready When Conflicts Label is Removed
on:
pull_request_target:
types:
- unlabeled
env:
DEFAULT_BRANCH: 'master'
jobs:
mark-ready:
if: github.event.label.name == 'conflicts'
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
repository: ${{ github.repository }}
ref: ${{ env.DEFAULT_BRANCH }}
token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
fetch-depth: 1
- name: Mark pull request as ready for review
run: gh pr ready "${{ github.event.pull_request.number }}"
env:
GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}

View File

@@ -15,10 +15,13 @@ env:
BUILD_DIR: build
jobs:
read-toolchain:
uses: ./.github/workflows/read-toolchain.yaml
build-with-the-latest-seastar:
needs:
- read-toolchain
runs-on: ubuntu-latest
# be consistent with tools/toolchain/image
container: scylladb/scylla-toolchain:fedora-40-20240621
container: ${{ needs.read-toolchain.outputs.image }}
strategy:
matrix:
build_type:

View File

@@ -0,0 +1,58 @@
name: Urgent Issue Reminder
on:
schedule:
- cron: '10 8 * * 1' # Runs every Monday at 8 AM
jobs:
reminder:
runs-on: ubuntu-latest
steps:
- name: Send reminders
uses: actions/github-script@v7
with:
script: |
const labelFilters = ['P0', 'P1', 'Field-Tier1','status/release blocker', 'status/regression'];
const excludingLabelFilters = ['documentation'];
const daysInactive = 7;
const now = new Date();
// Fetch open issues
const issues = await github.rest.issues.listForRepo({
owner: context.repo.owner,
repo: context.repo.repo,
state: 'open'
});
console.log("Looking for issues with labels:"+labelFilters+", excluding labels:"+excludingLabelFilters+ ", inactive for more than "+daysInactive+" days.");
for (const issue of issues.data) {
// Check if issue has any of the specified labels
const hasFilteredLabel = issue.labels.some(label => labelFilters.includes(label.name));
const hasExcludingLabel = issue.labels.some(label => excludingLabelFilters.includes(label.name));
if (hasExcludingLabel) continue;
if (!hasFilteredLabel) continue;
// Check for inactivity
const lastUpdated = new Date(issue.updated_at);
const diffInDays = (now - lastUpdated) / (1000 * 60 * 60 * 24);
console.log("Issue #"+issue.number+"; Days inactive:"+diffInDays);
if (diffInDays > daysInactive) {
if (issue.assignees.length > 0) {
console.log("==>> Alert about issue #"+issue.number);
const assigneesLogins = issue.assignees.map(assignee => `@${assignee.login}`).join(', ');
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: issue.number,
body: `${assigneesLogins}, This urgent issue had no activity for more than ${daysInactive} days. Please check its status.\n CC @mykaul @dani-tweig`
});
} else {
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: issue.number,
body: `This urgent issue had no activity for more than ${daysInactive} days. Please check its status.\n CC @mykaul @dani-tweig`
});
}
}
}

1
.gitignore vendored
View File

@@ -34,3 +34,4 @@ compile_commands.json
.mypy_cache
.envrc
clang_build
.idea/

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

View File

@@ -22,8 +22,11 @@ if(DEFINED CMAKE_BUILD_TYPE)
endif()
endif(DEFINED CMAKE_BUILD_TYPE)
option(Scylla_ENABLE_LTO "Turn on link-time optimization for the 'release' mode." ON)
include(mode.common)
if(CMAKE_CONFIGURATION_TYPES)
get_property(is_multi_config GLOBAL PROPERTY GENERATOR_IS_MULTI_CONFIG)
if(is_multi_config)
foreach(config ${CMAKE_CONFIGURATION_TYPES})
include(mode.${config})
list(APPEND scylla_build_modes ${scylla_build_mode_${config}})
@@ -41,31 +44,78 @@ else()
endif()
include(limit_jobs)
# Configure Seastar compile options to align with Scylla
set(CMAKE_CXX_STANDARD "23" CACHE INTERNAL "")
set(CMAKE_CXX_EXTENSIONS ON CACHE INTERNAL "")
set(CMAKE_CXX_SCAN_FOR_MODULES OFF CACHE INTERNAL "")
set(CMAKE_CXX_VISIBILITY_PRESET hidden)
set(Seastar_TESTING ON CACHE BOOL "" FORCE)
set(Seastar_API_LEVEL 7 CACHE STRING "" FORCE)
set(Seastar_DEPRECATED_OSTREAM_FORMATTERS OFF CACHE BOOL "" FORCE)
set(Seastar_APPS ON CACHE BOOL "" FORCE)
set(Seastar_EXCLUDE_APPS_FROM_ALL ON CACHE BOOL "" FORCE)
set(Seastar_EXCLUDE_TESTS_FROM_ALL ON CACHE BOOL "" FORCE)
set(Seastar_IO_URING OFF CACHE BOOL "" FORCE)
set(Seastar_SCHEDULING_GROUPS_COUNT 16 CACHE STRING "" FORCE)
set(Seastar_UNUSED_RESULT_ERROR ON CACHE BOOL "" FORCE)
add_subdirectory(seastar)
if(is_multi_config)
find_package(Seastar)
# this is atypical compared to standard ExternalProject usage:
# - Seastar's build system should already be configured at this point.
# - We maintain separate project variants for each configuration type.
#
# Benefits of this approach:
# - Allows the parent project to consume the compile options exposed by
# .pc file. as the compile options vary from one config to another.
# - Allows application of config-specific settings
# - Enables building Seastar within the parent project's build system
# - Facilitates linking of artifacts with the external project target,
# establishing proper dependencies between them
include(ExternalProject)
ExternalProject_Add(Seastar
SOURCE_DIR "${PROJECT_SOURCE_DIR}/seastar"
BINARY_DIR "${CMAKE_BINARY_DIR}/$<CONFIG>/seastar"
CONFIGURE_COMMAND ""
BUILD_COMMAND ${CMAKE_COMMAND} --build <BINARY_DIR>
--target seastar
--target seastar_testing
--target seastar_perf_testing
--target app_iotune
BUILD_ALWAYS ON
BUILD_BYPRODUCTS
<BINARY_DIR>/libseastar.$<IF:$<CONFIG:Debug,Dev>,so,a>
<BINARY_DIR>/libseastar_testing.$<IF:$<CONFIG:Debug,Dev>,so,a>
<BINARY_DIR>/libseastar_perf_testing.$<IF:$<CONFIG:Debug,Dev>,so,a>
<BINARY_DIR>/apps/iotune/iotune
<BINARY_DIR>/gen/include/seastar/http/chunk_parsers.hh
<BINARY_DIR>/gen/include/seastar/http/request_parser.hh
<BINARY_DIR>/gen/include/seastar/http/response_parser.hh
INSTALL_COMMAND "")
add_dependencies(Seastar::seastar Seastar)
add_dependencies(Seastar::seastar_testing Seastar)
else()
set(Seastar_TESTING ON CACHE BOOL "" FORCE)
set(Seastar_API_LEVEL 7 CACHE STRING "" FORCE)
set(Seastar_DEPRECATED_OSTREAM_FORMATTERS OFF CACHE BOOL "" FORCE)
set(Seastar_APPS ON CACHE BOOL "" FORCE)
set(Seastar_EXCLUDE_APPS_FROM_ALL ON CACHE BOOL "" FORCE)
set(Seastar_EXCLUDE_TESTS_FROM_ALL ON CACHE BOOL "" FORCE)
set(Seastar_IO_URING ON CACHE BOOL "" FORCE)
set(Seastar_SCHEDULING_GROUPS_COUNT 19 CACHE STRING "" FORCE)
set(Seastar_UNUSED_RESULT_ERROR ON CACHE BOOL "" FORCE)
add_subdirectory(seastar)
target_compile_definitions (seastar
PRIVATE
SEASTAR_NO_EXCEPTION_HACK)
endif()
set(ABSL_PROPAGATE_CXX_STD ON CACHE BOOL "" FORCE)
if(Scylla_ENABLE_LTO)
list(APPEND absl_cxx_flags $<$<CONFIG:RelWithDebInfo>:${CMAKE_CXX_COMPILE_OPTIONS_IPO};-ffat-lto-objects>)
endif()
find_package(Sanitizers QUIET)
set(sanitizer_cxx_flags
list(APPEND absl_cxx_flags
$<$<CONFIG:Debug,Sanitize>:$<TARGET_PROPERTY:Sanitizers::address,INTERFACE_COMPILE_OPTIONS>;$<TARGET_PROPERTY:Sanitizers::undefined_behavior,INTERFACE_COMPILE_OPTIONS>>)
if(CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
set(ABSL_GCC_FLAGS ${sanitizer_cxx_flags})
list(APPEND ABSL_GCC_FLAGS ${absl_cxx_flags})
elseif(CMAKE_CXX_COMPILER_ID STREQUAL "Clang")
set(ABSL_LLVM_FLAGS ${sanitizer_cxx_flags})
list(APPEND ABSL_LLVM_FLAGS ${absl_cxx_flags})
endif()
set(ABSL_DEFAULT_LINKOPTS
$<$<CONFIG:Debug,Sanitize>:$<TARGET_PROPERTY:Sanitizers::address,INTERFACE_LINK_LIBRARIES>;$<TARGET_PROPERTY:Sanitizers::undefined_behavior,INTERFACE_LINK_LIBRARIES>>)
@@ -98,10 +148,13 @@ find_package(ICU COMPONENTS uc i18n REQUIRED)
find_package(fmt 10.0.0 REQUIRED)
find_package(libdeflate REQUIRED)
find_package(libxcrypt REQUIRED)
find_package(p11-kit REQUIRED)
find_package(Snappy REQUIRED)
find_package(RapidJSON REQUIRED)
find_package(xxHash REQUIRED)
find_package(yaml-cpp REQUIRED)
find_package(zstd REQUIRED)
find_package(lz4 REQUIRED)
set(scylla_gen_build_dir "${CMAKE_BINARY_DIR}/gen")
file(MAKE_DIRECTORY "${scylla_gen_build_dir}")
@@ -147,6 +200,7 @@ target_sources(scylla-main
tombstone_gc_options.cc
tombstone_gc.cc
reader_concurrency_semaphore.cc
reader_concurrency_semaphore_group.cc
row_cache.cc
schema_mutations.cc
serializer.cc
@@ -169,7 +223,10 @@ target_link_libraries(scylla-main
Seastar::seastar
Snappy::snappy
systemd
ZLIB::ZLIB)
ZLIB::ZLIB
lz4::lz4_static
zstd::zstd_static
)
option(Scylla_CHECK_HEADERS
"Add check-headers target for checking the self-containness of headers")
@@ -196,10 +253,15 @@ include(check_headers)
check_headers(check-headers scylla-main
GLOB ${CMAKE_CURRENT_SOURCE_DIR}/*.hh)
option(Scylla_DIST
"Build dist targets"
ON)
add_custom_target(compiler-training)
add_subdirectory(api)
add_subdirectory(alternator)
add_subdirectory(audit)
add_subdirectory(db)
add_subdirectory(auth)
add_subdirectory(cdc)
@@ -207,6 +269,7 @@ add_subdirectory(compaction)
add_subdirectory(cql3)
add_subdirectory(data_dictionary)
add_subdirectory(dht)
add_subdirectory(ent)
add_subdirectory(gms)
add_subdirectory(idl)
add_subdirectory(index)
@@ -237,7 +300,8 @@ add_version_library(scylla_version
add_executable(scylla
main.cc)
target_link_libraries(scylla PRIVATE
set(scylla_libs
audit
scylla-main
api
auth
@@ -248,10 +312,12 @@ target_link_libraries(scylla PRIVATE
cql3
data_dictionary
dht
encryption
gms
idl
index
lang
ldap
locator
message
mutation
@@ -272,10 +338,21 @@ target_link_libraries(scylla PRIVATE
transport
types
utils)
target_link_libraries(scylla PRIVATE
${scylla_libs})
if(Scylla_ENABLE_LTO)
include(enable_lto)
foreach(target scylla ${scylla_libs})
enable_lto(${target})
endforeach()
endif()
target_link_libraries(scylla PRIVATE
seastar
p11-kit::p11-kit
Seastar::seastar
absl::headers
yaml-cpp::yaml-cpp
Boost::program_options)
target_include_directories(scylla PRIVATE
@@ -287,4 +364,10 @@ add_custom_target(maybe-scylla
add_dependencies(compiler-training
maybe-scylla)
add_subdirectory(dist)
if(Scylla_DIST)
add_subdirectory(dist)
endif()
if(Scylla_BUILD_INSTRUMENTED)
add_subdirectory(pgo)
endif()

View File

@@ -0,0 +1,62 @@
## **SCYLLADB SOFTWARE LICENSE AGREEMENT**
| Version: | 1.0 |
| :---- | :---- |
| Last updated: | December 18, 2024 |
**Your Acceptance**
By utilizing or accessing the Software in any manner, You hereby confirm and agree to be bound by this ScyllaDB Software License Agreement (the "**Agreement**"), which sets forth the terms and conditions on which ScyllaDB Ltd. ("**Licensor**") makes the Software available to You, as the Licensee. If Licensee does not agree to the terms of this Agreement or cannot otherwise comply with the Agreement, Licensee shall not utilize or access the Software.
The terms "**You**" or "**Licensee**" refer to any individual accessing or using the Software under this Agreement ("**Use**"). In case that such individual is Using the Software on behalf of a legal entity, You hereby irrevocably represents and warrants that You have full legal capacity and authority to enter into this Agreement on behalf of such entity as well as bind such entity to this Agreement, and in such case, the term "You" or "Licensee" in this Agreement will refer to such entity.
**Grant of License**
* **Software Definitions:** Software means the ScyllaDB software provided by Licensor, including the source code, object code, and any accompanying documentation or tools, or any part thereof, as made available under this Agreement.
* **Grant of License:** Subject to the terms and conditions of this Agreement, Licensor grants You a limited, non-exclusive, revocable, non-sublicensable, non-transferable, royalty free license to Use the Software, in each case solely for the purposes of:
1) Copying, distributing, evaluating (including performing benchmarking or comparative tests or evaluations , subject to the limitations below) and improving the Software and ScyllaDB; and
2) create a modified version of the Software (each, a "**Licensed Work**"); provided however, that each such Licensed Work keeps all or substantially all of the functions and features of the Software, and/or using all or substantially all of the source code of the Software. You hereby agree that all the Licensed Work are, upon creation, considered Licensed Work of the Licensor, shall be the sole property of the Licensor and its assignees, and the Licensor and its assignees shall be the sole owner of all rights of any kind or nature, in connection with such Licensed Work. You hereby irrevocably and unconditionally assign to the Licensor all the Licensed Work and any part thereof. This License applies separately for each version of the Licensed Work, which shall be considered "Software" for the purpose of this Agreement.
**License Limitations, Restrictions and Obligations:** The license grant above is subject to the following limitations, restrictions, and obligations. If Licensees Use of the Software does not comply with the above license grant or the terms of this section (including exceeding the Usage Limit set forth below), Licensee must: (i) refrain from any Use of the Software; and (ii) purchase a [commercial paid license](https://www.scylladb.com/scylladb-proprietary-software-license-agreement/) from the Licensor.
* **Updates:** You shall be solely responsible for providing all equipment, systems, assets, access, and ancillary goods and services needed to access and Use the Software. Licensor may modify or update the Software at any time, without notification, in its sole and absolute discretion. After the effective date of each such update, Licensor shall bear no obligation to run, provide or support legacy versions of the Software.
* **"Usage Limit":** Licensee's total overall available storage across all deployments and clusters of the Software and the Licensed Work under this License shall not exceed 10TB and/or an upper limit of 50 VCPUs (hyper threads).
* **IP Markings:** Licensee must retain all copyright, trademark, and other proprietary notices contained in the Software. You will not modify, delete, alter, remove, or obscure any intellectual property, including without limitations licensing, copyright, trademark, or any other notices of Licensor in the Software.
* **License Reproduction:** You must conspicuously display this Agreement on each copy of the Software. If You receive the Software from a third party, this Agreement still applies to Your Use of the Software. You will be responsible for any breach of this Agreement by any such third-party.
* Distribution of any Licensed Works is permitted, provided that: (i) You must include in any Licensed Work prominent notices stating that You have modified the Software, (ii) You include a copy of this Agreement with the Licensed Work, and (iii) You clearly identify all modifications made in the Licensed Work and provides attribution to the Licensor as the original author(s) of the Software.
* **Commercial Use Restrictions:** Licensee may not offer the Software as a software-as-a-service (SaaS) or commercial database-as-as-service (dBaaS) offering. Licensee may not use the Software to compete with Licensor's existing or future products or services. If your Use of the Software does not comply with the requirements currently in effect as described in this License, you must purchase a commercial license from the Licensor, its affiliated entities, or you must refrain from using the Software and all Licensed Work. Furthermore, if You make any written claim of patent infringement relating to the Software, Your patent license for the Software granted under this Agreement terminates immediately.
* Notwithstanding anything to the contrary, under the License granted hereunder, You shall not and shall not permit others to: (i) transfer the Software or any portions thereof to any other party except as expressly permitted herein; (ii) attempt to circumvent or overcome any technological protection measures incorporated into the Software; (iii) incorporate the Software into the structure, machinery or controls of any aircraft, other aerial device, military vehicle, hovercraft, waterborne craft or any medical equipment of any kind; or (iv) use the Software or any part thereof in any unlawful, harmful or illegal manner, or in a manner which infringes third parties rights in any way, including intellectual property rights.
**Monitoring; Audit**
* **License Key:** Licensor may implement a method of authentication, e.g., a unique license token ("License Key") as a condition of accessing or using the Software. Upon the implementation of such License Key, Licensee agrees to comply with Licensor terms and requirements with regards to such License Key
* **Monitoring & Data Sharing:** Licensor do not collect customer data from its database. Notwithstanding, Licensee acknowledges and agrees that the License Key and Software may share telemetry metrics and information regarding the execution volume and statistics with Licensor regarding Licensees use of the same. Any disclosure or use of such information shall be subject to, and in accordance with, Licensors Privacy Policy and Data Processing Agreement, which can be found at [https://www.scylladb.com/policies-agreements](https://www.scylladb.com/policies-agreements).
* **Information Requests; Audits:** Licensee shall keep accurate records of its access to and use of any Software, and shall promptly respond to any Licensor requests for information regarding the same. To ensure compliance with the terms of this Agreement, during the term of this Agreement and for a period of one (1) year thereafter, Licensor (or an agent bound by customary confidentiality undertakings on its behalf) may audit Licensees records which are related to its access to or use of the Software. The cost of such audit shall be borne by Licensor unless it is determined that Licensee has materially breached this Agreement.
**Termination**
* **Termination:** Licensor may immediately terminate this Agreement will automatically terminate if You for any reason, including without limitation for (i) Licensees breach of any term, condition, or restriction of this Agreement, unless such breach was cured to Licensors satisfaction within no more than 15 days from the date of the breach. Notwithstanding the foregoing, intentional; or (ii) if Licensee brings any claim, demand or repeated breaches lawsuit against Licensor.
* **Obligations on Termination:** Upon termination of this Agreement by You will cause Your licenses to terminate automatically and permanently, at Licensors sole discretion, Licensee must (i) immediately stop using any Software, (ii) return all copies of any tools or documentation provided by Licensor; and (iii) pay amount due to Licensor hereunder (e.g., audit costs). All obligations which by their nature must survive the termination of this Agreement shall so survive.
**Indemnity; Disclaimer; Limitation of Liability**
* **Indemnity:** Licensee hereby agrees to indemnify, defend and hold harmless Licensor and its affiliates from any losses or damages incurred due to a third party claim arising out of: (i) Licensees breach of this Agreement; (ii) Licensees negligence, willful misconduct or violation of law, or (iii) Licensees products or services.
* DISCLAIMER OF WARRANTIES: LICENSEE AGREES THAT LICENSOR HAS MADE NO EXPRESS WARRANTIES REGARDING THE SOFTWARE AND THAT THE SOFTWARE IS BEING PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. LICENSOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THE SOFTWARE, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION, ANY IMPLIED WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE; TITLE; MERCHANTABILITY; OR NON-INFRINGEMENT OF THIRD PARTY RIGHTS. LICENSOR DOES NOT WARRANT THAT THE SOFTWARE WILL OPERATE UNINTERRUPTED OR ERROR FREE, OR THAT ALL ERRORS WILL BE CORRECTED. LICENSOR DOES NOT GUARANTEE ANY PARTICULAR RESULTS FROM THE USE OF THE SOFTWARE, AND DOES NOT WARRANT THAT THE SOFTWARE IS FIT FOR ANY PARTICULAR PURPOSE.
* LIMITATION OF LIABILITY: TO THE FULLEST EXTENT PERMISSIBLE UNDER APPLICABLE LAW, IN NO EVENT WILL LICENSOR AND/OR ITS AFFILIATES, EMPLOYEES, OFFICERS AND DIRECTORS BE LIABLE TO LICENSEE FOR (I) ANY LOSS OF USE OR DATA; INTERRUPTION OF BUSINESS; OR ANY INDIRECT; SPECIAL; INCIDENTAL; OR CONSEQUENTIAL DAMAGES OF ANY KIND (INCLUDING LOST PROFITS); AND (II) ANY DIRECT DAMAGES EXCEEDING THE TOTAL AMOUNT OF ONE THOUSAND US DOLLARS ($1,000). THE FOREGOING PROVISIONS LIMITING THE LIABILITY OF LICENSOR SHALL APPLY REGARDLESS OF THE FORM OR CAUSE OF ACTION, WHETHER IN STRICT LIABILITY, CONTRACT OR TORT.
**Proprietary Rights; No Other Rights**
* **Ownership:** Licensor retains sole and exclusive ownership of all rights, interests and title in the Software and any scripts, processes, techniques, methodologies, inventions, know-how, concepts, formatting, arrangements, visual attributes, ideas, database rights, copyrights, patents, trade secrets, and other intellectual property related thereto, and all derivatives, enhancements, modifications and improvements thereof. Except for the limited license rights granted herein, Licensee has no rights in or to the Software and/ or Licensors trademarks, logo, or branding and You acknowledge that such Software, trademarks, logo, or branding is the sole property of Licensor.
* **Feedback:** Licensee is not required to provide any suggestions, enhancement requests, recommendations or other feedback regarding the Software ("Feedback"). If, notwithstanding this policy, Licensee submits Feedback, Licensee understands and acknowledges that such Feedback is not submitted in confidence and Licensor assumes no obligation, expressed or implied, by considering it. All right in any trademark or logo of Licensor or its affiliates and You shall make no claim of right to the Software or any part thereof to be supplied by Licensor hereunder and acknowledges that as between Licensor and You, such Software is the sole proprietary, title and interest in and to Licensor.such Feedback shall be assigned to, and shall become the sole and exclusive property of, Licensor upon its creation.
* Except for the rights expressly granted to You under this Agreement, You are not granted any other licenses or rights in the Software or otherwise. This Agreement constitutes the entire agreement between the You and the Licensor with respect to the subject matter hereof and supersedes all prior or contemporaneous communications, representations, or agreements, whether oral or written.
* **Third-Party Software:** Customer acknowledges that the Software may contain open and closed source components (“OSS Components”) that are governed separately by certain licenses, in each case as further provided by Company upon request. Any applicable OSS Component license is solely between Licensee and the applicable licensor of the OSS Component and Licensee shall comply with the applicable OSS Component license.
* If any provision of this Agreement is held to be invalid or unenforceable, such provision shall be struck and the remaining provisions shall remain in full force and effect.
**Miscellaneous**
* **Miscellaneous:** This Agreement may be modified at any time by Licensor, and constitutes the entire agreement between the parties with respect to the subject matter hereof. Licensee may not assign or subcontract its rights or obligations under this Agreement. This Agreement does not, and shall not be construed to create any relationship, partnership, joint venture, employer-employee, agency, or franchisor-franchisee relationship between the parties.
* **Governing Law & Jurisdiction:** This Agreement shall be governed and construed in accordance with the laws of Israel, without giving effect to their respective conflicts of laws provisions, and the competent courts situated in Tel Aviv, Israel, shall have sole and exclusive jurisdiction over the parties and any conflict and/or dispute arising out of, or in connection to, this Agreement
\[*End of ScyllaDB Software License Agreement*\]

View File

@@ -1,661 +0,0 @@
GNU AFFERO GENERAL PUBLIC LICENSE
Version 3, 19 November 2007
Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
Preamble
The GNU Affero General Public License is a free, copyleft license for
software and other kinds of works, specifically designed to ensure
cooperation with the community in the case of network server software.
The licenses for most software and other practical works are designed
to take away your freedom to share and change the works. By contrast,
our General Public Licenses are intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains free
software for all its users.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.
Developers that use our General Public Licenses protect your rights
with two steps: (1) assert copyright on the software, and (2) offer
you this License which gives you legal permission to copy, distribute
and/or modify the software.
A secondary benefit of defending all users' freedom is that
improvements made in alternate versions of the program, if they
receive widespread use, become available for other developers to
incorporate. Many developers of free software are heartened and
encouraged by the resulting cooperation. However, in the case of
software used on network servers, this result may fail to come about.
The GNU General Public License permits making a modified version and
letting the public access it on a server without ever releasing its
source code to the public.
The GNU Affero General Public License is designed specifically to
ensure that, in such cases, the modified source code becomes available
to the community. It requires the operator of a network server to
provide the source code of the modified version running there to the
users of that server. Therefore, public use of a modified version, on
a publicly accessible server, gives the public access to the source
code of the modified version.
An older license, called the Affero General Public License and
published by Affero, was designed to accomplish similar goals. This is
a different license, not a version of the Affero GPL, but Affero has
released a new version of the Affero GPL which permits relicensing under
this license.
The precise terms and conditions for copying, distribution and
modification follow.
TERMS AND CONDITIONS
0. Definitions.
"This License" refers to version 3 of the GNU Affero General Public License.
"Copyright" also means copyright-like laws that apply to other kinds of
works, such as semiconductor masks.
"The Program" refers to any copyrightable work licensed under this
License. Each licensee is addressed as "you". "Licensees" and
"recipients" may be individuals or organizations.
To "modify" a work means to copy from or adapt all or part of the work
in a fashion requiring copyright permission, other than the making of an
exact copy. The resulting work is called a "modified version" of the
earlier work or a work "based on" the earlier work.
A "covered work" means either the unmodified Program or a work based
on the Program.
To "propagate" a work means to do anything with it that, without
permission, would make you directly or secondarily liable for
infringement under applicable copyright law, except executing it on a
computer or modifying a private copy. Propagation includes copying,
distribution (with or without modification), making available to the
public, and in some countries other activities as well.
To "convey" a work means any kind of propagation that enables other
parties to make or receive copies. Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.
An interactive user interface displays "Appropriate Legal Notices"
to the extent that it includes a convenient and prominently visible
feature that (1) displays an appropriate copyright notice, and (2)
tells the user that there is no warranty for the work (except to the
extent that warranties are provided), that licensees may convey the
work under this License, and how to view a copy of this License. If
the interface presents a list of user commands or options, such as a
menu, a prominent item in the list meets this criterion.
1. Source Code.
The "source code" for a work means the preferred form of the work
for making modifications to it. "Object code" means any non-source
form of a work.
A "Standard Interface" means an interface that either is an official
standard defined by a recognized standards body, or, in the case of
interfaces specified for a particular programming language, one that
is widely used among developers working in that language.
The "System Libraries" of an executable work include anything, other
than the work as a whole, that (a) is included in the normal form of
packaging a Major Component, but which is not part of that Major
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
implementation is available to the public in source code form. A
"Major Component", in this context, means a major essential component
(kernel, window system, and so on) of the specific operating system
(if any) on which the executable work runs, or a compiler used to
produce the work, or an object code interpreter used to run it.
The "Corresponding Source" for a work in object code form means all
the source code needed to generate, install, and (for an executable
work) run the object code and to modify the work, including scripts to
control those activities. However, it does not include the work's
System Libraries, or general-purpose tools or generally available free
programs which are used unmodified in performing those activities but
which are not part of the work. For example, Corresponding Source
includes interface definition files associated with source files for
the work, and the source code for shared libraries and dynamically
linked subprograms that the work is specifically designed to require,
such as by intimate data communication or control flow between those
subprograms and other parts of the work.
The Corresponding Source need not include anything that users
can regenerate automatically from other parts of the Corresponding
Source.
The Corresponding Source for a work in source code form is that
same work.
2. Basic Permissions.
All rights granted under this License are granted for the term of
copyright on the Program, and are irrevocable provided the stated
conditions are met. This License explicitly affirms your unlimited
permission to run the unmodified Program. The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work. This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.
You may make, run and propagate covered works that you do not
convey, without conditions so long as your license otherwise remains
in force. You may convey covered works to others for the sole purpose
of having them make modifications exclusively for you, or provide you
with facilities for running those works, provided that you comply with
the terms of this License in conveying all material for which you do
not control copyright. Those thus making or running the covered works
for you must do so exclusively on your behalf, under your direction
and control, on terms that prohibit them from making any copies of
your copyrighted material outside their relationship with you.
Conveying under any other circumstances is permitted solely under
the conditions stated below. Sublicensing is not allowed; section 10
makes it unnecessary.
3. Protecting Users' Legal Rights From Anti-Circumvention Law.
No covered work shall be deemed part of an effective technological
measure under any applicable law fulfilling obligations under article
11 of the WIPO copyright treaty adopted on 20 December 1996, or
similar laws prohibiting or restricting circumvention of such
measures.
When you convey a covered work, you waive any legal power to forbid
circumvention of technological measures to the extent such circumvention
is effected by exercising rights under this License with respect to
the covered work, and you disclaim any intention to limit operation or
modification of the work as a means of enforcing, against the work's
users, your or third parties' legal rights to forbid circumvention of
technological measures.
4. Conveying Verbatim Copies.
You may convey verbatim copies of the Program's source code as you
receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice;
keep intact all notices stating that this License and any
non-permissive terms added in accord with section 7 apply to the code;
keep intact all notices of the absence of any warranty; and give all
recipients a copy of this License along with the Program.
You may charge any price or no price for each copy that you convey,
and you may offer support or warranty protection for a fee.
5. Conveying Modified Source Versions.
You may convey a work based on the Program, or the modifications to
produce it from the Program, in the form of source code under the
terms of section 4, provided that you also meet all of these conditions:
a) The work must carry prominent notices stating that you modified
it, and giving a relevant date.
b) The work must carry prominent notices stating that it is
released under this License and any conditions added under section
7. This requirement modifies the requirement in section 4 to
"keep intact all notices".
c) You must license the entire work, as a whole, under this
License to anyone who comes into possession of a copy. This
License will therefore apply, along with any applicable section 7
additional terms, to the whole of the work, and all its parts,
regardless of how they are packaged. This License gives no
permission to license the work in any other way, but it does not
invalidate such permission if you have separately received it.
d) If the work has interactive user interfaces, each must display
Appropriate Legal Notices; however, if the Program has interactive
interfaces that do not display Appropriate Legal Notices, your
work need not make them do so.
A compilation of a covered work with other separate and independent
works, which are not by their nature extensions of the covered work,
and which are not combined with it such as to form a larger program,
in or on a volume of a storage or distribution medium, is called an
"aggregate" if the compilation and its resulting copyright are not
used to limit the access or legal rights of the compilation's users
beyond what the individual works permit. Inclusion of a covered work
in an aggregate does not cause this License to apply to the other
parts of the aggregate.
6. Conveying Non-Source Forms.
You may convey a covered work in object code form under the terms
of sections 4 and 5, provided that you also convey the
machine-readable Corresponding Source under the terms of this License,
in one of these ways:
a) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by the
Corresponding Source fixed on a durable physical medium
customarily used for software interchange.
b) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by a
written offer, valid for at least three years and valid for as
long as you offer spare parts or customer support for that product
model, to give anyone who possesses the object code either (1) a
copy of the Corresponding Source for all the software in the
product that is covered by this License, on a durable physical
medium customarily used for software interchange, for a price no
more than your reasonable cost of physically performing this
conveying of source, or (2) access to copy the
Corresponding Source from a network server at no charge.
c) Convey individual copies of the object code with a copy of the
written offer to provide the Corresponding Source. This
alternative is allowed only occasionally and noncommercially, and
only if you received the object code with such an offer, in accord
with subsection 6b.
d) Convey the object code by offering access from a designated
place (gratis or for a charge), and offer equivalent access to the
Corresponding Source in the same way through the same place at no
further charge. You need not require recipients to copy the
Corresponding Source along with the object code. If the place to
copy the object code is a network server, the Corresponding Source
may be on a different server (operated by you or a third party)
that supports equivalent copying facilities, provided you maintain
clear directions next to the object code saying where to find the
Corresponding Source. Regardless of what server hosts the
Corresponding Source, you remain obligated to ensure that it is
available for as long as needed to satisfy these requirements.
e) Convey the object code using peer-to-peer transmission, provided
you inform other peers where the object code and Corresponding
Source of the work are being offered to the general public at no
charge under subsection 6d.
A separable portion of the object code, whose source code is excluded
from the Corresponding Source as a System Library, need not be
included in conveying the object code work.
A "User Product" is either (1) a "consumer product", which means any
tangible personal property which is normally used for personal, family,
or household purposes, or (2) anything designed or sold for incorporation
into a dwelling. In determining whether a product is a consumer product,
doubtful cases shall be resolved in favor of coverage. For a particular
product received by a particular user, "normally used" refers to a
typical or common use of that class of product, regardless of the status
of the particular user or of the way in which the particular user
actually uses, or expects or is expected to use, the product. A product
is a consumer product regardless of whether the product has substantial
commercial, industrial or non-consumer uses, unless such uses represent
the only significant mode of use of the product.
"Installation Information" for a User Product means any methods,
procedures, authorization keys, or other information required to install
and execute modified versions of a covered work in that User Product from
a modified version of its Corresponding Source. The information must
suffice to ensure that the continued functioning of the modified object
code is in no case prevented or interfered with solely because
modification has been made.
If you convey an object code work under this section in, or with, or
specifically for use in, a User Product, and the conveying occurs as
part of a transaction in which the right of possession and use of the
User Product is transferred to the recipient in perpetuity or for a
fixed term (regardless of how the transaction is characterized), the
Corresponding Source conveyed under this section must be accompanied
by the Installation Information. But this requirement does not apply
if neither you nor any third party retains the ability to install
modified object code on the User Product (for example, the work has
been installed in ROM).
The requirement to provide Installation Information does not include a
requirement to continue to provide support service, warranty, or updates
for a work that has been modified or installed by the recipient, or for
the User Product in which it has been modified or installed. Access to a
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
protocols for communication across the network.
Corresponding Source conveyed, and Installation Information provided,
in accord with this section must be in a format that is publicly
documented (and with an implementation available to the public in
source code form), and must require no special password or key for
unpacking, reading or copying.
7. Additional Terms.
"Additional permissions" are terms that supplement the terms of this
License by making exceptions from one or more of its conditions.
Additional permissions that are applicable to the entire Program shall
be treated as though they were included in this License, to the extent
that they are valid under applicable law. If additional permissions
apply only to part of the Program, that part may be used separately
under those permissions, but the entire Program remains governed by
this License without regard to the additional permissions.
When you convey a copy of a covered work, you may at your option
remove any additional permissions from that copy, or from any part of
it. (Additional permissions may be written to require their own
removal in certain cases when you modify the work.) You may place
additional permissions on material, added by you to a covered work,
for which you have or can give appropriate copyright permission.
Notwithstanding any other provision of this License, for material you
add to a covered work, you may (if authorized by the copyright holders of
that material) supplement the terms of this License with terms:
a) Disclaiming warranty or limiting liability differently from the
terms of sections 15 and 16 of this License; or
b) Requiring preservation of specified reasonable legal notices or
author attributions in that material or in the Appropriate Legal
Notices displayed by works containing it; or
c) Prohibiting misrepresentation of the origin of that material, or
requiring that modified versions of such material be marked in
reasonable ways as different from the original version; or
d) Limiting the use for publicity purposes of names of licensors or
authors of the material; or
e) Declining to grant rights under trademark law for use of some
trade names, trademarks, or service marks; or
f) Requiring indemnification of licensors and authors of that
material by anyone who conveys the material (or modified versions of
it) with contractual assumptions of liability to the recipient, for
any liability that these contractual assumptions directly impose on
those licensors and authors.
All other non-permissive additional terms are considered "further
restrictions" within the meaning of section 10. If the Program as you
received it, or any part of it, contains a notice stating that it is
governed by this License along with a term that is a further
restriction, you may remove that term. If a license document contains
a further restriction but permits relicensing or conveying under this
License, you may add to a covered work material governed by the terms
of that license document, provided that the further restriction does
not survive such relicensing or conveying.
If you add terms to a covered work in accord with this section, you
must place, in the relevant source files, a statement of the
additional terms that apply to those files, or a notice indicating
where to find the applicable terms.
Additional terms, permissive or non-permissive, may be stated in the
form of a separately written license, or stated as exceptions;
the above requirements apply either way.
8. Termination.
You may not propagate or modify a covered work except as expressly
provided under this License. Any attempt otherwise to propagate or
modify it is void, and will automatically terminate your rights under
this License (including any patent licenses granted under the third
paragraph of section 11).
However, if you cease all violation of this License, then your
license from a particular copyright holder is reinstated (a)
provisionally, unless and until the copyright holder explicitly and
finally terminates your license, and (b) permanently, if the copyright
holder fails to notify you of the violation by some reasonable means
prior to 60 days after the cessation.
Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.
Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License. If your rights have been terminated and not permanently
reinstated, you do not qualify to receive new licenses for the same
material under section 10.
9. Acceptance Not Required for Having Copies.
You are not required to accept this License in order to receive or
run a copy of the Program. Ancillary propagation of a covered work
occurring solely as a consequence of using peer-to-peer transmission
to receive a copy likewise does not require acceptance. However,
nothing other than this License grants you permission to propagate or
modify any covered work. These actions infringe copyright if you do
not accept this License. Therefore, by modifying or propagating a
covered work, you indicate your acceptance of this License to do so.
10. Automatic Licensing of Downstream Recipients.
Each time you convey a covered work, the recipient automatically
receives a license from the original licensors, to run, modify and
propagate that work, subject to this License. You are not responsible
for enforcing compliance by third parties with this License.
An "entity transaction" is a transaction transferring control of an
organization, or substantially all assets of one, or subdividing an
organization, or merging organizations. If propagation of a covered
work results from an entity transaction, each party to that
transaction who receives a copy of the work also receives whatever
licenses to the work the party's predecessor in interest had or could
give under the previous paragraph, plus a right to possession of the
Corresponding Source of the work from the predecessor in interest, if
the predecessor has it or can get it with reasonable efforts.
You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License. For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted under this License, and you may not initiate litigation
(including a cross-claim or counterclaim in a lawsuit) alleging that
any patent claim is infringed by making, using, selling, offering for
sale, or importing the Program or any portion of it.
11. Patents.
A "contributor" is a copyright holder who authorizes use under this
License of the Program or a work on which the Program is based. The
work thus licensed is called the contributor's "contributor version".
A contributor's "essential patent claims" are all patent claims
owned or controlled by the contributor, whether already acquired or
hereafter acquired, that would be infringed by some manner, permitted
by this License, of making, using, or selling its contributor version,
but do not include claims that would be infringed only as a
consequence of further modification of the contributor version. For
purposes of this definition, "control" includes the right to grant
patent sublicenses in a manner consistent with the requirements of
this License.
Each contributor grants you a non-exclusive, worldwide, royalty-free
patent license under the contributor's essential patent claims, to
make, use, sell, offer for sale, import and otherwise run, modify and
propagate the contents of its contributor version.
In the following three paragraphs, a "patent license" is any express
agreement or commitment, however denominated, not to enforce a patent
(such as an express permission to practice a patent or covenant not to
sue for patent infringement). To "grant" such a patent license to a
party means to make such an agreement or commitment not to enforce a
patent against the party.
If you convey a covered work, knowingly relying on a patent license,
and the Corresponding Source of the work is not available for anyone
to copy, free of charge and under the terms of this License, through a
publicly available network server or other readily accessible means,
then you must either (1) cause the Corresponding Source to be so
available, or (2) arrange to deprive yourself of the benefit of the
patent license for this particular work, or (3) arrange, in a manner
consistent with the requirements of this License, to extend the patent
license to downstream recipients. "Knowingly relying" means you have
actual knowledge that, but for the patent license, your conveying the
covered work in a country, or your recipient's use of the covered work
in a country, would infringe one or more identifiable patents in that
country that you have reason to believe are valid.
If, pursuant to or in connection with a single transaction or
arrangement, you convey, or propagate by procuring conveyance of, a
covered work, and grant a patent license to some of the parties
receiving the covered work authorizing them to use, propagate, modify
or convey a specific copy of the covered work, then the patent license
you grant is automatically extended to all recipients of the covered
work and works based on it.
A patent license is "discriminatory" if it does not include within
the scope of its coverage, prohibits the exercise of, or is
conditioned on the non-exercise of one or more of the rights that are
specifically granted under this License. You may not convey a covered
work if you are a party to an arrangement with a third party that is
in the business of distributing software, under which you make payment
to the third party based on the extent of your activity of conveying
the work, and under which the third party grants, to any of the
parties who would receive the covered work from you, a discriminatory
patent license (a) in connection with copies of the covered work
conveyed by you (or copies made from those copies), or (b) primarily
for and in connection with specific products or compilations that
contain the covered work, unless you entered into that arrangement,
or that patent license was granted, prior to 28 March 2007.
Nothing in this License shall be construed as excluding or limiting
any implied license or other defenses to infringement that may
otherwise be available to you under applicable patent law.
12. No Surrender of Others' Freedom.
If conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot convey a
covered work so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you may
not convey it at all. For example, if you agree to terms that obligate you
to collect a royalty for further conveying from those to whom you convey
the Program, the only way you could satisfy both those terms and this
License would be to refrain entirely from conveying the Program.
13. Remote Network Interaction; Use with the GNU General Public License.
Notwithstanding any other provision of this License, if you modify the
Program, your modified version must prominently offer all users
interacting with it remotely through a computer network (if your version
supports such interaction) an opportunity to receive the Corresponding
Source of your version by providing access to the Corresponding Source
from a network server at no charge, through some standard or customary
means of facilitating copying of software. This Corresponding Source
shall include the Corresponding Source for any work covered by version 3
of the GNU General Public License that is incorporated pursuant to the
following paragraph.
Notwithstanding any other provision of this License, you have
permission to link or combine any covered work with a work licensed
under version 3 of the GNU General Public License into a single
combined work, and to convey the resulting work. The terms of this
License will continue to apply to the part which is the covered work,
but the work with which it is combined will remain governed by version
3 of the GNU General Public License.
14. Revised Versions of this License.
The Free Software Foundation may publish revised and/or new versions of
the GNU Affero General Public License from time to time. Such new versions
will be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the
Program specifies that a certain numbered version of the GNU Affero General
Public License "or any later version" applies to it, you have the
option of following the terms and conditions either of that numbered
version or of any later version published by the Free Software
Foundation. If the Program does not specify a version number of the
GNU Affero General Public License, you may choose any version ever published
by the Free Software Foundation.
If the Program specifies that a proxy can decide which future
versions of the GNU Affero General Public License can be used, that proxy's
public statement of acceptance of a version permanently authorizes you
to choose that version for the Program.
Later license versions may give you additional or different
permissions. However, no additional obligations are imposed on any
author or copyright holder as a result of your choosing to follow a
later version.
15. Disclaimer of Warranty.
THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
16. Limitation of Liability.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.
17. Interpretation of Sections 15 and 16.
If the disclaimer of warranty and limitation of liability provided
above cannot be given local legal effect according to their terms,
reviewing courts shall apply local law that most closely approximates
an absolute waiver of all civil liability in connection with the
Program, unless a warranty or assumption of liability accompanies a
copy of the Program in return for a fee.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
state the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
<one line to give the program's name and a brief idea of what it does.>
Copyright (C) <year> <name of author>
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
Also add information on how to contact you by electronic and paper mail.
If your software can interact with users remotely through a computer
network, you should also make sure that it provides a way for users to
get its source. For example, if your program is a web application, its
interface could display a "Source" link that leads users to an archive
of the code. There are many ways you could offer source, and different
solutions will be better for different programs; see section 13 for the
specific requirements.
You should also get your employer (if you work as a programmer) or school,
if any, to sign a "copyright disclaimer" for the program, if necessary.
For more information on this, and how to apply and follow the GNU AGPL, see
<http://www.gnu.org/licenses/>.

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=6.2.0-dev
VERSION=2025.1.12
if test -f version
then
@@ -104,7 +104,7 @@ else
fi
if [ -f "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" ]; then
GIT_COMMIT_FILE=$(cat "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" |cut -d . -f 3)
GIT_COMMIT_FILE=$(cat "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" | rev | cut -d . -f 1 | rev)
if [ "$GIT_COMMIT" = "$GIT_COMMIT_FILE" ]; then
exit 0
fi

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "absl-flat_hash_map.hh"

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once

View File

@@ -15,6 +15,7 @@ target_sources(alternator
conditions.cc
auth.cc
streams.cc
consumed_capacity.cc
ttl.cc
${cql_grammar_srcs})
target_include_directories(alternator
@@ -24,11 +25,13 @@ target_include_directories(alternator
PRIVATE
${RAPIDJSON_INCLUDE_DIRS})
target_link_libraries(alternator
cql3
idl
Seastar::seastar
xxHash::xxhash
absl::headers)
PUBLIC
Seastar::seastar
xxHash::xxhash
PRIVATE
cql3
idl
absl::headers)
check_headers(check-headers alternator
GLOB_RECURSE ${CMAKE_CURRENT_SOURCE_DIR}/*.hh)

View File

@@ -3,12 +3,12 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "alternator/error.hh"
#include "auth/common.hh"
#include "log.hh"
#include "utils/log.hh"
#include <string>
#include <string_view>
#include "bytes.hh"

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include <string_view>
@@ -15,8 +15,6 @@
#include "utils/base64.hh"
#include "utils/rjson.hh"
#include <stdexcept>
#include <boost/algorithm/cxx11/all_of.hpp>
#include <boost/algorithm/cxx11/any_of.hpp>
#include "utils/overloaded_functor.hh"
#include "expressions.hh"
@@ -743,9 +741,9 @@ bool verify_condition_expression(
};
switch (list.op) {
case '&':
return boost::algorithm::all_of(list.conditions, verify_condition);
return std::ranges::all_of(list.conditions, verify_condition);
case '|':
return boost::algorithm::any_of(list.conditions, verify_condition);
return std::ranges::any_of(list.conditions, verify_condition);
default:
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error("bad operator in condition_list");

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
/*

View File

@@ -0,0 +1,94 @@
/*
* Copyright 2024-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "consumed_capacity.hh"
#include "error.hh"
namespace alternator {
/*
* \brief DynamoDB counts read capacity in half-integers - a short
* eventually-consistent read is counted as 0.5 unit.
* Because we want our counter to be an integer, it counts half units.
* Both read and write counters count in these half-units, and should be
* multiply by 0.5 (HALF_UNIT_MULTIPLIER) to get the DynamoDB-compatible RCU or WCU numbers.
*/
static constexpr double HALF_UNIT_MULTIPLIER = 0.5;
static constexpr uint64_t KB = 1024ULL;
static constexpr uint64_t RCU_BLOCK_SIZE_LENGTH = 4*KB;
static constexpr uint64_t WCU_BLOCK_SIZE_LENGTH = 1*KB;
bool consumed_capacity_counter::should_add_capacity(const rjson::value& request) {
const rjson::value* return_consumed = rjson::find(request, "ReturnConsumedCapacity");
if (!return_consumed) {
return false;
}
if (!return_consumed->IsString()) {
throw api_error::validation("Non-string ReturnConsumedCapacity field in request");
}
std::string consumed = return_consumed->GetString();
if (consumed == "INDEXES") {
throw api_error::validation("INDEXES consumed capacity is not supported");
}
if (consumed != "TOTAL") {
throw api_error::validation("Unknown consumed capacity "+ consumed);
}
return true;
}
void consumed_capacity_counter::add_consumed_capacity_to_response_if_needed(rjson::value& response) const noexcept {
if (_should_add_to_reponse) {
auto consumption = rjson::empty_object();
rjson::add(consumption, "CapacityUnits", get_consumed_capacity_units());
rjson::add(response, "ConsumedCapacity", std::move(consumption));
}
}
static uint64_t calculate_half_units(uint64_t unit_block_size, uint64_t total_bytes, bool is_quorum) {
uint64_t half_units = (total_bytes + unit_block_size -1) / unit_block_size; //divide by unit_block_size and round up
if (is_quorum) {
half_units *= 2;
}
return half_units;
}
rcu_consumed_capacity_counter::rcu_consumed_capacity_counter(const rjson::value& request, bool is_quorum) :
consumed_capacity_counter(should_add_capacity(request)),_is_quorum(is_quorum) {
}
uint64_t rcu_consumed_capacity_counter::get_half_units(uint64_t total_bytes, bool is_quorum) noexcept {
return calculate_half_units(RCU_BLOCK_SIZE_LENGTH, total_bytes, is_quorum);
}
uint64_t rcu_consumed_capacity_counter::get_half_units() const noexcept {
return get_half_units(_total_bytes, _is_quorum);
}
uint64_t wcu_consumed_capacity_counter::get_half_units() const noexcept {
return calculate_half_units(WCU_BLOCK_SIZE_LENGTH, _total_bytes, true);
}
uint64_t wcu_consumed_capacity_counter::get_units(uint64_t total_bytes) noexcept {
return calculate_half_units(WCU_BLOCK_SIZE_LENGTH, total_bytes, true) * HALF_UNIT_MULTIPLIER;
}
wcu_consumed_capacity_counter::wcu_consumed_capacity_counter(const rjson::value& request) :
consumed_capacity_counter(should_add_capacity(request)) {
}
consumed_capacity_counter& consumed_capacity_counter::operator +=(uint64_t units) {
_total_bytes += units;
return *this;
}
double consumed_capacity_counter::get_consumed_capacity_units() const noexcept {
return get_half_units() * HALF_UNIT_MULTIPLIER;
}
}

View File

@@ -0,0 +1,66 @@
/*
* Copyright 2024-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include "utils/rjson.hh"
namespace alternator {
/**
* \brief consumed_capacity_counter is a base class that holds the bookkeeping
* to calculate RCU and WCU
*
* DynamoDB counts read capacity in half-integers - a short
* eventually-consistent read is counted as 0.5 unit.
* Because we want our counter to be an integer, we counts half units in
* our internal calculations.
*
* We use consumed_capacity_counter for calculation of a specific action
*
* It is also used to update the response if needed.
*/
class consumed_capacity_counter {
public:
consumed_capacity_counter() = default;
consumed_capacity_counter(bool should_add_to_reponse) : _should_add_to_reponse(should_add_to_reponse){}
bool operator()() const noexcept {
return _should_add_to_reponse;
}
consumed_capacity_counter& operator +=(uint64_t bytes);
double get_consumed_capacity_units() const noexcept;
void add_consumed_capacity_to_response_if_needed(rjson::value& response) const noexcept;
virtual ~consumed_capacity_counter() = default;
/**
* \brief get_half_units calculate the half units from the total bytes based on the type of the request
*/
virtual uint64_t get_half_units() const noexcept = 0;
uint64_t _total_bytes = 0;
static bool should_add_capacity(const rjson::value& request);
protected:
bool _should_add_to_reponse = false;
};
class rcu_consumed_capacity_counter : public consumed_capacity_counter {
bool _is_quorum = false;
public:
rcu_consumed_capacity_counter(const rjson::value& request, bool is_quorum);
rcu_consumed_capacity_counter(): consumed_capacity_counter(false), _is_quorum(false){}
virtual uint64_t get_half_units() const noexcept;
static uint64_t get_half_units(uint64_t total_bytes, bool is_quorum) noexcept;
};
class wcu_consumed_capacity_counter : public consumed_capacity_counter {
virtual uint64_t get_half_units() const noexcept;
public:
wcu_consumed_capacity_counter(const rjson::value& request);
static uint64_t get_units(uint64_t total_bytes) noexcept;
};
}

View File

@@ -3,10 +3,12 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include <seastar/core/with_scheduling_group.hh>
#include <seastar/net/dns.hh>
#include "controller.hh"
#include "server.hh"
#include "executor.hh"
@@ -130,10 +132,10 @@ future<> controller::start_server() {
std::throw_with_nested(std::runtime_error("Failed to set up Alternator TLS credentials"));
}
}
bool alternator_enforce_authorization = _config.alternator_enforce_authorization();
_server.invoke_on_all(
[this, addr, alternator_port, alternator_https_port, creds = std::move(creds), alternator_enforce_authorization] (server& server) mutable {
return server.init(addr, alternator_port, alternator_https_port, creds, alternator_enforce_authorization,
[this, addr, alternator_port, alternator_https_port, creds = std::move(creds)] (server& server) mutable {
return server.init(addr, alternator_port, alternator_https_port, creds,
_config.alternator_enforce_authorization,
&_memory_limiter.local().get_semaphore(),
_config.max_concurrent_requests_per_shard);
}).handle_exception([this, addr, alternator_port, alternator_https_port] (std::exception_ptr ep) {

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
@@ -88,6 +88,9 @@ public:
static api_error table_not_found(std::string msg) {
return api_error("TableNotFoundException", std::move(msg));
}
static api_error limit_exceeded(std::string msg) {
return api_error("LimitExceededException", std::move(msg));
}
static api_error internal(std::string msg) {
return api_error("InternalServerError", std::move(msg), http::reply::status_type::internal_server_error);
}

File diff suppressed because it is too large Load Diff

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
@@ -23,6 +23,8 @@
#include "utils/rjson.hh"
#include "utils/updateable_value.hh"
#include "tracing/trace_state.hh"
namespace db {
class system_distributed_keyspace;
}
@@ -50,6 +52,8 @@ class gossiper;
}
class schema_builder;
namespace alternator {
class rmw_operation;
@@ -67,7 +71,7 @@ public:
* (very) large objects as there are overhead issues with this
* as well, but for massive lists of return objects this can
* help avoid large allocations/many re-allocs
*/
*/
json::json_return_type make_streamed(rjson::value&&);
struct json_string : public json::jsonable {
@@ -158,6 +162,7 @@ class executor : public peering_sharded_service<executor> {
service::migration_manager& _mm;
db::system_distributed_keyspace& _sdks;
cdc::metadata& _cdc_metadata;
utils::updateable_value<bool> _enforce_authorization;
// An smp_service_group to be used for limiting the concurrency when
// forwarding Alternator request between shards - if necessary for LWT.
smp_service_group _ssg;
@@ -176,10 +181,7 @@ public:
db::system_distributed_keyspace& sdks,
cdc::metadata& cdc_metadata,
smp_service_group ssg,
utils::updateable_value<uint32_t> default_timeout_in_ms)
: _gossiper(gossiper), _proxy(proxy), _mm(mm), _sdks(sdks), _cdc_metadata(cdc_metadata), _ssg(ssg) {
s_default_timeout_in_ms = std::move(default_timeout_in_ms);
}
utils::updateable_value<uint32_t> default_timeout_in_ms);
future<request_return_type> create_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> describe_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
@@ -224,7 +226,7 @@ private:
friend class rmw_operation;
static void describe_key_schema(rjson::value& parent, const schema&, std::unordered_map<std::string,std::string> * = nullptr);
public:
static void describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string,std::string>&);
@@ -232,18 +234,21 @@ public:
const query::partition_slice&,
const cql3::selection::selection&,
const query::result&,
const std::optional<attrs_to_get>&);
const std::optional<attrs_to_get>&,
uint64_t* = nullptr);
static future<std::vector<rjson::value>> describe_multi_item(schema_ptr schema,
const query::partition_slice&& slice,
shared_ptr<cql3::selection::selection> selection,
foreign_ptr<lw_shared_ptr<query::result>> query_result,
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get);
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get,
uint64_t& rcu_half_units);
static void describe_single_item(const cql3::selection::selection&,
const std::vector<managed_bytes_opt>&,
const std::optional<attrs_to_get>&,
rjson::value&,
uint64_t* item_length_in_bytes = nullptr,
bool = false);
static void add_stream_options(const rjson::value& stream_spec, schema_builder&, service::storage_proxy& sp);
@@ -265,6 +270,6 @@ bool is_big(const rjson::value& val, int big_size = 100'000);
// Check CQL's Role-Based Access Control (RBAC) permission (MODIFY,
// SELECT, DROP, etc.) on the given table. When permission is denied an
// appropriate user-readable api_error::access_denied is thrown.
future<> verify_permission(const service::client_state&, const schema_ptr&, auth::permission);
future<> verify_permission(bool enforce_authorization, const service::client_state&, const schema_ptr&, auth::permission);
}

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "expressions.hh"
@@ -17,12 +17,9 @@
#include "seastarx.hh"
#include <seastar/core/print.hh>
#include <seastar/core/format.hh>
#include <seastar/util/log.hh>
#include <boost/algorithm/cxx11/any_of.hpp>
#include <boost/algorithm/cxx11/all_of.hpp>
#include <functional>
#include <unordered_map>

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
/*
@@ -91,6 +91,18 @@ options {
throw expressions_syntax_error(format("{} at char {}", err,
ex->get_charPositionInLine()));
}
// ANTLR3 tries to recover missing tokens - it tries to finish parsing
// and create valid objects, as if the missing token was there.
// But it has a bug and leaks these tokens.
// We override offending method and handle abandoned pointers.
std::vector<std::unique_ptr<TokenType>> _missing_tokens;
TokenType* getMissingSymbol(IntStreamType* istream, ExceptionBaseType* e,
ANTLR_UINT32 expectedTokenType, BitsetListType* follow) {
auto token = BaseType::getMissingSymbol(istream, e, expectedTokenType, follow);
_missing_tokens.emplace_back(token);
return token;
}
}
@lexer::context {
void displayRecognitionError(ANTLR_UINT8** token_names, ExceptionBaseType* ex) {

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once

View File

@@ -0,0 +1,73 @@
/*
* Copyright 2024-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include <string>
#include <string_view>
#include "utils/rjson.hh"
#include "serialization.hh"
#include "column_computation.hh"
#include "db/view/regular_column_transformation.hh"
namespace alternator {
// An implementation of a "column_computation" which extracts a specific
// non-key attribute from the big map (":attrs") of all non-key attributes,
// and deserializes it if it has the desired type. GSI will use this computed
// column as a materialized-view key when the view key attribute isn't a
// full-fledged CQL column but rather stored in ":attrs".
class extract_from_attrs_column_computation : public regular_column_transformation {
// The name of the CQL column name holding the attribute map. It is a
// constant defined in executor.cc (as ":attrs"), so doesn't need
// to be specified when constructing the column computation.
static const bytes MAP_NAME;
// The top-level attribute name to extract from the ":attrs" map.
bytes _attr_name;
// The type we expect for the value stored in the attribute. If the type
// matches the expected type, it is decoded from the serialized format
// we store in the map's values) into the raw CQL type value that we use
// for keys, and returned by compute_value(). Only the types "S" (string),
// "B" (bytes) and "N" (number) are allowed as keys in DynamoDB, and
// therefore in desired_type.
alternator_type _desired_type;
public:
virtual column_computation_ptr clone() const override;
// TYPE_NAME is a unique string that distinguishes this class from other
// column_computation subclasses. column_computation::deserialize() will
// construct an object of this subclass if it sees a "type" TYPE_NAME.
static inline const std::string TYPE_NAME = "alternator_extract_from_attrs";
// Serialize the *definition* of this column computation into a JSON
// string with a unique "type" string - TYPE_NAME - which then causes
// column_computation::deserialize() to create an object from this class.
virtual bytes serialize() const override;
// Construct this object based on the previous output of serialize().
// Calls on_internal_error() if the string doesn't match the output format
// of serialize(). "type" is not checked column_computation::deserialize()
// won't call this constructor if "type" doesn't match.
extract_from_attrs_column_computation(const rjson::value &v);
extract_from_attrs_column_computation(bytes_view attr_name, alternator_type desired_type)
: _attr_name(attr_name), _desired_type(desired_type)
{}
// Implement regular_column_transformation's compute_value() that
// accepts the full row:
result compute_value(const schema& schema, const partition_key& key,
const db::view::clustering_or_static_row& row) const override;
// But do not implement column_computation's compute_value() that
// accepts only a partition key - that's not enough so our implementation
// of this function does on_internal_error().
bytes compute_value(const schema& schema, const partition_key& key) const override;
// This computed column does depend on a non-primary key column, so
// its result may change in the update and we need to compute it
// before and after the update.
virtual bool depends_on_non_primary_key_column() const override {
return true;
}
};
} // namespace alternator

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
@@ -11,10 +11,15 @@
#include "seastarx.hh"
#include "service/paxos/cas_request.hh"
#include "utils/rjson.hh"
#include "consumed_capacity.hh"
#include "executor.hh"
#include "tracing/trace_state.hh"
#include "keys.hh"
namespace alternator {
class consumed_capacity;
// An rmw_operation encapsulates the common logic of all the item update
// operations which may involve a read of the item before the write
// (so-called Read-Modify-Write operations). These operations include PutItem,
@@ -63,7 +68,7 @@ protected:
partition_key _pk = partition_key::make_empty();
clustering_key _ck = clustering_key::make_empty();
write_isolation _write_isolation;
mutable wcu_consumed_capacity_counter _consumed_capacity;
// All RMW operations can have a ReturnValues parameter from the following
// choices. But note that only UpdateItem actually supports all of them:
enum class returnvalues {
@@ -113,7 +118,8 @@ public:
tracing::trace_state_ptr trace_state,
service_permit permit,
bool needs_read_before_write,
stats& stats);
stats& stats,
uint64_t& wcu_total);
std::optional<shard_id> shard_for_execute(bool needs_read_before_write);
};

View File

@@ -3,12 +3,12 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "utils/base64.hh"
#include "utils/rjson.hh"
#include "log.hh"
#include "utils/log.hh"
#include "serialization.hh"
#include "error.hh"
#include "concrete_types.hh"
@@ -245,6 +245,27 @@ rjson::value deserialize_item(bytes_view bv) {
return deserialized;
}
// This function takes a bytes_view created earlier by serialize_item(), and
// if has the type "expected_type", the function returns the value as a
// raw Scylla type. If the type doesn't match, returns an unset optional.
// This function only supports the key types S (string), B (bytes) and N
// (number) - serialize_item() serializes those types as a single-byte type
// followed by the serialized raw Scylla type, so all this function needs to
// do is to remove the first byte. This makes this function much more
// efficient than deserialize_item() above because it avoids transformation
// to/from JSON.
std::optional<bytes> serialized_value_if_type(bytes_view bv, alternator_type expected_type) {
if (bv.empty() || alternator_type(bv[0]) != expected_type) {
return std::nullopt;
}
// Currently, serialize_item() for types in alternator_type (notably S, B
// and N) are nothing more than Scylla's raw format for these types
// preceded by a type byte. So we just need to skip that byte and we are
// left by exactly what we need to return.
bv.remove_prefix(1);
return bytes(bv);
}
std::string type_to_string(data_type type) {
static thread_local std::unordered_map<data_type, std::string> types = {
{utf8_type, "S"},

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
@@ -43,6 +43,7 @@ type_representation represent_type(alternator_type atype);
bytes serialize_item(const rjson::value& item);
rjson::value deserialize_item(bytes_view bv);
std::optional<bytes> serialized_value_if_type(bytes_view bv, alternator_type expected_type);
std::string type_to_string(data_type type);

View File

@@ -3,11 +3,12 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "alternator/server.hh"
#include "log.hh"
#include "gms/application_state.hh"
#include "utils/log.hh"
#include <fmt/ranges.h>
#include <seastar/http/function_handlers.hh>
#include <seastar/http/short_streams.hh>
@@ -211,13 +212,33 @@ protected:
// using _gossiper().get_live_members(). But getting
// just the list of live nodes in this DC needs more elaborate code:
auto& topology = _proxy.get_token_metadata_ptr()->get_topology();
sstring local_dc = topology.get_datacenter();
std::unordered_set<gms::inet_address> local_dc_nodes = topology.get_datacenter_endpoints().at(local_dc);
for (auto& ip : local_dc_nodes) {
// /localnodes lists nodes in a single DC. By default the DC of this
// server is used, but it can be overridden by a "dc" query option.
// If the DC does not exist, we return an empty list - not an error.
sstring query_dc = req->get_query_param("dc");
sstring local_dc = query_dc.empty() ? topology.get_datacenter() : query_dc;
std::unordered_set<locator::host_id> local_dc_nodes;
const auto& endpoints = topology.get_datacenter_endpoints();
auto dc_it = endpoints.find(local_dc);
if (dc_it != endpoints.end()) {
local_dc_nodes = dc_it->second;
}
// By default, /localnodes lists the nodes of all racks in the given
// DC, unless a single rack is selected by the "rack" query option.
// If the rack does not exist, we return an empty list - not an error.
sstring query_rack = req->get_query_param("rack");
for (auto& id : local_dc_nodes) {
auto ip = _gossiper.get_address_map().get(id);
if (!query_rack.empty()) {
auto rack = _gossiper.get_application_state_value(ip, gms::application_state::RACK);
if (rack != query_rack) {
continue;
}
}
// Note that it's not enough for the node to be is_alive() - a
// node joining the cluster is also "alive" but not responsive to
// requests. We need the node to be in normal state. See #19694.
if (_gossiper.is_normal(ip)) {
// requests. We alive *and* normal. See #19694, #21538.
if (_gossiper.is_alive(ip) && _gossiper.is_normal(ip)) {
// Use the gossiped broadcast_rpc_address if available instead
// of the internal IP address "ip". See discussion in #18711.
rjson::push_back(results, rjson::from_string(_gossiper.get_rpc_address(ip)));
@@ -375,7 +396,7 @@ static std::string_view truncated_content_view(const chunked_content& content, s
}
}
static tracing::trace_state_ptr maybe_trace_query(service::client_state& client_state, std::string_view username, sstring_view op, const chunked_content& query) {
static tracing::trace_state_ptr maybe_trace_query(service::client_state& client_state, std::string_view username, std::string_view op, const chunked_content& query) {
tracing::trace_state_ptr trace_state;
tracing::tracing& tracing_instance = tracing::tracing::get_local_tracing_instance();
if (tracing_instance.trace_next_query() || tracing_instance.slow_query_tracing_enabled()) {
@@ -436,9 +457,16 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, username, op, content);
tracing::trace(trace_state, "{}", op);
rjson::value json_request = co_await _json_parser.parse(std::move(content));
co_return co_await callback_it->second(_executor, client_state, trace_state,
make_service_permit(std::move(units)), std::move(json_request), std::move(req));
auto user = client_state.user();
auto f = [this, content = std::move(content), &callback = callback_it->second,
client_state = std::move(client_state), trace_state = std::move(trace_state),
units = std::move(units), req = std::move(req)] () mutable -> future<executor::request_return_type> {
rjson::value json_request = co_await _json_parser.parse(std::move(content));
co_return co_await callback(_executor, client_state, trace_state,
make_service_permit(std::move(units)), std::move(json_request), std::move(req));
};
co_return co_await _sl_controller.with_user_service_level(user, std::ref(f));
}
void server::set_routes(routes& r) {
@@ -556,9 +584,9 @@ server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gos
}
future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
bool enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests) {
utils::updateable_value<bool> enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests) {
_memory_limiter = memory_limiter;
_enforce_authorization = enforce_authorization;
_enforce_authorization = std::move(enforce_authorization);
_max_concurrent_requests = std::move(max_concurrent_requests);
if (!port && !https_port) {
return make_exception_future<>(std::runtime_error("Either regular port or TLS port"

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
@@ -39,7 +39,7 @@ class server {
qos::service_level_controller& _sl_controller;
key_cache _key_cache;
bool _enforce_authorization;
utils::updateable_value<bool> _enforce_authorization;
utils::small_vector<std::reference_wrapper<seastar::httpd::http_server>, 2> _enabled_servers;
gate _pending_requests;
// In some places we will need a CQL updateable_timeout_config object even
@@ -76,7 +76,7 @@ public:
server(executor& executor, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& service, qos::service_level_controller& sl_controller);
future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
bool enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);
utils::updateable_value<bool> enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);
future<> stop();
private:
void set_routes(seastar::httpd::routes& r);

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "stats.hh"
@@ -29,8 +29,6 @@ stats::stats() : api_operations{} {
seastar::metrics::description("Latency summary of an operation via Alternator API"), [this]{return to_metrics_summary(api_operations.name.summary());})(op(CamelCaseName)).set_skip_when_empty(),
OPERATION(batch_get_item, "BatchGetItem")
OPERATION(batch_write_item, "BatchWriteItem")
OPERATION(batch_get_item_batch_total, "BatchGetItemSize")
OPERATION(batch_write_item_batch_total, "BatchWriteItemSize")
OPERATION(create_backup, "CreateBackup")
OPERATION(create_global_table, "CreateGlobalTable")
OPERATION(create_table, "CreateTable")
@@ -96,8 +94,22 @@ stats::stats() : api_operations{} {
seastar::metrics::description("number of rows read during filtering operations")),
seastar::metrics::make_total_operations("filtered_rows_matched_total", cql_stats.filtered_rows_matched_total,
seastar::metrics::description("number of rows read and matched during filtering operations")),
seastar::metrics::make_counter("rcu_total", [this]{return 0.5 * rcu_half_units_total;},
seastar::metrics::description("total number of consumed read units")).set_skip_when_empty(),
seastar::metrics::make_counter("wcu_total", wcu_total[wcu_types::PUT_ITEM],
seastar::metrics::description("total number of consumed write units"),{op("PutItem")}).set_skip_when_empty(),
seastar::metrics::make_counter("wcu_total", wcu_total[wcu_types::DELETE_ITEM],
seastar::metrics::description("total number of consumed write units"),{op("DeleteItem")}).set_skip_when_empty(),
seastar::metrics::make_counter("wcu_total", wcu_total[wcu_types::UPDATE_ITEM],
seastar::metrics::description("total number of consumed write units"),{op("UpdateItem")}).set_skip_when_empty(),
seastar::metrics::make_counter("wcu_total", wcu_total[wcu_types::INDEX],
seastar::metrics::description("total number of consumed write units"),{op("Index")}).set_skip_when_empty(),
seastar::metrics::make_total_operations("filtered_rows_dropped_total", [this] { return cql_stats.filtered_rows_read_total - cql_stats.filtered_rows_matched_total; },
seastar::metrics::description("number of rows read and dropped during filtering operations")),
seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"),{op("BatchWriteItem")},
api_operations.batch_write_item_batch_total).set_skip_when_empty(),
seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"),{op("BatchGetItem")},
api_operations.batch_get_item_batch_total).set_skip_when_empty(),
});
}

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
@@ -67,6 +67,7 @@ public:
uint64_t get_shard_iterator = 0;
uint64_t get_records = 0;
utils::timed_rate_moving_average_summary_and_histogram put_item_latency;
utils::timed_rate_moving_average_summary_and_histogram get_item_latency;
utils::timed_rate_moving_average_summary_and_histogram delete_item_latency;
@@ -83,6 +84,18 @@ public:
uint64_t shard_bounce_for_lwt = 0;
uint64_t requests_blocked_memory = 0;
uint64_t requests_shed = 0;
uint64_t rcu_half_units_total = 0;
// wcu can results from put, update, delete and index
// Index related will be done on top of the operation it comes with
enum wcu_types {
PUT_ITEM,
UPDATE_ITEM,
DELETE_ITEM,
INDEX,
NUM_TYPES
};
uint64_t wcu_total[NUM_TYPES] = {0};
// CQL-derived stats
cql3::cql_stats cql_stats;
private:

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include <type_traits>
@@ -808,6 +808,9 @@ future<executor::request_return_type> executor::get_records(client_state& client
if (limit < 1) {
throw api_error::validation("Limit must be 1 or more");
}
if (limit > 1000) {
throw api_error::validation("Limit must be less than or equal to 1000");
}
auto db = _proxy.data_dictionary();
schema_ptr schema, base;
@@ -824,7 +827,7 @@ future<executor::request_return_type> executor::get_records(client_state& client
tracing::add_table_name(trace_state, schema->ks_name(), schema->cf_name());
co_await verify_permission(client_state, schema, auth::permission::SELECT);
co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::SELECT);
db::consistency_level cl = db::consistency_level::LOCAL_QUORUM;
partition_key pk = iter.shard.id.to_partition_key(*schema);
@@ -844,19 +847,21 @@ future<executor::request_return_type> executor::get_records(client_state& client
static const bytes op_column_name = cdc::log_meta_column_name_bytes("operation");
static const bytes eor_column_name = cdc::log_meta_column_name_bytes("end_of_batch");
std::optional<attrs_to_get> key_names = boost::copy_range<attrs_to_get>(
boost::range::join(std::move(base->partition_key_columns()), std::move(base->clustering_key_columns()))
| boost::adaptors::transformed([&] (const column_definition& cdef) {
std::optional<attrs_to_get> key_names =
base->primary_key_columns()
| std::views::transform([&] (const column_definition& cdef) {
return std::make_pair<std::string, attrs_to_get_node>(cdef.name_as_text(), {}); })
);
| std::ranges::to<attrs_to_get>()
;
// Include all base table columns as values (in case pre or post is enabled).
// This will include attributes not stored in the frozen map column
std::optional<attrs_to_get> attr_names = boost::copy_range<attrs_to_get>(base->regular_columns()
std::optional<attrs_to_get> attr_names = base->regular_columns()
// this will include the :attrs column, which we will also force evaluating.
// But not having this set empty forces out any cdc columns from actual result
| boost::adaptors::transformed([] (const column_definition& cdef) {
| std::views::transform([] (const column_definition& cdef) {
return std::make_pair<std::string, attrs_to_get_node>(cdef.name_as_text(), {}); })
);
| std::ranges::to<attrs_to_get>()
;
std::vector<const column_definition*> columns;
columns.reserve(schema->all_columns().size());
@@ -867,10 +872,11 @@ future<executor::request_return_type> executor::get_records(client_state& client
std::transform(pks.begin(), pks.end(), std::back_inserter(columns), [](auto& c) { return &c; });
std::transform(cks.begin(), cks.end(), std::back_inserter(columns), [](auto& c) { return &c; });
auto regular_columns = boost::copy_range<query::column_id_vector>(schema->regular_columns()
| boost::adaptors::filtered([](const column_definition& cdef) { return cdef.name() == op_column_name || cdef.name() == eor_column_name || !cdc::is_cdc_metacolumn_name(cdef.name_as_text()); })
| boost::adaptors::transformed([&] (const column_definition& cdef) { columns.emplace_back(&cdef); return cdef.id; })
);
auto regular_columns = schema->regular_columns()
| std::views::filter([](const column_definition& cdef) { return cdef.name() == op_column_name || cdef.name() == eor_column_name || !cdc::is_cdc_metacolumn_name(cdef.name_as_text()); })
| std::views::transform([&] (const column_definition& cdef) { columns.emplace_back(&cdef); return cdef.id; })
| std::ranges::to<query::column_id_vector>()
;
stream_view_type type = cdc_options_to_steam_view_type(base->cdc_options());
@@ -978,7 +984,7 @@ future<executor::request_return_type> executor::get_records(client_state& client
case cdc::operation::post_image:
{
auto item = rjson::empty_object();
describe_single_item(*selection, row, attr_names, item, true);
describe_single_item(*selection, row, attr_names, item, nullptr, true);
describe_single_item(*selection, row, key_names, item);
rjson::add(dynamodb, op == cdc::operation::pre_image ? "OldImage" : "NewImage", std::move(item));
break;

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include <chrono>
@@ -23,7 +23,7 @@
#include "gms/inet_address.hh"
#include "inet_address_vectors.hh"
#include "locator/abstract_replication_strategy.hh"
#include "log.hh"
#include "utils/log.hh"
#include "gc_clock.hh"
#include "replica/database.hh"
#include "service/client_state.hh"
@@ -99,7 +99,7 @@ future<executor::request_return_type> executor::update_time_to_live(client_state
}
sstring attribute_name(v->GetString(), v->GetStringLength());
co_await verify_permission(client_state, schema, auth::permission::ALTER);
co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::ALTER);
co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [&](std::map<sstring, sstring>& tags_map) {
if (enabled) {
if (tags_map.contains(TTL_TAG_KEY)) {
@@ -315,19 +315,19 @@ static size_t random_offset(size_t min, size_t max) {
// this range's primary node is down. For this we need to return not just
// a list of this node's secondary ranges - but also the primary owner of
// each of those ranges.
static future<std::vector<std::pair<dht::token_range, gms::inet_address>>> get_secondary_ranges(
static future<std::vector<std::pair<dht::token_range, locator::host_id>>> get_secondary_ranges(
const locator::effective_replication_map_ptr& erm,
gms::inet_address ep) {
locator::host_id ep) {
const auto& tm = *erm->get_token_metadata_ptr();
const auto& sorted_tokens = tm.sorted_tokens();
std::vector<std::pair<dht::token_range, gms::inet_address>> ret;
std::vector<std::pair<dht::token_range, locator::host_id>> ret;
if (sorted_tokens.empty()) {
on_internal_error(tlogger, "Token metadata is empty");
}
auto prev_tok = sorted_tokens.back();
for (const auto& tok : sorted_tokens) {
co_await coroutine::maybe_yield();
inet_address_vector_replica_set eps = erm->get_natural_endpoints(tok);
host_id_vector_replica_set eps = erm->get_natural_replicas(tok);
if (eps.size() <= 1 || eps[1] != ep) {
prev_tok = tok;
continue;
@@ -396,7 +396,7 @@ class ranges_holder_primary {
dht::token_range_vector _token_ranges;
public:
explicit ranges_holder_primary(dht::token_range_vector token_ranges) : _token_ranges(std::move(token_ranges)) {}
static future<ranges_holder_primary> make(const locator::vnode_effective_replication_map_ptr& erm, gms::inet_address ep) {
static future<ranges_holder_primary> make(const locator::vnode_effective_replication_map_ptr& erm, locator::host_id ep) {
co_return ranges_holder_primary(co_await erm->get_primary_ranges(ep));
}
std::size_t size() const { return _token_ranges.size(); }
@@ -410,13 +410,13 @@ public:
// ranges_holder<secondary> holds the secondary token ranges plus each
// range's primary owner, needed to implement should_skip().
class ranges_holder_secondary {
std::vector<std::pair<dht::token_range, gms::inet_address>> _token_ranges;
std::vector<std::pair<dht::token_range, locator::host_id>> _token_ranges;
const gms::gossiper& _gossiper;
public:
explicit ranges_holder_secondary(std::vector<std::pair<dht::token_range, gms::inet_address>> token_ranges, const gms::gossiper& g)
explicit ranges_holder_secondary(std::vector<std::pair<dht::token_range, locator::host_id>> token_ranges, const gms::gossiper& g)
: _token_ranges(std::move(token_ranges))
, _gossiper(g) {}
static future<ranges_holder_secondary> make(const locator::effective_replication_map_ptr& erm, gms::inet_address ep, const gms::gossiper& g) {
static future<ranges_holder_secondary> make(const locator::effective_replication_map_ptr& erm, locator::host_id ep, const gms::gossiper& g) {
co_return ranges_holder_secondary(co_await get_secondary_ranges(erm, ep), g);
}
std::size_t size() const { return _token_ranges.size(); }
@@ -521,8 +521,9 @@ struct scan_ranges_context {
// be good if we can read only the single item of the map - it
// should be possible (and a must for issue #7751!).
lw_shared_ptr<service::pager::paging_state> paging_state = nullptr;
auto regular_columns = boost::copy_range<query::column_id_vector>(
s->regular_columns() | boost::adaptors::transformed([] (const column_definition& cdef) { return cdef.id; }));
auto regular_columns =
s->regular_columns() | std::views::transform([] (const column_definition& cdef) { return cdef.id; })
| std::ranges::to<query::column_id_vector>();
selection = cql3::selection::selection::wildcard(s);
query::partition_slice::option_set opts = selection->get_query_options();
opts.set<query::partition_slice::option::allow_short_read>();
@@ -730,8 +731,8 @@ static future<bool> scan_table(
// FIXME: need to pace the scan, not do it all at once.
scan_ranges_context scan_ctx{s, proxy, std::move(column_name), std::move(member)};
auto erm = db.real_database().find_keyspace(s->ks_name()).get_vnode_effective_replication_map();
auto my_address = erm->get_topology().my_address();
token_ranges_owned_by_this_shard my_ranges(s, co_await ranges_holder_primary::make(erm, my_address));
auto my_host_id = erm->get_topology().my_host_id();
token_ranges_owned_by_this_shard my_ranges(s, co_await ranges_holder_primary::make(erm, my_host_id));
while (std::optional<dht::partition_range> range = my_ranges.next_partition_range()) {
// Note that because of issue #9167 we need to run a separate
// query on each partition range, and can't pass several of
@@ -751,7 +752,7 @@ static future<bool> scan_table(
// by tasking another node to take over scanning of the dead node's primary
// ranges. What we do here is that this node will also check expiration
// on its *secondary* ranges - but only those whose primary owner is down.
token_ranges_owned_by_this_shard my_secondary_ranges(s, co_await ranges_holder_secondary::make(erm, my_address, gossiper));
token_ranges_owned_by_this_shard my_secondary_ranges(s, co_await ranges_holder_secondary::make(erm, my_host_id, gossiper));
while (std::optional<dht::partition_range> range = my_secondary_ranges.next_partition_range()) {
expiration_stats.secondary_ranges_scanned++;
dht::partition_range_vector partition_ranges;

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once

View File

@@ -1,4 +1,29 @@
# Generate C++ sources from Swagger definitions
function(generate_swagger)
set(one_value_args TARGET VAR IN_FILE OUT_DIR)
cmake_parse_arguments(args "" "${one_value_args}" "" ${ARGN})
get_filename_component(in_file_name ${args_IN_FILE} NAME)
set(generator ${PROJECT_SOURCE_DIR}/seastar/scripts/seastar-json2code.py)
set(header_out ${args_OUT_DIR}/${in_file_name}.hh)
set(source_out ${args_OUT_DIR}/${in_file_name}.cc)
add_custom_command(
DEPENDS
${args_IN_FILE}
${generator}
OUTPUT ${header_out} ${source_out}
COMMAND ${CMAKE_COMMAND} -E make_directory ${args_OUT_DIR}
COMMAND ${generator} --create-cc -f ${args_IN_FILE} -o ${header_out})
add_custom_target(${args_TARGET}
DEPENDS
${header_out}
${source_out})
set(${args_VAR} ${header_out} ${source_out} PARENT_SCOPE)
endfunction()
set(swagger_files
api-doc/authorization_cache.json
api-doc/cache_service.json
@@ -17,6 +42,7 @@ set(swagger_files
api-doc/messaging_service.json
api-doc/metrics.json
api-doc/raft.json
api-doc/service_levels.json
api-doc/storage_proxy.json
api-doc/storage_service.json
api-doc/stream_manager.json
@@ -29,7 +55,7 @@ set(swagger_files
foreach(f ${swagger_files})
get_filename_component(fname "${f}" NAME_WE)
get_filename_component(dir "${f}" DIRECTORY)
seastar_generate_swagger(
generate_swagger(
TARGET scylla_swagger_gen_${fname}
VAR scylla_swagger_gen_${fname}_files
IN_FILE "${CMAKE_CURRENT_SOURCE_DIR}/${f}"
@@ -37,7 +63,7 @@ foreach(f ${swagger_files})
list(APPEND swagger_gen_files "${scylla_swagger_gen_${fname}_files}")
endforeach()
add_library(api)
add_library(api STATIC)
target_sources(api
PRIVATE
api.cc
@@ -57,6 +83,7 @@ target_sources(api
lsa.cc
messaging_service.cc
raft.cc
service_levels.cc
storage_proxy.cc
storage_service.cc
stream_manager.cc
@@ -71,11 +98,13 @@ target_include_directories(api
${CMAKE_SOURCE_DIR}
${scylla_gen_build_dir})
target_link_libraries(api
idl
wasmtime_bindings
Seastar::seastar
xxHash::xxhash
absl::headers)
PUBLIC
Seastar::seastar
xxHash::xxhash
PRIVATE
idl
wasmtime_bindings
absl::headers)
check_headers(check-headers api
GLOB_RECURSE ${CMAKE_CURRENT_SOURCE_DIR}/*.hh)

View File

@@ -94,6 +94,38 @@
]
}
]
},
{
"path":"/raft/trigger_stepdown/",
"operations":[
{
"method":"POST",
"summary":"Triggers stepdown of a leader for given Raft group or group0 if not provided (returns an error if the node is not a leader)",
"type":"string",
"nickname":"trigger_stepdown",
"produces":[
"application/json"
],
"parameters":[
{
"name":"group_id",
"description":"The ID of the group which leader should stepdown",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"timeout",
"description":"Timeout in seconds after which the endpoint returns a failure. If not provided, 60s is used.",
"required":false,
"allowMultiple":false,
"type":"long",
"paramType":"query"
}
]
}
]
}
]
}

View File

@@ -0,0 +1,56 @@
{
"apiVersion":"0.0.1",
"swaggerVersion":"1.2",
"basePath":"{{Protocol}}://{{Host}}",
"resourcePath":"/service_levels",
"produces":[
"application/json"
],
"apis":[
{
"path":"/service_levels/switch_tenants",
"operations":[
{
"method":"POST",
"summary":"Switch tenants on all opened connections if needed",
"type":"void",
"nickname":"do_switch_tenants",
"produces":[
"application/json"
],
"parameters":[]
}
]
},
{
"path":"/service_levels/count_connections",
"operations":[
{
"method":"GET",
"summary":"Count opened CQL connections per scheduling group per user",
"type":"connections_count_map",
"nickname":"count_connections",
"produces":[
"application/json"
],
"parameters":[]
}
]
}
],
"models":{},
"components": {
"schemas": {
"connections_count_map": {
"type": "object",
"additionalProperties": {
"type": "object",
"additionalProperties": {
"type": "integer"
}
}
}
}
}
}

View File

@@ -782,6 +782,14 @@
"type":"string",
"paramType":"query"
},
{
"name":"prefix",
"description":"The prefix of the objects for the backuped sstables",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"keyspace",
"description":"Name of a keyspace to copy sstables from",
@@ -790,6 +798,14 @@
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"Name of a table to copy sstables from",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"snapshot",
"description":"Name of a snapshot to copy sstables from",
@@ -831,13 +847,25 @@
"paramType":"query"
},
{
"name":"snapshot",
"description":"Name of a snapshot to copy SSTables from",
"name":"prefix",
"description":"The prefix of the object keys for the backuped SSTables",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"in": "body",
"name": "sstables",
"description": "The list of the object keys of the TOC component of the SSTables to be restored",
"required":true,
"schema" :{
"type": "array",
"items": {
"type": "string"
}
}
},
{
"name":"keyspace",
"description":"Name of a keyspace to copy SSTables to",
@@ -849,10 +877,19 @@
{
"name":"table",
"description":"Name of a table to copy SSTables to",
"required":false,
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"scope",
"description":"Defines the set of nodes to which mutations can be streamed",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query",
"enum": ["all", "dc", "rack", "node"]
}
]
}
@@ -939,7 +976,7 @@
]
},
{
"path":"/storage_service/cleanup_all",
"path":"/storage_service/cleanup_all/",
"operations":[
{
"method":"POST",
@@ -949,6 +986,30 @@
"produces":[
"application/json"
],
"parameters":[
{
"name":"global",
"description":"true if cleanup of entire cluster is requested",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/mark_node_as_clean",
"operations":[
{
"method":"POST",
"summary":"Mark the node as clean. After that the node will not be considered as needing cleanup during automatic cleanup which is triggered by some topology operations",
"type":"void",
"nickname":"reset_cleanup_needed",
"produces":[
"application/json"
],
"parameters":[]
}
]
@@ -1611,38 +1672,6 @@
}
]
},
{
"path":"/storage_service/truncate/{keyspace}",
"operations":[
{
"method":"POST",
"summary":"Truncates (deletes) the given columnFamily from the provided keyspace. Calling truncate results in actual deletion of all data in the cluster under the given columnFamily and it will fail unless all hosts are up. All data in the given column family will be deleted, but its definition will not be affected.",
"type":"void",
"nickname":"truncate",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"The keyspace",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"cf",
"description":"Column family name",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/keyspaces",
"operations":[
@@ -2824,6 +2853,70 @@
}
]
},
{
"path":"/storage_service/tablets/repair",
"operations":[
{
"nickname":"repair_tablet",
"method":"POST",
"summary":"Repair a tablet",
"type":"tablet_repair_result",
"produces":[
"application/json"
],
"parameters":[
{
"name":"ks",
"description":"Keyspace name to repair",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"Table name to repair",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"tokens",
"description":"Tokens owned by the tablets to repair. Multiple tokens can be provided using a comma-separated list. When set to the special word 'all', all tablets will be repaired",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"hosts_filter",
"description":"Repair replicas listed in the comma-separated host_id list.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"dcs_filter",
"description":"Repair replicas listed in the comma-separated DC list",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"await_completion",
"description":"Set true to wait for the repair to complete. Set false to skip waiting for the repair to complete. When the option is not provided, it defaults to false.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/tablets/balancing",
"operations":[
@@ -2992,6 +3085,22 @@
]
}
]
},
{
"path":"/storage_service/raft_topology/cmd_rpc_status",
"operations":[
{
"method":"GET",
"summary":"Get information about currently running topology cmd rpc",
"type":"string",
"nickname":"raft_topology_get_cmd_status",
"produces":[
"application/json"
],
"parameters":[
]
}
]
}
],
"models":{
@@ -3128,11 +3237,11 @@
"properties":{
"start_token":{
"type":"string",
"description":"The range start token"
"description":"The range start token (exclusive)"
},
"end_token":{
"type":"string",
"description":"The range start token"
"description":"The range end token (inclusive)"
},
"endpoints":{
"type":"array",
@@ -3242,6 +3351,15 @@
}
}
}
},
"tablet_repair_result":{
"id":"tablet_repair_result",
"description":"Tablet repair result",
"properties":{
"tablet_task_id":{
"type":"string"
}
}
}
}
}

View File

@@ -11,8 +11,8 @@
"url": "http://scylladb.com"
},
"license": {
"name": "AGPL",
"url": "https://github.com/scylladb/scylla/blob/master/LICENSE.AGPL"
"name": "ScyllaDB-Source-Available-1.0",
"url": "https://github.com/scylladb/scylla/blob/master/LICENSE-ScyllaDB-Source-Available.md"
}
},
"host": "{{Host}}",

View File

@@ -198,7 +198,7 @@
"parameters":[
{
"name":"ttl",
"description":"The number of seconds for which the tasks will be kept in memory after it finishes",
"description":"The number of seconds for which the task started internally will be kept in memory after it finishes",
"required":true,
"allowMultiple":false,
"type":"long",
@@ -218,6 +218,65 @@
]
}
]
},
{
"path":"/task_manager/user_ttl",
"operations":[
{
"method":"POST",
"summary":"Set user task ttl in seconds and get last value",
"type":"long",
"nickname":"get_and_update_user_ttl",
"produces":[
"application/json"
],
"parameters":[
{
"name":"user_ttl",
"description":"The number of seconds for which the task started by user will be kept in memory after it finishes",
"required":true,
"allowMultiple":false,
"type":"long",
"paramType":"query"
}
]
},
{
"method":"GET",
"summary":"Get current user task ttl value",
"type":"long",
"nickname":"get_user_ttl",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/task_manager/drain/{module}",
"operations":[
{
"method":"POST",
"summary":"Drain finished local tasks",
"type":"void",
"nickname":"drain_tasks",
"produces":[
"application/json"
],
"parameters":[
{
"name":"module",
"description":"The module to drain",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
}
],
"models":{
@@ -249,7 +308,8 @@
"created",
"running",
"done",
"failed"
"failed",
"suspended"
],
"description":"The state of a task"
},
@@ -317,7 +377,8 @@
"created",
"running",
"done",
"failed"
"failed",
"suspended"
],
"description":"The state of the task"
},

View File

@@ -93,6 +93,14 @@
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"user_task",
"description":"A flag indicating whether a task was started by user (false by default)",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
},

View File

@@ -42,6 +42,14 @@
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
},
{
"name":"consider_only_existing_data",
"description":"Set to \"true\" to flush all memtables and force tombstone garbage collection to check only the sstables being compacted (false by default). The memtable, commitlog and other uncompacted sstables will not be checked during tombstone garbage collection.",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "api.hh"
@@ -35,6 +35,8 @@
#include "task_manager_test.hh"
#include "tasks.hh"
#include "raft.hh"
#include "gms/gossip_address_map.hh"
#include "service_levels.hh"
logging::logger apilog("api");
@@ -79,7 +81,7 @@ future<> set_server_init(http_context& ctx) {
});
}
future<> set_server_config(http_context& ctx, const db::config& cfg) {
future<> set_server_config(http_context& ctx, db::config& cfg) {
auto rb02 = std::make_shared < api_registry_builder20 > (ctx.api_doc, "/v2");
return ctx.http_server.set_routes([&ctx, &cfg, rb02](routes& r) {
set_config(rb02, ctx, r, cfg, false);
@@ -135,6 +137,14 @@ future<> unset_load_meter(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_load_meter(ctx, r); });
}
future<> set_format_selector(http_context& ctx, db::sstables_format_selector& sel) {
return ctx.http_server.set_routes([&ctx, &sel] (routes& r) { set_format_selector(ctx, r, sel); });
}
future<> unset_format_selector(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_format_selector(ctx, r); });
}
future<> set_server_sstables_loader(http_context& ctx, sharded<sstables_loader>& sst_loader) {
return ctx.http_server.set_routes([&ctx, &sst_loader] (routes& r) { set_sstables_loader(ctx, r, sst_loader); });
}
@@ -143,16 +153,16 @@ future<> unset_server_sstables_loader(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_sstables_loader(ctx, r); });
}
future<> set_server_view_builder(http_context& ctx, sharded<db::view::view_builder>& vb) {
return ctx.http_server.set_routes([&ctx, &vb] (routes& r) { set_view_builder(ctx, r, vb); });
future<> set_server_view_builder(http_context& ctx, sharded<db::view::view_builder>& vb, sharded<gms::gossiper>& g) {
return ctx.http_server.set_routes([&ctx, &vb, &g] (routes& r) { set_view_builder(ctx, r, vb, g); });
}
future<> unset_server_view_builder(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_view_builder(ctx, r); });
}
future<> set_server_repair(http_context& ctx, sharded<repair_service>& repair) {
return ctx.http_server.set_routes([&ctx, &repair] (routes& r) { set_repair(ctx, r, repair); });
future<> set_server_repair(http_context& ctx, sharded<repair_service>& repair, sharded<gms::gossip_address_map>& am) {
return ctx.http_server.set_routes([&ctx, &repair, &am] (routes& r) { set_repair(ctx, r, repair, am); });
}
future<> unset_server_repair(http_context& ctx) {
@@ -178,8 +188,8 @@ future<> unset_server_snapshot(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_snapshot(ctx, r); });
}
future<> set_server_token_metadata(http_context& ctx, sharded<locator::shared_token_metadata>& tm) {
return ctx.http_server.set_routes([&ctx, &tm] (routes& r) { set_token_metadata(ctx, r, tm); });
future<> set_server_token_metadata(http_context& ctx, sharded<locator::shared_token_metadata>& tm, sharded<gms::gossiper>& g) {
return ctx.http_server.set_routes([&ctx, &tm, &g] (routes& r) { set_token_metadata(ctx, r, tm, g); });
}
future<> unset_server_token_metadata(http_context& ctx) {
@@ -263,10 +273,10 @@ future<> unset_server_cache(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_cache_service(ctx, r); });
}
future<> set_hinted_handoff(http_context& ctx, sharded<service::storage_proxy>& proxy) {
future<> set_hinted_handoff(http_context& ctx, sharded<service::storage_proxy>& proxy, sharded<gms::gossiper>& g) {
return register_api(ctx, "hinted_handoff",
"The hinted handoff API", [&proxy] (http_context& ctx, routes& r) {
set_hinted_handoff(ctx, r, proxy);
"The hinted handoff API", [&proxy, &g] (http_context& ctx, routes& r) {
set_hinted_handoff(ctx, r, proxy, g);
});
}
@@ -274,16 +284,16 @@ future<> unset_hinted_handoff(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_hinted_handoff(ctx, r); });
}
future<> set_server_compaction_manager(http_context& ctx) {
auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);
return ctx.http_server.set_routes([rb, &ctx](routes& r) {
rb->register_function(r, "compaction_manager",
"The Compaction manager API");
set_compaction_manager(ctx, r);
future<> set_server_compaction_manager(http_context& ctx, sharded<compaction_manager>& cm) {
return register_api(ctx, "compaction_manager", "The Compaction manager API", [&cm] (http_context& ctx, routes& r) {
set_compaction_manager(ctx, r, cm);
});
}
future<> unset_server_compaction_manager(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_compaction_manager(ctx, r); });
}
future<> set_server_done(http_context& ctx) {
auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);
@@ -291,15 +301,22 @@ future<> set_server_done(http_context& ctx) {
rb->register_function(r, "lsa", "Log-structured allocator API");
set_lsa(ctx, r);
rb->register_function(r, "commitlog",
"The commit log API");
set_commitlog(ctx,r);
rb->register_function(r, "collectd",
"The collectd API");
set_collectd(ctx, r);
});
}
future<> set_server_commitlog(http_context& ctx, sharded<replica::database>& db) {
return register_api(ctx, "commitlog", "The commit log API", [&db] (http_context& ctx, routes& r) {
set_commitlog(ctx, r, db);
});
}
future<> unset_server_commitlog(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_commitlog(ctx, r); });
}
future<> set_server_task_manager(http_context& ctx, sharded<tasks::task_manager>& tm, lw_shared_ptr<db::config> cfg) {
auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);
@@ -342,6 +359,12 @@ future<> unset_server_cql_server_test(http_context& ctx) {
#endif
future<> set_server_service_levels(http_context &ctx, cql_transport::controller& ctl, sharded<cql3::query_processor>& qp) {
return register_api(ctx, "service_levels", "The service levels API", [&ctl, &qp] (http_context& ctx, routes& r) {
set_service_levels(ctx, r, ctl, qp);
});
}
future<> set_server_tasks_compaction_module(http_context& ctx, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& snap_ctl) {
auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
@@ -219,11 +219,10 @@ template <class T, class Base = T>
class req_param {
public:
sstring name;
sstring param;
T value;
req_param(const request& req, sstring name, T default_val) : name(name) {
param = req.get_query_param(name);
sstring param = req.get_query_param(name);
if (param.empty()) {
value = default_val;
return;

View File

@@ -3,13 +3,14 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include <seastar/http/httpd.hh>
#include <seastar/core/future.hh>
#include "gms/gossip_address_map.hh"
#include "replica/database_fwd.hh"
#include "tasks/task_manager.hh"
#include "seastarx.hh"
@@ -17,6 +18,8 @@
using request = http::request;
using reply = http::reply;
class compaction_manager;
namespace service {
class load_meter;
@@ -49,6 +52,7 @@ namespace cql_transport { class controller; }
namespace db {
class snapshot_ctl;
class config;
class sstables_format_selector;
namespace view {
class view_builder;
}
@@ -69,6 +73,10 @@ namespace tasks {
class task_manager;
}
namespace cql3 {
class query_processor;
}
namespace api {
struct http_context {
@@ -84,7 +92,7 @@ struct http_context {
};
future<> set_server_init(http_context& ctx);
future<> set_server_config(http_context& ctx, const db::config& cfg);
future<> set_server_config(http_context& ctx, db::config& cfg);
future<> unset_server_config(http_context& ctx);
future<> set_server_snitch(http_context& ctx, sharded<locator::snitch_ptr>& snitch);
future<> unset_server_snitch(http_context& ctx);
@@ -92,9 +100,9 @@ future<> set_server_storage_service(http_context& ctx, sharded<service::storage_
future<> unset_server_storage_service(http_context& ctx);
future<> set_server_sstables_loader(http_context& ctx, sharded<sstables_loader>& sst_loader);
future<> unset_server_sstables_loader(http_context& ctx);
future<> set_server_view_builder(http_context& ctx, sharded<db::view::view_builder>& vb);
future<> set_server_view_builder(http_context& ctx, sharded<db::view::view_builder>& vb, sharded<gms::gossiper>& g);
future<> unset_server_view_builder(http_context& ctx);
future<> set_server_repair(http_context& ctx, sharded<repair_service>& repair);
future<> set_server_repair(http_context& ctx, sharded<repair_service>& repair, sharded<gms::gossip_address_map>& am);
future<> unset_server_repair(http_context& ctx);
future<> set_transport_controller(http_context& ctx, cql_transport::controller& ctl);
future<> unset_transport_controller(http_context& ctx);
@@ -104,7 +112,7 @@ future<> set_server_authorization_cache(http_context& ctx, sharded<auth::service
future<> unset_server_authorization_cache(http_context& ctx);
future<> set_server_snapshot(http_context& ctx, sharded<db::snapshot_ctl>& snap_ctl);
future<> unset_server_snapshot(http_context& ctx);
future<> set_server_token_metadata(http_context& ctx, sharded<locator::shared_token_metadata>& tm);
future<> set_server_token_metadata(http_context& ctx, sharded<locator::shared_token_metadata>& tm, sharded<gms::gossiper>& g);
future<> unset_server_token_metadata(http_context& ctx);
future<> set_server_gossip(http_context& ctx, sharded<gms::gossiper>& g);
future<> unset_server_gossip(http_context& ctx);
@@ -116,11 +124,12 @@ future<> set_server_storage_proxy(http_context& ctx, sharded<service::storage_pr
future<> unset_server_storage_proxy(http_context& ctx);
future<> set_server_stream_manager(http_context& ctx, sharded<streaming::stream_manager>& sm);
future<> unset_server_stream_manager(http_context& ctx);
future<> set_hinted_handoff(http_context& ctx, sharded<service::storage_proxy>& p);
future<> set_hinted_handoff(http_context& ctx, sharded<service::storage_proxy>& p, sharded<gms::gossiper>& g);
future<> unset_hinted_handoff(http_context& ctx);
future<> set_server_cache(http_context& ctx);
future<> unset_server_cache(http_context& ctx);
future<> set_server_compaction_manager(http_context& ctx);
future<> set_server_compaction_manager(http_context& ctx, sharded<compaction_manager>& cm);
future<> unset_server_compaction_manager(http_context& ctx);
future<> set_server_done(http_context& ctx);
future<> set_server_task_manager(http_context& ctx, sharded<tasks::task_manager>& tm, lw_shared_ptr<db::config> cfg);
future<> unset_server_task_manager(http_context& ctx);
@@ -132,7 +141,12 @@ future<> set_server_raft(http_context&, sharded<service::raft_group_registry>&);
future<> unset_server_raft(http_context&);
future<> set_load_meter(http_context& ctx, service::load_meter& lm);
future<> unset_load_meter(http_context& ctx);
future<> set_format_selector(http_context& ctx, db::sstables_format_selector& sel);
future<> unset_format_selector(http_context& ctx);
future<> set_server_cql_server_test(http_context& ctx, cql_transport::controller& ctl);
future<> unset_server_cql_server_test(http_context& ctx);
future<> set_server_service_levels(http_context& ctx, cql_transport::controller& ctl, sharded<cql3::query_processor>& qp);
future<> set_server_commitlog(http_context& ctx, sharded<replica::database>&);
future<> unset_server_commitlog(http_context& ctx);
}

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "api/api-doc/authorization_cache.json.hh"

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "cache_service.hh"

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "collectd.hh"
@@ -11,6 +11,7 @@
#include <seastar/core/scollectd.hh>
#include <seastar/core/scollectd_api.hh>
#include <boost/range/irange.hpp>
#include <ranges>
#include <regex>
#include "api/api_init.hh"
@@ -61,7 +62,7 @@ void set_collectd(http_context& ctx, routes& r) {
return do_with(std::vector<cd::collectd_value>(), [id] (auto& vec) {
vec.resize(smp::count);
return parallel_for_each(boost::irange(0u, smp::count), [&vec, id] (auto cpu) {
return parallel_for_each(std::views::iota(0u, smp::count), [&vec, id] (auto cpu) {
return smp::submit_to(cpu, [id = *id] {
return scollectd::get_collectd_value(id);
}).then([&vec, cpu] (auto res) {

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include <fmt/ranges.h>
@@ -24,6 +24,9 @@
#include "compaction/compaction_manager.hh"
#include "unimplemented.hh"
#include <boost/range/algorithm/copy.hpp>
#include <boost/range/numeric.hpp>
extern logging::logger apilog;
namespace api {
@@ -61,14 +64,6 @@ table_id get_uuid(const sstring& name, const replica::database& db) {
return get_uuid(ks, cf, db);
}
future<> foreach_column_family(http_context& ctx, const sstring& name, std::function<void(replica::column_family&)> f) {
auto uuid = get_uuid(name, ctx.db.local());
return ctx.db.invoke_on_all([f, uuid](replica::database& db) {
f(db.find_column_family(uuid));
});
}
future<json::json_return_type> get_cf_stats(http_context& ctx, const sstring& name,
int64_t replica::column_family_stats::*f) {
return map_reduce_cf(ctx, name, int64_t(0), [f](const replica::column_family& cf) {
@@ -83,7 +78,7 @@ future<json::json_return_type> get_cf_stats(http_context& ctx,
}, std::plus<int64_t>());
}
static future<json::json_return_type> set_tables(http_context& ctx, const sstring& keyspace, std::vector<sstring> tables, std::function<future<>(replica::table&)> set) {
static future<json::json_return_type> for_tables_on_all_shards(http_context& ctx, const sstring& keyspace, std::vector<sstring> tables, std::function<future<>(replica::table&)> set) {
if (tables.empty()) {
tables = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());
}
@@ -123,7 +118,7 @@ static future<json::json_return_type> set_tables_autocompaction(http_context& ct
return ctx.db.invoke_on(0, [&ctx, keyspace, tables = std::move(tables), enabled] (replica::database& db) {
auto g = autocompaction_toggle_guard(db);
return set_tables(ctx, keyspace, tables, [enabled] (replica::table& cf) {
return for_tables_on_all_shards(ctx, keyspace, tables, [enabled] (replica::table& cf) {
if (enabled) {
cf.enable_auto_compaction();
} else {
@@ -136,7 +131,7 @@ static future<json::json_return_type> set_tables_autocompaction(http_context& ct
static future<json::json_return_type> set_tables_tombstone_gc(http_context& ctx, const sstring &keyspace, std::vector<sstring> tables, bool enabled) {
apilog.info("set_tables_tombstone_gc: enabled={} keyspace={} tables={}", enabled, keyspace, tables);
return set_tables(ctx, keyspace, std::move(tables), [enabled] (replica::table& t) {
return for_tables_on_all_shards(ctx, keyspace, std::move(tables), [enabled] (replica::table& t) {
t.set_tombstone_gc_enabled(enabled);
return make_ready_future<>();
});
@@ -206,8 +201,8 @@ static future<json::json_return_type> get_cf_histogram(http_context& ctx, utils:
};
return ctx.db.map(fun).then([](const std::vector<utils::ihistogram> &res) {
std::vector<httpd::utils_json::histogram> r;
boost::copy(res | boost::adaptors::transformed(to_json), std::back_inserter(r));
return make_ready_future<json::json_return_type>(r);
std::ranges::copy(res | std::views::transform(to_json), std::back_inserter(r));
return make_ready_future<json::json_return_type>(std::move(r));
});
}
@@ -233,7 +228,7 @@ static future<json::json_return_type> get_cf_rate_and_histogram(http_context& ct
};
return ctx.db.map(fun).then([](const std::vector<utils::rate_moving_average_and_histogram> &res) {
std::vector<httpd::utils_json::rate_moving_average_and_histogram> r;
boost::copy(res | boost::adaptors::transformed(timer_to_json), std::back_inserter(r));
std::ranges::copy(res | std::views::transform(timer_to_json), std::back_inserter(r));
return make_ready_future<json::json_return_type>(r);
});
}
@@ -723,25 +718,25 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
cf::get_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->get_path_param("name"), ratio_holder(), [] (replica::column_family& cf) {
return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_false_positive_as_ratio_holder), ratio_holder());
return std::ranges::fold_left(*cf.get_sstables() | std::views::transform(filter_false_positive_as_ratio_holder), ratio_holder(), std::plus{});
}, std::plus<>());
});
cf::get_all_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, ratio_holder(), [] (replica::column_family& cf) {
return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_false_positive_as_ratio_holder), ratio_holder());
return std::ranges::fold_left(*cf.get_sstables() | std::views::transform(filter_false_positive_as_ratio_holder), ratio_holder(), std::plus{});
}, std::plus<>());
});
cf::get_recent_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->get_path_param("name"), ratio_holder(), [] (replica::column_family& cf) {
return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_recent_false_positive_as_ratio_holder), ratio_holder());
return std::ranges::fold_left(*cf.get_sstables() | std::views::transform(filter_recent_false_positive_as_ratio_holder), ratio_holder(), std::plus{});
}, std::plus<>());
});
cf::get_all_recent_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, ratio_holder(), [] (replica::column_family& cf) {
return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_recent_false_positive_as_ratio_holder), ratio_holder());
return std::ranges::fold_left(*cf.get_sstables() | std::views::transform(filter_recent_false_positive_as_ratio_holder), ratio_holder(), std::plus{});
}, std::plus<>());
});
@@ -998,11 +993,13 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
auto ks_cf = parse_fully_qualified_cf_name(req->get_path_param("name"));
auto&& ks = std::get<0>(ks_cf);
auto&& cf_name = std::get<1>(ks_cf);
return sys_ks.local().load_view_build_progress().then([ks, cf_name, &ctx](const std::vector<db::system_keyspace_view_build_progress>& vb) mutable {
// Use of load_built_views() as filtering table should be in sync with
// built_indexes_virtual_reader filtering with BUILT_VIEWS table
return sys_ks.local().load_built_views().then([ks, cf_name, &ctx](const std::vector<db::system_keyspace::view_name>& vb) mutable {
std::set<sstring> vp;
for (auto b : vb) {
if (b.view.first == ks) {
vp.insert(b.view.second);
if (b.first == ks) {
vp.insert(b.second);
}
}
std::vector<sstring> res;
@@ -1010,7 +1007,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
replica::column_family& cf = ctx.db.local().find_column_family(uuid);
res.reserve(cf.get_index_manager().list_indexes().size());
for (auto&& i : cf.get_index_manager().list_indexes()) {
if (!vp.contains(secondary_index::index_table_name(i.metadata().name()))) {
if (vp.contains(secondary_index::index_table_name(i.metadata().name()))) {
res.emplace_back(i.metadata().name());
}
}
@@ -1056,12 +1053,12 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::set_compaction_strategy_class.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto [ks, cf] = parse_fully_qualified_cf_name(req->get_path_param("name"));
sstring strategy = req->get_query_param("class_name");
apilog.info("column_family/set_compaction_strategy_class: name={} strategy={}", req->get_path_param("name"), strategy);
return foreach_column_family(ctx, req->get_path_param("name"), [strategy](replica::column_family& cf) {
return for_tables_on_all_shards(ctx, ks, {std::move(cf)}, [strategy] (replica::table& cf) {
cf.set_compaction_strategy(sstables::compaction_strategy::type(strategy));
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
return make_ready_future<>();
});
});
@@ -1095,7 +1092,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
return ctx.db.map_reduce0([key, uuid] (replica::database& db) -> future<std::unordered_set<sstring>> {
auto sstables = co_await db.find_column_family(uuid).get_sstables_by_partition_key(key);
co_return boost::copy_range<std::unordered_set<sstring>>(sstables | boost::adaptors::transformed([] (auto s) { return s->get_filename(); }));
co_return sstables | std::views::transform([] (auto s) { return s->get_filename(); }) | std::ranges::to<std::unordered_set>();
}, std::unordered_set<sstring>(),
[](std::unordered_set<sstring> a, std::unordered_set<sstring>&& b) mutable {
a.merge(b);
@@ -1115,7 +1112,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
api::req_param<unsigned> list_size(*req, "list_size", 10);
apilog.info("toppartitions query: name={} duration={} list_size={} capacity={}",
name, duration.param, list_size.param, capacity.param);
name, duration.value, list_size.value, capacity.value);
return seastar::do_with(db::toppartitions_query(ctx.db, {{ks, cf}}, {}, duration.value, list_size, capacity), [&ctx] (db::toppartitions_query& q) {
return run_toppartitions_query(q, ctx, true);

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
@@ -23,8 +23,6 @@ void set_column_family(http_context& ctx, httpd::routes& r, sharded<db::system_k
void unset_column_family(http_context& ctx, httpd::routes& r);
table_id get_uuid(const sstring& name, const replica::database& db);
future<> foreach_column_family(http_context& ctx, const sstring& name, std::function<void(replica::column_family&)> f);
template<class Mapper, class I, class Reducer>
future<I> map_reduce_cf_raw(http_context& ctx, const sstring& name, I init,

View File

@@ -3,24 +3,26 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "commitlog.hh"
#include "db/commitlog/commitlog.hh"
#include "api/api-doc/commitlog.json.hh"
#include "api/api-doc/storage_service.json.hh"
#include "api/api_init.hh"
#include "replica/database.hh"
#include <vector>
namespace api {
using namespace seastar::httpd;
namespace ss = httpd::storage_service_json;
template<typename T>
static auto acquire_cl_metric(http_context& ctx, std::function<T (const db::commitlog*)> func) {
static auto acquire_cl_metric(sharded<replica::database>& db, std::function<T (const db::commitlog*)> func) {
typedef T ret_type;
return ctx.db.map_reduce0([func = std::move(func)](replica::database& db) {
return db.map_reduce0([func = std::move(func)](replica::database& db) {
if (db.commitlog() == nullptr) {
return make_ready_future<ret_type>();
}
@@ -30,11 +32,11 @@ static auto acquire_cl_metric(http_context& ctx, std::function<T (const db::comm
});
}
void set_commitlog(http_context& ctx, routes& r) {
void set_commitlog(http_context& ctx, routes& r, sharded<replica::database>& db) {
httpd::commitlog_json::get_active_segment_names.set(r,
[&ctx](std::unique_ptr<request> req) {
[&db](std::unique_ptr<request> req) {
auto res = make_shared<std::vector<sstring>>();
return ctx.db.map_reduce([res](std::vector<sstring> names) {
return db.map_reduce([res](std::vector<sstring> names) {
res->insert(res->end(), names.begin(), names.end());
}, [](replica::database& db) {
if (db.commitlog() == nullptr) {
@@ -52,20 +54,35 @@ void set_commitlog(http_context& ctx, routes& r) {
return res;
});
httpd::commitlog_json::get_completed_tasks.set(r, [&ctx](std::unique_ptr<request> req) {
return acquire_cl_metric<uint64_t>(ctx, std::bind(&db::commitlog::get_completed_tasks, std::placeholders::_1));
httpd::commitlog_json::get_completed_tasks.set(r, [&db](std::unique_ptr<request> req) {
return acquire_cl_metric<uint64_t>(db, std::bind(&db::commitlog::get_completed_tasks, std::placeholders::_1));
});
httpd::commitlog_json::get_pending_tasks.set(r, [&ctx](std::unique_ptr<request> req) {
return acquire_cl_metric<uint64_t>(ctx, std::bind(&db::commitlog::get_pending_tasks, std::placeholders::_1));
httpd::commitlog_json::get_pending_tasks.set(r, [&db](std::unique_ptr<request> req) {
return acquire_cl_metric<uint64_t>(db, std::bind(&db::commitlog::get_pending_tasks, std::placeholders::_1));
});
httpd::commitlog_json::get_total_commit_log_size.set(r, [&ctx](std::unique_ptr<request> req) {
return acquire_cl_metric<uint64_t>(ctx, std::bind(&db::commitlog::get_total_size, std::placeholders::_1));
httpd::commitlog_json::get_total_commit_log_size.set(r, [&db](std::unique_ptr<request> req) {
return acquire_cl_metric<uint64_t>(db, std::bind(&db::commitlog::get_total_size, std::placeholders::_1));
});
httpd::commitlog_json::get_max_disk_size.set(r, [&ctx](std::unique_ptr<request> req) {
return acquire_cl_metric<uint64_t>(ctx, std::bind(&db::commitlog::disk_limit, std::placeholders::_1));
httpd::commitlog_json::get_max_disk_size.set(r, [&db](std::unique_ptr<request> req) {
return acquire_cl_metric<uint64_t>(db, std::bind(&db::commitlog::disk_limit, std::placeholders::_1));
});
ss::get_commitlog.set(r, [&db](const_req req) {
return db.local().commitlog()->active_config().commit_log_location;
});
}
void unset_commitlog(http_context& ctx, routes& r) {
httpd::commitlog_json::get_active_segment_names.unset(r);
httpd::commitlog_json::get_archiving_segment_names.unset(r);
httpd::commitlog_json::get_completed_tasks.unset(r);
httpd::commitlog_json::get_pending_tasks.unset(r);
httpd::commitlog_json::get_total_commit_log_size.unset(r);
httpd::commitlog_json::get_max_disk_size.unset(r);
ss::get_commitlog.unset(r);
}
}

View File

@@ -3,17 +3,22 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include <seastar/core/sharded.hh>
namespace seastar::httpd {
class routes;
}
namespace replica { class database; }
namespace api {
struct http_context;
void set_commitlog(http_context& ctx, seastar::httpd::routes& r);
void set_commitlog(http_context& ctx, seastar::httpd::routes& r, seastar::sharded<replica::database>&);
void unset_commitlog(http_context& ctx, seastar::httpd::routes& r);
}

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include <seastar/core/coroutine.hh>
@@ -13,6 +13,7 @@
#include "compaction/compaction_manager.hh"
#include "api/api.hh"
#include "api/api-doc/compaction_manager.json.hh"
#include "api/api-doc/storage_service.json.hh"
#include "db/system_keyspace.hh"
#include "column_family.hh"
#include "unimplemented.hh"
@@ -23,13 +24,14 @@
namespace api {
namespace cm = httpd::compaction_manager_json;
namespace ss = httpd::storage_service_json;
using namespace json;
using namespace seastar::httpd;
static future<json::json_return_type> get_cm_stats(http_context& ctx,
static future<json::json_return_type> get_cm_stats(sharded<compaction_manager>& cm,
int64_t compaction_manager::stats::*f) {
return ctx.db.map_reduce0([f](replica::database& db) {
return db.get_compaction_manager().get_stats().*f;
return cm.map_reduce0([f](compaction_manager& cm) {
return cm.get_stats().*f;
}, int64_t(0), std::plus<int64_t>()).then([](const int64_t& res) {
return make_ready_future<json::json_return_type>(res);
});
@@ -44,11 +46,10 @@ static std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_ha
return std::move(a);
}
void set_compaction_manager(http_context& ctx, routes& r) {
cm::get_compactions.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return ctx.db.map_reduce0([](replica::database& db) {
void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_manager>& cm) {
cm::get_compactions.set(r, [&cm] (std::unique_ptr<http::request> req) {
return cm.map_reduce0([](compaction_manager& cm) {
std::vector<cm::summary> summaries;
const compaction_manager& cm = db.get_compaction_manager();
for (const auto& c : cm.get_compactions()) {
cm::summary s;
@@ -100,10 +101,9 @@ void set_compaction_manager(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(json_void());
});
cm::stop_compaction.set(r, [&ctx] (std::unique_ptr<http::request> req) {
cm::stop_compaction.set(r, [&cm] (std::unique_ptr<http::request> req) {
auto type = req->get_query_param("type");
return ctx.db.invoke_on_all([type] (replica::database& db) {
auto& cm = db.get_compaction_manager();
return cm.invoke_on_all([type] (compaction_manager& cm) {
return cm.stop_compaction(type);
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
@@ -113,9 +113,6 @@ void set_compaction_manager(http_context& ctx, routes& r) {
cm::stop_keyspace_compaction.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto ks_name = validate_keyspace(ctx, req);
auto table_names = parse_tables(ks_name, ctx, req->query_parameters, "tables");
if (table_names.empty()) {
table_names = map_keys(ctx.db.local().find_keyspace(ks_name).metadata().get()->cf_meta_data());
}
auto type = req->get_query_param("type");
co_await ctx.db.invoke_on_all([&] (replica::database& db) {
auto& cm = db.get_compaction_manager();
@@ -135,8 +132,8 @@ void set_compaction_manager(http_context& ctx, routes& r) {
}, std::plus<int64_t>());
});
cm::get_completed_tasks.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cm_stats(ctx, &compaction_manager::stats::completed_tasks);
cm::get_completed_tasks.set(r, [&cm] (std::unique_ptr<http::request> req) {
return get_cm_stats(cm, &compaction_manager::stats::completed_tasks);
});
cm::get_total_compactions_completed.set(r, [] (std::unique_ptr<http::request> req) {
@@ -153,14 +150,14 @@ void set_compaction_manager(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(0);
});
cm::get_compaction_history.set(r, [&ctx] (std::unique_ptr<http::request> req) {
std::function<future<>(output_stream<char>&&)> f = [&ctx] (output_stream<char>&& out) -> future<> {
cm::get_compaction_history.set(r, [&cm] (std::unique_ptr<http::request> req) {
std::function<future<>(output_stream<char>&&)> f = [&cm] (output_stream<char>&& out) -> future<> {
auto s = std::move(out);
bool first = true;
std::exception_ptr ex;
try {
co_await s.write("[");
co_await ctx.db.local().get_compaction_manager().get_compaction_history([&s, &first](const db::compaction_history_entry& entry) mutable -> future<> {
co_await cm.local().get_compaction_history([&s, &first](const db::compaction_history_entry& entry) mutable -> future<> {
cm::history h;
h.id = fmt::to_string(entry.id);
h.ks = std::move(entry.ks);
@@ -168,7 +165,9 @@ void set_compaction_manager(http_context& ctx, routes& r) {
h.compacted_at = entry.compacted_at;
h.bytes_in = entry.bytes_in;
h.bytes_out = entry.bytes_out;
for (auto it : entry.rows_merged) {
std::map<int32_t, int64_t> items(entry.rows_merged.begin(), entry.rows_merged.end());
for (auto it : items) {
httpd::compaction_manager_json::row_merged e;
e.key = it.first;
e.value = it.second;
@@ -201,6 +200,25 @@ void set_compaction_manager(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(res);
});
ss::get_compaction_throughput_mb_per_sec.set(r, [&cm](std::unique_ptr<http::request> req) {
int value = cm.local().throughput_mbs();
return make_ready_future<json::json_return_type>(value);
});
}
void unset_compaction_manager(http_context& ctx, routes& r) {
cm::get_compactions.unset(r);
cm::get_pending_tasks_by_table.unset(r);
cm::force_user_defined_compaction.unset(r);
cm::stop_compaction.unset(r);
cm::stop_keyspace_compaction.unset(r);
cm::get_pending_tasks.unset(r);
cm::get_completed_tasks.unset(r);
cm::get_total_compactions_completed.unset(r);
cm::get_bytes_compacted.unset(r);
cm::get_compaction_history.unset(r);
cm::get_compaction_info.unset(r);
ss::get_compaction_throughput_mb_per_sec.unset(r);
}
}

View File

@@ -3,17 +3,21 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include <seastar/core/sharded.hh>
namespace seastar::httpd {
class routes;
}
class compaction_manager;
namespace api {
struct http_context;
void set_compaction_manager(http_context& ctx, seastar::httpd::routes& r);
void set_compaction_manager(http_context& ctx, seastar::httpd::routes& r, seastar::sharded<compaction_manager>& cm);
void unset_compaction_manager(http_context& ctx, seastar::httpd::routes& r);
}

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "api/api.hh"
@@ -14,6 +14,7 @@
#include "replica/database.hh"
#include "db/config.hh"
#include <sstream>
#include <fmt/ranges.h>
#include <boost/algorithm/string/replace.hpp>
#include <seastar/http/exception.hh>
@@ -83,7 +84,7 @@ future<> get_config_swagger_entry(std::string_view name, const std::string& desc
namespace cs = httpd::config_json;
void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx, routes& r, const db::config& cfg, bool first) {
void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx, routes& r, db::config& cfg, bool first) {
rb->register_function(r, [&cfg, first] (output_stream<char>& os) {
return do_with(first, [&os, &cfg] (bool& first) {
auto f = make_ready_future();
@@ -193,6 +194,17 @@ void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx
return cfg.saved_caches_directory();
});
ss::set_compaction_throughput_mb_per_sec.set(r, [&cfg](std::unique_ptr<http::request> req) mutable {
api::req_param<uint32_t> value(*req, "value", 0);
cfg.compaction_throughput_mb_per_sec(value.value, utils::config_file::config_source::API);
return make_ready_future<json::json_return_type>(json::json_void());
});
ss::set_stream_throughput_mb_per_sec.set(r, [&cfg](std::unique_ptr<http::request> req) mutable {
api::req_param<uint32_t> value(*req, "value", 0);
cfg.stream_io_throughput_mb_per_sec(value.value, utils::config_file::config_source::API);
return make_ready_future<json::json_return_type>(json::json_void());
});
}
void unset_config(http_context& ctx, routes& r) {
@@ -213,6 +225,8 @@ void unset_config(http_context& ctx, routes& r) {
sp::set_truncate_rpc_timeout.unset(r);
ss::get_all_data_file_locations.unset(r);
ss::get_saved_caches_location.unset(r);
ss::set_compaction_throughput_mb_per_sec.unset(r);
ss::set_stream_throughput_mb_per_sec.unset(r);
}
}

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
@@ -13,6 +13,6 @@
namespace api {
void set_config(std::shared_ptr<httpd::api_registry_builder20> rb, http_context& ctx, httpd::routes& r, const db::config& cfg, bool first = false);
void set_config(std::shared_ptr<httpd::api_registry_builder20> rb, http_context& ctx, httpd::routes& r, db::config& cfg, bool first = false);
void unset_config(http_context& ctx, httpd::routes& r);
}

View File

@@ -3,13 +3,14 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "build_mode.hh"
#ifndef SCYLLA_BUILD_MODE_RELEASE
#include <seastar/core/coroutine.hh>
#include <boost/range/algorithm/transform.hpp>
#include "api/api-doc/cql_server_test.json.hh"
#include "cql_server_test.hh"
@@ -27,21 +28,24 @@ struct connection_sl_params : public json::json_base {
json::json_element<sstring> _role_name;
json::json_element<sstring> _workload_type;
json::json_element<sstring> _timeout;
json::json_element<sstring> _scheduling_group;
connection_sl_params(const sstring& role_name, const sstring& workload_type, const sstring& timeout) {
connection_sl_params(const sstring& role_name, const sstring& workload_type, const sstring& timeout, const sstring& scheduling_group) {
_role_name = role_name;
_workload_type = workload_type;
_timeout = timeout;
_scheduling_group = scheduling_group;
register_params();
}
connection_sl_params(const connection_sl_params& params)
: connection_sl_params(params._role_name(), params._workload_type(), params._timeout()) {}
: connection_sl_params(params._role_name(), params._workload_type(), params._timeout(), params._scheduling_group()) {}
void register_params() {
add(&_role_name, "role_name");
add(&_workload_type, "workload_type");
add(&_timeout, "timeout");
add(&_scheduling_group, "scheduling_group");
}
};
@@ -50,12 +54,13 @@ void set_cql_server_test(http_context& ctx, seastar::httpd::routes& r, cql_trans
auto sl_params = co_await ctl.get_connections_service_level_params();
std::vector<connection_sl_params> result;
boost::transform(std::move(sl_params), std::back_inserter(result), [] (const cql_transport::connection_service_level_params& params) {
std::ranges::transform(std::move(sl_params), std::back_inserter(result), [] (const cql_transport::connection_service_level_params& params) {
auto nanos = std::chrono::duration_cast<std::chrono::nanoseconds>(params.timeout_config.read_timeout).count();
return connection_sl_params(
std::move(params.role_name),
sstring(qos::service_level_options::to_string(params.workload_type)),
to_string(cql_duration(months_counter{0}, days_counter{0}, nanoseconds_counter{nanos})));
to_string(cql_duration(months_counter{0}, days_counter{0}, nanoseconds_counter{nanos})),
std::move(params.scheduling_group_name));
});
co_return result;
});

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#ifndef SCYLLA_BUILD_MODE_RELEASE

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "locator/snitch_base.hh"

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "api/api-doc/error_injection.json.hh"

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "failure_detector.hh"

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include <seastar/core/coroutine.hh>

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include <vector>
@@ -14,6 +14,7 @@
#include "gms/inet_address.hh"
#include "service/storage_proxy.hh"
#include "gms/gossiper.hh"
namespace api {
@@ -21,18 +22,18 @@ using namespace json;
using namespace seastar::httpd;
namespace hh = httpd::hinted_handoff_json;
void set_hinted_handoff(http_context& ctx, routes& r, sharded<service::storage_proxy>& proxy) {
hh::create_hints_sync_point.set(r, [&proxy] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto parse_hosts_list = [] (sstring arg) {
void set_hinted_handoff(http_context& ctx, routes& r, sharded<service::storage_proxy>& proxy, sharded<gms::gossiper>& g) {
hh::create_hints_sync_point.set(r, [&proxy, &g] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto parse_hosts_list = [&g] (sstring arg) {
std::vector<sstring> hosts_str = split(arg, ",");
std::vector<gms::inet_address> hosts;
std::vector<locator::host_id> hosts;
hosts.reserve(hosts_str.size());
for (const auto& host_str : hosts_str) {
try {
gms::inet_address host;
host = gms::inet_address(host_str);
hosts.push_back(host);
hosts.push_back(g.local().get_host_id(host));
} catch (std::exception& e) {
throw httpd::bad_param_exception(format("Failed to parse host address {}: {}", host_str, e.what()));
}
@@ -41,7 +42,7 @@ void set_hinted_handoff(http_context& ctx, routes& r, sharded<service::storage_p
return hosts;
};
std::vector<gms::inet_address> target_hosts = parse_hosts_list(req->get_query_param("target_hosts"));
std::vector<locator::host_id> target_hosts = parse_hosts_list(req->get_query_param("target_hosts"));
return proxy.local().create_hint_sync_point(std::move(target_hosts)).then([] (db::hints::sync_point sync_point) {
return json::json_return_type(sync_point.encode());
});

View File

@@ -3,19 +3,20 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include <seastar/core/sharded.hh>
#include "api/api_init.hh"
#include "gms/gossiper.hh"
namespace service { class storage_proxy; }
namespace api {
void set_hinted_handoff(http_context& ctx, httpd::routes& r, sharded<service::storage_proxy>& p);
void set_hinted_handoff(http_context& ctx, httpd::routes& r, sharded<service::storage_proxy>& p, sharded<gms::gossiper>& g);
void unset_hinted_handoff(http_context& ctx, httpd::routes& r);
}

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "api/api-doc/lsa.json.hh"
@@ -11,7 +11,7 @@
#include <seastar/http/exception.hh>
#include "utils/logalloc.hh"
#include "log.hh"
#include "utils/log.hh"
namespace api {
using namespace seastar::httpd;

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "messaging_service.hh"
@@ -114,7 +114,7 @@ void set_messaging_service(http_context& ctx, routes& r, sharded<netw::messaging
}));
get_version.set(r, [&ms](const_req req) {
return ms.local().get_raw_version(gms::inet_address(req.get_query_param("addr")));
return ms.local().current_version;
});
get_dropped_messages_by_ver.set(r, [&ms](std::unique_ptr<request> req) {

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include <seastar/core/coroutine.hh>
@@ -11,7 +11,7 @@
#include "api/api-doc/raft.json.hh"
#include "service/raft/raft_group_registry.hh"
#include "log.hh"
#include "utils/log.hh"
using namespace seastar::httpd;
@@ -102,8 +102,8 @@ void set_raft(http_context&, httpd::routes& r, sharded<service::raft_group_regis
if (!req->query_parameters.contains("group_id")) {
// Read barrier on group 0 by default
co_await raft_gr.invoke_on(0, [timeout] (service::raft_group_registry& raft_gr) {
return raft_gr.group0_with_timeouts().read_barrier(nullptr, timeout);
co_await raft_gr.invoke_on(0, [timeout] (service::raft_group_registry& raft_gr) -> future<> {
co_await raft_gr.group0_with_timeouts().read_barrier(nullptr, timeout);
});
co_return json_void{};
}
@@ -111,18 +111,52 @@ void set_raft(http_context&, httpd::routes& r, sharded<service::raft_group_regis
raft::group_id gid{utils::UUID{req->get_query_param("group_id")}};
std::atomic<bool> found_srv{false};
co_await raft_gr.invoke_on_all([gid, timeout, &found_srv] (service::raft_group_registry& raft_gr) {
co_await raft_gr.invoke_on_all([gid, timeout, &found_srv] (service::raft_group_registry& raft_gr) -> future<> {
if (!raft_gr.find_server(gid)) {
return make_ready_future<>();
co_return;
}
found_srv = true;
return raft_gr.get_server_with_timeouts(gid).read_barrier(nullptr, timeout);
co_await raft_gr.get_server_with_timeouts(gid).read_barrier(nullptr, timeout);
});
if (!found_srv) {
throw bad_param_exception{fmt::format("Server for group ID {} not found", gid)};
}
co_return json_void{};
});
r::trigger_stepdown.set(r, [&raft_gr] (std::unique_ptr<http::request> req) -> future<json_return_type> {
auto timeout = get_request_timeout(*req);
auto dur = timeout.value ? *timeout.value - lowres_clock::now() : std::chrono::seconds(60);
const auto stepdown_timeout_ticks = dur / service::raft_tick_interval;
auto timeout_dur = raft::logical_clock::duration(stepdown_timeout_ticks);
if (!req->query_parameters.contains("group_id")) {
// Stepdown on group 0 by default
co_await raft_gr.invoke_on(0, [timeout_dur] (service::raft_group_registry& raft_gr) {
apilog.info("Triggering stepdown for group0");
return raft_gr.group0().stepdown(timeout_dur);
});
co_return json_void{};
}
raft::group_id gid{utils::UUID{req->get_path_param("group_id")}};
std::atomic<bool> found_srv{false};
co_await raft_gr.invoke_on_all([gid, timeout_dur, &found_srv] (service::raft_group_registry& raft_gr) -> future<> {
auto* srv = raft_gr.find_server(gid);
if (!srv) {
co_return;
}
found_srv = true;
apilog.info("Triggering stepdown for group {}", gid);
co_await srv->stepdown(timeout_dur);
});
if (!found_srv) {
throw std::runtime_error{fmt::format("Server for group ID {} not found", gid)};
}
co_return json_void{};
});
}
@@ -131,6 +165,7 @@ void unset_raft(http_context&, httpd::routes& r) {
r::trigger_snapshot.unset(r);
r::get_leader_host.unset(r);
r::read_barrier.unset(r);
r::trigger_stepdown.unset(r);
}
}

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once

63
api/service_levels.cc Normal file
View File

@@ -0,0 +1,63 @@
/*
* Copyright (C) 2023-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "service_levels.hh"
#include "api/api-doc/service_levels.json.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
#include "db/consistency_level_type.hh"
#include "seastar/json/json_elements.hh"
#include "transport/controller.hh"
#include <unordered_map>
namespace api {
namespace sl = httpd::service_levels_json;
using namespace json;
using namespace seastar::httpd;
void set_service_levels(http_context& ctx, routes& r, cql_transport::controller& ctl, sharded<cql3::query_processor>& qp) {
sl::do_switch_tenants.set(r, [&ctl] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
co_await ctl.update_connections_scheduling_group();
co_return json_void();
});
sl::count_connections.set(r, [&qp] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto connections = co_await qp.local().execute_internal(
"SELECT username, scheduling_group FROM system.clients WHERE client_type='cql' ALLOW FILTERING",
db::consistency_level::LOCAL_ONE,
cql3::query_processor::cache_internal::no
);
using connections_per_user = std::unordered_map<sstring, uint64_t>;
using connections_per_scheduling_group = std::unordered_map<sstring, connections_per_user>;
connections_per_scheduling_group result;
for (auto it = connections->begin(); it != connections->end(); it++) {
auto user = it->get_as<sstring>("username");
auto shg = it->get_as<sstring>("scheduling_group");
if (result.contains(shg)) {
result[shg][user]++;
}
else {
result[shg] = {{user, 1}};
}
}
co_return result;
});
}
}

17
api/service_levels.hh Normal file
View File

@@ -0,0 +1,17 @@
/*
* Copyright (C) 2023-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include "api/api_init.hh"
namespace api {
void set_service_levels(http_context& ctx, httpd::routes& r, cql_transport::controller& ctl, sharded<cql3::query_processor>& qp);
}

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "storage_proxy.hh"

Some files were not shown because too many files have changed in this diff Show More