Commit Graph

46494 Commits

Author SHA1 Message Date
Botond Dénes
7943db9844 replica/table: add error injection to memtable post-flush phase
After the memtable was flushed to disk, but before it is merged to
cache. The injection point will only active for the table specified in
the "table_name" injection parameter.

(cherry picked from commit 6c1f6427b3)
2025-04-10 06:52:18 -04:00
Botond Dénes
bd8c584a01 utils/error_injection: add a way to set parameters from error injection points
With this, now it is possible to have two-way communication between
the error injection point and its enabler. The test can enable the error
injection point, then wait until it is hit, before proceedin.

(cherry picked from commit f7938e3f8b)
2025-04-10 06:52:18 -04:00
Botond Dénes
50c05abd14 test/cluster: add test_data_resurrection_in_memtable.py
Reproducers for #23252 and #23291 -- cache garbage
collecting tombstones resurrecting data in the memtable.

(cherry picked from commit 34b18d7ef4)
2025-04-10 06:52:18 -04:00
Botond Dénes
de1d8372fa test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
Such that a given index in the return hosts refers to the same
underlying Scylla instance, as the same index in the passed-in nodes
list. This is what users of this method intuitively expect, but
currently the returned hosts list is unordered (has random order).

(cherry picked from commit e5afd9b5fb)
2025-04-10 03:17:27 -04:00
Botond Dénes
dcc3604e02 replica/mutation_dump: don't assume cells are live
Currently the dumper unconditionally extracts the value of atomic cells,
assuming they are live. This doesn't always hold of course and
attempting to get the value of a dead cell will lead to marshalling
errors. Fix by checking is_live() before attempting to get the cell
value. Fix for both regular and collection cells.

(cherry picked from commit df09b3f970)
2025-04-10 03:17:27 -04:00
Botond Dénes
39ca3463b3 replica/database: do_apply() add error injection point
So writes (to user tables) can be failed on a replica, via error
injection. Should simplify tests which want to create differences in
what writes different replicas receive.

(cherry picked from commit cb76cafb60)
2025-04-10 03:17:27 -04:00
Botond Dénes
1c7a6ba140 replica: improve memtable overlap checks for the cache
The current memtable overlap check that is used by the cache
-- table::get_max_purgeable_fn_for_cache_underlying_reader() -- only
checks the active memtable, so memtables which are either being flushed
or are already flushed and also have active reads against them do not
participate in the overlap check.
This can result in temporary data resurrection, where a cache read can
garbage-collect a tombstone which still covers data in a flushing or
flushed memtable, which still have active read against it.

To prevent this, extend the overlap check to also consider all of the
memtable list. Furthermore, memtable_list::erase() now places the removed
(flushed) memtable in an intrusive list. These entries are alive only as
long as there are readers still keeping an `lw_shared_ptr<memtable>`
alive. This list is now also consulted on overlap checks.

(cherry picked from commit d126ea09ba)
2025-04-10 03:17:27 -04:00
Botond Dénes
4febf2a938 replica/memtable: add is_merging_to_cache()
And set it when the memtable is merged to cache.

(cherry picked from commit 7e600a0747)
2025-04-10 03:17:27 -04:00
Botond Dénes
b43d024ffb db/row_cache: add overlap-check for cache tombstone garbage collection
The cache should not garbage-collect tombstone which cover data in the
memtable. Add overlap checks (get_max_purgeable) to garbage collection
to detect tombstones which cover data in the memtable and to prevent
their garbage collection.

(cherry picked from commit 6b5b563ef7)
2025-04-10 03:17:27 -04:00
Botond Dénes
4bb1969a7f mutation/mutation_compactor: copy key passed-in to consume_new_partition()
This doesn't introduce additional work for single-partition queries: the
key is copied anyway on consume_end_of_stream().
Multi-partition reads and compaction are not that sensitive to
additional copy added.

This change fixes a bug in the compacting_reader: currently the reader
passes _last_uncompacted_partition_start.key() to the compactor's
consume_new_partition(). When the compactor emits enough content for this
partition, _last_uncompacted_partition_start is moved from to emit the
partition start, this makes the key reference passed to the compaction
corrupt (refer to moved-from value). This in turn means that subsequent
GC checks done by the compactor will be done with a corrupt key and
therefore can result in tombstone being garbage-collected while they
still cover data elsewhere (data resurrection).

The compacting reader is violating the API contract and normally the bug
should be fixed there. We make an exception here because doing the fix
in the mutation compactor better aligns with our future plans:
* The fix simplifies the compactor (gets rid of _last_dk).
* Prepares the way to get rid of the consume API used by the compactor.

(cherry picked from commit c2518cdf1a)
2025-04-10 03:17:27 -04:00
Nadav Har'El
c94d8e2471 Merge '[Backport 2025.1] transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing' from Scylladb[bot]
A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause) can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver.

However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation. For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to set it back then).

This patch fixes this.

Fixes #23173

The issue fixed by this PR is not critical but the fix is simple and safe enough so we should backport it to all live releases.

- (cherry picked from commit ca6bddef35)

- (cherry picked from commit f7e1695068)

Parent PR: #23174

Closes scylladb/scylladb#23524

* github.com:scylladb/scylladb:
  CQL Tracing: set common query parameters in a single function
  transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing
2025-04-09 14:59:13 +03:00
Kefu Chai
d7265a1bc2 storage_proxy: Prevent integer overflow in abstract_read_executor::execute
Fix UBSan abort caused by integer overflow when calculating time difference
between read and write operations. The issue occurs when:
1. The queried partition on replicas is not purgeable (has no recorded
   modified time)
2. Digests don't match across replicas
3. The system attempts to calculate timespan using missing/negative
   last_modified timestamps

This change skips cross-DC repair optimization when write timestamp is
negative or missing, as this optimization is only relevant for reads
occurring within write_timeout of a write.

Error details:
```
service/storage_proxy.cc:5532:80: runtime error: signed integer overflow: -9223372036854775808 - 1741940132787203 cannot be represented in type 'int64_t' (aka 'long')
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior service/storage_proxy.cc:5532:80
Aborting on shard 1, in scheduling group sl:default
```

Related to previous fix 39325cf which handled negative read_timestamp cases.

Fixes #23314
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23359

(cherry picked from commit ebf9125728)

Closes scylladb/scylladb#23387
2025-04-09 14:56:10 +03:00
Nadav Har'El
7f19a27f4f Merge '[Backport 2025.1] main: safely check stop_signal in-between starting services' from Scylladb[bot]
To simplify aborting scylla while starting the services,
add a _ready state to stop_signal, so that until
main is ready to be stopped by the abort_source,
just register that the signal is caught, and
let a check() method poll that and request abort
and throw respective exception only then, in controlled
points that are in-between starting of services
after the service started successfully and a deferred
stop action was installed.

This patch prevents gate_closed_exception to escape handling
when start-up is aborted early with the stop signal,
causing https://github.com/scylladb/scylladb/issues/23153
The regression is apparently due to a25c3eaa1c

Fixes https://github.com/scylladb/scylladb/issues/23153

* Requires backport to 2025.1 due to a25c3eaa1c

- (cherry picked from commit 23433f593c)

- (cherry picked from commit 282ff344db)

- (cherry picked from commit feef7d3fa1)

- (cherry picked from commit b6705ad48b)

Parent PR: #23103

Closes scylladb/scylladb#23184

* github.com:scylladb/scylladb:
  main: add checkpoints
  main: safely check stop_signal in-between starting services
  main: move prometheus start message
  main: move per-shard database start message
2025-04-09 14:54:19 +03:00
Nadav Har'El
c6825920a6 alternator: in GetRecords, enforce Limit to be <= 1000
Alternator Streams' "GetRecords" operation has a "Limit" parameter on
how many records to return. The DynamoDB documentations says that the
upper limit on this Limit parameter is 1000 - but Alternator didn't
enforce this. In this patch we begin enforcing this highest Limit, and
also add a test for verifying this enforcement. As usual, the new test
passes on DynamoDB, and after this patch - also on Alternator.

The reason why it's useful to have *some* upper limit on Limit is that
the existing executor::get_records() implementation does not really have
preemption points in all the necessary places. In particular, we have a
loop on all returned records without preemption points. We also store
the returned records in a RapidJson vector, which requires a contiguous
allocation.

Even before this patch, GetRecords had a hard limit of 1 MB of results.
But still, in some cases 1 MB of results may be a lot of results, and we
can see stalls in the aforementioned places being O(number of results).

Fixes #23534

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23547

(cherry picked from commit 84fd52315f)

Closes scylladb/scylladb#23643
2025-04-09 12:46:30 +03:00
Botond Dénes
bff75aa812 Merge '[Backport 2025.1] Add tablet enforcing option' from Scylladb[bot]
This series add a new config option: `tablets_mode_for_new_keyspaces` that replaces the existing
`enable_tablets` option. It can be set to the following values:
    disabled: New keyspaces use vnodes by default, unless enabled by the tablets={'enabled':true} option
    enabled:  New keyspaces use tablets by default, unless disabled by the tablets={'disabled':true} option
    enforced: New keyspaces must use tablets. Tablets cannot be disabled using the CREATE KEYSPACE option

`tablets_mode_for_new_keyspaces=disabled` or `tablets_mode_for_new_keyspaces=enabled` control whether
tablets are disabled or enabled by default for new keyspaces, respectively.
In either cases, tablets can be opted-in or out using the `tablets={'enabled':...}`
keyspace option, when the keyspace is created.

`tablets_mode_for_new_keyspaces=enforced` enables tablets by default for new keyspaces,
like `tablets_mode_for_new_keyspaces=enabled`.
However, it does not allow to opt-out when creating
new keyspaces by setting `tablets = {'enabled': false}`

Fixes scylladb/scylla-enterprise#4355

[Edit: changed `Refs` above to `Fixes` to apeace the backport bot gods]

* Requires backport to 2025.1

- (cherry picked from commit c62865df90)

- (cherry picked from commit 62aeba759b)

- (cherry picked from commit 9fac0045d1)

Parent PR: #22273

Closes scylladb/scylladb#23602

* github.com:scylladb/scylladb:
  boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy
  tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option
  db/config: add tablets_mode_for_new_keyspaces option
2025-04-09 08:47:10 +03:00
Michał Chojnowski
2a74426084 table: fix a race in table::take_storage_snapshot()
`safe_foreach_sstable` doesn't do its job correctly.

It iterates over an sstable set under the sstable deletion
lock in an attempt to ensure that SSTables aren't deleted during the iteration.

The thing is, it takes the deletion lock after the SSTable set is
already obtained, so SSTables might get unlinked *before* we take the lock.

Remove this function and fix its usages to obtain the set and iterate
over it under the lock.

Closes scylladb/scylladb#23397

(cherry picked from commit e23fdc0799)

Closes scylladb/scylladb#23628
2025-04-08 19:07:22 +03:00
Lakshmi Narayanan Sreethar
b7e72b3167 replica/table::do_apply : do not check for async gate's closure
The `table::do_apply()` method verifies if the compaction group's async
gate is open to determine if the compaction group is active. Closing
this async gate prevents any new operations but waits for existing
holders to exit, allowing their operations to complete. When holding a
gate, holders will observe the gate as closed when it is being closed,
but this is irrelevant as they are already inside the gate and are
allowed to complete. All the callers of `table::do_apply()` already
enter the gate before calling the method. So, the async gate check
inside `table::do_apply()` will erroneously throw an exception when the
compaction group is closing despite holding the gate. This commit
removes the check to prevent this from happening.

Fixes #23348

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#23579

(cherry picked from commit 750f4baf44)

Closes scylladb/scylladb#23645
2025-04-08 18:59:22 +03:00
Yaron Kaikov
98359dbfb1 .github: Make "make-pr-ready-for-review" workflow run in base repo
in 57683c1a50 we fixed the `token` error,
but removed the checkout part which causing now the following error
```
failed to run git: fatal: not a git repository (or any of the parent directories): .git
```
Adding the repo checkout stage to avoid such error

Fixes: https://github.com/scylladb/scylladb/issues/22765

Closes scylladb/scylladb#23641

(cherry picked from commit 2dc7ea366b)

Closes scylladb/scylladb#23654
2025-04-08 13:49:27 +03:00
Benny Halevy
27ca0d1812 boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 9fac0045d1)
2025-04-08 08:35:26 +03:00
Benny Halevy
736f89b31a tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option
`tablets_mode_for_new_keyspaces=enforced` enables tablets by default for
new keyspaces, like `tablets_mode_for_new_keyspaces=enabled`.
However, it does not allow to opt-out when creating
new keyspaces by setting `tablets = {'enabled': false}`.

Refs scylladb/scylla-enterprise#4355

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 62aeba759b)
2025-04-08 08:35:14 +03:00
Benny Halevy
a49e27ac8f db/config: add tablets_mode_for_new_keyspaces option
The new option deprecates the existing `enable_tablets` option.
It will be extended in the next patch with a 3rd value: "enforced"
while will enable tablets by default for new keyspace but
without the posibility to opt out using the `tablets = {'enabled':
false}` keyspace schema option.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c62865df90)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-08 08:08:47 +03:00
Botond Dénes
1a896169dc Merge '[Backport 2025.1] repair: release erm in repair_writer_impl::create_writer when possible' from Scylladb[bot]
Currently, repair_writer_impl::create_writer keeps erm to ensure that a sharder is valid. If we repair a tablet, erm blocks the state machine and no operation on any tablet of this table might be performed.

Use auto_refreshing_sharder and topology_guard to ensure that the operation is safe and that tablet operations on the whole table aren't blocked.

Fixes: #23453.

Needs backport to 2025.1 that introduces the tablet repair scheduler.

- (cherry picked from commit 1dc29ddc86)

- (cherry picked from commit bae6711809)

Parent PR: #23455

Closes scylladb/scylladb#23580

* github.com:scylladb/scylladb:
  \test: add test to check concurrent migration and repair of two different tablets
  repair: release erm in repair_writer_impl::create_writer when possible
scylla-2025.1.1-candidate-20250408065609 scylla-2025.1.1
2025-04-07 10:10:20 +03:00
Kefu Chai
9ccad33e59 .github: Make "make-pr-ready-for-review" workflow run in base repo
The "make-pr-ready-for-review" workflow was failing with an "Input
required and not supplied: token" error.  This was due to GitHub Actions
security restrictions preventing access to the token when the workflow
is triggered in a fork:
```
    Error: Input required and not supplied: token
```

This commit addresses the issue by:

- Running the workflow in the base repository instead of the fork. This
  grants the workflow access to the required token with write permissions.
- Simplifying the workflow by using a job-level `if` condition to
  controlexecution, as recommended in the GitHub Actions documentation
  (https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/using-conditions-to-control-job-execution).
  This is cleaner than conditional steps.
- Removing the repository checkout step, as the source code is not required for this workflow.

This change resolves the token error and ensures the
"make-pr-ready-for-review" workflow functions correctly.

Fixes scylladb/scylladb#22765

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22766

(cherry picked from commit ca832dc4fb)

Closes scylladb/scylladb#23561
2025-04-07 08:10:10 +03:00
Piotr Smaron
a17dd4d4c9 [Backport 2025.1] auth: forbid modifying system ks by non-superusers
Before this patch, granting a user MODIFY permissions on ALL KEYSPACES allowed the user to write to system tables, where the user could also set himself to "superuser" granting him all other permissions. After this patch, MODIFY permissions on ALL KEYSPACES is limited only to non-system keyspaces.

Fixes: scylladb/scylladb#23218
(cherry picked from commit fee50f287c)

Parent PR: #23219

Closes scylladb/scylladb#23594
2025-04-06 15:10:06 +03:00
Nadav Har'El
a2a4c6e4b2 test/alternator: increase timeout in Alternator RBAC test
On our testing infrastructure, tests often run a hundred times (!)
slower than usual, for various reasons that we can't always avoid.
This is why all our test frameworks drastically increase the default
timeouts.

We forgot to increase the timeout in one place - where Alternator tests
use CQL. This is needed for the Alternator role-based access control
(RBAC) tests, which is configured via CQL and therefore the Alternator
test unusually uses CQL.

So in this patch we increase the timeout of CQL driver used by
Alternator tests to the same high timeouts (60-120 seconds) used by
the regular CQL tests. As the famous saying goes, these timeouts should
be enough for anyone.

Fixes #23569.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23578

(cherry picked from commit a9a6f9eecc)

Closes scylladb/scylladb#23601
2025-04-06 11:49:46 +03:00
Avi Kivity
64182d9df6 Update seastar submodule (prefaulter leaving zombie threads)
* seastar a350b5d70e...6d8fccf14c (1):
  > smp: prefaulter: don't leave zombie worker threads

Fixes #23316
2025-04-05 22:28:53 +03:00
Pavel Emelyanov
8e85ef90d2 sstables_loader: Do not stop sharded<progress_monitor> unconditionally
The member in question is unconditionally .stop()-ed in task's
release_resources() method, however, it may happen that the thing wasn't
.start()-ed in the first place. Start happens in the middle of the
task's .run() method and there can be several reasons why it can be
skipped -- e.g. the task is aborted early, or collecting sstables from
S3 throws.

fixes: #23231

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23483

(cherry picked from commit 832d83ae4b)

Closes scylladb/scylladb#23557
2025-04-04 17:46:20 +03:00
Aleksandra Martyniuk
b5b2ffa5df \test: add test to check concurrent migration and repair of two different tablets
(cherry picked from commit bae6711809)
2025-04-04 10:14:51 +02:00
Andrzej Jackowski
b7f067ce33 audit: fix empty query string in BATCH query
Function modification_statement::add_raw() is never called, which
makes query string in audit_info of batch queries empty. In enterprise
branch, add_raw is called in Cql.g and those changes were never merged
to master.

This changes:
 - Add missing call of add_raw() to Cql.g
 - Include other related changes (from PR#3228 in scylla-enterprise)

Fixes scylladb#23311

Closes scylladb/scylladb#23315

(cherry picked from commit b8adbcbc84)

Closes scylladb/scylladb#23495
2025-04-03 16:46:33 +03:00
Aleksandra Martyniuk
307f00a398 repair: release erm in repair_writer_impl::create_writer when possible
Currently, repair_writer_impl::create_writer keeps erm to ensure
that a sharder is valid. If we repair a tablet, erm blocks the state
machine and no operation on any tablet of this table might be performed.

Use auto_refreshing_sharder and topology_guard to ensure that the
operation is safe and that tablet operations on the whole table
aren't blocked.

Fixes: #23453.
(cherry picked from commit 1dc29ddc86)
2025-04-03 13:19:40 +00:00
Dawid Mędrek
c56e47f72f db/hints: Cancel draining when stopping node
Draining hints may occur in one of the two scenarios:

* a node leaves the cluster and the local node drains all of the hints
  saved for that node,
* the local node is being decommissioned.

Draining may take some time and the hint manager won't stop until it
finishes. It's not a problem when decommissioning a node, especially
because we want the cluster to retain the data stored in the hints.
However, it may become a problem when the local node started draining
hints saved for another node and now it's being shut down.

There are two reasons for that:

* Generally, in situations like that, we'd like to be able to shut down
  nodes as fast as possible. The data stored in the hints won't
  disappear from the cluster yet since we can restart the local node.
* Draining hints may introduce flakiness in tests. Replaying hints doesn't
  have the highest priority and it's reflected in the scheduling groups we
  use as well as the explicitly enforced throughput. If there are a large
  number of hints to be replayed, it might affect our tests.
  It's already happened, see: scylladb/scylladb#21949.

To solve those problems, we change the semantics of draining. It will behave
as before when the local node is being decommissioned. However, when the
local node is only being stopped, we will immediately cancel all ongoing
draining processes and stop the hint manager. To amend for that, when we
start a node and it initializes a hint endpoint manager corresponding to
a node that's already left the cluster, we will begin the draining process
of that endpoint manager right away.

That should ensure all data is retained, while possibly speeding up
the shutdown process.

There's a small trade-off to it, though. If we stop a node, we can then
remove it. It won't have a chance to replay hints it might've before
these changes, but that's an edge case. We expect this commit to bring
more benefit than harm.

We also provide tests verifying that the implementation works as intended.

Fixes scylladb/scylladb#21949

Closes scylladb/scylladb#22811

(cherry picked from commit 0a6137218a)

Closes scylladb/scylladb#23370
2025-04-03 09:09:05 +02:00
Tomasz Grabiec
51ee15f02d Merge '[Backport 2025.1] tablets: Make load balancing capacity-aware' from Tomasz Grabiec
Before this patch, the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
 1) tablets have the same size
 2) shards have the same capacity

That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogeneous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.

After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assumes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.

One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.

Also, per-shard goal for tablet count is still the same for all nodes in the cluster,
so nodes with less capacity will be below limit and nodes with more capacity will
be slightly above limit. This shouldn't be a significant problem in practice, we could
compensate for this by increasing the limit.

Fixes #23042

* github.com:scylladb/scylladb:
  tablets: Make load balancing capacity-aware
  topology_coordinator: Fix confusing log message
  topology_coordinator: Refresh load stats after adding a new node
  topology_coordinator: Allow capacity stats to be refreshed with some nodes down
  topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
  test: boost: tablets_test: Always provide capacity in load_stats
  test: perf_load_balancing: Set node capacity
  test: perf_load_balancing: Convert to topology_builder
  config, disk_space_monitor: Allow overriding capacity via config
  storage_service, tablets: Collect per-node capacity in load_stats
  test: tablets_test: Add support for auto-split mode
  test: cql_test_env: Expose db config

Closes scylladb/scylladb#23443

* github.com:scylladb/scylladb:
  Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec
  test: tablets_test: Add support for auto-split mode
  test: cql_test_env: Expose db config
scylla-2025.1.1-candidate-20250402022041
2025-04-01 20:31:05 +02:00
Vlad Zolotarov
feadb781f2 CQL Tracing: set common query parameters in a single function
Each query-type (QUERY, EXECUTE, BATCH) CQL opcode has a number of parameters
in their payload which we always want to record in the Tracing object.
Today it's a Consistency Level, Serial Consistency Level and a Default Timestamp.

Setting each of them individually can lead to a human error when one (or more) of
them would not be set. Let's eliminate such a possibility by defining
a single function that sets them all.

This also allows an easy addition of such parameters to this function in
the future.

(cherry picked from commit f7e1695068)
2025-04-01 11:45:54 +00:00
Vlad Zolotarov
6b71d6b9ba transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing
A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause)
can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of
QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver.

However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation.
For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to
set it back then).

This patch fixes this.

Fixes #23173

(cherry picked from commit ca6bddef35)
2025-04-01 11:45:54 +00:00
Jenkins Promoter
36bb089663 Update ScyllaDB version to: 2025.1.1 2025-04-01 14:18:18 +03:00
yangpeiyu2_yewu
1661b35050 mutation_writer/multishard_writer.cc: wrap writer into futurize_invoke
wrapped writer in seastar::futurize_invoke to make sure that the close() for the mutation_reader can be executed before destruction.

Fixes scylladb/scylladb#22790

Closes scylladb/scylladb#22812

(cherry picked from commit 0de232934a)

Closes scylladb/scylladb#22943
2025-04-01 13:46:27 +03:00
Asias He
8c93a331f7 repair: Enable small table optimization for system_replicated_keys
This enterprise-only system table is replicated and small. It should be
included for small table optimization.

Fixes scylladb/scylla-enterprise#5256

Closes scylladb/scylladb#23135

Closes scylladb/scylladb#23147
2025-04-01 13:36:51 +03:00
Calle Wilund
85c161b9f1 generic_server: Update conditions for is_broken_pipe_or_connection_reset
Refs scylla-enterprise#5185
Fixes #22901

If a tls socket gets EPIPE the error is not translated to a specific
gnutls error code, but only a generic ERROR_PULL/PUSH. Since we treat
EPIPE as ignorable for plain sockets, we need to unwind nested exception
here to detect that the error was in fact due to this, so we can suppress
log output for this.

Closes scylladb/scylladb#22888

(cherry picked from commit e49f2046e5)

Closes scylladb/scylladb#23045
2025-04-01 13:06:29 +03:00
Patryk Jędrzejczak
d088cc8a2d Merge '[Backport 2025.1] Fix a regression that sometimes causes an internal error and demote barrier_and_drain rpc error log to a warning ' from Scylladb[bot]
The series fixes a regression and demotes a barrier_and_drain logging error to a warning since this particular condition may happen during normal operation.

We want to backport both since one is a bug fix and another is trivial and reduces CI flakiness.

- (cherry picked from commit 1da7d6bf02)

- (cherry picked from commit fe45ea505b)

Parent PR: #22650

Closes scylladb/scylladb#22923

* https://github.com/scylladb/scylladb:
  topology_coordinator: demote barrier_and_drain rpc failure to warning
  topology_coordinator: read peers table only once during topology state application
2025-04-01 11:54:56 +02:00
Patryk Jędrzejczak
39c20144e5 Merge '[Backport 2025.1] raft topology: Add support for raft topology init to happen before group0 initialization' from Scylladb[bot]
In the current scenario, the problem discovered is that there is a time
gap between group0 creation and raft_initialize_discovery_leader call.
Because of that, the group0 snapshot/apply entry enters wrong values
from the disk(null) and updates the in-memory variables to wrong values.
During the above time gap, the in-memory variables have wrong values and
perform absurd actions.

This PR removes the variable `_manage_topology_change_kind_from_group0`
which was used earlier as a work around for correctly handling
`topology_change_kind` variable, it was brittle and had some bugs
(causing issues like scylladb/scylladb#21114). The reason for this bug
that _manage_topology_change_kind used to block reading from disk and
was enabled after group0 initialization and starting raft server for the
restart case. Similarly, it was hard to manage `topology_change_kind`
using `_manage_topology_change_kind_from_group0` correctly in bug free
anner.

Post `_manage_topology_change_kind_from_group0` removal, careful
management of `topology_change_kind` variable was needed for maintaining
correct `topology_change_kind` in all scenarios. So this PR also performs
a refactoring to populate all init data to system tables even before
group0 creation(via `raft_initialize_discovery_leader` function). Now
because `raft_initialize_discovery_leader` happens before the group 0
creation, we write mutations directly to system tables instead of a
group 0 command. Hence, post group0 creation, the node can read the
correct values from system tables and correct values are maintained
throughout.

Added a new function `initialize_done_topology_upgrade_state` which
takes care of updating the correct upgrade state to system tables before
starting group0 server. This ensures that the node can read the correct
values from system tables and correct values are maintained throughout.

By moving `raft_initialize_discovery_leader` logic to happen before
starting group0 server, and not as group0 command post server start, we
also get rid of the potential problem of init group0 command not being
the 1st command on the server. Hence ensuring full integrity as expected
by programmer.

This PR fixes a bug. Hence we need to backport it.

Fixes: scylladb/scylladb#21114

- (cherry picked from commit 4748125a48)

- (cherry picked from commit e491950c47)

- (cherry picked from commit 623e01344b)

- (cherry picked from commit d7884cf651)

Parent PR: #22484

Closes scylladb/scylladb#22966

* https://github.com/scylladb/scylladb:
  storage_service: Remove the variable _manage_topology_change_kind_from_group0
  storage_service: fix indentation after the previous commit
  raft topology: Add support for raft topology system tables initialization to happen before group0 initialization
  service/raft: Refactor mutation writing helper functions.
2025-04-01 11:46:15 +02:00
Jenkins Promoter
f1e7cee7a5 Update pgo profiles - aarch64 2025-04-01 04:20:56 +03:00
Jenkins Promoter
023b27312d Update pgo profiles - x86_64 2025-04-01 04:08:00 +03:00
Anna Stuchlik
2ffbc81e19 doc: remove the outdated info on seeds-info
This commit removes the outdated information about seed nodes.
We no longer need it in the docs, as a) the documentation is versioned,
and b) the ScyllaDB Open Source 4.3 and ScyllaDB Enterprise 2021.1 versions
mentioned in the docs are no longer supported.

In addition, some clarification has been added to the existing sections.

Fixes https://github.com/scylladb/scylladb/issues/22400

Closes scylladb/scylladb#23282

(cherry picked from commit dbbf9e19e4)

Closes scylladb/scylladb#23327
2025-03-31 12:33:59 +02:00
Yaron Kaikov
88e548ed72 .github: add action to make PR ready for review when conflicts label was removed
Moving a PR out of draft is only allowed to users with write access,
adding a github action to switch PR to `ready for review` once the
`conflicts` label was removed

Closes scylladb/scylladb#22446

(cherry picked from commit ed4bfad5c3)

Closes scylladb/scylladb#23023
2025-03-31 13:22:04 +03:00
Tomasz Grabiec
975882a489 test: tablets: Fix flakiness due to ungraceful shutdown
The test fails sporadically with:

cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test3.test2 - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}

That's becase a server is stopped in the middle of the workload.

The server is stopped ungracefully which will cause some requests to
time out. We should stop it gracefully to allow in-flight requests to
finish.

Fixes #20492

Closes scylladb/scylladb#23451

(cherry picked from commit 8e506c5a8f)

Closes scylladb/scylladb#23469
2025-03-28 14:56:02 +01:00
Evgeniy Naydanov
3653662099 test.py: random_failures: deselect topology ops for some injections
After recent changes #18640 and #19151 started to reproduce for
stop_after_sending_join_node_request and
stop_after_bootstrapping_initial_raft_configuration error injections too.

The solution is the same: deselect the tests.

Fixes #23302

Closes scylladb/scylladb#23405

(cherry picked from commit 574c81eac6)

Closes scylladb/scylladb#23460
2025-03-27 13:19:59 +02:00
Anna Stuchlik
7336bb38fa doc: fix product names in the 2025.1 upgrage guides
This commit fixes the product names in the upgrade 2025.1 guides so that:

- 6.2 is preceded with "ScyllaDB Open Source"
- 2024.x is preceded with "ScyllaDB Enterprise"
- 2025.1 is preceded with "ScyllaDB"

Fixes https://github.com/scylladb/scylladb/issues/23154

Closes scylladb/scylladb#23223

(cherry picked from commit cd61f60549)

Closes scylladb/scylladb#23328
2025-03-27 11:58:01 +02:00
Avi Kivity
cff90755d8 Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec
Before this patch, the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
 1) tablets have the same size
 2) shards have the same capacity

That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogeneous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.

After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assumes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.

One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.

Also, per-shard goal for tablet count is still the same for all nodes in the cluster,
so nodes with less capacity will be below limit and nodes with more capacity will
be slightly above limit. This shouldn't be a significant problem in practice, we could
compensate for this by increasing the limit.

Refs #23042

Closes scylladb/scylladb#23079

* github.com:scylladb/scylladb:
  tablets: Make load balancing capacity-aware
  topology_coordinator: Fix confusing log message
  topology_coordinator: Refresh load stats after adding a new node
  topology_coordinator: Allow capacity stats to be refreshed with some nodes down
  topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
  test: boost: tablets_test: Always provide capacity in load_stats
  test: perf_load_balancing: Set node capacity
  test: perf_load_balancing: Convert to topology_builder
  config, disk_space_monitor: Allow overriding capacity via config
  storage_service, tablets: Collect per-node capacity in load_stats

(cherry picked from commit b1d9f80d85)
2025-03-25 23:16:35 +01:00
Tomasz Grabiec
3be469da29 test: tablets_test: Add support for auto-split mode
rebalance_tablets() was performing migrations and merges automatically
but not splits, because splits need to be acked by replicas via
load_stats. It's inconvenient in tests which want to rebalance to the
equilibrium point. This patch changes rebalance_tablets() to split
automatically by default, can be disabled for tests which expect
differently.

shared_load_stats was introduced to provide a stable holder of
load_stats which can be reused across rebalance_tablets() calls.

(cherry picked from commit 5e471c6f1b)
2025-03-25 18:23:22 +01:00
Tomasz Grabiec
1895724465 test: cql_test_env: Expose db config
(cherry picked from commit f3b63bfeff)
2025-03-25 18:22:32 +01:00