Commit Graph

47837 Commits

Author SHA1 Message Date
Avi Kivity
5e4941a74b Merge '[Backport 2025.2] sstables/mx/writer: handle non-full prefix row keys' from Scylladb[bot]
Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full prefix.
This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions.

Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.

Add a full-stack test which checks that rows with bad keys are correctly handled.

Fixes: https://github.com/scylladb/scylladb/issues/24489

The bug is present in all versions, has to be backported to all supported versions.

- (cherry picked from commit 92b5fe8983)

- (cherry picked from commit 0753643606)

- (cherry picked from commit b0d5462440)

- (cherry picked from commit 093d4f8d69)

- (cherry picked from commit 678deece88)

- (cherry picked from commit 64f8500367)

- (cherry picked from commit b931145a26)

- (cherry picked from commit 3e1c50e9a7)

- (cherry picked from commit 46ff7f9c12)

- (cherry picked from commit ebd9420687)

- (cherry picked from commit aae212a87c)

- (cherry picked from commit 592ca789e2)

- (cherry picked from commit edc2906892)

Parent PR: #24492

Closes scylladb/scylladb#24744

* github.com:scylladb/scylladb:
  test/boost/sstable_datafile_test: add test for corrupt data
  sstables/mx/writer: handler rows with empty keys
  test/lib/cql_assertions: introduce columns_assertions
  sstables: add corrupt_data_handler to sstables::sstables
  tools/scylla-sstable: make large_data_handler a local
  db: introduce corrupt_data_handler
  mutation: introduce frozen_mutation_fragment_v2
  mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
  mutation/mutation_partition_view: extract de-ser of {clustering,static} row
  idl-compiler.py: generate skip() definition for enums serializers
  idl: extract full_position.idl from position_in_partition.idl
  db/system_keyspace: add apply_mutation()
  db/system_keyspace: introduce the corrupt_data table
2025-07-01 12:27:01 +03:00
Gleb Natapov
31ed717afb storage_proxy: retry paxos repair even if repair write succeeded
After paxos state is repaired in begin_and_repair_paxos we need to
re-check the state regardless if write back succeeded or not. This
is how the code worked originally but it was unintentionally changed
when co-routinized in 61b2e41a23.

Fixes #24630

Closes scylladb/scylladb#24651

(cherry picked from commit 5f953eb092)

Closes scylladb/scylladb#24703
2025-07-01 10:15:12 +02:00
Abhinav Jha
160c937efe group0: modify start_operation logic to account for synchronize phase race condition
In the present scenario, the bootstrapping node undergoes synchronize phase after
initialization of group0, then enters post_raft phase and becomes fully ready for
group0 operations. The topology coordinator is agnostic of this and issues stream
ranges command as soon as the node successfully completes `join_group0`. Although for
a node booting into an already upgraded cluster, the time duration for which, node
remains in synchronize phase is negligible but this race condition causes trouble in a
small percentage of cases, since the stream ranges operation fails and node fails to bootstrap.

This commit addresses this issue and updates the error throw logic to account for this
edge case and lets the node wait (with timeouts) for synchronize phase to get over instead of throwing
error.

A regression test is also added to confirm the working of this code change. The test adds a
wait in synchronize phase for newly joining node and releases only after the program counter
reaches the synchronize case in the `start_operation` function. Hence it indicates that in the
updated code, the start_operation will wait for the node to get done with the
synchronize phase instead of throwing error.

This PR fixes a bug. Hence we need to backport it.

Fixes: scylladb/scylladb#23536

Closes scylladb/scylladb#23829

(cherry picked from commit 5ff693eff6)

Closes scylladb/scylladb#24628
2025-07-01 10:10:55 +02:00
Jenkins Promoter
0bf8fe4778 Update pgo profiles - aarch64 2025-07-01 04:30:55 +03:00
Jenkins Promoter
a08ff869f3 Update pgo profiles - x86_64 2025-07-01 04:07:23 +03:00
Jenkins Promoter
fe22df0af2 Update ScyllaDB version to: 2025.2.1 2025-06-30 23:59:09 +03:00
Anna Stuchlik
3d8368cacb doc: remove OSS mention from the SI notes
This commit removes a confusing reference to an Open Source version
form the Local Secondary Indexes page.

Fixes https://github.com/scylladb/scylladb/issues/24668

Closes scylladb/scylladb#24673

(cherry picked from commit 2367330513)

Closes scylladb/scylladb#24723
2025-06-30 18:53:48 +03:00
Botond Dénes
236cab0f66 test/boost/sstable_datafile_test: add test for corrupt data
* create a table with random schema
* generate data: random mutations + one row with bad key
* write data to sstable
* check that only good data is written to sstable
* check that the bad data was saved to system.corrupt_data

(cherry picked from commit edc2906892)
2025-06-30 12:44:29 +00:00
Botond Dénes
cd97f4c4c3 sstables/mx/writer: handler rows with empty keys
Although valid for compact tables, non-full (or empty) clustering key
prefixes are not handled for row keys when writing sstables. Only the
present components are written, consequently if the key is empty, it is
omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full
prefix. This mis-match results in parsing failures, as the parser parses
part of the row content as a key resulting in a garbage key and
subsequent mis-parsing of the row content and maybe even subsequent
partitions.

Use the recently introduced corrupt_data_handler to handle rows with
such corrupt keys. This way, we avoid corrupting the sstables beyond
parsing and the rows are also kept around in system.corrupt_data for
later inspection and possible recovery.

(cherry picked from commit 592ca789e2)
2025-06-30 12:44:29 +00:00
Botond Dénes
7654ccbef5 test/lib/cql_assertions: introduce columns_assertions
To enable targeted and optionally typed assertions against individual
columns in a row.

(cherry picked from commit aae212a87c)
2025-06-30 12:44:29 +00:00
Botond Dénes
9eb9ffe4bc sstables: add corrupt_data_handler to sstables::sstables
Similar to how large_data_handler is handled, propagate through
sstables::sstables_manager and store its owner: replica::database.
Tests and tools are also patched. Mostly mechanical changes, updating
constructors and patching callers.

(cherry picked from commit ebd9420687)
2025-06-30 12:44:29 +00:00
Botond Dénes
b0a233b2c9 tools/scylla-sstable: make large_data_handler a local
No reason for it to be a global, not even convenience.

(cherry picked from commit 46ff7f9c12)
2025-06-30 12:44:29 +00:00
Botond Dénes
53373ea9b7 db: introduce corrupt_data_handler
Similar to large_data_handler, this interface allows sstable writers to
delegate the handling of corrupt data.
Two implementations are provided:
* system_table_corrupt_data_handler - saved corrupt data in
  system.corrupt_data, with a TTL=10days (non-configurable for now)
* nop_corrupt_data_handler - drops corrupt data

(cherry picked from commit 3e1c50e9a7)
2025-06-30 12:44:29 +00:00
Botond Dénes
b952d8a88c mutation: introduce frozen_mutation_fragment_v2
Mirrors frozen_mutation_fragment and shares most of the underlying
serialization code, the only exception is replacing range_tombstone with
range_tombstone_change in the mutation fragment variant.

(cherry picked from commit b931145a26)
2025-06-30 12:44:28 +00:00
Botond Dénes
a561600e7e mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
Instead of mutation_fragment, let caller convert into mutation_fragment.
Allows reuse in future callers which will want to convert to
mutation_fragment_v2.

(cherry picked from commit 64f8500367)
2025-06-30 12:44:28 +00:00
Botond Dénes
45b6cc069f mutation/mutation_partition_view: extract de-ser of {clustering,static} row
From the visitor in frozen_mutation_fragment::unfreeze(). We will want
to re-use it in the future frozen_mutation_fragment_v2::unfreeze().

Code-movement only, the code is not changed.

(cherry picked from commit 678deece88)
2025-06-30 12:44:28 +00:00
Botond Dénes
355a1b4af4 idl-compiler.py: generate skip() definition for enums serializers
Currently they only have the declaration and so far they got away with
it, looks like no users exists, but this is about to change so generate
the definition too.

(cherry picked from commit 093d4f8d69)
2025-06-30 12:44:28 +00:00
Botond Dénes
2ead6a43a5 idl: extract full_position.idl from position_in_partition.idl
A future user of position_in_partition.idl doesn't need full_position
and so doesn't want to include full_position.hh to fix compile errors
when including position_in_partition.idl.hh.
Extract it to a separate idl file: it has a single user in a
storage_proxy VERB.

(cherry picked from commit b0d5462440)
2025-06-30 12:44:28 +00:00
Botond Dénes
14595c49ae db/system_keyspace: add apply_mutation()
Allow applying writes in the form of mutations directly to the keyspace.
Allows lower-level mutation API to build writes. Advantageous if writes
can contain large cells that would otherwise possibly cause large
allocation warnings if used via the internal CQL API.

(cherry picked from commit 0753643606)
2025-06-30 12:44:28 +00:00
Botond Dénes
43eb3bcf91 db/system_keyspace: introduce the corrupt_data table
To serve as a place to store corrupt mutation fragments. These fragments
cannot be written to sstables, as they would be spread around by
compaction and/or repair. They even might make parsing the sstable
impossible. So they are stored in this special table instead, kept
around to be inspected later and possibly restored if possible.

(cherry picked from commit 92b5fe8983)
2025-06-30 12:44:28 +00:00
Aleksandra Martyniuk
7fd4d77fdd test: rest_api: fix test_repair_task_progress
test_repair_task_progress checks the progress of children of root
repair task. However, nothing ensures that the children are
already created.

Wait until at least one child of a root repair task is created.

Fixes: #24556.

Closes scylladb/scylladb#24560

(cherry picked from commit 0deb9209a0)

Closes scylladb/scylladb#24655
2025-06-28 09:39:06 +03:00
Patryk Jędrzejczak
2c89800e76 Merge '[Backport 2025.2] docs: document the new recovery procedure' from Scylladb[bot]
We replace the documentation of the old recovery procedure with the
documentation of the new recovery procedure.

The new recovery procedure requires the Raft-based topology to be
enabled, so to remove the old procedure from the documentation,
we must assume users have the Raft-based topology enabled.
We can do it in 2025.2 because the upgrade guides to 2025.1 state that
enabling the Raft-based topology is a mandatory step of the upgrade.
Another reminder is the upgrade guides to 2025.2.

Since we rely on the Raft-based topology being enabled, we remove the
obsolete parts of the documentation.

We will make the Raft-based topology mandatory in the code in the
future, hopefully in 2025.3. For this reason, we also don't touch the
dev docs in this PR.

Fixes scylladb/scylladb#24530

Requires backport to 2025.2 because 2025.2 contains the new recovery
procedure.

- (cherry picked from commit 4e256182a0)

- (cherry picked from commit 203ea5d8f9)

Parent PR: #24583

Closes scylladb/scylladb#24702

* https://github.com/scylladb/scylladb:
  docs: rely on the Raft-based topology being enabled
  docs: handling-node-failures: document the new recovery procedure
2025-06-27 11:58:36 +02:00
Patryk Jędrzejczak
b1bfa4b115 docs: rely on the Raft-based topology being enabled
In 2025.2, we don't force enabling the Raft-based topology in the code,
but we stated in the upgrade guides that it's a mandatory step of the
upgrade to 2025.1. We also remind users to enable the Raft-based
topology in the upgrade guides to 2025.2. Hence, we can rely in the
the documentation on the Raft-based topology being enabled. If it is
still disabled, we can just send the user to the upgrade guides. Hence:
- we remove all documentation related to enabling the Raft-based
  topology, enabling the Raft-based schema (enabled Raft-based topology
  implies enabled Raft-based schema), and the gossip-based topology,
- we can replace the documentation of the old manual recovery procedure
  with the documentation of the new manual recovery procedure (done in
  the previous commit).

(cherry picked from commit 203ea5d8f9)
2025-06-26 22:18:56 +00:00
Patryk Jędrzejczak
f052af6c45 docs: handling-node-failures: document the new recovery procedure
We replace the documentation of the old recovery procedure with the
documentation of the new recovery procedure.

We can get rid of the old procedure from the documentation because
we requested users to enable the Raft-based topology during upgrades to
2025.1 and 2025.2.

We leave the note that enabling the Raft-based topology is required to
use the new recovery procedure just in case, since we didn't force
enabling the Raft-based topology in the code.

(cherry picked from commit 4e256182a0)
2025-06-26 22:18:56 +00:00
Jenkins Promoter
33e947e753 Update ScyllaDB version to: 2025.2.0 scylla-2025.2.0 scylla-2025.2.0-candidate-20250625081920 2025-06-25 15:29:15 +03:00
Anna Stuchlik
b469158418 doc: improve the tablets limitations section
This PR improves the Limitations and Unsupported Features section
for tablets, as it has been confusing to the customers.

Refs https://github.com/scylladb/scylla-enterprise/issues/5465

Fixes https://github.com/scylladb/scylladb/issues/24562

Closes scylladb/scylladb#24563

(cherry picked from commit 17eabbe712)

Closes scylladb/scylladb#24588
2025-06-24 10:06:21 +03:00
Benny Halevy
afa2b40ac9 disk_space_monitor: add space_source_registration
Register the current space_source_fn in an RAII
object that resets monitor._space_source to the
previous function when the RAII object is destroyed.

Use space_source_registration in database_test::
mutation_dump_generated_schema_deterministic_id_version
to prevent use-after-stack-return in the test.

Fixes #24314

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#24342

(cherry picked from commit 8b387109fc)

Closes scylladb/scylladb#24392
2025-06-24 10:02:23 +03:00
Raphael S. Carvalho
fa420f8644 replica: Fix truncate assert failure
Truncate doesn't really go well with concurrent writes. The fix (#23560) exposed
a preexisting fragility which I missed.

1) truncate gets RP mark X, truncated_at = second T
2) new sstable written during snapshot or later, also at second T (difference of MS)
3) discard_sstables() get RP Y > saved RP X, since creation time of sstable
with RP Y is equal to truncated_at = second T.

So the problem is that truncate is using a clock of second granularity for
filtering out sstables written later, and after we got low mark and truncate time,
it can happen that a sstable is flushed later within the same second, but at a
different millisecond.
By switching to a millisecond clock (db_clock), we allow sstables written later
within the same second from being filtered out. It's not perfect but
extremely unlikely a new write lands and get flushed in the same
millisecond we recorded truncated_at timepoint. In practice, truncate
will not be used concurrently to writes, so this should be enough for
our tests performing such concurrent actions.
We're moving away from gc_clock which is our cheap lowres_clock, but
time is only retrieved when creating sstable objects, which frequency of
creation is low enough for not having significant consequences, and also
db_clock should be cheap enough since it's usually syscall-less.

Fixes #23771.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#24426

(cherry picked from commit 2d716f3ffe)

Closes scylladb/scylladb#24435
2025-06-24 10:02:06 +03:00
Andrzej Jackowski
60bc1c339c test: wait for normal state propagation in test_auth_v2_migration
By default, cluster tests have skip_wait_for_gossip_to_settle=0 and
ring_delay_ms=0. In tests with gossip topology, it may lead to a race,
where nodes see different state of each other.

In case of test_auth_v2_migration, there are three nodes. If the first
node already knows that the third node is NORMAL, and the second node
does not, the system_auth tables can return incomplete results.

To avoid such a race, this commit adds a check that all nodes see other
nodes as NORMAL before any writes are done.

Refs: #24163

Closes scylladb/scylladb#24185

(cherry picked from commit 555d897a15)

Closes scylladb/scylladb#24520
2025-06-24 10:01:42 +03:00
Michał Chojnowski
3eba371e09 test/boost/mutation_reader_test: fix a use-after-free in test_fast_forwarding_combined_reader_is_consistent_with_slicing
The contract in mutation_reader.hh says:

```
// pr needs to be valid until the reader is destroyed or fast_forward_to()
// is called again.
    future<> fast_forward_to(const dht::partition_range& pr) {
```

`test_fast_forwarding_combined_reader_is_consistent_with_slicing` violates
this by passing a temporary to `fast_forward_to`.

Fix that.

Fixes scylladb/scylladb#24542

Closes scylladb/scylladb#24543

(cherry picked from commit 27f66fb110)

Closes scylladb/scylladb#24548
2025-06-24 10:01:19 +03:00
Gleb Natapov
c644526bf9 api: return error from get_host_id_map if gossiper is not enabled yet.
Token metadata api is initialized before gossiper is started.
get_host_id_map REST endpoint cannot function without the fully
initialized gossiper though. The gossiper is started deep in
the join_cluster call chain, but if we move token_metadata api
initialization after the call it means that no api will be available
during bootstrap. This is not what we want.

Make a simple fix by returning an error from the api if the gossiper is
not initialized yet.

Fixes: #24479

Closes scylladb/scylladb#24575

(cherry picked from commit e364995e28)

Closes scylladb/scylladb#24587
2025-06-24 10:00:48 +03:00
Nadav Har'El
34bdbad128 Merge '[Backport 2025.2] cql, schema: Extend keyspace, table, views, indexes name length limit from 48 to 192 bytes' from Scylladb[bot]
cql, schema: Extend name length limit from 48 to 192 bytes

    This commit increases the maximum length of names for keyspaces, tables, materialized views, and indexes from 48 to 192 bytes.
    The previous 48-bytes limit was inherited from Cassandra 3 for compatibility. However, this validation was removed in Cassandra 4 and 5 (see CASSANDRA-20389)
    and some usage scenarios (such as some feature store workflows generating long table names) now depend on this relaxed constraint.
    This change brings ScyllaDB's behavior in line with modern Cassandra versions and better supports these use cases.

    The new limit of 192 bytes is derived from underlying filesystem limitations to prevent runtime errors when creating directories for table data.
    When a new table is created, ScyllaDB generates a directory for its SSTables. The directory name is constructed from the table name, a dash, and a 32-character UUID.
    For a CDC-enabled table, an associated log table is also created, which has the suffix `_scylla_cdc_log` appended to its name.
    The directory name for this log table becomes the longest possible representation.
    Additionally we reserve 15 bytes for future use, allowing for potential future extensions without breaking existing schemas.
    To guarantee that directory creation never fails due to exceeding filesystem name limits, the maximum name length is calculated as follows:
      255 bytes (common filesystem limit for a path component)
    -  32 bytes (for the 32-character UUID string)
    -   1 byte  (for the '-' separator)
    -  15 bytes (for the '_scylla_cdc_log' suffix)
    -  15 bytes (reserved for future use)
    ----------
    = 192 bytes (Maximum allowed name length)
    This calculation is similar in principle to the one proposed for Cassandra to fix related directory creation failures (see apache/cassandra/pull/4038).

    This patch also updates/adds all associated tests to validate the new 192-byte limit.
    The documentation has been updated accordingly.

Fixes #4480

Backport 2025.2: The significantly shorter maximum table name length in Scylla compared to Cassandra is becoming a more common issue for users in the latest release.

- (cherry picked from commit a41c12cd85)

- (cherry picked from commit 4577c66a04)

Parent PR: #24500

Closes scylladb/scylladb#24603

* github.com:scylladb/scylladb:
  cql, schema: Extend name length limit from 48 to 192 bytes
  replica: Remove unused keyspace::init_storage()
2025-06-23 15:48:23 +03:00
Karol Nowacki
76bd23cddd cql, schema: Extend name length limit from 48 to 192 bytes
This commit increases the maximum length of names for keyspaces, tables, materialized views, and indexes from 48 to 192 bytes.
The previous 48-bytes limit was inherited from Cassandra 3 for compatibility. However, this validation was removed in Cassandra 4 and 5 (see CASSANDRA-20389)
and some usage scenarios (such as some feature store workflows generating long table names) now depend on this relaxed constraint.
This change brings ScyllaDB's behavior in line with modern Cassandra versions and better supports these use cases.

The new limit of 192 bytes is derived from underlying filesystem limitations to prevent runtime errors when creating directories for table data.
When a new table is created, ScyllaDB generates a directory for its SSTables. The directory name is constructed from the table name, a dash, and a 32-character UUID.
For a CDC-enabled table, an associated log table is also created, which has the suffix `_scylla_cdc_log` appended to its name.
The directory name for this log table becomes the longest possible representation.
Additionally we reserve 15 bytes for future use, allowing for potential future extensions without breaking existing schemas.
To guarantee that directory creation never fails due to exceeding filesystem name limits, the maximum name length is calculated as follows:
  255 bytes (common filesystem limit for a path component)
-  32 bytes (for the 32-character UUID string)
-   1 byte  (for the '-' separator)
-  15 bytes (for the '_scylla_cdc_log' suffix)
-  15 bytes (reserved for future use)
----------
= 192 bytes (Maximum allowed name length)
This calculation is similar in principle to the one proposed for Cassandra to fix related directory creation failures (see apache/cassandra/pull/4038).

This patch also updates/adds all associated tests to validate the new 192-byte limit.
The documentation has been updated accordingly.

(cherry picked from commit 4577c66a04)
2025-06-22 17:38:30 +00:00
Karol Nowacki
87f31f79a3 replica: Remove unused keyspace::init_storage()
This function was declared but had no implementation or callers. It is being removed as minor code cleanup.

(cherry picked from commit a41c12cd85)
2025-06-22 17:38:29 +00:00
Jenkins Promoter
942b16ffe5 Update ScyllaDB version to: 2025.2.0-rc6 2025-06-22 15:01:54 +03:00
Pavel Emelyanov
66fe11a126 Update seastar submodule (no nested stall backtraces)
* seastar 9f0034a0...450e36d5 (1):
  > stall_detector: no backtrace if exception

Fixes #24464

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24541
2025-06-19 10:08:40 +03:00
Piotr Dulikowski
9536949911 Merge '[Backport 2025.2] tablets: deallocate storage state on end_migration' from Scylladb[bot]
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:

* When the stage is updated from `cleanup` to `end_migration`, the
  storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
  then we don't allocate a storage group for it. This happens for
  example if the leaving replica is restarted during tablet migration.
  If it's initialized in `cleanup` stage then we allocate a storage
  group, and it will be deallocated when transitioning to
  `end_migration`.

This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.

It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.

Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.

This fixes the following issue:

1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it's already cleaned up
5. the storage group remains allocated on the leaving replica after the
   migration is completed - it's not cleaned up properly.

Fixes https://github.com/scylladb/scylladb/issues/23481

backport to all relevant releases since it's a bug that results in a crash

- (cherry picked from commit 34f15ca871)

- (cherry picked from commit fb18fc0505)

- (cherry picked from commit bd88ca92c8)

Parent PR: #24393

Closes scylladb/scylladb#24488

* github.com:scylladb/scylladb:
  test/cluster/test_tablets: test restart during tablet cleanup
  test: tablets: add get_tablet_info helper
  tablets: deallocate storage state on end_migration
scylla-2025.2.0-rc5 scylla-2025.2.0-rc5-candidate-20250618080131
2025-06-18 10:25:32 +02:00
Anna Stuchlik
01d3b504d1 doc: add support for z3 GCP
This commit adds support for z3-highmem-highlssd instance types to
Cloud Instance Recommendations for GCP.

Fixes https://github.com/scylladb/scylladb/issues/24511

Closes scylladb/scylladb#24533

(cherry picked from commit 648d8caf27)

Closes scylladb/scylladb#24545
2025-06-17 23:40:47 +03:00
Michael Litvak
305f827888 test/cluster/test_tablets: test restart during tablet cleanup
Add a test that reproduces issue scylladb/scylladb#23481.

The test migrates a tablet from one node to another, and while the
tablet is in some stage of cleanup - either before or right after,
depending on the parameter - the leaving replica, on which the tablet is
cleaned, is restarted.

This is interesting because when the leaving replica starts and loads
its state, the tablet could be in different stages of cleanup - the
SSTables may still exist or they may have been cleaned up already, and
we want to make sure the state is loaded correctly.

(cherry picked from commit bd88ca92c8)
2025-06-17 13:59:10 +00:00
Michael Litvak
d094bc6fc9 test: tablets: add get_tablet_info helper
Add a helper for tests to get the tablet info from system.tablets for a
tablet owning a given token.

(cherry picked from commit fb18fc0505)
2025-06-17 13:59:10 +00:00
Michael Litvak
c11a2e2aaf tablets: deallocate storage state on end_migration
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:

* When the stage is updated from `cleanup` to `end_migration`, the
  storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
  then we don't allocate a storage group for it. This happens for
  example if the leaving replica is restarted during tablet migration.
  If it's initialized in `cleanup` stage then we allocate a storage
  group, and it will be deallocated when transitioning to
  `end_migration`.

This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.

It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.

Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.

This fixes the following issue:

1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it was already cleaned up
4. the storage group remains allocated on the leaving replica after the
   migration is completed - it's not cleaned up properly.

Fixes scylladb/scylladb#23481

(cherry picked from commit 34f15ca871)
2025-06-17 13:59:10 +00:00
Botond Dénes
a63b22eec6 Merge '[Backport 2025.2] tablets: fix missing data after tablet merge ' from Scylladb[bot]
Consider the following scenario:

1) let's assume tablet 0 has range [1, 5] (pre merge)
2) tablet merge happens, tablet 0 has now range [1, 10]
3) tablet_sstable_set isn't refreshed, so holds a stale state, thinks tablet 0 still has range [1, 5]
4) during a full scan, forward service will intersect the full range with tablet ranges and consume one tablet at a time
5) replica service is asked to consume range [1, 10] of tablet 0 (post merge)

We have two possible outcomes:

With cache bypass:

1) cache reader is bypassed
2) sstable reader is created on range [1, 10]
3) unrefreshed tablet_sstable_set holds stale state, but select correctly all sstables intersecting with range [1, 10]

With cache:

1) cache reader is created
2) finds partition with token 5 is cached
3) sstable reader is created on range [1, 4] (later would fast forward to range [6, 10]; also belongs to tablet 0)
4) incremental selector consumes the pre-merge sstable spanning range [1, 5]
4.1) since the partitioned_sstable_set pre-merge contains only that sstable, EOS is reached
4.2) since EOS is reached, the fast forward to range [6, 10] is not allowed.
So with the set refreshed, sstable set is aligned with tablet ranges, and no premature EOS is signalled, otherwise preventing fast forward to from happening and all data from being properly captured in the read.

This change fixes the bug and triggers a mutation source refresh whenever the number of tablets for the table has changed, not only when we have incoming tablets.

Additionally, includes a fix for range reads that span more than one tablet, which can happen during split execution.

Fixes: https://github.com/scylladb/scylladb/issues/23313

This change needs to be backported to all supported versions which implement tablet merge.

- (cherry picked from commit d0329ca370)

- (cherry picked from commit 1f9f724441)

- (cherry picked from commit 53df911145)

Parent PR: #24287

Closes scylladb/scylladb#24339

* github.com:scylladb/scylladb:
  replica: Fix range reads spanning sibling tablets
  test: add reproducer and test for mutation source refresh after merge
  tablets: trigger mutation source refresh on tablet count change
2025-06-17 08:35:14 +03:00
Jenkins Promoter
0adf905112 Update ScyllaDB version to: 2025.2.0-rc5 2025-06-16 16:21:22 +03:00
Pavel Emelyanov
c2a9f2d9c6 Update seastar submodule
* seastar d7ff58f2...9f0034a0 (1):
  > http_client: Add ECONNRESET to retryable errors

And switch to 2025.2 branch from scylla-seastar for backports

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24446
2025-06-15 17:33:16 +03:00
Raphael S. Carvalho
79958472bc replica: Fix range reads spanning sibling tablets
We don't guarantee that coordinators will only emit range reads that
span only one tablet.

Consider this scenario:

1) split is about to be finalized, barrier is executed, completes.
2) coordinator starts a read, uses pre-split erm (split not committed to group0 yet)
3) split is committed to group0, all replicas switch storage.
4) replica-side read is executed, uses a range which spans tablets.

We could fix it with two-phase split execution. Rather than pushing the
complexity to higher levels, let's fix incremental selector which should
be able to serve all the tokens owned by a given shard. During split
execution, either of sibling tablets aren't going anywhere since it
runs with state machine locked, so a single read spanning both
sibling tablets works as long as the selector works across tablet
boundaries.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 53df911145)
2025-06-15 09:14:38 -03:00
Ferenc Szili
ba192c1a29 test: add reproducer and test for mutation source refresh after merge
This change adds a reproducer and test for the fix where the local mutation
source is not always refreshed after a tablet merge.

(cherry picked from commit 1f9f724441)
2025-06-15 09:14:37 -03:00
Jenkins Promoter
89f5374435 Update pgo profiles - aarch64 2025-06-15 04:46:00 +03:00
Jenkins Promoter
184e0716b3 Update pgo profiles - x86_64 2025-06-15 04:08:36 +03:00
Anna Stuchlik
baa2592299 doc: remove the limitation for disabling CDC
This commit removes the instruction to stop all writes before disabling CDC with ALTER.

Fixes https://github.com/scylladb/scylla-docs/issues/4020

Closes scylladb/scylladb#24406

(cherry picked from commit b0ced64c88)

Closes scylladb/scylladb#24476
scylla-2025.2.0-rc4 scylla-2025.2.0-rc4-candidate-20250613105409
2025-06-13 14:07:38 +03:00
Robert Bindar
a926cba476 Add support for nodetool refresh --skip-reshape
This patch adds the new option in nodetool, patches the
load_new_ss_tables REST request with a new parameter and
skips the reshape step in refresh if this flag is passed.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#24409
Fixes: #24365

(cherry picked from commit ca1a9c8d01)

Closes scylladb/scylladb#24472
2025-06-13 14:06:19 +03:00