Commit Graph

1202 Commits

Author SHA1 Message Date
Yaniv Michael Kaul
489bdb2f93 repair/repair: defer percentage calculations in debug log with value_of()
Severity: low

Six percentage calculation function calls were evaluated eagerly as
debug log arguments on every repair range completion. Use
seastar::value_of() to defer the computations to log-time, skipping
them when debug logging is disabled.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:39 +02:00
Yaniv Michael Kaul
9efd94199d repair/row_level: avoid redundant coroutine invocations in debug logs
Severity: high

working_row_hashes().get() was being called multiple times purely for
debug log arguments, rebuilding the entire hash set each time. Cache
the result and wrap debug logs in is_enabled() guards to avoid the
expensive coroutine re-invocation when debug logging is disabled.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:39 +02:00
Asias He
6cb263bab0 repair: Prevent CPU stall during cross-shard row copy and destruction
When handling `repair_stream_cmd::end_of_current_rows`, passing the
foreign list directly to `put_row_diff_handler` triggered a massive
synchronous deep copy on the destination shard. Additionally, destroying
the list triggered a synchronous deallocation on the source shard. This
blocked the reactor and triggered the CPU stall detector.

This commit fixes the issue by introducing `clone_gently()` to copy the
list elements one by one, and leveraging the existing
`utils::clear_gently()` to destroy them. Both utilize
`seastar::coroutine::maybe_yield()` to allow the reactor to breathe
during large cross-shard transfers and cleanups.

Fixes SCYLLADB-403

Closes scylladb/scylladb#28979
2026-03-17 11:05:15 +02:00
Botond Dénes
475220b9c9 Merge 'Remove the rest of pre raft topology code' from Gleb Natapov
Remove the rest of the code that assumes that either group0 does not exist yet or a cluster is till not upgraded to raft topology. Both of those are not supported any more.

No need to backport since we remove functionality here.

Closes scylladb/scylladb#28841

* github.com:scylladb/scylladb:
  service level: remove version 1 service level code
  features: move GROUP0_SCHEMA_VERSIONING to deprecated features list
  migration_manager: remove unused forward definitions
  test: remove unused code
  auth: drop auth_migration_listener since it does nothing now
  schema: drop schema_registry_entry::maybe_sync() function
  schema: drop make_table_deleting_mutations since it should not be needed with raft
  schema: remove calculate_schema_digest function
  schema: drop recalculate_schema_version function and its uses
  migration_manager: drop check for group0_schema_versioning feature
  cdc: drop usage of cdc_local table and v1 generation definition
  storage_service: no need to add yourself to the topology during reboot since raft state loading already did it
  storage_service: remove unused functions
  group0: drop with_raft() function from group0_guard since it always returns true now
  gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more
  gossiper: drop tokens from loaded_endpoint_state
  gossiper: remove unused functions
  storage_service: do not pass loaded_peer_features to join_topology()
  storage_service: remove unused fields from replacement_info
  gossiper: drop is_safe_for_restart() function and its use
  storage_service: remove unused variables from join_topology
  gossiper: remove the code that was only used in gossiper topology
  storage_service: drop the check for raft mode from recovery code
  cdc: remove legacy code
  test: remove unused injection points
  auth: remove legacy auth mode and upgrade code
  treewide: remove schema pull code since we never pull schema any more
  raft topology: drop upgrade_state and its type from the topology state machine since it is not used any longer
  group0: hoist the checks for an illegal upgrade into main.cc
  api: drop get_topology_upgrade_state and always report upgrade status as done
  service_level_controller: drop service level upgrade code
  test: drop run_with_raft_recovery parameter to cql_test_env
  group0: get rid of group0_upgrade_state
  storage_service: drop topology_change_kind as it is no longer needed
  storage_service: drop check_ability_to_perform_topology_operation since no upgrades can happen any more
  service_storage: remove unused functions
  storage_service: remove non raft rebuild code
  storage_service: set topology change kind only once
  group0: drop in_recovery function and its uses
  group0: rename use_raft to maintenance_mode and make it sync
2026-03-11 10:24:20 +02:00
Botond Dénes
81e214237f Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk
This pull request adds support for calculation and storing CRC32 digests for all SSTable components.
This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure
and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.
Several test cases where introduced to verify expected behaviour.

Additionally, this PR adds new rewrite component mechanism for safe sstable component rewriting.
Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed to the final name after sealing. This allowed crash recovery by simply removing the temporary file on startup.

However, with component digests stored in scylla_metadata (#20100),
replacing a component like Statistics requires atomically updating both the component
and scylla_metadata with the new digest - impossible with POSIX rename.

The new mechanism creates a clone sstable with a fresh generation:

- Hard-links all components from the source except the component being rewritten and scylla_metadata
- Copies original sstable components pointer and recognized components from the source
- Invokes a modifier callback to adjust the new sstable before rewriting
- Writes the modified component along with updated scylla_metadata containing the new digest
- Seals the new sstable with a temporary TOC
- Replaces the old sstable atomically, the same way as it is done in compaction

This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair).
In case of any failure durning the whole process, sstable will be automatically deleted on the node startup due to
temporary toc persistence.

Backport is not required, it is a new feature

Fixes https://github.com/scylladb/scylladb/issues/20100, https://github.com/scylladb/scylladb/issues/27453

Closes scylladb/scylladb#28338

* github.com:scylladb/scylladb:
  docs: document components_digests subcomponent and trailing digest in Scylla.db
  sstable_compaction_test: Add tests for perform_component_rewrite
  sstable_test: add verification testcases of SSTable components digests persistance
  sstables: store digest of all sstable components in scylla metadata
  sstables: replace rewrite_statistics with new rewrite component mechanism
  sstables: add new rewrite component mechanism for safe sstable component rewriting
  compaction: add compaction_group_view method to specify sstable version
  sstables: add null_data_sink and serialized_checksum for checksum-only calculation
  sstables: extract default write open flags into a constant
  sstables: Add write_simple_with_digest for component checksumming
  sstables: Extract file writer closing logic into separate methods
  sstables: Implement CRC32 digest-only writer
2026-03-10 16:02:53 +02:00
Gleb Natapov
02fc4ad0a9 treewide: remove schema pull code since we never pull schema any more
Schema pull was used by legacy schema code which is not supported for a
long time now and during legacy recovery which is no longer supported as
well. It can be dropped now.
2026-03-10 10:09:39 +02:00
Botond Dénes
509f2af8db Merge 'repair: Fix rwlock in compaction_state and lock holder lifecycle' from Raphael Raph Carvalho
Consider this:

- repair takes the lock holder
- tablet merge filber destories the compaction group and the compaction state
- repair fails
- repair destroy the lock holder

This is observed in the test:

```
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28]

...

repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair
repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551]
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found)
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge

....

scylla[10793] Segmentation fault on shard 28, in scheduling group streaming
```

The rwlock in compaction_state could be destroyed before the lock holder
of the rwlock is destroyed. This causes user after free when the lock
the holder is destroyed.

To fix it, users of repair lock will now be waited when a compaction
group is being stopped.
That way, compaction group - which controls the lifetime of rwlock -
cannot be destroyed while the lock is held.
Additionally, the merge completion fiber - that might remove groups -
is properly serialized with incremental repair.

The issue can be reproduced using sanitize build consistently and can not
be reproduced after the fix.

Fixes #27365

Closes scylladb/scylladb#28823

* github.com:scylladb/scylladb:
  repair: Fix rwlock in compaction_state and lock holder lifecycle
  repair: Prevent repair lock holder leakage after table drop
2026-03-05 14:18:25 +02:00
Asias He
225b10b683 repair: Fix rwlock in compaction_state and lock holder lifecycle
Consider this:

- repair takes the lock holder
- tablet merge filber destories the compaction group and the compaction state
- repair fails
- repair destroy the lock holder

This is observed in the test:

```
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28]

...

repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair
repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551]
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found)
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge

....

scylla[10793] Segmentation fault on shard 28, in scheduling group streaming
```

The rwlock in compaction_state could be destroyed before the lock holder
of the rwlock is destroyed. This causes user after free when the lock
the holder is destroyed.

To fix it, users of repair lock will now be waited when a compaction
group is being stopped.
That way, compaction group - which controls the lifetime of rwlock -
cannot be destroyed while the lock is held.
Additionally, the merge completion fiber - that might remove groups -
is properly serialized with incremental repair.

The issue can be reproduced using sanitize build consistently and can not
be reproduced after the fix.

Fixes #27365

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2026-03-03 21:05:15 -03:00
Raphael S. Carvalho
1d8903d9f7 repair: Prevent repair lock holder leakage after table drop
Prevent repair lock holder from being leaked in repair_service when table
is dropped midway.
The leakage might result in use-after-free later, since the repair lock
itself will be gone after table drop.
The RPC verb that removes the lock on success path will not be called
by coordinator after table was dropped.

Refs #27365.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-896.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2026-03-03 21:05:10 -03:00
Marcin Maliszkiewicz
a83ee6cf66 Merge 'db/batchlog_manager: re-add v1 support for mixed clusters' from Botond Dénes
3f7ee3ce5d introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective.
It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time.
However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally  on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost.
This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible.
The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem.

Fixes: https://github.com/scylladb/scylladb/issues/27886

Needs backport to 2026.1, to fix upgrades for clusters using batches

Closes scylladb/scylladb#28736

* github.com:scylladb/scylladb:
  test/boost/batchlog_manager_test: add tests for v1 batchlog
  test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2
  test/boost/batchlog_manager_test: fix indentation
  test/boost/batchlog_manager_test: extract prepare_batches() method
  test/lib/cql_assertions: is_rows(): add dump parameter
  tools/scylla-sstable: extract query result printers
  tools/scylla-sstable: add std::ostream& arg to query result printers
  repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log
  db/batchlog_manager: re-add v1 support
  db/batchlog_manager: return all_replayed from process_batch()
  db/batchlog_manager: process_bath() fix indentation
  db/batchlog_manager: make batch() a standalone function
  db/batchlog_manager: make structs stats public
  db/batchlog_manager: allocate limiter on the stack
  db/batchlog_manager: add feature_service dependency
  gms/feature_service: add batchlog_v2 feature
2026-03-02 12:09:10 +01:00
Taras Veretilnyk
5bbc44ed12 sstables: replace rewrite_statistics with new rewrite component mechanism
This commits migrates all callers that used rewrite_statistics to new
rewrite component mechanism.
2026-02-26 22:38:55 +01:00
Gleb Natapov
6173ea476b node_ops: remove topology over node ops code
The code is no longer called.
2026-02-25 10:08:32 +02:00
Botond Dénes
0549b61d55 repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log
Provides visibility into whether batchlog replay was successful or not.
2026-02-20 07:03:46 +02:00
Aleksandra Martyniuk
3fe596d556 service: pass topology guard to RBNO
Currently, raft-based node operations with streaming use topology
guards, but repair-based don't.

Topology guards ensure that if a respective session is closed
(the operation has finished), each leftover operation being a part
of this session fails. Thanks to that we won't incorrectly assume
that e.g. the old rpc received late belongs to the newly started
operation. This is especially important if the operation involves
writes.

Pass a topology_guard down from raft_topology_cmd_handler to repair
tasks. Repair tasks already support topology guards.

Fixes: https://github.com/scylladb/scylladb/issues/27759
2026-01-20 10:06:34 +01:00
Asias He
7ba7b25bdd repair: Implement auto repair for tablet repair
This patch implements the basic auto repair support for tablet repair.

It was decided to add no per table configuration for the initial
implementation, so two scylla yaml config options are introduced to set
the default auto repair configs for all the tablet tables.

- auto_repair_enabled_default

Set true to enable auto repair for tablet tables by default. The value
will be overridden by the per keyspace or per table configuration which
is not implemented yet.

- auto_repair_threshold_default_in_seconds

Set the default time in seconds for the auto repair threshold for tablet
tables. If the time since last repair is bigger than the configured
time, the tablet is eligible for auto repair. The value will be
overridden by the per keyspace or per table configuration which is not
implemented yet.

The following metrcis are added:

- auto_repair_needs_repair_nr

The number of tablets with auto repair enabled that needs repair

- auto_repair_enabled_nr

The number of tablets with auto repair enabled

The metrics are useful to tell if auto repair is falling behind.

In the future, more auto repair scheduling will be added, e.g.,
scheduling based on the repaired and unrepaired sstable set size,
tombstone ratio and so on, in addition to the time based scheduling.

Fixes SCYLLADB-99
2026-01-09 16:11:39 +08:00
Asias He
4f77dd058d repair: Add tablet repair progress report support
This patch adds tablet repair progress report support so that the user
could use the /task_manager/task_status API to query the progress.

In order to support this, a new system table is introduced to record the
user request related info, i.e, start of the request and end of the
request.

The progress is accurate when tablet split or merge happens in the
middle of the request, since the tokens of the tablet are recorded when
the request is started and when repair of each tablet is finished. The
original tablet repair is considered as finished when the finished
ranges cover the original tablet token ranges.

After this patch, the /task_manager/task_status API will report correct
progress_total and progress_completed.

Fixes #22564
Fixes #26896

Closes scylladb/scylladb#27679
2026-01-08 21:55:18 +02:00
Asias He
0aabf51380 repair: Fix sstable_list_to_mark_as_repaired with multishard writer
It was obseved:

```
test_repair_disjoint_row_2nodes_diff_shard_count was spuriously failing due to
segfault.

backtrace pointed to a failure when allocating an object from the chain of
freed objects, which indicates memory corruption.

(gdb) bt
    at ./seastar/include/seastar/core/shared_ptr.hh:275
    at ./seastar/include/seastar/core/shared_ptr.hh:430
Usual suspect is use-after-free, so ran the reproducer in the sanitize mode,
which indicated shared ptr was being copied into another cpu through the
multi shard writer:

seastar - shared_ptr accessed on non-owner cpu, at: ...
--------
seastar::smp_message_queue::async_work_item<mutation_writer::multishard_writer::make_shard_writer...

```

The multishard writer itself was fine, the problem was in the streaming consumer
for repair copying a shared ptr. It could work fine with same smp setting, since
there will be only 1 shard in the consumer path, from rpc handler all the way
to the consumer. But with mixed smp setting, the ptr would be copied into the
cpus involved, and since the shared ptr is not cpu safe, the refcount change
can go wrong, causing double free, use-after-free.

To fix, we pass a generic incremental repair handler to the streaming
consumer. The handler is safe to be copied to different shards. It will
be a no op if incremental repair is not enabled or on a different shard.

A reproducer test is added. The test could reproduce the crash
consistently before the fix and work well after the fix.

Fixes #27666

Closes scylladb/scylladb#27870
2026-01-08 21:55:18 +02:00
Asias He
3abda7d15e topology_coordinator: Ensure repair_update_compaction_ctrl is executed
Consider this:

- n1 is a coordinator and schedules tablet repair
- n1 detects tablet repair failed, so it schedules tablet transition to end_repair state
- n1 loses leadership and n2 becomes the new topology coordinator
- n2 runs end_repair on the tablet with session_id=00000000-0000-0000-0000-000000000000
- when a new tablet repair is scheduled, it hangs since the lock is already taken because it was not removed in previous step

To fix, we use the global_tablet_id to index the lock instead of the
session id.

In addition, we retry the repair_update_compaction_ctrl verb in case of
error to ensure the verb is eventually executed. The verb handler is
also updated to check if it is still in end_repair stage.

Fixes #26346

Closes scylladb/scylladb#27740
2025-12-31 13:17:18 +01:00
Botond Dénes
bfdd4f7776 Merge 'Synchronize incremental repair and tablet split' from Raphael Raph Carvalho
Split prepare can run concurrently with repair.

Consider this:

1) split prepare starts
2) incremental repair starts
3) split prepare finishes
4) incremental repair produces unsplit sstable
5) split is not happening on sstable produced by repair
        5.1) that sstable is not marked as repaired yet
        5.2) might belong to repairing set (has compaction disabled)
6) split executes
7) repairing or repaired set has unsplit sstable

If split was acked to coordinator (meaning prepare phase finished),
repair must make sure that all sstables produced by it are split.
It's not happening today with incremental repair because it disables
split on sstables belonging to repairing group. And there's a window
where sstables produced by repair belong to that group.

To solve the problem, we want the invariant where all sealed sstables
will be split.
To achieve this, streaming consumers are patched to produce unsealed
sstable, and the new variant add_new_sstable_and_update_cache() will
take care of splitting the sstable while it's unsealed.
If no split is needed, the new sstable will be sealed and attached.

This solution was also needed to interact nicely with out of space
prevention too. If disk usage is critical, split must not happen on
restart, and the invariant aforementioned allows for it, since any
unsplit sstable left unsealed will be discarded on restart.
The streaming consumer will fail if disk usage is critical too.

The reason interposer consumer doesn't fully solve the problem is
because incremental repair can start before split, and the sstable
being produced when split decision was emitted must be split before
attached. So we need a solution which covers both scenarios.

Fixes #26041.
Fixes #27414.

Should be backported to 2025.4 that contains incremental repair

Closes scylladb/scylladb#26528

* github.com:scylladb/scylladb:
  test: Add reproducer for split vs intra-node migration race
  test: Verify split failure on behalf of repair during critical disk utilization
  test: boost: Add failure_when_adding_new_sstable_test
  test: Add reproducer for split vs incremental repair race condition
  compaction: Fail split of new sstable if manager is disabled
  replica: Don't split in do_add_sstable_and_update_cache()
  streaming: Leave sstables unsealed until attached to the table
  replica: Wire add_new_sstables_and_update_cache() into intra-node streaming
  replica: Wire add_new_sstable_and_update_cache() into file streaming consumer
  replica: Wire add_new_sstable_and_update_cache() into streaming consumer
  replica: Document old add_sstable_and_update_cache() variants
  replica: Introduce add_new_sstables_and_update_cache()
  replica: Introduce add_new_sstable_and_update_cache()
  replica: Account for sstables being added before ACKing split
  replica: Remove repair read lock from maybe_split_new_sstable()
  compaction: Preserve state of input sstable in maybe_split_new_sstable()
  Rename maybe_split_sstable() to maybe_split_new_sstable()
  sstables: Allow storage::snapshot() to leave destination sstable unsealed
  sstables: Add option to leave sstable unsealed in the stream sink
  test: Verify unsealed sstable can be compacted
  sstables: Allow unsealed sstable to be loaded
  sstables: Restore sstable_writer_config::leave_unsealed
2025-12-23 07:28:56 +02:00
Calle Wilund
5f8f724d78 repair: Don't use off-strategy as repair destination with tablet tables
Fixes #17384

Bypasses enabling off-strategy storage/placement for repair streams
when table repaired is using tablets. Instead, the resulting sstable(s)
will be placed in the "normal" set of sstables, and bypass a post-repair
off-strategy compaction.

v2:
Bypass off-strat for whatever reason iff dest is tablets.

Closes scylladb/scylladb#27500
2025-12-16 06:54:07 +02:00
Raphael S. Carvalho
77a4f95eb8 test: Add reproducer for split vs incremental repair race condition
Refs #26041.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 17:01:16 -03:00
Avi Kivity
24264e24bb Revert "repair: Add tablet repair progress report support"
This reverts commit faad0167d7. It causes
a regression in

test_two_tablets_concurrent_repair_and_migration_repair_writer_level

in debug mode (with ~5%-10% probability).

Fixes #27510.

Closes scylladb/scylladb#27560
2025-12-11 12:18:11 +02:00
Asias He
faad0167d7 repair: Add tablet repair progress report support
This patch adds tablet repair progress report support so that the user
could use the /task_manager/task_status API to query the progress.

In order to support this, a new system table is introduced to record the
user request related info, i.e, start of the request and end of the
request.

The progress is accurate when tablet split or merge happens in the
middle of the request, since the tokens of the tablet are recorded when
the request is started and when repair of each tablet is finished. The
original tablet repair is considered as finished when the finished
ranges cover the original tablet token ranges.

After this patch, the /task_manager/task_status API will report correct
progress_total and progress_completed.

Fixes #22564
Fixes #26896

Closes scylladb/scylladb#26924
2025-12-08 13:35:19 +02:00
Asias He
e97a504775 repair: Allow min max range to be updated for repair history
It is observed that:

repair - repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: Failed to update
system.repair_history table of node d27de212-6f32-4649ad76-a9ef1165fdcb:
seastar::rpc::remote_verb_error (repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: range (minimum
token,maximum token) is not in the format of (start, end])

This is because repair checks the end of the range to be repaired needs
to be inclusive. When small_table_optimization is enabled for regular
repair, a (minimum token,maximum token) will be used.

To fix, we can relax the check of (start, end] for the min max range.

Fixes #27220

Closes scylladb/scylladb#27357
2025-12-05 10:41:25 +02:00
Aleksandra Martyniuk
e3e81a9a7a repair: throw if flush failed in get_flush_time
Currently, _flush_time was stored as a std::optional<gc_clock::time_point>
and std::nullopt indicates that the flush was needed but failed. It's confusing
for the caller and does not work as expected since the _flush_time is initialized
with value (not optional).

Change _flush_time type to gc_clock::time_point. If a flush is needed but failed,
get_flush_time() throws an exception.

This was suppose to be a part of https://github.com/scylladb/scylladb/pull/26319
but it was mistakenly overwritten during rebases.

Refs: https://github.com/scylladb/scylladb/issues/24415.

Closes scylladb/scylladb#26794
2025-12-04 11:45:53 +02:00
Asias He
da5cc13e97 repair: Fix deadlock when topology coordinator steps down in the middle
Consider this:

1) n1 is the topology coordinator
2) n1 schedules and executes a tablet repair with session id s1 for a
tablet on n3 an n4.
3) n3 and n4 take and store the in _rs._repair_compaction_locks[s1]
4) n1 steps down before it executes
locator::tablet_transition_stage::end_repair
5) n2 becomes the new topology coordinator
6) n2 runs locator::tablet_transition_stage::repair again
7) n3 and n4 try to take the lock again and hangs since the lock is
already taken.

To avoid the deadlock, we can throw in step 7 so that n2 will
proceed to end_repair stage and release the lock. After that, the
scheduler could schedule the tablet repair request again.

Fixes #26346

Closes scylladb/scylladb#27163
2025-11-28 15:14:39 +01:00
Radosław Cybulski
d589e68642 Add precompiled headers to CMakeLists.txt
Add precompiled header support to CMakeLists.txt and configure.py -
it improves compilation time by approximately 10%.

New header `stdafx.hh` is added, don't include it manually -
the compiler will include it for you. The header contains includes from
external libraries used by Scylla - seastar, standard library,
linux headers and zlib.

The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER`
or configure.py --disable-precompiled-header to disable.

The feature should be disabled, when trying to check headers - otherwise
you might get false negatives on missing includes from seastar / abseil and so on.

Note: following configuration needs to be added to ccache.conf:

    sloppiness = pch_defines,time_macros,include_file_mtime,include_file_ctime

Closes scylladb/scylladb#26617
2025-11-21 12:27:41 +02:00
Asias He
dbeca7c14d repair: Add metric for time spent on tablet repair
It is useful to check time spent on tablet repair. It can be used to
compare incremental repair and non-incremental repair. The time does not
include the time waiting for the tablet scheduler to schedule the tablet
repair task.

Fixes #26505

Closes scylladb/scylladb#26502
2025-11-06 10:00:20 +03:00
Aleksandra Martyniuk
d436233209 repair: fail tablet repair if any batch wasn't sent successfully
If any batch replay failed, we cannot update repair_time as we risk the
data resurrection.

If replay of any batch needs to be retried, run the whole repair but
fail at the very end, so that the repair_time for it won't be updated.
2025-10-23 10:39:42 +02:00
Aleksandra Martyniuk
7f20b66eff db: repair: throw if replay fails
Return a flag determining whether all the batches were sent successfully in
batchlog_manager::replay_all_failed_batches (batches skipped due to being
too fresh are not counted). Throw in repair_flush_hints_batchlog_handler
if not all batches were replayed, to ensure that repair_time isn't updated.
2025-10-23 10:38:31 +02:00
Asias He
33bc1669c4 repair: Fix uuid and nodes_down order in the log
Fixes #26536

Closes scylladb/scylladb#26547
2025-10-20 13:21:59 +03:00
Michał Chojnowski
55c4b89b88 sstables: make sstable::estimated_keys_for_range asynchronous
Currently, `sstable::estimated_keys_for_range` works by
checking what fraction of Summary is covered by the given
range, and multiplying this fraction to the number of all keys.
Since computing things on Summary doesn't involve I/O (because Summary
is always kept in RAM), this is synchronous.

In a later patch, we will modify `sstable::estimated_keys_for_range`
so that it can deal with sstables that don't have a Summary
(because they use BTI indexes instead of BIG indexes).
In that case, the function is going to compute the relevant fraction
by using the index instead of Summary. This will require making
the function asynchronous. This is what we do in this patch.

(The actual change to the logic of `sstable::estimated_keys_for_range`
will come in the next patch. In this one, we only make it asynchronous).
2025-09-29 13:01:21 +02:00
Asias He
b31e651657 repair: Always reset node ops progress to 100% upon completion
Always set the node ops progress to 100% when the operation finishes,
regardless of success or failure. This ensures the progress never
remains below 100%, which would otherwise indicates a pending node
operation in case of an error.

Fixes #26193

Closes scylladb/scylladb#26194
2025-09-25 11:05:52 +03:00
Pavel Emelyanov
a1ea553fe1 code: Replace distributed<> with sharded<>
The latter is recommended in seastar, and the former was left as
compatibility alias. Latest seastar explicitly marks it as deprecated so
once the submodule is updated, compilation logs will explode.

Most of the patch is generated with

    for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]*>') ; do sed -e 's/\<distributed<\([A-Za-z0-9:_]*\)>/sharded<\1>/g' -i $f; done
    for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done

and a small manual change in test/perf/perf.hh

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26136
2025-09-19 12:22:51 +02:00
Gleb Natapov
d3badf7406 storage_service: change node_ops_info::ignore_nodes to host id
It drop useless translation from id to ip during removenode through
topology coordinator.

Closes scylladb/scylladb#25958
2025-09-15 10:18:24 +02:00
Asias He
cb7db47ae1 repair: Add incremental_mode option for tablet repair
This patch introduces a new `incremental_mode` parameter to the tablet
repair REST API, providing more fine-grained control over the
incremental repair process.

Previously, incremental repair was on and could not be turned off. This
change allows users to select from three distinct modes:

- `regular`: This is the default mode. It performs a standard
  incremental repair, processing only unrepaired sstables and skipping
  those that are already repaired. The repair state (`repaired_at`,
  `sstables_repaired_at`) is updated.

- `full`: This mode forces the repair to process all sstables, including
  those that have been previously repaired. This is useful when a full
  data validation is needed without disabling the incremental repair
  feature. The repair state is updated.

- `disabled`: This mode completely disables the incremental repair logic
  for the current repair operation. It behaves like a classic
  (pre-incremental) repair, and it does not update any incremental
  repair state (`repaired_at` in sstables or `sstables_repaired_at` in
  the system.tablets table).

The implementation includes:

- Adding the `incremental_mode` parameter to the
  `/storage_service/repair/tablet` API endpoint.
- Updating the internal repair logic to handle the different modes.
- Adding a new test case to verify the behavior of each mode.
- Updating the API documentation and developer documentation.

Fixes #25605

Closes scylladb/scylladb#25693
2025-09-09 06:50:21 +03:00
Radosław Cybulski
c242234552 Revert "build: add precompiled headers to CMakeLists.txt"
This reverts commit 01bb7b629a.

Closes scylladb/scylladb#25735
2025-09-03 09:46:00 +03:00
Avi Kivity
600349e29a Merge 'tasks: return task::impl from make_and_start_task ' from Aleksandra Martyniuk
Currently, make_and_start_task returns a pointer to task_manager::task
that hides the implementation details. If we need to access
the implementation (e.g. because we want a task to "return" a value),
we need to make and start task step by step openly.

Return task_manager::task::impl from make_and_start_task. Use it
where possible.

Fixes: https://github.com/scylladb/scylladb/issues/22146.

Optimization; no backport

Closes scylladb/scylladb#25743

* github.com:scylladb/scylladb:
  tasks: return task::impl from make_and_start_task
  compaction: use current_task_type
  repair: add new param to tablet_repair_task_impl
  repair: add new params to shard_repair_task_impl
  repair: pass argument by value
2025-08-31 15:44:37 +03:00
Avi Kivity
bc5773f777 Merge 'Add out of space prevention mechanisms' from Łukasz Paszkowski
When a scaling out is delayed or fails, it is crucial to ensure that clusters remain operational
and recoverable even under extreme conditions. To achieve this, the following proactive measures
are implemented:
- reject writes
      - includes: inserts, updates, deletes, counter updates, hints, read+repair and lwt writes
      - applicable to: user tables, views, CDC log, audit, cql tracing
- stop running compactions/repairs and prevent from starting new ones
- reject incoming tablet migrations

The aforementioned mechanisms are automatically enabled when node's disk utilization reaches
the critical level (default: 98%) and disabled when the utilization drop below the threshold.

Apart from that, the series add tests that require mounted volumes to simulate out of space.
The paths to the volumes can be provided using the a pytest argument, i.e.  `--space-limited-dirs`.
When not provided, tests are skipped.

Test scenarios:

1. Start a cluster and write data until one of the nodes reaches 90% of the disk utilization
2. Perform an **operation** that would take the nodes over 100%
3. The nodes should not exceed the critical disk utilization (98% by default)
4. Scale out the cluster by adding one node per rack
5. Retry or wait for the **operation** from step 2

The **operation** is: writing data, running compactions, building materialized views, running repair,
migrating tablets (caused by RF change, decommission).

The test is successful, if no nodes run out of space, the **operation** from step 2 is
aborted/paused/timed out and the **operation** from step 5 is successful.

`perf-simple-query --smp 1 -m 1G` results obtained for fixed 400MHz frequency:

Read path (before)

```
instructions_per_op:
	mean=   39661.51 standard-deviation=34.53
	median= 39655.39 median-absolute-deviation=23.33
	maximum=39708.71 minimum=39622.61
```

Read path (after)

```
instructions_per_op:
	mean=   39691.68 standard-deviation=34.54
	median= 39683.14 median-absolute-deviation=11.94
	maximum=39749.32 minimum=39656.63
```

Write path (before):

```
instructions_per_op:
	mean=   50942.86 standard-deviation=97.69
	median= 50974.11 median-absolute-deviation=34.25
	maximum=51019.23 minimum=50771.60
```

Write path (after):

```
instructions_per_op:
	mean=   51000.15 standard-deviation=115.04
	median= 51043.93 median-absolute-deviation=52.19
	maximum=51065.81 minimum=50795.00
```

Fixes: https://github.com/scylladb/scylladb/issues/14067
Refs: https://github.com/scylladb/scylladb/issues/2871

No backport, as it is a new feature.

Closes scylladb/scylladb#23917

* github.com:scylladb/scylladb:
  tests/cluster: Add new storage tests
  test/scylla_cluster: Override workdir when passed via cmdline
  streaming: Reject incoming migrations
  storage_service: extend locator::load_stats to collect per-node critical disk utilization flag
  repair_service: Add a facility to disable the service
  compaction_manager: Subscribe to out of space controller
  compaction_manager: Replace enabled/disabled states with running state
  database: Add critical_disk_utilization mode database can be moved to
  disk_space_monitor: add subscription API for threshold-based disk space monitoring
  docs: Add feature documentation
  config: Add critical_disk_utilization_level option
  replica/exceptions: Add a new custom replica exception
2025-08-30 18:47:57 +03:00
Piotr Dulikowski
7ccb50514d Merge 'Introduce view building coordinator' from Michał Jadwiszczak
This patch introduces `view_building_coordinator`, a single entity within whole cluster responsible for building tablet-based views.

The view building coordinator takes slightly different approach than the existing node-local view builder. The whole process is split into smaller view building tasks, one per each tablet replica of the base table.
The coordinator builds one base table at a time and it can choose another when all views of currently processing base table are built.
The tasks are started by setting `STARTED` state and they are executed by node-local view building worker. The tasks are scheduled in a way, that each shard processes only one tablet at a time (multiple tasks can be started for a shard on a node because a table can have multiple views but then all tasks have the same base table and tablet (last_token)). Once the coordinator starts the tasks, it sends `work_on_view_building_tasks` RPC to start the tasks and receive their results.
This RPC is resilient to RPC failure or raft leader change, meaning if one RPC call started a batch of tasks but then failed (for instance the raft leader was changed and caller aborted waiting for the response), next RPC call will attach itself to the already started batch.

The coordinator plugs into handling tablet operations (migration/resize/RF change) and adjusts its tasks accordingly. At the start of each tablet operation, the coordinator aborts necessary view building tasks to prevent https://github.com/scylladb/scylladb/issues/21564. Then, new adjusted tasks are created at the end of the operation.
If the operation fails at any moment, aborted tasks are rollback.

The view building coordinator can also handle staging sstables using process_staging view building tasks. We do this because we don't want to start generating view updates from a staging sstable prematurely, before the writes are directed to the new replica (https://github.com/scylladb/scylladb/issues/19149).

For detailed description check: `docs/dev/view-building-coordinator.md`

Fixes https://github.com/scylladb/scylladb/issues/22288
Fixes https://github.com/scylladb/scylladb/issues/19149
Fixes https://github.com/scylladb/scylladb/issues/21564
Fixes https://github.com/scylladb/scylladb/issues/17603
Fixes https://github.com/scylladb/scylladb/issues/22586
Fixes https://github.com/scylladb/scylladb/issues/18826
Fixes https://github.com/scylladb/scylladb/issues/23930

---

This PR is reimplementation of https://github.com/scylladb/scylladb/pull/21942

Closes scylladb/scylladb#23760

* github.com:scylladb/scylladb:
  test/cluster: add view build status tests
  test/cluster: add view building coordinator tests
  utils/error_injection: allow to abort `injection_handler::wait_for_message()`
  test: adjust existing tests
  utils/error_injection: add injection with `sleep_abortable()`
  db/view/view_builder: ignore `no_such_keyspace` exception
  docs/dev: add view building coordinator documentation
  db/view/view_building_worker: work on `process_staging` tasks
  db/view/view_building_worker: register staging sstable to view building coordinator when needed
  db/view/view_building_worker: discover staging sstables
  db/view/view_building_worker: add method to register staging sstable
  db/view/view_update_generator: add method to process staging sstables instantly
  db/view/view_update_generator: extract generating updates from staging sstables to a method
  db/view/view_update_generator: ignore tablet-based sstables
  db/view/view_building_coordinator: update view build status on node join/left
  db/view/view_building_coordinator: handle tablet operations
  db/view: add view building task mutation builder
  service/topology_coordinator: run view building coordinator
  db/view: introduce `view_building_coordinator`
  db/view/view_building_worker: update built views locally
  db/view: introduce `view_building_worker`
  db/view: extract common view building functionalities
  db/view: prepare to create abstract `view_consumer`
  message/messaging_service: add `work_on_view_building_tasks` RPC
  service/topology_coordinator: make `term_changed_error` public
  db/schema_tables: create/cleanup tasks when an index is created/dropped
  service/migration_manager: cleanup view building state on drop keyspace
  service/migration_manager: cleanup view building state on drop view
  service/migration_manager: create view building tasks on create view
  test/boost: enable proxy remote in some tests
  service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()`
  service/migration_manager: coroutinize `prepare_new_view_announcement()`
  service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine`
  service: reload `view_building_state_machine` on group0 apply()
  service/vb_coordinator: add currently processing base
  db/system_keyspace: move `get_scylla_local_mutation()` up
  db/system_keyspace: add `view_building_tasks` table
  db/view: add view_building_state and views_state
  db/system_keyspace: add method to get view build status map
  db/view: extract `system.view_build_status_v2` cql statements to system_keyspace
  db/system_keyspace: move `internal_system_query_state()` function earlier
  db/view: ignore tablet-based views in `view_builder`
  gms/feature_service: add VIEW_BUILDING_COORDINATOR feature
2025-08-29 17:28:44 +02:00
Aleksandra Martyniuk
7fe1ad1f63 tasks: return task::impl from make_and_start_task
Currently, make_and_start_task returns a pointer to task_manager::task
that hides the implementation details. If we need to access
the implementation (e.g. because we want a task to "return" a value),
we need to make and start task step by step openly.

Return task_manager::task::impl from make_and_start_task. Use it
where possible.

Fixes: https://github.com/scylladb/scylladb/issues/22146.
2025-08-29 17:12:07 +02:00
Łukasz Paszkowski
9809800aa8 repair_service: Add a facility to disable the service
Repair service currently have two functions: stop() and shutdown() that
stop all ongoing repairs and prevent any further repairs from being started.

It is possible to stop the repair_service once. Once stopped, it cannot
be restarted. We would like, however, to enable / disable the repair service
many times.

Similarly to compaction_manager, the repair service provides two new functions:
- drain() - abort all ongoing local repair task and disable the service,
            i.e. no new local task will be scheduled and data received from
            the repair master is rejected. It's, though, still possible to
            schedule a global repair request
- enable() - enable the service

By default, the repair service is enabled immediately once started.

For tablet-based keyspaces, the new facility prevents tablets from being
repaired. Whenever the repair_service is disabled and the request to repair
a tablet arrives, an exception is returned.

Once the exception is thrown, the tablet is moved into the end_repair
state and the operation will be retried later. Hence, disabling the service
does not fail the global tablet repair request.
2025-08-29 14:56:13 +02:00
Aleksandra Martyniuk
f3b43b6384 repair: add new param to tablet_repair_task_impl
Currently, sched_info is set immediately after tablet_repair_task_impl
is created.

Pass this param to constructor instead. It's a preparation for
the following changes.
2025-08-29 14:37:00 +02:00
Aleksandra Martyniuk
57b47e282e repair: add new params to shard_repair_task_impl
Currently, neighbors and small_table_optimization_ranges_reduced_factor
are set immediately after shard_repair_task_impl is created.

Pass these params to constructor instead. It's a preparation for following
changes.
2025-08-29 14:27:00 +02:00
Aleksandra Martyniuk
6a0d8728de repair: pass argument by value
shard_repair_task_impl constructor gets some of its arguments by const
reference. Due to that those arguments are copied when they could be
moved.

Get shard_repair_task_impl constructor arguments by value. Use std::move
where possible.
2025-08-29 14:24:47 +02:00
Radosław Cybulski
01bb7b629a build: add precompiled headers to CMakeLists.txt
Add precompiled header support to CMakeLists.txt and configure.py -
it improves compilation time by approximately 10%.

New header `stdafx.hh` is added, don't include it manually -
the compiler will include it for you. The header contains includes from
external libraries used by Scylla - seastar, standard library,
linux headers and zlib.

The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER`
or configure.py --disable-precompiled-header to disable.

The feature should be disabled, when trying to check headers - otherwise
you might get false negatives on missing includes from seastar / abseil and so on.

Note: following configuration needs to be added to ccache.conf:

    sloppiness = pch_defines,time_macros

Closes #25182
2025-08-27 21:37:54 +03:00
Michał Jadwiszczak
233f4dcee3 db/view/view_building_worker: register staging sstable to view building coordinator when needed
Change return type of `check_needs_view_update_path()`. Instead of
retrning bool which tells whether to use staging directory (and register
to `view_update_generator`) or use normal directory.

Now the function returns enum with possible values:
- `normal_directory` - use normal directory for the sstable
- `staging_directly_to_generator` - use staging directory and register
      to `view_update_generator`
- `staging_managed_by_vbc` - use staging directory but don't register it
      to `view_update_generator` but create view building tasks for
      later

The third option is new, it's used when the table has any view which is
in building process currrently. In this case, registering it to `view_update_generator`
prematurely may lead to base-view inconsistency
(for example when a replica is in a pending state).
2025-08-27 10:23:03 +02:00
Ernest Zaslavsky
4bee0491ba cmake: Add missing incremental.cc to repair/CMakeLists.txt
Add `incremental.cc` to `repair/CMakeLists.txt` to fix CMake based build

Closes scylladb/scylladb#25601
2025-08-21 09:40:36 +03:00
Avi Kivity
611918056a Merge 'repair: Add tablet incremental repair support' from Asias He
The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.

Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.

This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:

The repaired_at on disk field of a SSTable is used.
   - A 64-bit number increases sequentially

The sstables_repaired_at is added to the system.tablets table.
   - repaired_at <= sstables_repaired_at means the sstable is repaired

The being_repaired in memory field of a SSTable is added.
   - A repair UUID tells which sstable has participated in the repair

Initial test results:

    1) Medium dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~500GB
    Cluster pre-populated with ~500GB of data before starting repairs job.
    Results for Repair Timings:
    The regular repair run took 210 mins.
    Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
    The speedup is: 183 mins  / 48s = 228X

    2) Small dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~167GB
    Cluster pre-populated with ~167GB of data before starting the repairs job.
    Regular repair 1st run took 110s,  2nd and 3rd runs took 110s.
    Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
    The speedup is: 110s / 1.5s = 73X

    3) Large dataset results
    Node amount: 6
    Instance type: i4i.2xlarge, 3 racks
    50% of base load, 50% read/write
    Dataset == Sum of data on each node

    Dataset     Non-incremental repair (minutes)
    1.3 TiB     31:07
    3.5 TiB     25:10
    5.0 TiB     19:03
    6.3 TiB     31:42

    Dataset     Incremental repair (minutes)
    1.3 TiB     24:32
    3.0 TiB     13:06
    4.0 TiB     5:23
    4.8 TiB     7:14
    5.6 TiB     3:58
    6.3 TiB     7:33
    7.0 TiB     6:55

Fixes #22472

Closes scylladb/scylladb#24291

* github.com:scylladb/scylladb:
  replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
  compaction: Move compaction_reenabler to compaction_reenabler.hh
  topology_coordinator: Make rpc::remote_verb_error to warning level
  repair: Add metrics for sstable bytes read and skipped from sstables
  test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair
  test.py: Add tests for tablet incremental repair
  repair: Add tablet incremental repair support
  compaction: Add tablet incremental repair support
  feature_service: Add TABLET_INCREMENTAL_REPAIR feature
  tablet_allocator: Add tablet_force_tablet_count_increase and decrease
  repair: Add incremental helpers
  sstable: Add being_repaired to sstable
  sstables: Add set_repaired_at to metadata_collector
  mutation_compactor: Introduce add operator to compaction_stats
  tablet: Add sstables_repaired_at to system.tablets table
  test: Fix drain api in task_manager_client.py
2025-08-19 13:13:22 +03:00
Asias He
082bc70a0a replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
It helps to hide the compaction_group_views from repair subsystem.
2025-08-18 11:01:22 +08:00