Commit Graph

80 Commits

Author SHA1 Message Date
Raphael S. Carvalho
74ecedfb5c replica: Fail timed-out single-key read on cleaned up tablet replica
Consider the following:
1) single-key read starts, blocks on replica e.g. waiting for memory.
2) the same replica is migrated away
3) single-key read expires, coordinator abandons it, releases erm.
4) migration advances to cleanup stage, barrier doesn't wait on
   timed-out read
5) compaction group of the replica is deallocated on cleanup
6) that single-key resumes, but doesn't find sstable set (post cleanup)
7) with abort-on-internal-error turned on, node crashes

It's fine for abandoned (= timed out) reads to fail, since the
coordinator is gone.
For active reads (non timed out), the barrier will wait for them
since their coordinator holds erm.
This solution consists of failing reads which underlying tablet
replica has been cleaned up, by just converting internal error
to plain exception.

Fixes #26229.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#27078
2025-11-20 11:44:03 +02:00
Botond Dénes
6ee0f1f3a7 Merge 'replica/table: add a metric for hypothetical total file size without compression' from Michał Chojnowski
This patch adds a metric for pre-compression size of sstable files.

This patch adds a per-table metric
`scylla_column_family_total_disk_space_before_compression`,
which measures the hypothetical total size of sstables on disk,
if Data.db was replaced with an uncompressed equivalent.

As for the implementation:
Before the patch, tables and sstable sets are already tracking their total physical file size.
Whenever sstables are added or removed, the size delta is propagated from the sstable up through sstable sets into table_stats.
To implement the new metric, we turn the size delta that is getting passed around from a one-dimensional to a two-dimensional value, which includes both the physical and the pre-compression size.

New functionality, no backport needed.

Closes scylladb/scylladb#26996

* github.com:scylladb/scylladb:
  replica/table: add a metric for hypothetical total file size without compression
  replica/table: keep track of total pre-compression file size
2025-11-20 09:10:38 +02:00
Michał Chojnowski
1cfce430f1 replica/table: keep track of total pre-compression file size
Every table and sstable set keeps track of the total file size
of contained sstables.

Due to a feature request, we also want to keep track of the hypothetical
file size if Data files were uncompressed, to add a metric that
shows the compression ratio of sstables.

We achieve this by replacing the relevant `uint_64 bytes_on_disk`
counters everywhere with a struct that contains both the actual
(post-compression) size and the hypothetical pre-compression size.

This patch isn't supposed to change any observable behavior.
In the next patch, we will use these changes to add a new metric.
2025-11-13 00:49:57 +01:00
Taras Veretilnyk
deb8e32e86 sstables: Add integrity option to create_single_key_sstable_reader
Added an sstables::integrity_check parameter to create_single_key_sstable_reader methods across its implementations.
This allows callers to enable SSTable integrity checks during single-key reads.
2025-10-28 19:27:35 +01:00
Benny Halevy
b17a36c071 tablets: read_tablet_mutations: use unfreeze_and_split_gently
Split the tablets mutations by number of rows, based on
`min_tablets_in_mutation` (currently calibrated to 1024),
similar to the splitting done in
`storage_service::merge_topology_snapshot`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:15:41 +03:00
Benny Halevy
d21984d0cc tablets: tablet_map_to_mutations: maybe split tablets mutation
Split the generated tablets mutation if we run out of task
quota to prevent stalls, both when preparing the mutations
and later on when freezing/unfreezing them or converting
them to canonical_mutation and back.

Note that this will convert large mutation to long
vectors of mutations.  A followup change is considered
to convert std::vector:s of mutations to chunked_vector
to prevent large allocations.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:15:41 +03:00
Benny Halevy
aaddff5211 tablets: tablet_map_to_mutations: accept process_func
Prepare for generating several mutations for the
tablet_map by calling process_func for each generated mutation.

This allows the caller to directly freeze those mutations
one at a time into a vector of frozen mutations or simililarly
convert them into canonical mutations.

Next patch will split large tablet mutations to prevent stalls.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:15:38 +03:00
Asias He
4f3d076dab tablets: Demote set sstables_repaired_at log to debug level
This is log is too excessive when tablet count is high. Demote to debug
level.

Fixes #25926

Closes scylladb/scylladb#26175
2025-09-25 11:05:51 +03:00
Asias He
cb7db47ae1 repair: Add incremental_mode option for tablet repair
This patch introduces a new `incremental_mode` parameter to the tablet
repair REST API, providing more fine-grained control over the
incremental repair process.

Previously, incremental repair was on and could not be turned off. This
change allows users to select from three distinct modes:

- `regular`: This is the default mode. It performs a standard
  incremental repair, processing only unrepaired sstables and skipping
  those that are already repaired. The repair state (`repaired_at`,
  `sstables_repaired_at`) is updated.

- `full`: This mode forces the repair to process all sstables, including
  those that have been previously repaired. This is useful when a full
  data validation is needed without disabling the incremental repair
  feature. The repair state is updated.

- `disabled`: This mode completely disables the incremental repair logic
  for the current repair operation. It behaves like a classic
  (pre-incremental) repair, and it does not update any incremental
  repair state (`repaired_at` in sstables or `sstables_repaired_at` in
  the system.tablets table).

The implementation includes:

- Adding the `incremental_mode` parameter to the
  `/storage_service/repair/tablet` API endpoint.
- Updating the internal repair logic to handle the different modes.
- Adding a new test case to verify the behavior of each mode.
- Updating the API documentation and developer documentation.

Fixes #25605

Closes scylladb/scylladb#25693
2025-09-09 06:50:21 +03:00
Avi Kivity
611918056a Merge 'repair: Add tablet incremental repair support' from Asias He
The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.

Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.

This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:

The repaired_at on disk field of a SSTable is used.
   - A 64-bit number increases sequentially

The sstables_repaired_at is added to the system.tablets table.
   - repaired_at <= sstables_repaired_at means the sstable is repaired

The being_repaired in memory field of a SSTable is added.
   - A repair UUID tells which sstable has participated in the repair

Initial test results:

    1) Medium dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~500GB
    Cluster pre-populated with ~500GB of data before starting repairs job.
    Results for Repair Timings:
    The regular repair run took 210 mins.
    Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
    The speedup is: 183 mins  / 48s = 228X

    2) Small dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~167GB
    Cluster pre-populated with ~167GB of data before starting the repairs job.
    Regular repair 1st run took 110s,  2nd and 3rd runs took 110s.
    Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
    The speedup is: 110s / 1.5s = 73X

    3) Large dataset results
    Node amount: 6
    Instance type: i4i.2xlarge, 3 racks
    50% of base load, 50% read/write
    Dataset == Sum of data on each node

    Dataset     Non-incremental repair (minutes)
    1.3 TiB     31:07
    3.5 TiB     25:10
    5.0 TiB     19:03
    6.3 TiB     31:42

    Dataset     Incremental repair (minutes)
    1.3 TiB     24:32
    3.0 TiB     13:06
    4.0 TiB     5:23
    4.8 TiB     7:14
    5.6 TiB     3:58
    6.3 TiB     7:33
    7.0 TiB     6:55

Fixes #22472

Closes scylladb/scylladb#24291

* github.com:scylladb/scylladb:
  replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
  compaction: Move compaction_reenabler to compaction_reenabler.hh
  topology_coordinator: Make rpc::remote_verb_error to warning level
  repair: Add metrics for sstable bytes read and skipped from sstables
  test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair
  test.py: Add tests for tablet incremental repair
  repair: Add tablet incremental repair support
  compaction: Add tablet incremental repair support
  feature_service: Add TABLET_INCREMENTAL_REPAIR feature
  tablet_allocator: Add tablet_force_tablet_count_increase and decrease
  repair: Add incremental helpers
  sstable: Add being_repaired to sstable
  sstables: Add set_repaired_at to metadata_collector
  mutation_compactor: Introduce add operator to compaction_stats
  tablet: Add sstables_repaired_at to system.tablets table
  test: Fix drain api in task_manager_client.py
2025-08-19 13:13:22 +03:00
Asias He
f9021777d8 compaction: Add tablet incremental repair support
This patch addes incremental_repair support in compaction.

- The sstables are split into repaired and unrepaired set.

- Repaired and unrepaired set compact sperately.

- The repaired_at from sstable and sstables_repaired_at from
  system.tablets table are used to decide if a sstable is repaired or
  not.

- Different compactions tasks, e.g., minor, major, scrub, split, are
  serialized with tablet repair.
2025-08-18 11:01:21 +08:00
Botond Dénes
614d17347a tombstone_gc: extract shared state into shared_tombstone_gc_state
Instead of storing it partially in tombstone_gc and partially in an
external map. Move all external parts into the new
shared_tombstone_gc_state. This new class is responsible for
keeping and updating the repair history. tombstone_gc_state just keeps
const pointers to the shared state as before and is only responsible for
querying the tombstone gc before times.
This separation makes the code easier to follow and also enables further
patching of tombstone_gc_state.
2025-08-11 07:09:14 +03:00
Asias He
5377f87e5a tablet: Add sstables_repaired_at to system.tablets table
It is used to store the repaired_at for each tablet.
2025-08-11 10:10:07 +08:00
Benny Halevy
fce6c4b41d tablets: prevent accidental copy of tablets_map
As they are wasteful in many cases, it is better
to move the tablet_map if possible, or clone
it gently in an async fiber.

Add clone() and clone_gently() methods to
allow explicit copies.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-07-22 15:07:26 +03:00
Benny Halevy
3feb759943 everywhere: use utils::chunked_vector for list of mutations
Currently, we use std::vector<*mutation> to keep
a list of mutations for processing.
This can lead to large allocation, e.g. when the vector
size is a function of the number of tables.

Use a chunked vector instead to prevent oversized allocations.

`perf-simple-query --smp 1` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (read path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...

89055.97 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   18003 cycles/op,        0 errors)
103372.72 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39380 insns/op,   17300 cycles/op,        0 errors)
98942.27 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39413 insns/op,   17336 cycles/op,        0 errors)
103752.93 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39407 insns/op,   17252 cycles/op,        0 errors)
102516.77 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39403 insns/op,   17288 cycles/op,        0 errors)
throughput:
	mean=   99528.13 standard-deviation=6155.71
	median= 102516.77 median-absolute-deviation=3844.59
	maximum=103752.93 minimum=89055.97
instructions_per_op:
	mean=   39403.99 standard-deviation=14.25
	median= 39406.75 median-absolute-deviation=9.30
	maximum=39416.63 minimum=39380.39
cpu_cycles_per_op:
	mean=   17435.81 standard-deviation=318.24
	median= 17300.40 median-absolute-deviation=147.59
	maximum=18002.53 minimum=17251.75
```

After (read path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
59755.04 tps ( 66.2 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39466 insns/op,   22834 cycles/op,        0 errors)
71854.16 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   17883 cycles/op,        0 errors)
82149.45 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39411 insns/op,   17409 cycles/op,        0 errors)
49640.04 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   19975 cycles/op,        0 errors)
54963.22 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   18235 cycles/op,        0 errors)
throughput:
	mean=   63672.38 standard-deviation=13195.12
	median= 59755.04 median-absolute-deviation=8709.16
	maximum=82149.45 minimum=49640.04
instructions_per_op:
	mean=   39448.38 standard-deviation=31.60
	median= 39466.17 median-absolute-deviation=25.75
	maximum=39474.12 minimum=39411.42
cpu_cycles_per_op:
	mean=   19267.01 standard-deviation=2217.03
	median= 18234.80 median-absolute-deviation=1384.25
	maximum=22834.26 minimum=17408.67
```

`perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (write path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
63736.96 tps ( 59.4 allocs/op,  16.4 logallocs/op,  14.3 tasks/op,   49667 insns/op,   19924 cycles/op,        0 errors)
64109.41 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   49992 insns/op,   20084 cycles/op,        0 errors)
56950.47 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50005 insns/op,   20501 cycles/op,        0 errors)
44858.42 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50014 insns/op,   21947 cycles/op,        0 errors)
28592.87 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50027 insns/op,   27659 cycles/op,        0 errors)
throughput:
	mean=   51649.63 standard-deviation=15059.74
	median= 56950.47 median-absolute-deviation=12087.33
	maximum=64109.41 minimum=28592.87
instructions_per_op:
	mean=   49941.18 standard-deviation=153.76
	median= 50005.24 median-absolute-deviation=73.01
	maximum=50027.07 minimum=49667.05
cpu_cycles_per_op:
	mean=   22023.01 standard-deviation=3249.92
	median= 20500.74 median-absolute-deviation=1938.76
	maximum=27658.75 minimum=19924.32
```

After (write path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
53395.93 tps ( 59.4 allocs/op,  16.5 logallocs/op,  14.3 tasks/op,   50326 insns/op,   21252 cycles/op,        0 errors)
46527.83 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50704 insns/op,   21555 cycles/op,        0 errors)
55846.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50731 insns/op,   21060 cycles/op,        0 errors)
55669.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50735 insns/op,   21521 cycles/op,        0 errors)
52130.17 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50757 insns/op,   21334 cycles/op,        0 errors)
throughput:
	mean=   52713.91 standard-deviation=3795.38
	median= 53395.93 median-absolute-deviation=2955.40
	maximum=55846.30 minimum=46527.83
instructions_per_op:
	mean=   50650.57 standard-deviation=182.46
	median= 50731.38 median-absolute-deviation=84.09
	maximum=50756.62 minimum=50325.87
cpu_cycles_per_op:
	mean=   21344.42 standard-deviation=202.86
	median= 21334.00 median-absolute-deviation=176.37
	maximum=21554.61 minimum=21060.24
```

Fixes #24815

Improvement for rare corner cases. No backport required

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#24919
2025-07-13 19:13:11 +03:00
Michael Litvak
a5feb80797 tablets: validate no clustering row mutations on co-located tables
When preparing a tablet metadata change, add another validation that no
clustering row mutations are written to the tablet map of a co-located
dependent table.

A co-located table should never have clustering rows in the
`system.tablets` table. It has only the static row with base_table
column set, pointing to the base table.
2025-07-01 13:20:20 +03:00
Michael Litvak
064ac25ff9 tablets: read_tablet_transition_stage: read from base table
When reading tablet information from system.tablets we need to read it
from the base table, if exists.
2025-07-01 13:20:18 +03:00
Michael Litvak
c74cbca7cb tablets: save and read tablet metadata of co-located tables
Update the tablet metadata save and read methods to work with tablet
maps of co-located tables.

The new function colocated_tablet_map_to_mutation is used to generate a
mutation of a co-located table to system.tablets. It creates a static
row with the base_table column set with the base table id. The function
save_tablet_metadata is updated to use this function for co-located
tables.

When reading tablet metadata from the table, we handle the new case of
reading a co-located table. We store the co-located tables relationships
in the tablet_metadata_builder's `colocated_tables` map, and process it
in on_end_of_stream. The reason we defer the processing is that we want
to set all normal tablet maps first, to ensure the base tablet map is
found when we process a co-located table.
2025-07-01 10:29:59 +03:00
Michael Litvak
4777444024 tablets: add base_table column to system.tablets
Add a new column base_table to the system.tablets table.

It can be set to point to another table to indicate that the tablets of
this table are co-located with the tablets of the base table.

When it's set, we don't store other tablet information in system.tablets
and in the in-memory tablet map object for this table, and we need to
refer instead to the base table tablet information. The method
get_tablet_map always returns the base tablet map.
2025-07-01 10:29:59 +03:00
Raphael S. Carvalho
53df911145 replica: Fix range reads spanning sibling tablets
We don't guarantee that coordinators will only emit range reads that
span only one tablet.

Consider this scenario:

1) split is about to be finalized, barrier is executed, completes.
2) coordinator starts a read, uses pre-split erm (split not committed to group0 yet)
3) split is committed to group0, all replicas switch storage.
4) replica-side read is executed, uses a range which spans tablets.

We could fix it with two-phase split execution. Rather than pushing the
complexity to higher levels, let's fix incremental selector which should
be able to serve all the tokens owned by a given shard. During split
execution, either of sibling tablets aren't going anywhere since it
runs with state machine locked, so a single read spanning both
sibling tablets works as long as the selector works across tablet
boundaries.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-05-27 22:39:40 -03:00
Asias He
5545289bfa repair: Introduce Host and DC filter support
Currently, the tablet repair scheduler repairs all replicas of a tablet.
It does not support hosts or DCs selection. It should be enough for most
cases. However, users might still want to limit the repair to certain
hosts or DCs in production. #21985 added the preparation work to add the
config options for the selection. This patch adds the hosts or DCs
selection support.

Fixes #22417
2025-02-14 09:13:11 +01:00
Tomasz Grabiec
c7f78edc78 Merge 'repair: Wire repair_time in system.tablets for tombstone gc' from Asias He
The repair_time in system.tablets will be updated when repair runs
successfully. We can now use it to update the repair time for tombstone
gc, i.e, when the system.tablets.repair_time is propagated, call
gc_state.update_repair_time() on the node that is the owner of the
tablet.

Since b3b3e880d3 ("repair: Reduce hints and batchlog flush"), the
repair time that could be used for tombstone gc might be smaller than
when the repair is started, so the actual repair time for tombstone gc
is returned by the repair rpc call from the repair master node.

Fixes #17507

New feature. No backport is needed.

Closes scylladb/scylladb#21896

* github.com:scylladb/scylladb:
  repair: Stop using rpc to update repair time for repairs scheduled by scheduler
  repair: Wire repair_time in system.tablets for tombstone gc
  test: Disable flush_cache_time for two tablet repair tests
  test: Introduce guarantee_repair_time_next_second helper
  repair: Return repair time for repair_service::repair_tablet
  service: Add tablet_operation.hh
2025-01-20 18:08:49 +01:00
Botond Dénes
47989b1503 Merge 'tasks: add tablet resize virtual task' from Aleksandra Martyniuk
In this change, tablet_virtual_task starts supporting tablet
resize (i.e. split and merge).

Users can see running resize tasks - finished tasks are not
presented with the task manager API.

A new task state "suspended" is added. If a resize was revoked,
it will appear to users as suspended. We assume that the resize was revoked
when the tablet number didn't change.

Fixes: #21366.
Fixes: #21367.

No backport, new feature

Closes scylladb/scylladb#21891

* github.com:scylladb/scylladb:
  test: boost: check resize_task_info in tablet_test.cc
  test: add tests to check revoked resize virtual tasks
  test: add tests to check the list of resize virtual tasks
  test: add tests to check spilt and merge virtual tasks status
  test: test_tablet_tasks: generalize functions
  replica: service: add split virtual task's children
  replica: service: pass parent info down to storage_group::split
  tasks: children of virtual tasks aren't internal by default
  tasks: initialize shard in task_info ctor
  service: extend tablet_virtual_task::abort
  service: retrun status_helper struct from tablet_virtual_task::get_status_helper
  service: extend tablet_virtual_task::wait
  tasks: add suspended task state
  service: extend tablet_virtual_task::get_status
  service: extend tablet_virtual_task::contains
  service: extend tablet_virtual_task::get_stats
  service: add service::task_manager_module::get_nodes
  tasks: add task_manager::get_nodes
  tasks: drop noexcept from module::get_nodes
  replica: service: add resize_task_info static column to system.tablets
  locator: extend tablet_task_info to cover resize tasks
2025-01-17 14:24:07 +02:00
Asias He
53e6025aa6 repair: Wire repair_time in system.tablets for tombstone gc
The repair_time in system.tablets will be updated when repair runs
successfully. We can now use it to update the repair time for tombstone
gc, i.e, when the system.tablets.repair_time is propagated, call
gc_state.update_repair_time() on the node that is the owner of the
tablet.

Since b3b3e880d3 ("repair: Reduce hints and batchlog flush"), the
repair time that could be used for tombstone gc might be smaller than
when the repair is started, so the actual repair time for tombstone gc
is returned by the repair rpc call from the repair master node.

Fixes #17507
2025-01-17 16:12:05 +08:00
Asias He
cd96fb5a78 repair: Add repair_hosts_filter and repair_dcs_filter
They will be useful for hosts and DCs selection for the repair
scheduler. It is not implemented yet. Adding it earlier, so we do not
need to change the system tabler later.

Closes scylladb/scylladb#21985
2025-01-14 08:46:26 +02:00
Aleksandra Martyniuk
18b829add8 replica: service: add resize_task_info static column to system.tablets
Add resize_task_info static column to system.tablets. Set or delete
resize_task_info value when the resize_decision is changed.
Reflect the column content in tablet_map.
2025-01-10 10:03:07 +01:00
Aleksandra Martyniuk
1c29726477 replica: do not set tablet_task_info if it isn't valid
Currently, in tablet_map_to_mutation, repair's and migration's
tablet_task_info is always set.

Do not set the tablet_task_info if there is no running operation.

Closes scylladb/scylladb#22005
2024-12-20 16:10:53 +02:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Aleksandra Martyniuk
d0cda8ebef replica: check enabled features in tablet_map_to_mutation
Before adding a value to a new column in tablet_map_to_mutation
check if the column is supported by the whole cluster.

Closes scylladb/scylladb#21941
2024-12-17 07:02:11 +02:00
Aleksandra Martyniuk
3f9c76c52d replica: add repair related fields to tablet_map_to_mutation 2024-12-12 11:40:40 +01:00
Aleksandra Martyniuk
9fad3a621a replica: service: add migration_task_info column to system.tablets
Add migration_task_info column to system.tablets. Set migration_task_info
value on migration request if the feature is enabled in the cluster.
Reflect the column content in tablet_metadata.
2024-12-11 12:07:36 +01:00
Kefu Chai
f436edfa22 mutation: remove unused "#include"s
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.

please note, because `mutation/mutation.hh` does not include
`seastar/coroutine/maybe_yield.hh` anymore, and quite a few source
files were relying on this header to bring in the declaration of
`maybe_yield()`, we have to include this header in the places where
this symbol is used. the same applies to `seastar/core/when_all.hh`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-29 14:01:44 +08:00
Avi Kivity
29497f8c5d Merge 'Automatically compute schema version of system tables' from Tomasz Grabiec
Schema of system tables is defined statically and table_schema_version needs to be explicitly set in code like this:

```
builder.with_version(system_keyspace::generate_schema_version(table_id, version_offset));
```

Whenever schema is changed, the schema version needs to change, otherwise we hit undefined behavior when trying to interpret mutation data created with the old schema using the new schema.

It's not obvious that one needs to do that and developers often forget to do that. There were several instances of mistakes of omission, some caught during review, some not, e.g.: 31ea74b96e.

This patch changes definitions to call the new `schema_builder::with_hash_version()`, which will make the schema builder compute version from schema definition so that changes of the schema will automatically change the version. This way we no longer rely on the developer to remember to bump the version offset.

All nodes should arrive at the same version, which is verified by existing `test_group0_schema_versioning` and a new unit test: `test_system_schema_version_is_stable`.

Closes scylladb/scylladb#21602

* github.com:scylladb/scylladb:
  system_tables: Compute schema version automatically
  schema_builder: Introduce with_hash_version()
  schema: Store raw_view_info in schema::raw_schema
  schema: Remove dead comment
  hashing: Add hasher for unordered_map
  hashing: Add hasher for unique_ptr
  hashing: Add hasher for double

[avi: add missing include <memory> to hashing.hh]
2024-11-24 18:44:32 +02:00
Asias He
b71a563030 repair: Add core tablet repair scheduler support
This adds a new tablet migration kind: repair. It allows tablet repair
scheduler to use this migration kind to schedule repair jobs.

The current repair scheduler implementation does the following:

- A tablet is picked to be repaired when the time since last repair is
  bigger than a threshold (auto repair mode) or it is requested by user
  (manual repair mode)

- The tablet repair can be scheduled along with tablet migration and
  rebuild. It runs in the tablet_migration track.

- Repair jobs are scheduled in a smart way so that at any point in time,
  there are no more than configured jobs per shard, which is similar to
  scylla manager's control.

In this patch, both the manual repair and the auto repair are not
enabled yet.
2024-11-20 09:42:41 +08:00
Tomasz Grabiec
8738d9bfa0 system_tables: Compute schema version automatically
This depends on the previous change to the schema_builder
which makes version computation depend on definition only
instead of being new time uuid.

This way we avoid the possibility for a common mistake
when schema of a system table is extended but we forget
to bump up its version passed to .with_version().
2024-11-15 19:16:41 +01:00
Tomasz Grabiec
3e438d23e1 Merge 'Check system.tablets update before putting it into the table' from Pavel Emelyanov
Having tablet metadata with more than 1 pending replica will prevent this metadata from being (re)loaded due to sanity check on load. This patch fails the operation which tries to save the wrong metadata with a similar sanity check. For that, changes submitted to raft are validated, and if it's topology_change that affects system.tablets, the new "replicas" and "new_replicas" values are checked similarly to how they will be on (re)load.

fixes #20043

Closes scylladb/scylladb#21020

* github.com:scylladb/scylladb:
  tablets: Validate system.tablets update
  group0_client: Introduce change validation
  group0_client: Add shared_token_metadata dependency
2024-10-15 00:38:59 +02:00
Pavel Emelyanov
1863ccd900 tablets: Validate system.tablets update
Implement change validation for raft topology_change command. For now
the only check is that the "pending replicas" contains at most one
entry. The check mirrors similar one in `process_one_row` function.

If not passed, this prevents system.tablets from being updated with the
mutation(s) that will not be loaded later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-10 12:39:58 +03:00
Pavel Emelyanov
3dcf3d65d7 replica: Use substract_sets() helper
The process_one_row() evaluates pending_replica by subtracting replicas
from new_replicas. There's a convenience helper for that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#21019
2024-10-09 20:02:16 +03:00
Kefu Chai
3e84d43f93 treewide: use seastar::format() or fmt::format() explicitly
before this change, we rely on `using namespace seastar` to use
`seastar::format()` without qualifying the `format()` with its
namespace. this works fine until we changed the parameter type
of format string `seastar::format()` from `const char*` to
`fmt::format_string<...>`. this change practically invited
`seastar::format()` to the club of `std::format()` and `fmt::format()`,
where all members accept a templated parameter as its `fmt`
parameter. and `seastar::format()` is not the best candidate anymore.
despite that argument-dependent lookup (ADT for short) favors the
function which is in the same namespace as its parameter, but
`using namespace` makes `seastar::format()` more competitive,
so both `std::format()` and `seastar::format()` are considered
as the condidates.

that is what is happening scylladb in quite a few caller sites of
`format()`, hence ADT is not able to tell which function the winner
in the name lookup:

```
/__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous
  265 |     return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id());
      |            ^~~~~~
/usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
 4290 |     format(format_string<_Args...> __fmt, _Args&&... __args)
      |     ^
/__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
  143 | format(fmt::format_string<A...> fmt, A&&... a) {
      | ^
```

in this change, we

change all `format()` to either `fmt::format()` or `seastar::format()`
with following rules:
- if the caller expects an `sstring` or `std::string_view`, change to
  `seastar::format()`
- if the caller expects an `std::string`, change to `fmt::format()`.
  because, `sstring::operator std::basic_string` would incur a deep
  copy.

we will need another change to enable scylladb to compile with the
latest seastar. namely, to pass the format string as a templated
parameter down to helper functions which format their parameters.
to miminize the scope of this change, let's include that change when
bumping up the seastar submodule. as that change will depend on
the seastar change.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-11 23:21:40 +03:00
Botond Dénes
3ee0d7f2d1 Merge 'tools: Enhance scylla sstable shard-of to support tablets' from Kefu Chai
before this change, `scylla sstable shard-of` didn't support tablets,
because:

- with tablets enabled, data distribution uses the scheduler
- this replaces the previous method of mapping based on vnodes and shard numbers
- as a result, we can no longer deduce sstable mapping from token ranges

in this change, we:
- read `system.tablets` table to retrieve tablet information
- print the tablet's replica set (list of <host, shard> pairs)
- this helps users determine where a given sstable is hosted

This approach provides the closest equivalent functionality of
`shard-of` in the tablet era.

Fixes scylladb/scylladb#16488

---

no need to backport, it's an improvement, not a critical fix.

Closes scylladb/scylladb#20002

* github.com:scylladb/scylladb:
  tools: enhance `scylla sstable shard-of` to support tablets
  replica/tablets: extract tablet_replica_set_from_cell()
  tools: extract get_table_directory() out
  tools: extract read_mutation out
  build: split the list of source file across multiple line
  tools/scylla-sstable: print warning when running shard-of with tablets
2024-08-20 13:51:12 +03:00
Lakshmi Narayanan Sreethar
44583eed9e replica: fix copy constructor of tablet_sstable_set
Remove the existing copy constructor to enable the use of the implicit
copy constructor. This fixes the issue of `_sstable_set_ids` not being
copied in the current copy constructor.

Fixes #19519

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-08-17 23:37:58 +05:30
Kefu Chai
4291033b14 replica/tablets: extract tablet_replica_set_from_cell()
so it can be reused to implement a low-level tool which reads tablets
data from sstables

Refs scylladb/scylladb#16488
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-15 15:49:55 +08:00
Botond Dénes
bb1e733fe0 replica/tablets: add update_tablet_metadata()
Allows updateng tablet metadata in-place, according to the provided
hint, reading and updating only the parts that actually changed.
2024-08-11 09:52:37 -04:00
Botond Dénes
66292b4baa replica/tablets: fix indentation
Left broken from the previous patch.
2024-08-11 09:52:37 -04:00
Botond Dénes
aa378c458e replica/tablets: extract tablet_metadata builder logic
So it can be reused in a new method.
Indentation is left broken deliberately, to make the patch easier to
read.
2024-08-11 09:52:37 -04:00
Botond Dénes
f5976aa87b replica/tablets: add get_tablet_metadata_change_hint() and update_tablet_metadata_change_hint()
Extract a hint of what a tablet mutation changed. The hint can be later
used to selectively reload only the changed parts from disk.
Two variants are added:
* get_tablet_metadata_change_hint() - extracts a hint from a list of
  tablet mutations
* update_tablet_metadata_change_hint() - updates an existing hint based
  on a single mutation, allowing for incremental hint extraction
2024-08-11 09:52:37 -04:00
Botond Dénes
0254cfc7d3 locator/tablets: make tablet_metadata cheap to copy
Keep lw_shared_ptr<tablet_map> in the tablet map and use COW semantics.
To prevent accidental changes to shared tablet_map instances, all
modifications to a tablet_map have to go through a new
`mutate_tablet_map()` method, which implements the copy-modify-swap
idiom.
2024-08-11 09:52:37 -04:00
Avi Kivity
fdc1449392 treewide: rename flat_mutation_reader_v2 to mutation_reader
flat_mutation_reader_v2 was introduced in a pair of commits in 2021:

  e3309322c3 "Clone flat_mutation_reader related classes into v2 variants"
  08b5773c12 "Adapt flat_mutation_reader_v2 to the new version of the API"

as a replacement for flat_mutation_reader, using range_tombstone_change
instead of range_tombstone to represent represent range tombstones. See
those commits for more information.

The transition was incremental; the last use of the original
flat_mutation_reader was removed in 2022 in commit

  026f8cc1e7 "db: Use mutation_partition_v2 in mvcc"

In turn, flat_mutation_reader was introduced in 2017 in commit

  748205ca75 "Introduce flat_mutation_reader"

To transition from a mutation_reader that nested rows within
a partition in a separate stream, to a flat reader that streamed
partitions and rows in the same stream.

Here, we reclaim the original name and rename the awkward
flat_mutation_reader_v2 to mutation_reader.

Note that mutation_fragment_v2 remains since we still use the original
for compatibilty, sometimes.

Some notes about the transition:

 - files were also renamed. In one case (flat_mutation_reader_test.cc), the
   rename target already existed, so we rename to
    mutation_reader_another_test.cc.

 - a namespace 'mutation_reader' with two definitions existed (in
   mutation_reader_fwd.hh). Its contents was folded into the mutation_reader
   class. As a result, a few #includes had to be adjusted.

Closes scylladb/scylladb#19356
2024-06-21 07:12:06 +03:00
Gleb Natapov
fb764720d3 topology coordinator: add more trace level logging for debugging
Add more logging that provide more visibility into what happens during
topology loading.

Message-ID: <ZnE5OAmUbExVZMWA@scylladb.com>
2024-06-18 10:34:03 +02:00
Raphael S. Carvalho
0b2ec3063c sstables: Fix incremental_reader_selector (for range reads) with tablets
incremental_reader_selector is the mechanism for incremental comsumption
of disjoint sstables on range reads.

tablet_sstable_set was implemented, such that selector is efficient with
tablets.

The problem is selector is vnode addicted and will only consider a given
set exhausted when maximum token is reached.

With tablets, that means a range read on first tablet of a given shard
will also consume other tablets living in the same shard. That results
in combined reader having to work with empty sstable readers of tablets
that don't intersect with the range of the read. It won't cause extra
I/O because the underlying sstables don't intersect with the range of
the read. It's only unnecessary CPU work, as it involves creating
readers (= allocation), feeding them into combined reader, which will
in turn invoke the sstable readers only to realize they don't have any
data for that range.

With 100k tablets (ranges), and 100 tablets per shard, and ~5 sstables
per tablet, there will be this amount of readers (empty or not):
  (100k * ((100^2 + 100) / 2) * avg_sstable_per_tablet=5) = ~2.5 billions.

~5000 times more readers, it can be quite significant additional cpu
work, even though I/O dominates the most in scans. It's an inefficiency
that we rather get rid of.

The behavior can be observed from logs (there's 1 sstable for each of
4 tablets, but note how readers are created for every single one of
them when reading only 1 tablet range):
```
table - make_reader_v2 - range=(-inf, {-4611686018427387905, end}]
    incremental_reader_selector - create_new_readers(null): selecting on pos {minimum token, w=-1}
    sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._34qn42... that has range [{-9151620220812943033, start},{-4813568684827439727, end}]
    incremental_reader_selector - create_new_readers(null): selecting on pos {-4611686018427387904, w=-1}
    sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._368nk2... that has range [{-4599560452460784857, start},{-78043747517466964, end}]
    incremental_reader_selector - create_new_readers(null): selecting on pos {0, w=-1}
    sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._38lj42... that has range [{851021166589397842, start},{3516631334339266977, end}]
    incremental_reader_selector - create_new_readers(null): selecting on pos {4611686018427387904, w=-1}
    sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._3dba82... that has range [{5065088566032249228, start},{9215673076482556375, end}]
```

Fix is about making sure the tablet set won't select past the
supplied range of the read.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#18556
2024-05-14 07:43:22 +03:00