Commit Graph

585 Commits

Author SHA1 Message Date
Piotr Dulikowski
4581c72430 Merge 'lwt: prohibit for tablet-based views and cdc logs' from Petr Gusev
`SELECT` commands with SERIAL consistency level are historically allowed for vnode-based views, even though they don't provide linearizability guarantees and in general don't make much sense. In this PR we prohibit LWTs for tablet-based views, but preserve old behavior for vnode-based views for compatibility. Similar logic is applied to CDC log tables.

We also add a general check that disallows colocating a table with another colocated table, since this is not needed for now.

Fixes https://github.com/scylladb/scylladb/issues/26258

backports: not needed (a new feature)

Closes scylladb/scylladb#26284

* github.com:scylladb/scylladb:
  cql_test_env.cc: log exception when callback throws
  lwt: prohibit for tablet-based views and cdc logs
  tablets: disallow chains of colocated tables
  database: get_base_table_for_tablet_colocation: extract table_id_by_name lambda
2025-09-30 07:15:16 +02:00
Petr Gusev
b01f56a6d3 database: get_base_table_for_tablet_colocation: extract table_id_by_name lambda
In upcoming commits we’ll add a test to ensure that a table cannot be
colocated with another table that is itself already colocated.
This must also hold in the case where both colocated tables are
created simultaneously in a single migration_manager announcement.
We use Paxos tables as an example of colocated tables in this test.
To support this, get_base_table_for_tablet_colocation needs to look
for the base table among the batch of tables being created.
2025-09-26 16:46:32 +02:00
Botond Dénes
fb16c0a6d4 root,replica: move multishard_mutation_query to replica/
It belongs there, it is a completely replica-side thing. Also take the
opportunity to rename it to multishard_query.{hh,cc}, it is not just
mutation anymore (data query is also implemented).
2025-09-26 11:15:38 +03:00
Botond Dénes
1999d8e3d3 compaction: remove using namespace {compaction,sstables}
Some files in compaction/ have using namespace {compaction,sstables}
clauses, some even in headers. This is considered bad practice and
muddies the namespace use. Remove them.
2025-09-25 15:03:57 +03:00
Wojciech Mitros
d9b8278178 mv: handle mismatched base/view replica count caused by RF change
During an ALTER KEYSPACE statement execution where a table with a view
is present, we need to perform tablet migrations for both tables.
These migrations are not synchronized, so at some point the base may
have a different number of non-pending replicas than the view. Because
of that, we can't pair them correctly. If there is more non-pending
base replicas than view replicas, we don't need to do anything because
the view replica that didn't finish migrating is a pending replica
and will get view updates from all base replicas. But if there is more
non-pending view replicas than base replicas, we may currently lose
view updates to the new view replica.

This patch adds a workaround for this scenario. If after one migration
we have too more non-pending view replicas than base replicas, we add
it to the pending replica list so that it gets an update anyway.

This patch will also take effect if the base and view replica counts
differ due to some other bug. To track that, a new metric is added
to count such occurrences.

This patch also includes a test for this exact scenario, which is enforced by an injection.

Fixes https://github.com/scylladb/scylladb/issues/21492
2025-09-22 12:50:16 +02:00
Michał Chojnowski
9e70df83ab db: get rid of sstables-format-selector
Our sstable format selection logic is weird, and hard to follow.

If I'm not misunderstanding, the pieces are:
1. There's the `sstable_format` config entry, which currently
   doesn't do anything, but in the past it used to disable
   cluster features for versions newer than the specified one.
2. There are deprecated and unused config entries for individual
   versions (`enable_sstables_mc_format`, `enable_sstables_md_format`,
   etc).
3. There is a cluster feature for each version:
   ME_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, etc.
   (Currently all sstable version features have been grandfathered,
   and aren't checked by the code anymore).
4. There's an entry in `system.scylla_local` which contains the
   latest enabled sstable version. (Why? Isn't this directly derived
   from cluster features anyway)?
5. There's `sstable_manager::_format` which contains the
   sstable version to be used for new writes.
   This field is updated by `sstables_format_selector`
   based on cluster features and the `system.scylla_local` entry.

I don't see why those pieces are needed. Version selection has the
following constraints:
1. New sstables must be written with a format that supports existing
   data. For example, range tombstones with an infinite bound are only
   supported by sstables since version "mc". So if a range tombstone
   with an infinite bound exists somewhere in the dataset,
   the format chosen for new sstables has to be at least as new as "mc".
2. A new format might only be used after a corresponding cluster feature
   is enabled. (Otherwise new sstables might become unreadable if they
   are sent to another node, or if a node is downgraded).
3. The user should have a way to inhibit format ugprades if he wishes.

So far, constraint (1) has been fulfilled by never using formats older
than the newest format ever enabled on the node. (With an exception
for resharding and reshaping system tables).
Constraint (2) has been fulfilled by calling `sstable_manager::set_format`
only after the corresponsing cluster feature is enabled.
Constraint (3) has been fulfilled by the ability to inhibit cluster
features by setting `sstable_format` by some fixed value.

The main thing I don't like about this whole setup is that it doesn't
let me downgrade the preferred sstable format. After a format is
enabled, there is no way to go back to writing the old format again.
That is no good -- after I make some performance-sensitive changes
in a new format, it might turn out to be a pessimization for the
particular workload, and I want to be able to go back.

This patch aims to give a way to downgrade formats without violating
the constraints. What it does is:
1. The entry in `system.scylla_local` becomes obsolete.
   After the patch we no longer update or read it.
   As far as I understand, the purpose of this entry is to prevent
   unwanted format downgrades (which is something cluster features
   are designed for) and it's updated if and only if relevant
   cluster features are updated. So there's no reason to have it,
   we can just directly use cluster features.
2. `sstable_format_selector` gets deleted.
   Without the `system.scylla_local` around, it's just a glorified
   feature listener.
3. The format selection logic is moved into `sstable_manager`.
   It already sees the `db::config` and the `gms::feature_service`.
   For the foreseeable future, the knowledge of enabled cluster features
   and current config should be enough information to pick the right formats.
4. The `sstable_format` entry in `db::config` is no longer intended to
   inhibit cluster features. Instead, it is intended to select the
   format for new sstables, and it becomes live-updatable.
5. Instead of writing new sstables with "highest supported" format,
   (which used to be set by `sstables_format_selector`) we write
   them with the "preferred" format, which is determined by
   `sstable_manager` based on the combination of enabled features
   and the current value of `sstable_format`.

Closes scylladb/scylladb#26092

[avi: Pavel found the reason for the scylla_local entry -
      it predates stable storage for cluster features]
2025-09-19 16:17:56 +03:00
Pavel Emelyanov
a1ea553fe1 code: Replace distributed<> with sharded<>
The latter is recommended in seastar, and the former was left as
compatibility alias. Latest seastar explicitly marks it as deprecated so
once the submodule is updated, compilation logs will explode.

Most of the patch is generated with

    for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]*>') ; do sed -e 's/\<distributed<\([A-Za-z0-9:_]*\)>/sharded<\1>/g' -i $f; done
    for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done

and a small manual change in test/perf/perf.hh

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26136
2025-09-19 12:22:51 +02:00
Michael Litvak
650ae30c97 cdc: colocate cdc table with base
When creating a tablet map for a CDC table, make it be co-located with
its base table.

We modify db::get_base_table_for_tablet_colocation to return the base
table id of a CDC table, handling both cases that the base table is a
new table that's created in the same operation, or is an existing table
in the db. This function is used by the tablet allocator to decide
whether to create a co-located tablet map or allocate new tablets.
2025-09-17 14:47:12 +02:00
Radosław Cybulski
436150eb52 treewide: fix spelling errors
Fix spelling errors reported by copilot on github.
Remove single use namespace alias.

Closes scylladb/scylladb#25960
2025-09-12 15:58:19 +03:00
Calle Wilund
bc20861afb system_keyspace: Prune dropped tables from truncation on start/drop
Fixes #25683

Once a table drop is complete, there should be no reason to retain
truncation records for it, as any replay should skip mutations
anyway (no CF), and iff we somehow resurrect a dropped table,
this replay-resurrected data is the least problem anyway.

Adds a prune phase to the startup drop_truncation_rp_records run,
which ignores updating, and instead deletes records for non-existant
tables (which should patch any existing servers with lingering data
as well).

Also does an explicit delete of records on actual table DROP, to
ensure we don't grow this table more than needed even in long
uptime nodes.

Small unit test included.

Closes scylladb/scylladb#25699
2025-09-03 07:25:34 +03:00
Łukasz Paszkowski
9539e80e54 compaction_manager: Subscribe to out of space controller 2025-08-29 14:56:07 +02:00
Łukasz Paszkowski
3d03b88719 database: Add critical_disk_utilization mode database can be moved to
When database operates in the critical disk utilization mode, all
mutation writes including inserts, updates, deletes, counter updates,
hints, read+repair, lwt writes) to user tables and other associated
with them tables like views, CDC log, audit are rejected, with a clear
error exception returned.

The mode is meant to be used with the disk space monitor in order
to prevent any user writes when node's disk utilization is too high.
2025-08-29 13:46:45 +02:00
Dawid Mędrek
837d267cbf main: Log RF-rack-invalid keyspaces at startup
When the configuration option `rf_rack_valid_keyspaces` is enabled and there
is an RF-rack-invalid keyspace, starting a node fails. However, when the
configuration option is disabled, but there still is a keyspace that violates
the condition, we'd like Scylla to print a warning informing the user about
the fact. That's what happens in this commit.

We provide a validation test.
2025-08-21 19:35:33 +02:00
Avi Kivity
611918056a Merge 'repair: Add tablet incremental repair support' from Asias He
The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.

Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.

This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:

The repaired_at on disk field of a SSTable is used.
   - A 64-bit number increases sequentially

The sstables_repaired_at is added to the system.tablets table.
   - repaired_at <= sstables_repaired_at means the sstable is repaired

The being_repaired in memory field of a SSTable is added.
   - A repair UUID tells which sstable has participated in the repair

Initial test results:

    1) Medium dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~500GB
    Cluster pre-populated with ~500GB of data before starting repairs job.
    Results for Repair Timings:
    The regular repair run took 210 mins.
    Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
    The speedup is: 183 mins  / 48s = 228X

    2) Small dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~167GB
    Cluster pre-populated with ~167GB of data before starting the repairs job.
    Regular repair 1st run took 110s,  2nd and 3rd runs took 110s.
    Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
    The speedup is: 110s / 1.5s = 73X

    3) Large dataset results
    Node amount: 6
    Instance type: i4i.2xlarge, 3 racks
    50% of base load, 50% read/write
    Dataset == Sum of data on each node

    Dataset     Non-incremental repair (minutes)
    1.3 TiB     31:07
    3.5 TiB     25:10
    5.0 TiB     19:03
    6.3 TiB     31:42

    Dataset     Incremental repair (minutes)
    1.3 TiB     24:32
    3.0 TiB     13:06
    4.0 TiB     5:23
    4.8 TiB     7:14
    5.6 TiB     3:58
    6.3 TiB     7:33
    7.0 TiB     6:55

Fixes #22472

Closes scylladb/scylladb#24291

* github.com:scylladb/scylladb:
  replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
  compaction: Move compaction_reenabler to compaction_reenabler.hh
  topology_coordinator: Make rpc::remote_verb_error to warning level
  repair: Add metrics for sstable bytes read and skipped from sstables
  test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair
  test.py: Add tests for tablet incremental repair
  repair: Add tablet incremental repair support
  compaction: Add tablet incremental repair support
  feature_service: Add TABLET_INCREMENTAL_REPAIR feature
  tablet_allocator: Add tablet_force_tablet_count_increase and decrease
  repair: Add incremental helpers
  sstable: Add being_repaired to sstable
  sstables: Add set_repaired_at to metadata_collector
  mutation_compactor: Introduce add operator to compaction_stats
  tablet: Add sstables_repaired_at to system.tablets table
  test: Fix drain api in task_manager_client.py
2025-08-19 13:13:22 +03:00
Asias He
be15972006 compaction: Move compaction_reenabler to compaction_reenabler.hh
So it can be used without bringing the whole
compaction/compaction_manager.hh.
2025-08-18 11:01:22 +08:00
Botond Dénes
bd32d41cad replica/database: memtable_list::get_max_purgeable(): set expiry-treshold
Use the newly introduced expiry_treshold field of max_purgeable, to help
exclude memtables from the overlap check if possible.
2025-08-11 17:20:12 +03:00
Botond Dénes
9d00d7e08d replica/memtable_list: add tombstone_gc_state* member
To be passed down to the memtable.
2025-08-11 07:09:19 +03:00
Botond Dénes
614d17347a tombstone_gc: extract shared state into shared_tombstone_gc_state
Instead of storing it partially in tombstone_gc and partially in an
external map. Move all external parts into the new
shared_tombstone_gc_state. This new class is responsible for
keeping and updating the repair history. tombstone_gc_state just keeps
const pointers to the shared state as before and is only responsible for
querying the tombstone gc before times.
This separation makes the code easier to follow and also enables further
patching of tombstone_gc_state.
2025-08-11 07:09:14 +03:00
Botond Dénes
1d3a3163a3 replica/memtable_list: s/min_live_timestamp()/get_max_purgeable()/
Also change to the return type to max_purgeable, instead of raw
timestamp. Prepares for further patching of this code.
2025-08-11 07:09:13 +03:00
Raphael S. Carvalho
2c4a9ba70c treewide: Rename table_state to compaction_group_view
Since table_state is a view to a compaction group, it makes sense
to rename it as so.

With upcoming incremental repair, each replica::compaction_group
will be actually two compaction groups, so there will be two
views for each replica::compaction_group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:51:28 +03:00
Benny Halevy
34b223f6f9 replica: database: keyspace: rename {create,update}_effective_replication_map
to *_static_effective_replication_map, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-06 16:03:53 +03:00
Benny Halevy
688bd4fd43 locator: effective_replication_map_factory: rename create_effective_replication_map
to create_static_effective_replication_map, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-06 16:03:53 +03:00
Benny Halevy
cbad497859 locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al
to static_effective_replication_map_ptr, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-06 16:03:53 +03:00
Benny Halevy
bd62421c05 keyspace: rename get_vnode_effective_replication_map
to get_static_effective_replication_map, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map (both are per-keyspace).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-06 13:40:43 +03:00
Benny Halevy
ec85678de1 locator: abstract_replication_strategy: define is_local
Prefer for specializing the local replication strategy,
local effective replication map, et. al byt defining
an is_local() predicate, similar to uses_tablets().

Note that is_vnode_based() still applies to local replication
strategy.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-06 13:34:23 +03:00
Ferenc Szili
268ec72dc9 truncate: change check for write during truncate into a log warning
TRUNCATE TABLE performs a memtable flush and then discards the sstables
of the table being truncated. It collects the highest replay position
for both of these. When the highest replay position of the discarded
sstables is higher than the highest replay position of the flushed
memtable, that means that we have had writes during truncate which have
been flushed to disk independently of the truncate process. We check for
this and trigger an on_internal_error() which throws an exception,
informing the user that writing data concurrently with TRUNCATE TABLE is
not advised.

The problem with this is that truncate is also called from DROP KEYSPACE
and DROP TABLE. These are raft operations and exceptions thrown by them
are caught by the (...) exception handler in the raft applier fiber,
which then exits leaving the node without the ability to execute
subsequent raft commands.

This commit changes the on_internal_error() into a warning log entry. It
also outputs to keyspace/table names, the truncated_at timepoint, the
offending replay positions which caused the check to fail.

Fixes: #25173
Fixes: #25013
2025-08-04 12:24:50 +02:00
Patryk Jędrzejczak
8e43856ca7 Merge 'Pass more elaborated "reasons" to stop_ongoing_compactions()' from Pavel Emelyanov
When running compactions are aborted by the aforementioned helper, in logs there appear a line like
"Compaction for ks/cf was stopped due to: user-triggered operation". This message could've been better, since it may indicate several distinct reasons described with the same "user-triggered operation".

With this PR the message will help telling "truncate", "cleanup", "rewrite" and "split" from each other.

Closes scylladb/scylladb#25136

* https://github.com/scylladb/scylladb:
  compaction: Pass "reason" to perform_task_on_all_files()
  compaction: Pass "reason" to run_with_compaction_disabled()
  compaction: Pass "reason" to stop_and_disable_compaction()
2025-07-29 16:06:17 +02:00
Petr Gusev
1b70623908 database: get_base_table_for_tablet_colocation: handle paxos state table
We need to mark paxos state table as colocated with the user table, so
that the corresponding tablets are migrated/repaired together.
2025-07-24 19:48:08 +02:00
Pavel Emelyanov
db46da45d2 compaction: Pass "reason" to stop_and_disable_compaction()
This tells "truncate" operation from other reasons

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-07-22 18:51:16 +03:00
Avi Kivity
6fce817aa8 Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz
This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft.

Pulling database::apply() out of schema merging code will allow to batch changes to subsystems. Future generic code will first call prepare() on all implementations, then single database::apply() and then update() on all implementations, then on each shard it will call commit() for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then post_commit().

Backport: no, it's a new feature

Fixes: https://github.com/scylladb/scylladb/issues/19649
Fixes https://github.com/scylladb/scylladb/issues/24531

Closes scylladb/scylladb#24886

[avi: adjust for std::vector<mutations> -> utils::chunked_vector<mutations>]

* github.com:scylladb/scylladb:
  test: add type creation to test_snapshot
  storage_service: always wake up load balancer on update tablet metadata
  db: schema_applier: call destroy also when exception occurs
  db: replica: simplify seeding ERM during shema change
  db: remove cleanup from add_column_family
  db: abort on exception during schema commit phase
  db: make user defined types changes atomic
  replica: db: make keyspace schema changes atomic
  db: atomically apply changes to tables and views
  replica: make truncate_table_on_all_shards get whole schema from table_shards
  service: split update_tablet_metadata into two phases
  service: pull out update_tablet_metadata from migration_listener
  db: service: add store_service dependency to schema_applier
  service: simplify load_tablet_metadata and update_tablet_metadata
  db: don't perform move on tablet_hint reference
  replica: split add_column_family_and_make_directory into steps
  replica: db: split drop_table into steps
  db: don't move map references in merge_tables_and_views()
  db: introduce commit_on_shard function
  db: access types during schema merge via special storage
  replica: make non-preemptive keyspace create/update/delete functions public
  replica: split update keyspace into two phases
  replica: split creating keyspace into two functions
  db: rename create_keyspace_from_schema_partition
  db: decouple functions and aggregates schema change notification from merging code
  db: store functions and aggregates change batch in schema_applier
  db: decouple tables and views schema change notifications from merging code
  db: store tables and views schema diff in schema_applier
  db: decouple user type schema change notifications from types merging code
  service: unify keyspace notification functions arguments
  db: replica: decouple keyspace schema change notifications to a separate function
  db: add class encapsulating schema merging
2025-07-13 20:47:55 +03:00
Benny Halevy
3feb759943 everywhere: use utils::chunked_vector for list of mutations
Currently, we use std::vector<*mutation> to keep
a list of mutations for processing.
This can lead to large allocation, e.g. when the vector
size is a function of the number of tables.

Use a chunked vector instead to prevent oversized allocations.

`perf-simple-query --smp 1` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (read path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...

89055.97 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   18003 cycles/op,        0 errors)
103372.72 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39380 insns/op,   17300 cycles/op,        0 errors)
98942.27 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39413 insns/op,   17336 cycles/op,        0 errors)
103752.93 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39407 insns/op,   17252 cycles/op,        0 errors)
102516.77 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39403 insns/op,   17288 cycles/op,        0 errors)
throughput:
	mean=   99528.13 standard-deviation=6155.71
	median= 102516.77 median-absolute-deviation=3844.59
	maximum=103752.93 minimum=89055.97
instructions_per_op:
	mean=   39403.99 standard-deviation=14.25
	median= 39406.75 median-absolute-deviation=9.30
	maximum=39416.63 minimum=39380.39
cpu_cycles_per_op:
	mean=   17435.81 standard-deviation=318.24
	median= 17300.40 median-absolute-deviation=147.59
	maximum=18002.53 minimum=17251.75
```

After (read path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
59755.04 tps ( 66.2 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39466 insns/op,   22834 cycles/op,        0 errors)
71854.16 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   17883 cycles/op,        0 errors)
82149.45 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39411 insns/op,   17409 cycles/op,        0 errors)
49640.04 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   19975 cycles/op,        0 errors)
54963.22 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   18235 cycles/op,        0 errors)
throughput:
	mean=   63672.38 standard-deviation=13195.12
	median= 59755.04 median-absolute-deviation=8709.16
	maximum=82149.45 minimum=49640.04
instructions_per_op:
	mean=   39448.38 standard-deviation=31.60
	median= 39466.17 median-absolute-deviation=25.75
	maximum=39474.12 minimum=39411.42
cpu_cycles_per_op:
	mean=   19267.01 standard-deviation=2217.03
	median= 18234.80 median-absolute-deviation=1384.25
	maximum=22834.26 minimum=17408.67
```

`perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (write path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
63736.96 tps ( 59.4 allocs/op,  16.4 logallocs/op,  14.3 tasks/op,   49667 insns/op,   19924 cycles/op,        0 errors)
64109.41 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   49992 insns/op,   20084 cycles/op,        0 errors)
56950.47 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50005 insns/op,   20501 cycles/op,        0 errors)
44858.42 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50014 insns/op,   21947 cycles/op,        0 errors)
28592.87 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50027 insns/op,   27659 cycles/op,        0 errors)
throughput:
	mean=   51649.63 standard-deviation=15059.74
	median= 56950.47 median-absolute-deviation=12087.33
	maximum=64109.41 minimum=28592.87
instructions_per_op:
	mean=   49941.18 standard-deviation=153.76
	median= 50005.24 median-absolute-deviation=73.01
	maximum=50027.07 minimum=49667.05
cpu_cycles_per_op:
	mean=   22023.01 standard-deviation=3249.92
	median= 20500.74 median-absolute-deviation=1938.76
	maximum=27658.75 minimum=19924.32
```

After (write path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
53395.93 tps ( 59.4 allocs/op,  16.5 logallocs/op,  14.3 tasks/op,   50326 insns/op,   21252 cycles/op,        0 errors)
46527.83 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50704 insns/op,   21555 cycles/op,        0 errors)
55846.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50731 insns/op,   21060 cycles/op,        0 errors)
55669.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50735 insns/op,   21521 cycles/op,        0 errors)
52130.17 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50757 insns/op,   21334 cycles/op,        0 errors)
throughput:
	mean=   52713.91 standard-deviation=3795.38
	median= 53395.93 median-absolute-deviation=2955.40
	maximum=55846.30 minimum=46527.83
instructions_per_op:
	mean=   50650.57 standard-deviation=182.46
	median= 50731.38 median-absolute-deviation=84.09
	maximum=50756.62 minimum=50325.87
cpu_cycles_per_op:
	mean=   21344.42 standard-deviation=202.86
	median= 21334.00 median-absolute-deviation=176.37
	maximum=21554.61 minimum=21060.24
```

Fixes #24815

Improvement for rare corner cases. No backport required

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#24919
2025-07-13 19:13:11 +03:00
Marcin Maliszkiewicz
b103fee5b6 db: replica: simplify seeding ERM during shema change
We know that caller is running on shard 0 so we can avoid some extra boilerplate.
2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz
44490ceb77 db: remove cleanup from add_column_family
Since we abort now on failure during schema commit
there is no need for cleanup as it only manages in-memory
state.

Explicit cf.stop was added to code paths outside of schema
merging to avoid unnecessary regressions.
2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz
e3f92328d3 replica: db: make keyspace schema changes atomic
Now all keyspace related schema changes are observable
on given shard as they would be applied atomically.
This is achieved by commit_on_shard() function being
non-preemptive (no futures, no co_awaits).

In the future we'll extend this to the whole schema
and also other subsystems.
2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz
b18cc8145f db: atomically apply changes to tables and views
In this commit we make use of splitted functions introduced before.
Pattern is as follows:
- in merge_tables_and_views we call some preparatory functions
- in schema_applier::update we call non-yielding step
- in schema_applier::post_commit we call cleanups and other finalizing async
  functions

Additionally we introduce frozen_schema_diff because converting
schema_ptr to global_schema_ptr triggers schema registration and
with atomic changes we need to place registration only in commit
phase. Schema freezing is the same method global_schema_ptr uses
to transport schema across shards (via schema_registry cache).
2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz
19bc6ffcb0 replica: make truncate_table_on_all_shards get whole schema from table_shards
Before for views and indexes it was fetching base schema from db (and
couple other properties). This is a problem once we introduce atomic
tables and views deletion (in the following commit).
Because once we delete table it can no longer be fetched from db object,
and truncation is performed after atomically deleting all relevant
tables/views/indexes.

Now the whole relevant schema will be fetched via global_table_ptr
(table_shards) object.
2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz
1c5ec877a7 replica: split add_column_family_and_make_directory into steps
This is similar work as for drop_table in previous commit.

add_column_family_and_make_directory() behaves exactly the same
as before but calls to it in schema_applier will be replaced by
calls directly to split steps. Other usages will remain intact as
they don't need atomicity (like creating system tables at startup).
2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz
c2cd02272a replica: db: split drop_table into steps
This is done so that actual dropping can be
an atomic step which could be composed with other
schema operations, and eventually all subsystems modified
via raft so that we could introduce atomic changes which
span across different subsystems.

We split drop_table_on_all_shards() into:
- prepare_tables_metadata_change_on_all_shards()
- prepare_drop_table_on_all_shards()
- drop_table()
- cleanup_drop_table_on_all_shards()

prepare_tables_metadata_change_on_all_shards() is necessary
because when applying multiple schema changes at once (e.g. drop
and add tables) we need to lock only once.

We add legacy_drop_table_on_all_shards() which
behaves exactly like old drop_table_on_all_shards() to be
compatible with code which doesn't need to play with atomicity.

Usages of legacy_drop_table_on_all_shards() in schema_applier
will be replaced with direct calls to split functions in the following
commits - that's the place we will take advantage of drop_table not
yielding (as it returns void now).
2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz
71bd452075 replica: make non-preemptive keyspace create/update/delete functions public
As those operations will be managed by schema_applier class. This
will be implemented in following commit.
2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz
dce0e65213 replica: split update keyspace into two phases
- first phase is preemptive (prepare_update_keyspace)
- second phase is non-preemptive (update_keyspace)

This is done so that schema change can be applied atomically.

Aditionally create keyspace code was changed to share common
part with update keyspace flow.

This commit doesn't yet change the behaviour of the code,
as it doesn't guarantee atomicity, it will be done in following
commits.
2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz
734f79e2ad replica: split creating keyspace into two functions
This is done so that in following commits insert_keyspace can be used
to atomically change schema (as it doesn't yield).
2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz
ec270b0b5e db: rename create_keyspace_from_schema_partition
It only creates keyspace metadata.
2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz
45c5c44c2d db: replica: decouple keyspace schema change notifications to a separate function
In following commits we want to separate updating code from committing
shema change (making it visible). Since notifications should be issued
after change is visible we need to separate them and call after
committing.

In subsequent commits other notification types will be moved too.

We change here order of notification calls with regards to rest
of schema updating code. I.e. before keyspace notifications triggered
before tables were updated, after the change they will trigger once
everything is updated. There is no indication that notification
listeners depend on this behaviour.
2025-07-10 10:40:42 +02:00
Benny Halevy
493a2303da replica: database: get and expose a mutable locator::shared_token_metadata
Prepare for next patch, the will use this shared_token_metadata
to make mutable_token_metadata_ptr:s

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-07-06 14:22:20 +03:00
Michael Litvak
7211c0b490 tablets: co-locate view tablets with base when the partition keys match
For a view table that has the same partition key as its base table, the
view's tablets are co-located with the base tablets

Fixes scylladb/scylladb#17043
2025-07-01 13:20:19 +03:00
Michael Litvak
018b61f658 tablets: allocator: create co-located tables in a single operation
Co-located base and child tables may be created together in a single
operation.  The tablet allocator in this case needs to handle them
together and not each table independently, because we need to have the
base schema and tablet map when creating the child tablet map.

We do this by registering the tablet allocator to the migration
notification on_before_create_column_families that announces multiple
new tables, and there we allocate tablets for all the new base tables,
and for the new child tables we create their maps from the base tables,
which are either a new table or an existing one.
2025-07-01 13:20:19 +03:00
Michael Litvak
3db8f6fd37 tablets: allocate co-located tablets
When allocating tablets for a new table, add the option to create a
co-located tablet map with an existing base table.

The co-located tablet map is created with the base_table value set.
2025-07-01 13:20:18 +03:00
Botond Dénes
ebd9420687 sstables: add corrupt_data_handler to sstables::sstables
Similar to how large_data_handler is handled, propagate through
sstables::sstables_manager and store its owner: replica::database.
Tests and tools are also patched. Mostly mechanical changes, updating
constructors and patching callers.
2025-06-25 08:41:26 +03:00
Avi Kivity
cd79a8fc25 Revert "Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz"
This reverts commit 0b516da95b, reversing
changes made to 30199552ac. It breaks
cluster.random_failures.test_random_failures.test_random_failures
in debug mode (at least).

Fixes #24513
2025-06-16 22:38:12 +03:00
Tomasz Grabiec
0b516da95b Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz
This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft.

Pulling `database::apply()` out of schema merging code will allow to batch changes to subsystems. Future generic code will first call `prepare()` on all implementations, then single  `database::apply()` and then `update()` on all implementations, then on each shard it will call `commit()` for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then `post_commit()`.

Backport: no, it's a new feature

Fixes: https://github.com/scylladb/scylladb/issues/19649

Closes scylladb/scylladb#20853

* github.com:scylladb/scylladb:
  storage_service: always wake up load balancer on update tablet metadata
  db: schema_applier: call destroy also when exception occurs
  db: replica: simplify seeding ERM during shema change
  db: remove cleanup from add_column_family
  db: abort on exception during schema commit phase
  db: make user defined types changes atomic
  replica: db: make keyspace schema changes atomic
  db: atomically apply changes to tables and views
  replica: make truncate_table_on_all_shards get whole schema from table_shards
  service: split update_tablet_metadata into two phases
  service: pull out update_tablet_metadata from migration_listener
  db: service: add store_service dependency to schema_applier
  service: simplify load_tablet_metadata and update_tablet_metadata
  db: don't perform move on tablet_hint reference
  replica: split add_column_family_and_make_directory into steps
  replica: db: split drop_table into steps
  db: don't move map references in merge_tables_and_views()
  db: introduce commit_on_shard function
  db: access types during schema merge via special storage
  replica: make non-preemptive keyspace create/update/delete functions public
  replica: split update keyspace into two phases
  replica: split creating keyspace into two functions
  db: rename create_keyspace_from_schema_partition
  db: decouple functions and aggregates schema change notification from merging code
  db: store functions and aggregates change batch in schema_applier
  db: decouple tables and views schema change notifications from merging code
  db: store tables and views schema diff in schema_applier
  db: decouple user type schema change notifications from types merging code
  service: unify keyspace notification functions arguments
  db: replica: decouple keyspace schema change notifications to a separate function
  db: add class encapsulating schema merging
2025-06-10 13:45:32 +02:00