`system.view_building_tasks` is a single partition table, so it makes
more sense to use a mutation builder and generate 1 mutation per group0
command instead of generating multiple mutations.
`system.view_building_tasks` is a single-partition Raft group0 table (pk = `"view_building"`, CK = timeuuid). When `clean_finished_tasks()` deletes hundreds of finished tasks, the physical rows remain in SSTables until compaction. Any subsequent read of the partition counts every column of every tombstoned row
as a dead cell, triggering `tombstone_warn_threshold` warnings in large clusters.
Two-part fix:
**1. Range tombstones instead of row tombstones (commits 2–3)**
Instead of one row tombstone per finished task, find the minimum alive task UUID (`min_alive_uuid`) and emit a single range tombstone `[before_all, min_alive_uuid)` covering all tasks below that boundary. This reduces the tombstone count significantly and also benefits future compaction.
**2. Bounded scan with `min_task_id` (commits 4–6)**
Even with range tombstones, physical rows remain until compaction and still count as dead cells during reads. The only way to avoid them is to not read them at all.
- Add a `min_task_id timeuuid` static column to `system.view_building_tasks`.
- On every GC, write `min_task_id = min_alive_uuid` atomically with the range tombstone (same Raft batch).
- On reload, read `min_task_id` first using a **static-only partition slice** (empty `_row_ranges` + `always_return_static_content`): the SSTable reader stops immediately after the static row before processing any clustering tombstones — zero dead cells counted.
- Use `AND id >= min_task_id` as a lower bound for the main task scan, skipping all tombstoned rows.
The static-only read and the bounded scan are gated on the `VIEW_BUILDING_TASKS_MIN_TASK_ID` cluster feature so mixed-version clusters fall back to the full scan.
The issue is not critical, so the fix shouldn't be backported.
Fixes SCYLLADB-657
Closesscylladb/scylladb#28929
* github.com:scylladb/scylladb:
test/cluster/test_view_building_coordinator: add reproducer for tombstone threshold warning
docs: document tombstone avoidance in view_building_tasks
view_building: add `task_uuid_generator` to `view_building_task_mutation_builder`
view_building: introduce `task_uuid_generator`
view_building: store `min_alive_uuid` in view building state
view_building: set min_task_id when GC-ing finished tasks
view_building: add min_task_id support to view_building_task_mutation_builder
view_building: add min_task_id static column and bounded scan to system_keyspace
view_building: use range tombstone when GC-ing finished tasks
view_building: add range tombstone support to view_building_task_mutation_builder
view_building: introduce VIEW_BUILDING_TASKS_MIN_TASK_ID cluster feature
Before this patch, if wait_for_schema_agreement() times out, it threw
a generic std::runtime_error, making it inconvenient for callers to
catch this error only. So in this patch we create and use a new exception
type, schema_agreement_timeout, based on seastar::timed_out_error.
Although wait_for_schema_agreement() was added in commit
a429018a8a was a utility function used in
a dozen places, it has become less interesting after we introduced schema
changes over Raft, and over the years most of the callers to this function
were removed, except one in view.cc which uses an infinite timeout, so
doesn't care about the timeout exception type.
In the next patch we want to add a new caller which *does* care about
the time exception type - hence this patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
With the new `min_alive_uuid` saved in the group0 table,
we need to make sure that all new tasks are created with time uuid
greater than the value saved in `min_alive_uuid`.
This patch introduces the `task_uuid_generator` which ensures that
when we are generating multiple tasks in one group0 command, each task
will have an unique time uuid and each time uuid will be greater than
`min_alive_uuid`.
Remove the rest of the code that assumes that either group0 does not exist yet or a cluster is till not upgraded to raft topology. Both of those are not supported any more.
No need to backport since we remove functionality here.
Closesscylladb/scylladb#28841
* github.com:scylladb/scylladb:
service level: remove version 1 service level code
features: move GROUP0_SCHEMA_VERSIONING to deprecated features list
migration_manager: remove unused forward definitions
test: remove unused code
auth: drop auth_migration_listener since it does nothing now
schema: drop schema_registry_entry::maybe_sync() function
schema: drop make_table_deleting_mutations since it should not be needed with raft
schema: remove calculate_schema_digest function
schema: drop recalculate_schema_version function and its uses
migration_manager: drop check for group0_schema_versioning feature
cdc: drop usage of cdc_local table and v1 generation definition
storage_service: no need to add yourself to the topology during reboot since raft state loading already did it
storage_service: remove unused functions
group0: drop with_raft() function from group0_guard since it always returns true now
gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more
gossiper: drop tokens from loaded_endpoint_state
gossiper: remove unused functions
storage_service: do not pass loaded_peer_features to join_topology()
storage_service: remove unused fields from replacement_info
gossiper: drop is_safe_for_restart() function and its use
storage_service: remove unused variables from join_topology
gossiper: remove the code that was only used in gossiper topology
storage_service: drop the check for raft mode from recovery code
cdc: remove legacy code
test: remove unused injection points
auth: remove legacy auth mode and upgrade code
treewide: remove schema pull code since we never pull schema any more
raft topology: drop upgrade_state and its type from the topology state machine since it is not used any longer
group0: hoist the checks for an illegal upgrade into main.cc
api: drop get_topology_upgrade_state and always report upgrade status as done
service_level_controller: drop service level upgrade code
test: drop run_with_raft_recovery parameter to cql_test_env
group0: get rid of group0_upgrade_state
storage_service: drop topology_change_kind as it is no longer needed
storage_service: drop check_ability_to_perform_topology_operation since no upgrades can happen any more
service_storage: remove unused functions
storage_service: remove non raft rebuild code
storage_service: set topology change kind only once
group0: drop in_recovery function and its uses
group0: rename use_raft to maintenance_mode and make it sync
Schema pull was used by legacy schema code which is not supported for a
long time now and during legacy recovery which is no longer supported as
well. It can be dropped now.
When a user tries to use ALTER TABLE on a materialized view,
the resulting error message is `Cannot use ALTER TABLE on Materialized View`.
The intention behind this error is that ALTER MATERIALIZED VIEW should
be used instead.
But we observed that some users interpret this error message as a general
"You cannot do any ALTER on this thing".
This patch enhances the error message (and others similar to it)
to prevent the confusion.
Closesscylladb/scylladb#28831
There are several places in the code that need to explicitly switch into
gossiper scheduling group. For that they currently call database to
provide the group, but it's better to get gossiper sched group from
gossiper itself, all the more so all those places have gossiper at hand.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is to initialize dependency references, in particular gossiper&,
before _group0_barrier. The latter will need to access this->_gossiper
in the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When calling a migration notification from the context of a notification
callback, this could lead to a deadlock with unregistering a listener:
A: the parent notification is called. it calls thread_for_each, where it
acquires a read lock on the vector of listeners, and calls the
callback function for each listener while holding the lock.
B: a listener is unregistered. it calls `remove` and tries to acquire a
write lock on the vector of listeners. it waits because the lock is
held.
A: the callback function calls another notification and calls
thread_for_each which tries to acquire the read lock again. but it
waits since there is a waiter.
Currently we have such concrete scenario when creating a table, where
the callback of `before_create_column_family` in the tablet allocator
calls `before_allocate_tablet_map`, and this could deadlock with node
shutdown where we unregister listeners.
Fix this by not acquiring the read lock again in the nested
notification. There is no need because the read lock is already held by
the parent notification while the child notification is running. We add
a function `thread_for_each_nested` that is similar to `thread_for_each`
except it assumes the read lock is already held and doesn't acquire it,
and it should be used for nested notifications instead of
`thread_for_each`.
Fixesscylladb/scylladb#27364Closesscylladb/scylladb#27637
After previous commits, we can drop entire task's state and replace it
with single boolean flag, which determines if a task was aborted.
Once a task was aborted, it cannot get resurrected to a normal state.
Migration manager depends on storage service. For instance,
it has a reload_schema_in_bg background task which calls
_ss.local() so it expects that storage service is not stopped
before it stops.
To solve this we use permit approach, and during storage_service
stop:
- we ignore *new* code execution in migration_manager which'd use
storage_service
- but wait with storage_service shutdown until all *existing*
executions are done
Fixesscylladb/scylladb#26734
Backport: no need, problem existed since very long time, code restructure in https://github.com/scylladb/scylladb/commit/389afcd (and following commits) made
it hitting more often, as _ss was called earlier, but it's not released yet.
Closesscylladb/scylladb#26779
* github.com:scylladb/scylladb:
service: attach storage_service to migration_manager using pluggabe
service: migration_manager: corutinize merge_schema_from
service: migration_manager: corutinize reload_schema
Migration manager depends on storage service. For instance,
it has a reload_schema_in_bg background task which calls
_ss.local() so it expects that storage service is not stopped
before it stops.
To solve this we use permit approach, and during storage_service
stop:
- we ignore *new* code execution in migration_manager which'd use
storage_service
- but wait with storage_service shutdown until all *existing*
executions are done
Fixesscylladb/scylladb#26734
When generating CDC log mutations for some base mutation, use a CDC schema that is compatible with the base schema.
The compatible CDC schema has for every base column a corresponding CDC column with the same name. If using a non-compatible schema, we may encounter a situation, especially during ALTER, that we have a mutation with a base column set with some value, but the CDC schema doesn't have a column by that name. This would cause the user request to fail with an error.
We add to the schema object a schema_ptr that for CDC-enabled tables points to the schema object of the CDC table that is compatible with the schema. It is set by the schema merge algorithm when creating the schema for a table that is created or altered. We use the fact that a base table and its CDC table are created and altered in the same group0 operation, and this way we can find and set the cdc schema for a base table.
When transporting the base schema as a frozen schema between shards, we transport with it the frozen cdc schema as well.
The patch starts with a series of refactoring commits that make extending the frozen schema easier and cleans up some duplication in the code about the frozen schema. We combine the two types `frozen_schema_with_base_info` and `view_schema_and_base_info` to a single type `extended_frozen_schema` that holds a frozen schema with additional data that is not part of the schema mutations but needs to be transported with it to unfreeze it - base_info, and the frozen cdc schema which is added in a later commit.
Fixes https://github.com/scylladb/scylladb/issues/26405
backport not needed - enhancement
Closesscylladb/scylladb#24960
* github.com:scylladb/scylladb:
test: cdc: test cdc compatible schema
cdc: use compatiable cdc schema
db: schema_applier: create schema with pointer to CDC schema
db: schema_applier: extract cdc tables
schema: add pointer to CDC schema
schema_registry: remove base_info from global_schema_ptr
schema_registry: use extended_frozen_schema in schema load
schema_registry: replace frozen_schema+base_info with extended_frozen_schema
frozen_schema: extract info from schema_ptr in the constructor
frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema
The migration manager offers some free functions to prepare mutations
for a new/updated table/view. Most of them include a validation check
for the schema extensions, but in the following ones it's missing:
* `prepare_new_column_family_announcement` (overload with vector as out parameter)
* `prepare_new_column_families_announcement`
Presumably, this was just an omission. It's also not a very important
one since the only extension having validation logic is the
`encryption_schema_extension`, but none of these functions is connected
to user queries where encryption options can be provided in the schema.
User queries go through the other
`prepare_new_column_family_announcement` overload, which does perform a
validation check.
Add validation in the missing places.
Fixes#26470.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#26487
When creating a schema for a non-CDC table in the schema_applier, find
its CDC schema that we created previously in the same operation, if any,
and create the schema with a pointer to the CDC schema.
We use the fact that for a base table with CDC enabled, its CDC schema
is created or altered together in the same group0 operation.
Similarly, in schema_tables, when creating table schemas from the
schema tables, first create all schemas that don't have CDC enabled,
then create schemas that have CDC enabled by extending them with the
pointer to the CDC schema that we created before.
There are few additional cases where we create schemas that we need to
consider how to handle.
When loading a schema from schema tables in the schema_loader we decide
not to set the CDC schema, because this schema is mostly used for tools
and it's not used for generating CDC mutations.
When transporting a schema by RPC in the migration manager, we don't
transport its CDC schema, and we always set it to null. Because we use
raft we expect this shouldn't have any effect, because the schema is
synchronized through raft and not through the RPC.
Add to the schema object a member that points to the CDC schema object
that is compatible with this schema, if any.
The compatible CDC schema is created and altered with its base schema in
the same group0 operation.
When generating CDC log mutations for some base mutation we want them to
be created using a compatible schema thas has a CDC column corresponding
to each base column. This change will allow us to find the right CDC
schema given a base mutation.
We also update the relevant structures in the schema registry that are
related to learning about schemas and transporting schemas across
shards or nodes.
When transporting a schema as frozen_schema, we need to transport the
frozen cdc schema as well, and set it again when unfreezing and
reconstructing the schema.
When adding a schema to the registry, we need to ensure its CDC schema
is added to the registry as well.
Currently we always set the CDC schema to nullptr and maintain the
previous behavior. We will change it in a later commit. Until then, we
mark all places where CDC schema is passed clearly so we don't forget
it.
Change the schema loader type in the schema_registry to return a
extended_frozen_schema instead of view_schema_and_base_info, and
remove view_schema_and_base_info which is not used anymore.
The casting between them is trivial.
Currently we construct a frozen schema with base info in few places, and
the caller is responsible for constructing the frozen schema and extracting
the base info if it's a view table.
We change it to make it simpler and remove the burden from the caller.
The caller can simply pass the schema_ptr, and the constructor for
extended_frozen_schema will construct the frozen schema and extract
the additional info it needs. This will make it easier to add additional
fields, and reduces code duplication.
We also make temporary castings between extended_frozen_schema and
view_schema_and_base_info for the transition, which are trivial, until
they are combined to a single type.
We need to avoid reloading schema early as it goes via
schema_applier which internally depends on storage_service
and on distribued_loader initializing all keyspaces.
Simply moving migration manager startup later in the code is not
easy as some services depend on it being initialized so we just
enable those feature listeners a bit later.
Split the tablets mutations by number of rows, based on
`min_tablets_in_mutation` (currently calibrated to 1024),
similar to the splitting done in
`storage_service::merge_topology_snapshot`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fix an issue where executing a CREATE TABLE IF NOT EXISTS statement with
CDC enabled fails with an error if the table already exists. Instead,
the query should succeed and be a no-op.
This regression was introduced by commit fed1048059. Previously, when
executing the query, we would first check if the table exists in
do_prepare_new_column_families_announcement. If it did, we would throw
an already_exists_exception, which was handled correctly; otherwise, we
would continue and create the CDC table in the
before_create_column_families notification.
The order of operations was changed in fed1048059, causing the
regression. Now, we first create the CDC schema and add it to the schema
list for creation, and then check for each of them if they already
exist. The problem is that when we create the CDC schema in
on_pre_create_column_families, it also checks if the CDC table already
exists. If it does, it throws an invalid_request_exception, which is not
caught and handled as expected.
This patch restores the previous order of operations: we first check if
the tables exist, and only then add the CDC schema in pre_create.
Fixesscylladb/scylladb#26142
Add a new notification on_before_allocate_tablet_map that is called when
creating a tablet map for a new table and passes the tablet map.
This will be useful next for CDC for example. when creating tablets for
a new table we want to create CDC streams for each tablet in the same
operation, and we need to have the tablet map with the tablet count and
tokens for each tablet, because the CDC streams are based on that.
We need to change slightly the tablet allocation code for this to work
with colocated tables, because previously when we created the tablet map
of a colocated table we didn't have a reference to the base tablet map,
but now we do need it so we can pass it to the notification.
When creating a new table with CDC enabled, we create also a CDC log
table by adding the CDC table's mutations in the same operation.
Previously, it works by the CDC log service subscribing to
on_before_create_column_family and adding the CDC table's mutations
there when being notified about a new created table.
The problem is that when we create the tables we also create their
tablet maps in the tablet allocator, and we want to created the two
tables as co-located tables: we allocate a tablet map for the base
table, and the CDC table is co-located with the base table.
This doesn't work well with the previous approach because the
notification that creates the CDC table is the same notification that
the tablet allocator creates the base tablet map, so the two operations
are independent, but really we want the tablet allocator to work on both
tables together, so that we have the base table's schema and tablet map
when we create the CDC table's co-located tablet map.
In order to achieve this, we want to create and add the CDC table's
schema, and only after that notify using before_create_column_families
with a vector that contains both the base table and CDC table. The
tablet allocator will then have all the information it needs to create
the co-located tablet map.
We move the creation of the CDC log table - instead of adding the
table's mutations in on_before_create_column_family, we create the table
schema and add it to the new tables vector in
on_pre_create_column_families, which is called by the migration manager
in do_prepare_new_column_families_announcement. The migration manager
will then create and add all mutations for creating the tables, and
notify about the tables being created together.
When a keyspace is dropped, remove all unfinished building tasks for all
views and remove their entries from `system.view_built_status_v2`
and `system.built_views`.
When a view is dropped, remove all unfinished building tasks, remove
entries from `system.view_built_status_v2` and `system.built_views`.
If the view is currently being built, removing its tasks means they
are also aborted.
Finished tasks are already removed from the table.
Create view building tasks in the same batch as new view mutations.
The tasks are created only if `VIEW_BUILDING_COORDINATOR` feature is on
and the view is in tablet keyspace.
Commit ddc3b6dcf5 added a check of group0 state in
get_schema_for_write(), but group0 client can only be used on shard 0,
and get_schema_for_write() can be called on any shard, so we cannot use
_group0_client there directly. Move assert where we use another group0
function already where it is guarantied to run on shard 0.
Closesscylladb/scylladb#25204
Pass a timeout parameter through to start_operation()
and add_entry(), respectively.
This is a preparatory change for the next commit, which
will use the timeout to properly handle timeouts during
lazy creation of Paxos state tables.
If schema pull are disabled group0 is used to bring up to date schema
by calling start_group0_operation() which executes raft read barrier
internally, but if the group0 is still in use_pre_raft_procedures
start_group0_operation() silently does nothing. Later the code that
assumes that schema is already up-to-date will fail and print warnings
into the log. But since getting queries in the state when a node is in
raft enabled mode but group0 is still not configured is illegal it is
better to make those errors more visible buy asserting them during
testing.
Closesscylladb/scylladb#25112
This issue happens with removenode, when RBNO is disabled, so range
streamer is used.
The deadlock happens in a scenario like this:
1. Start 3 nodes: {A, B, C}, RF=2
2. Node A is lost
3. removenode A
4. Both B and C gain ownership of ranges.
5. Streaming sessions are started with crossed directions: B->C, C->B
Readers created by sender side exhaust streaming semaphore on B and C.
Receiver side attempts to obtain a permit indirectly by calling
check_needs_view_update_path(), which reads local tables. That read is
blocked and times-out, causing streaming to fail. The streaming writer
is already using a tracking-only permit.
Even if we didn't deadlock, and the streaming semaphore was simply exhausted
by other receiving sessions (via tracking-only permit), the query may still time-out due to starvation.
To avoid that, run the query under a different scheduling group, which
translates to the system semaphore instead of the maintenance
semaphore, to break the dependency. The gossip group was chosen
because it shouldn't be contended and this change should not interfere
with it much.
Fixes#24807Fixes#24925Closesscylladb/scylladb#24929
* github.com:scylladb/scylladb:
streaming: Avoid deadlock by running view checks in a separate scheduling group
service: migration_manager: Run group0 barrier in gossip scheduling group
This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft.
Pulling database::apply() out of schema merging code will allow to batch changes to subsystems. Future generic code will first call prepare() on all implementations, then single database::apply() and then update() on all implementations, then on each shard it will call commit() for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then post_commit().
Backport: no, it's a new feature
Fixes: https://github.com/scylladb/scylladb/issues/19649
Fixes https://github.com/scylladb/scylladb/issues/24531Closesscylladb/scylladb#24886
[avi: adjust for std::vector<mutations> -> utils::chunked_vector<mutations>]
* github.com:scylladb/scylladb:
test: add type creation to test_snapshot
storage_service: always wake up load balancer on update tablet metadata
db: schema_applier: call destroy also when exception occurs
db: replica: simplify seeding ERM during shema change
db: remove cleanup from add_column_family
db: abort on exception during schema commit phase
db: make user defined types changes atomic
replica: db: make keyspace schema changes atomic
db: atomically apply changes to tables and views
replica: make truncate_table_on_all_shards get whole schema from table_shards
service: split update_tablet_metadata into two phases
service: pull out update_tablet_metadata from migration_listener
db: service: add store_service dependency to schema_applier
service: simplify load_tablet_metadata and update_tablet_metadata
db: don't perform move on tablet_hint reference
replica: split add_column_family_and_make_directory into steps
replica: db: split drop_table into steps
db: don't move map references in merge_tables_and_views()
db: introduce commit_on_shard function
db: access types during schema merge via special storage
replica: make non-preemptive keyspace create/update/delete functions public
replica: split update keyspace into two phases
replica: split creating keyspace into two functions
db: rename create_keyspace_from_schema_partition
db: decouple functions and aggregates schema change notification from merging code
db: store functions and aggregates change batch in schema_applier
db: decouple tables and views schema change notifications from merging code
db: store tables and views schema diff in schema_applier
db: decouple user type schema change notifications from types merging code
service: unify keyspace notification functions arguments
db: replica: decouple keyspace schema change notifications to a separate function
db: add class encapsulating schema merging
Fixes two issues.
One is potential priority inversion. The barrier will be executed
using scheduling group of the first fiber which triggers it, the rest
will block waiting on it. For example, CQL statements which need to
sync the schema on replica side can block on the barrier triggered by
streaming. That's undesirable. This is theoretical, not proved in the
field.
The second problem is blocking the error path. This barrier is called
from the streaming error handling path. If the streaming concurrency
semaphore is exhausted, and streaming fails due to timeout on
obtaining the permit in check_needs_view_update_path(), the error path
will block too because it will also attempt to obtain the permit as
part of the group0 barrier. Running it in the gossip scheduling group
prevents this.
Fixes#24925
It's not a good usage as there is only one non-empty implementation.
Also we need to change it further in the following commit which
makes it incompatible with listener code.
There is already implicit logical dependency via migration_notifier
but in the next commits we'll be moving store_service out from it
as we need better control (i.e. return a value from the call).