Commit Graph

4972 Commits

Author SHA1 Message Date
Michał Chojnowski
7d26d3c7cb db/config: add an option that disables dict-aware sstable compressors in DDL statements
For reasons, we want to be able to disallow dictionary-aware compressors
in chosen deployments.

This patch adds a knob for that. When the knob is disabled,
dictionary-aware compressors will be rejected in the validation
stage of CREATE and ALTER statements.

Closes scylladb/scylladb#24355
2025-06-09 13:30:40 +03:00
Marcin Maliszkiewicz
ddc0656eb5 db: schema_applier: call destroy also when exception occurs
Otherwise objects may be destroyed on wrong shard, and assert
will trigger in ~sharded().
2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz
547bb1f663 db: replica: simplify seeding ERM during shema change
We know that caller is running on shard 0 so we can avoid some extra boilerplate.
2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz
97cdb72d4d db: remove cleanup from add_column_family
Since we abort now on failure during schema commit
there is no need for cleanup as it only manages in-memory
state.

Explicit cf.stop was added to code paths outside of schema
merging to avoid unnecessary regressions.
2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz
d5075c70ef db: abort on exception during schema commit phase
As we have no way to recover from partial commit.
2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz
858db822dc db: make user defined types changes atomic
The same order of creation/destruction is preserved as in the
original code, looking from single shard point of view.

create_types() is called on each shard separately, while in theory
we should be able reuse results similarly as diff_rows(). But we
don't introduce this optimization yet.
2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz
5b2e4140cc replica: db: make keyspace schema changes atomic
Now all keyspace related schema changes are observable
on given shard as they would be applied atomically.
This is achieved by commit_on_shard() function being
non-preemptive (no futures, no co_awaits).

In the future we'll extend this to the whole schema
and also other subsystems.
2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz
556e89bc9d db: atomically apply changes to tables and views
In this commit we make use of splitted functions introduced before.
Pattern is as follows:
- in merge_tables_and_views we call some preparatory functions
- in schema_applier::update we call non-yielding step
- in schema_applier::post_commit we call cleanups and other finalizing async
  functions

Additionally we introduce frozen_schema_diff because converting
schema_ptr to global_schema_ptr triggers schema registration and
with atomic changes we need to place registration only in commit
phase. Schema freezing is the same method global_schema_ptr uses
to transport schema across shards (via schema_registry cache).
2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz
21a5a3c01f service: pull out update_tablet_metadata from migration_listener
It's not a good usage as there is only one non-empty implementation.
Also we need to change it further in the following commit which
makes it incompatible with listener code.
2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz
92e3d69f79 db: service: add store_service dependency to schema_applier
There is already implicit logical dependency via migration_notifier
but in the next commits we'll be moving store_service out from it
as we need better control (i.e. return a value from the call).
2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz
3119a02edd db: don't perform move on tablet_hint reference
This lambda is called several times so there should be no move.
Currently the bug likely doesn't manifest as code does work
only on shard 0.
2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz
141a5643e5 replica: db: split drop_table into steps
This is done so that actual dropping can be
an atomic step which could be composed with other
schema operations, and eventually all subsystems modified
via raft so that we could introduce atomic changes which
span across different subsystems.

We split drop_table_on_all_shards() into:
- prepare_tables_metadata_change_on_all_shards()
- prepare_drop_table_on_all_shards()
- drop_table()
- cleanup_drop_table_on_all_shards()

prepare_tables_metadata_change_on_all_shards() is necessary
because when applying multiple schema changes at once (e.g. drop
and add tables) we need to lock only once.

We add legacy_drop_table_on_all_shards() which
behaves exactly like old drop_table_on_all_shards() to be
compatible with code which doesn't need to play with atomicity.

Usages of legacy_drop_table_on_all_shards() in schema_applier
will be replaced with direct calls to split functions in the following
commits - that's the place we will take advantage of drop_table not
yielding (as it returns void now).
2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz
2bae38e252 db: don't move map references in merge_tables_and_views()
Since they are const it's not needed and misleading.
2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz
85f19e165a db: introduce commit_on_shard function
This will be the place for all atomic schema switching
operations.

Note that atomicity is observed only from single shard
point of view. All shards may switch at slightly different times
as global locking for this is not feasible.
2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz
b3730282c3 db: access types during schema merge via special storage
Once we create types atomically the code which is before commit
may depend on newly added types, so it has to access both old and
new types. New storage called in_progress_types_storage was added.
2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz
aceb1f9659 db: rename create_keyspace_from_schema_partition
It only creates keyspace metadata.
2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz
f8fe51640a db: decouple functions and aggregates schema change notification from merging code 2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz
52069d954f db: store functions and aggregates change batch in schema_applier
To be used in following commit.
2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz
5fff3097a5 db: decouple tables and views schema change notifications from merging code
As post_commit() can't be fully implemented at this stage,
it was moved to interim place to keep things working.
It will be moved back later.
2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz
6f8579e242 db: store tables and views schema diff in schema_applier
It will be used in subsequent commit for moving
notifications code.
2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz
b74c1e9ae4 db: decouple user type schema change notifications from types merging code
Merging types code now returns generic affected_types structure which
is used both for notifications and dropping types. New static
function drop_types() replaces dropping lambda used before.

While I think it's not necessary for dropping nor notifications to
use per shard copies (like it's using before and after this patch)
it could just use string parameters or something similar but
this requires too many changes in other classes so it's out of scope
here.
2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz
3a95edd0d7 service: unify keyspace notification functions arguments
Keyspace metadata is not used, only name is needed so
we can remove those extra find_keyspace() calls.

Moreover there is no need to copy the name.
2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz
d7202586ca db: replica: decouple keyspace schema change notifications to a separate function
In following commits we want to separate updating code from committing
shema change (making it visible). Since notifications should be issued
after change is visible we need to separate them and call after
committing.

In subsequent commits other notification types will be moved too.

We change here order of notification calls with regards to rest
of schema updating code. I.e. before keyspace notifications triggered
before tables were updated, after the change they will trigger once
everything is updated. There is no indication that notification
listeners depend on this behaviour.
2025-05-27 19:59:47 +02:00
Marcin Maliszkiewicz
ddf9f7ae05 db: add class encapsulating schema merging
This commit doesn't yet change how schema merging
works but it prepares the ground for it.

We split merging code into several functions.
Main reasons for it are that:

- We want to generalize and create some interface
which each subsystem would use.

- We need to pull mutation's apply() out
of the code because raft will call it directly,
and it will contain a mix of mutations from more
than one subsystem. This is needed because we have
the need to update multiple subsystems atomically
(e.g. auth and schema during auto-grant when creating
a table).

In this commit do_merge_schema() code is split between
prepare(), update(), commit(), post_commit(). The idea
behind each of these phases is described in the comments.
The last 2 phases are not yet implemented as it requires more
code changes but adding schema_applier enclosing class
will help to create some copied state in the future and
implement commit() and post_commit() phases.
2025-05-27 19:33:02 +02:00
Botond Dénes
485df63fd5 Merge 'Extend compaction_history table with additional compaction statistics' from Łukasz Paszkowski
Currently, the `system.compaction_history` table miss information like the type of compaction (cleanup, major, resharding, etc), the sstable generations involved (in and out), shard's id the compaction was triggered on and statistics on purged tombstones to be collected during compaction.

The series extends the table with the following columns:

-  "compaction_type" (text)
- "shard_id" (int)
- "sstables_in" (list<sstableinfo_type>)
- "sstables_out" (list<sstableinfo_type>)
- "total_tombstone_purge_attempt" (long)
- "total_tombstone_purge_failure_due_to_overlapping_with_memtable" (long)
- "total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable" (long)

with a user defined type `sstableinfo_type` that holds the information about sstable file

- generation (uuid)
- origin (text)
- size (long)

Additional statistics stored in the compaction_history have been incorporated in the API  `/compaction_manager/compaction_history` and the `nodetool compactionhistory` command.

No backport is required. It extends the existing compaction history output.

Fixes https://github.com/scylladb/scylladb/issues/3791

Closes scylladb/scylladb#21288

* github.com:scylladb/scylladb:
  nodetool: Refactor of compactionhistory_operation
  nodetool: Add more stats into compactionhistory output
  api/compaction_manager: Extend compaction_history api
  compaction: Collect tombstone purge stats during compaction
  compacting_reader: Extend to accept tombstone purge statistics
  mutation_compactor: Collect tombstone purge attempts
  compaction_garbage_collector: Extend return type of max_purgeable_fn
  compaction: Extend compaction_result to collect more information
  system_keyspace: Upgrade compaction_history table
  system_keyspace: Create UDT: sstableinfo_type
  system_keyspace: Extract compaction_history struct
  system_keyspace: Squeeze update_compaction_history parameters
  compaction/compaction_manager: update_history accepts compaction_result as rvalue
2025-05-27 14:12:13 +03:00
Avi Kivity
f195c05b0d untyped_result_set: mark get_blob() as returning unfragmented data
Blobs can be large, and unfragmented blobs can easily exceed 128k
(as seen in #23903). Rename get_blob() to get_blob_unfragmented()
to warn users.

Note that most uses are fine as the blobs are really short strings.

Closes scylladb/scylladb#24102
2025-05-26 09:40:34 +02:00
Łukasz Paszkowski
503d4f014c compaction_garbage_collector: Extend return type of max_purgeable_fn
Currently, when a max purgeable timestamp is computed, there is no
information where it comes from and how the value was obtained.
Take compaction, if there are memtables or other uncompacting sstables
possibly shadowing data, the timestamp is decreased to ensure a
tombstone is not purged but the caller does not know what that the
timestamp has its value.

In this patch, we extend the return type of max_purgeable_fn to
contain not only a timestamp but also an information on how it was
computed. This information will be required to collect statistics
on tombstone purge failures due to overlapping memtables/uncompacting
sstables that come later in the series.
2025-05-16 19:59:54 +02:00
Łukasz Paszkowski
0490068982 system_keyspace: Upgrade compaction_history table
Currently, the system.compaction_history table miss precious
information like the type of compaction (cleanup, major, resharding,
etc) or the sstable generations involved (in and out) used countless
times to diagnose issues.

Thus, the commit extend the current definition of the table by adding
the following columns:
+ "compaction_type" (text)
+ "started_at" (int)
+ "shard_id" (int)
+ "sstables_in" (list<sstableinfo_type>)
+ "sstables_out" (list<sstableinfo_type>)
+ "total_tombstone_purge_attempt" (long)
+ "total_tombstone_purge_failure_due_to_overlapping_with_memtable" (long)
+ "total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable" (long)

Furthermore, the commit introduces a new feature flag in order to
prevent nodes from writing data to new columns when a cluster is
not fully upgraded.
2025-05-14 08:32:05 +02:00
Łukasz Paszkowski
28d0c98dab system_keyspace: Create UDT: sstableinfo_type
The new user defined type holds the following information on sstable:
+ generation uuid;
+ origin text;
+ size long;

and will be used by the system.compaction_history table to keep
track of compacted files and the files being the result of this
compaction.
2025-05-14 08:31:40 +02:00
Łukasz Paszkowski
dc6f8881b8 system_keyspace: Extract compaction_history struct
Move the compaction_history_entry struct to a seperate file. The intent
of this change is to later re-use it in scylla-nodetool as it currently
defines its own structure that is very similar.
2025-05-14 08:31:40 +02:00
Łukasz Paszkowski
4c93b5292d system_keyspace: Squeeze update_compaction_history parameters
Since the number of statistics inserted into compaction_history
table grows in time, the number of parameters in the method
update_compaction_history grows as well.

So instead, let's re-use the already existing compaction_history_entry
structure to populate data from the compaction_manager to the
system table.
2025-05-14 08:31:40 +02:00
Botond Dénes
674d41e3e6 readers/mutation_source: s/make_reader_v2/make_mutation_reader/ 2025-05-09 07:53:29 -04:00
Botond Dénes
7af0690762 mutation/mutation_compactor: drop v2 from compactor and related names 2025-05-09 07:53:29 -04:00
Botond Dénes
b5170e27d0 replica/table: s/make_reader_v2/make_mutation_reader/ 2025-05-09 07:53:29 -04:00
Botond Dénes
3d2651e07c readers/queue: drop v2 from reader and related names 2025-05-09 07:53:29 -04:00
Botond Dénes
ca7f557e86 readers/multishard: drop v2 from reader and related names 2025-05-09 07:53:29 -04:00
Botond Dénes
4d92bc8b2f readers/evictable: drop v2 from reader and related names 2025-05-09 07:53:28 -04:00
Avi Kivity
04fb2c026d config: decrease default large allocation warning threshold to 128k
Back in 2017 (5a2439e702), we introduced a check for large
allocations as they can stall the memory allocator. The warning
threshold was set at 1 MB. Since then many fixes for large allocations
went in and it is now time to reduce the threshold further.

We reduce it here to 128 kB, the natural allocation size for the
system. A quick run showed no warnings.

Closes scylladb/scylladb#23975
2025-05-05 12:13:48 +03:00
Pavel Emelyanov
b56d6fbb84 Merge 'sstables: Fix quadratic space complexity in partitioned_sstable_set' from Raphael Raph Carvalho
Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with.

A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range.

Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled).

There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried.
And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them.

Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario.

It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly.

Fixes #23634.

We can backport this into 2024.2, 2025.1, but we should let this cook in master for 1 month or so.

Closes scylladb/scylladb#23806

* github.com:scylladb/scylladb:
  test: Verify partitioned set store split and unsplit correctly
  sstables: Fix quadratic space complexity in partitioned_sstable_set
  compaction: Wire table_state into make_sstable_set()
  compaction: Introduce token_range() to table_state
  dht: Add overlap_ratio() for token range
2025-05-05 11:28:38 +03:00
Nadav Har'El
3ce7e250cc alternator: fix schema "concurrent modification" errors
In ScyllaDB, schema modification operations use "optimistic locking":
A schema operation reads the current schema, decides what it wants to do
and prepares changes to the schema, and then attempts to commit those
changes - but only if the schema hasn't changed since the first read.
If the schema has already been changed by some other node - we need to
try again. In a loop.

In Alternator, there are six operations that perform schema modification:
CreateTable, DeleteTable, UpdateTable, TagResource, UntagResource and
UpdateTimeToLive. All of them were missing this loop. We knew about
this - and even had FIXME in all places. So all these operations,
when facing contention of concurrent schema modifications on different
nodes may fail one of these operations with an error like:

   Internal server error: service::group0_concurrent_modification
   (Failed to apply group 0 change due to concurrent modification).

This problem had very minor effect, if any, on real users because the
DynamoDB SDK automatically retries operations that fail with retryable
errors - like this "Internal server error" - and most likely the schema
operation will succeed upon retry. However, as shown in issue #13152
these failures were annoying in our CI, where tests - which disable
request retries - failed on these errors.

This patch fixes all six operations (the last three operations all
use one common function, db::modify_tags(), so are fixed by one
change) to add the missing loop.

The patch also includes reproducing tests for all these operations -
the new tests all fail before this patch, and pass with it.

These new tests are much more reliable reproducers than the dtests
we had that only sometimes - very rarely - reproduced the problem.
Moreover, the new tests reproduces the bug seperately for each of the
six operations, so if we forget to fix one of the six operations, one
of the tests would have continued to fail. Of course I checked this
during development.

The new tests are in the test/cluster framework, not test/alternator,
because this problem can only be reproduced in a multi-node cluster:
On a single node, it serializes its schema modifications on its own;
The collisions only happen when more than one node attempts schema
modifications at the same time.

Fixes #13152

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23827
2025-05-05 09:59:08 +03:00
Raphael S. Carvalho
c77f710a0c sstables: Fix quadratic space complexity in partitioned_sstable_set
Interval map is very susceptible to quadratic space behavior when
it's flooded with many entries overlapping all (or most of)
intervals, since each such entry will have presence on all
intervals it overlaps with.

A trigger we observed was memtable flush storm, which creates many
small "L0" sstables that spans roughly the entire token range.

Since we cannot rely on insertion order, solution will be about
storing sstables with such wide ranges in a vector (unleveled).

There should be no consequence for single-key reads, since upper
layer applies an additional filtering based on token of key being
queried.
And for range scans, there can be an increase in memory usage,
but not significant because the sstables span an wide range and
would have been selected in the combined reader if the range of
scan overlaps with them.

Anyway, this is a protection against storm of memtable flushes
and shouldn't be the common scenario.

It works both with tablets and vnodes, by adjusting the token
range spanned by compaction group accordingly.

Fixes #23634.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Wojciech Mitros
bf7bba9634 mv: add a test for dropping an index while it's building
Dropping an index is a schema change of its base table and
a schema drop of the index's materialized view. This combination
of schema changes used to cause issues during view building, because
when a view schema was dropped, it wasn't getting updated with the
new version of the base schema, and while the view building was
in progress, we would update the base schema for the base table
mutation reader and try generating updates with a view schema that
wasn't compatible with the base schema, failing on an `on_internal_error`.

In this patch we add a test for this scenario. We create an index,
halt its view building process using an injection, and drop it.
If no errors are thrown, the test succeeds.

The test was failing before https://github.com/scylladb/scylladb/pull/23337
and is passing afterwards.
2025-04-24 01:09:32 +02:00
Wojciech Mitros
d77f11d436 base_info: remove the lw_shared_ptr variant
The base_dependent_view_info is no longer needed to be shared or
modified in the view_info, so we no longer need to keep it as
a shared pointer.
2025-04-24 01:08:40 +02:00
Wojciech Mitros
d7bd86591e view_info: don't re-set base_info after construction
In the previous commits we made sure that the base info is not dependent
on the base schema version, and the info dependent on the base schema
version is calculated when it's needed. In this patch we remove the
unnecessary re-setting of the base_info.

The set_base_info method isn't removed completely, because it also has
a secondary function - zeroing the view_info fields other than base_info.
Because of this, in this patch we rename it accordingly and limit its
use to the updates caused by a base schema change.
2025-04-24 01:08:40 +02:00
Wojciech Mitros
ea462efa3d base_info: remove base_info snapshot semantics
The base info in view schemas no longer changes on base schema
updates, so saving the base info with a view schema from a specific
point in time doesn't provide any additional benefits.
In this patch we remove the code using the base_and_view snapshots
as it's no longer useful.
2025-04-24 01:08:40 +02:00
Wojciech Mitros
ad55935411 base_info: remove base schema from the base_info
The base info now only contains values which are not reliant on the
base schema version. We remove the the base schema from the base info
to make it immutable regardless of base schema version, at the point
of this patch it's also not needed anywhere - the new base info can
replace the base schema in most places, and in the few (view_updates)
where we need it, we pull the most recent base schema version from
the database.

After this change, the base info no longer changes in a view schema
after creation, so we'll no longer get errors when we try generating
view updates with a base_info that's incompatible with a specific
base schema version.

Fixes #9059
Fixes #21292
Fixes #22410
2025-04-24 01:08:39 +02:00
Wojciech Mitros
05fce91945 schema_registry: store base info instead of base schema for view entries
In the following patch we plan to remove the base schema from the base_info
to make the base_info immutable. To do that, we first prepare the schema
registry for the change; we need to be able to create view schemas from
frozen schemas there and frozen schemas have no information about the base
table. Unless we do this change, after base schemas are removed from the
base info, we'll no longer be able to load a view schema to the schema registry
without looking up the base schema in the database.

This change also required some updates to schema building:
* we add a method for unfreezing a view schema with base info instead of
a base schema
* we make it possible to use schema_builder with a base info instead of
a base schema
* we add a method for creating a view schema from mutations with a base info
instead of a base schema
* we add a view_info constructor withat base info instead of a base schema
* we update the naming in schema_registry to reflect the usage of base info
instead of base schema
2025-04-24 01:08:39 +02:00
Wojciech Mitros
6e539c2b4d base_info: make members non-const
In the following patches we'll add the base info instead of the
base schema to various places (schema building, schema registry).
There, we'll sometimes need to update the base_info fields, which
we can't do with const members. There's also a place (global_schema_ptr)
where we won't be able to use the base_info_ptr (a shared pointer to the
base_info), so we can't just use the base_info_ptr everywhere instead.

In this patch we unmark these members as const.
In the following patches we'll remove the methods for changing the
base_info in the view schema, so it will remain effectively const.
2025-04-24 01:08:39 +02:00
Wojciech Mitros
32258d8f9a view_info: move the base info to a separate header
In the following commits the base_depenedent_view_info will be needed
in many more places. To avoid including the whole db/view/view.hh
or forward declaring (where possible) the base info, we move it to
a separate header which can be included anywhere at almost no cost.
2025-04-24 01:08:39 +02:00
Wojciech Mitros
a3d2cd6b5e view_info: move computation of view pk columns not in base pk to view_updates
In preparation of making the base_info immutable, we want to get rid of
any base_dependent_view_info fields that can change when base schema
is updated.
The _base_regular_columns_in_view_pk and _base_static_columns_in_view_pk
base column_ids of corresponding base columns and they can change
(decrease) when an earlier column is dropped in the base table.
view_updates is the only location where these values are used and calculating
them is not expensive when comparing to the overall work done while performing
a view update - we iterate over all view primary key columns and look them up
in the base table.
With this in mind, we can just calculate them when creating a view_updates
object, instead of keeping them in the base_info. We do that in this patch.
2025-04-24 01:08:39 +02:00