This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft.
Pulling `database::apply()` out of schema merging code will allow to batch changes to subsystems. Future generic code will first call `prepare()` on all implementations, then single `database::apply()` and then `update()` on all implementations, then on each shard it will call `commit()` for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then `post_commit()`.
Backport: no, it's a new feature
Fixes: https://github.com/scylladb/scylladb/issues/19649Closesscylladb/scylladb#20853
* github.com:scylladb/scylladb:
storage_service: always wake up load balancer on update tablet metadata
db: schema_applier: call destroy also when exception occurs
db: replica: simplify seeding ERM during shema change
db: remove cleanup from add_column_family
db: abort on exception during schema commit phase
db: make user defined types changes atomic
replica: db: make keyspace schema changes atomic
db: atomically apply changes to tables and views
replica: make truncate_table_on_all_shards get whole schema from table_shards
service: split update_tablet_metadata into two phases
service: pull out update_tablet_metadata from migration_listener
db: service: add store_service dependency to schema_applier
service: simplify load_tablet_metadata and update_tablet_metadata
db: don't perform move on tablet_hint reference
replica: split add_column_family_and_make_directory into steps
replica: db: split drop_table into steps
db: don't move map references in merge_tables_and_views()
db: introduce commit_on_shard function
db: access types during schema merge via special storage
replica: make non-preemptive keyspace create/update/delete functions public
replica: split update keyspace into two phases
replica: split creating keyspace into two functions
db: rename create_keyspace_from_schema_partition
db: decouple functions and aggregates schema change notification from merging code
db: store functions and aggregates change batch in schema_applier
db: decouple tables and views schema change notifications from merging code
db: store tables and views schema diff in schema_applier
db: decouple user type schema change notifications from types merging code
service: unify keyspace notification functions arguments
db: replica: decouple keyspace schema change notifications to a separate function
db: add class encapsulating schema merging
The existing `download_source` implementation optimizes performance
by keeping the connection to S3 open and draining data directly from
the socket. While this eliminates the overhead (60-100ms) of repeatedly
establishing new connections, it leads to rapid exhaustion of client-
side connections.
On a single shard, two `mx_readers` for load and stream are enough to
trigger this issue. Since each client typically holds two connections,
readers keeping index and data sources open can cause deadlocks where
processes stall due to unavailable connections.
Introduce `chunked_download_source`, a new S3 download method built on
`download_source`, to dynamically manage connections:
- Buffers data in 5MiB chunks using a producer-consumer model
- Closes connections once buffers reach capacity, returning them to
the pool for other clients
- Uses a filling fiber that resumes fetching once buffers are
consumed from the queue
Performance remains comparable to `download_source`, achieving
95MiB/s for sequential 1GiB downloads from S3. However, preloading
large chunks may cause read amplification.
Fixes: https://github.com/scylladb/scylladb/issues/23785Closesscylladb/scylladb#23880
Truncate doesn't really go well with concurrent writes. The fix (#23560) exposed
a preexisting fragility which I missed.
1) truncate gets RP mark X, truncated_at = second T
2) new sstable written during snapshot or later, also at second T (difference of MS)
3) discard_sstables() get RP Y > saved RP X, since creation time of sstable
with RP Y is equal to truncated_at = second T.
So the problem is that truncate is using a clock of second granularity for
filtering out sstables written later, and after we got low mark and truncate time,
it can happen that a sstable is flushed later within the same second, but at a
different millisecond.
By switching to a millisecond clock (db_clock), we allow sstables written later
within the same second from being filtered out. It's not perfect but
extremely unlikely a new write lands and get flushed in the same
millisecond we recorded truncated_at timepoint. In practice, truncate
will not be used concurrently to writes, so this should be enough for
our tests performing such concurrent actions.
We're moving away from gc_clock which is our cheap lowres_clock, but
time is only retrieved when creating sstable objects, which frequency of
creation is low enough for not having significant consequences, and also
db_clock should be cheap enough since it's usually syscall-less.
Fixes#23771.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#24426
Before for views and indexes it was fetching base schema from db (and
couple other properties). This is a problem once we introduce atomic
tables and views deletion (in the following commit).
Because once we delete table it can no longer be fetched from db object,
and truncation is performed after atomically deleting all relevant
tables/views/indexes.
Now the whole relevant schema will be fetched via global_table_ptr
(table_shards) object.
It's not a good usage as there is only one non-empty implementation.
Also we need to change it further in the following commit which
makes it incompatible with listener code.
There is already implicit logical dependency via migration_notifier
but in the next commits we'll be moving store_service out from it
as we need better control (i.e. return a value from the call).
- remove load_tablet_metadata(), instead we add wake_up_load_balancer flag
to update_tablet_metadata(), it reduces number of public functions and
also serves as a comment (removed comment with very similar meaning)
- reimplement the code to not use mutate_token_metadata(), this way
it's more readable and it's also needed as we'll split
update_tablet_metadata() in following commits so that we can have
subroutine which doesn't yield (for ensuring atomicity)
This is done so that actual dropping can be
an atomic step which could be composed with other
schema operations, and eventually all subsystems modified
via raft so that we could introduce atomic changes which
span across different subsystems.
We split drop_table_on_all_shards() into:
- prepare_tables_metadata_change_on_all_shards()
- prepare_drop_table_on_all_shards()
- drop_table()
- cleanup_drop_table_on_all_shards()
prepare_tables_metadata_change_on_all_shards() is necessary
because when applying multiple schema changes at once (e.g. drop
and add tables) we need to lock only once.
We add legacy_drop_table_on_all_shards() which
behaves exactly like old drop_table_on_all_shards() to be
compatible with code which doesn't need to play with atomicity.
Usages of legacy_drop_table_on_all_shards() in schema_applier
will be replaced with direct calls to split functions in the following
commits - that's the place we will take advantage of drop_table not
yielding (as it returns void now).
Register the current space_source_fn in an RAII
object that resets monitor._space_source to the
previous function when the RAII object is destroyed.
Use space_source_registration in database_test::
mutation_dump_generated_schema_deterministic_id_version
to prevent use-after-stack-return in the test.
Fixes#24314
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#24342
This PR adjusts existing Boost tests so they respect the invariant
introduced by enabling `rf_rack_valid_keyspaces` configuration option.
We disable it explicitly in more problematic tests. After that, we
enable the option by default in the whole test suite.
Fixesscylladb/scylladb#23958
Backport: backporting to 2025.1 and 2025.2 to be able to test the implementation there too.
Closesscylladb/scylladb#23802
* github.com:scylladb/scylladb:
test/lib/cql_test_env.cc: Enable rf_rack_valid_keyspaces by default
test/boost/tablets_test.cc: Explicitly disable rf_rack_valid_keyspaces in problematic tests
test/boost/tablets_test.cc: Fix indentation in test_load_balancing_with_random_load
test/boost/tablets_test.cc: Adjust test_load_balancing_with_random_load to RF-rack-validity
test/boost/tablets_test.cc: Adjust test_load_balancing_works_with_in_progress_transitions to RF-rack-validity
test/boost/tablets_test.cc: Adjust test_load_balancing_resize_requests to RF-rack-validity
test/boost/tablets_test.cc: Adjust test_load_balancing_with_two_empty_nodes to RF-rack-validity
test/boost/tablets_test.cc: Adjust test_load_balancer_shuffle_mode to RF-rack-validity
Consider the following scenario:
1) let's assume tablet 0 has range [1, 5] (pre merge)
2) tablet merge happens, tablet 0 has now range [1, 10]
3) tablet_sstable_set isn't refreshed, so holds a stale state, thinks tablet 0 still has range [1, 5]
4) during a full scan, forward service will intersect the full range with tablet ranges and consume one tablet at a time
5) replica service is asked to consume range [1, 10] of tablet 0 (post merge)
We have two possible outcomes:
With cache bypass:
1) cache reader is bypassed
2) sstable reader is created on range [1, 10]
3) unrefreshed tablet_sstable_set holds stale state, but select correctly all sstables intersecting with range [1, 10]
With cache:
1) cache reader is created
2) finds partition with token 5 is cached
3) sstable reader is created on range [1, 4] (later would fast forward to range [6, 10]; also belongs to tablet 0)
4) incremental selector consumes the pre-merge sstable spanning range [1, 5]
4.1) since the partitioned_sstable_set pre-merge contains only that sstable, EOS is reached
4.2) since EOS is reached, the fast forward to range [6, 10] is not allowed.
So with the set refreshed, sstable set is aligned with tablet ranges, and no premature EOS is signalled, otherwise preventing fast forward to from happening and all data from being properly captured in the read.
This change fixes the bug and triggers a mutation source refresh whenever the number of tablets for the table has changed, not only when we have incoming tablets.
Additionally, includes a fix for range reads that span more than one tablet, which can happen during split execution.
Fixes: https://github.com/scylladb/scylladb/issues/23313
This change needs to be backported to all supported versions which implement tablet merge.
Closesscylladb/scylladb#24287
* github.com:scylladb/scylladb:
replica: Fix range reads spanning sibling tablets
test: add reproducer and test for mutation source refresh after merge
tablets: trigger mutation source refresh on tablet count change
When map_reduce is called on a collection, one shouldn't expect that it
processes the elements of the collection in any specific order.
Current test of map-reduce over boost outcome assumes that if reduce
function is the string concatenation, then it would concatenate the
given vector of strings in the order they are listed. That requirement
should be relaxed, and the result may have reversed concatentation.
Fixesscylladb/scylladb#24321
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#24325
Max purgeable has two possible values for each partition: one for
regular tombstones and one for shadowable ones. Yet currently a single
member is used to cache the max-purgeable value for the partition, so
whichever kind of tombstone is checked first, its max-purgeable will
become sticky and apply to the other kind of tombstones too. E.g. if the
first can_gc() check is for a regular tombstone, its max-purgeable will
apply to shadowable tombstones in the partition too, meaning they might
not be purged, even though they are purgeable, as the shadowable
max-purgeable is expected to be more lenient. The other way around is
worse, as it will result in regular tombstone being incorrectly purged,
permitted by the more lenient shadowable tombstone max-purgeable.
Fix this by caching the two possible values in two separate members.
A reproducer unit test is also added.
Fixes: scylladb/scylladb#23272Closesscylladb/scylladb#24171
Metadata id was introduced in CQLv5 to make metadata of prepared
statement metadata consistent between driver and database.
This commit introduces a protocol extension that allows to use the same
mechanism in CQLv4. As CQLv5 is currently unsupported in ScyllaDb (as well
as in some of the drivers), the motivation is to allow fixing https://github.com/scylladb/scylladb/issues/20860.
This change:
- Implement metadata::calculate_metadata_id()
- Implement SCYLLA_USE_METADATA_ID protocol extension for CQLv4
- Added description of SCYLLA_USE_METADATA_ID in documentation
- Add boost tests to confirm correctness of the function
- Add python tests for table metadata change corner-cases
Fixesscylladb/scylladb#20860
Also see related https://scylladb.atlassian.net/wiki/spaces/RND/pages/42238631/MetadataId+extension+in+CQLv4+Requirement+Document
No backport needed (unless specifically requested by a customer), because there are existing workarounds for the issue
Closesscylladb/scylladb#23292
* github.com:scylladb/scylladb:
test: add tests for prepared statement metadata consistency corner cases
transport: implement SCYLLA_USE_METADATA_ID support
cql3: implement metadata::calculate_metadata_id()
We don't guarantee that coordinators will only emit range reads that
span only one tablet.
Consider this scenario:
1) split is about to be finalized, barrier is executed, completes.
2) coordinator starts a read, uses pre-split erm (split not committed to group0 yet)
3) split is committed to group0, all replicas switch storage.
4) replica-side read is executed, uses a range which spans tablets.
We could fix it with two-phase split execution. Rather than pushing the
complexity to higher levels, let's fix incremental selector which should
be able to serve all the tokens owned by a given shard. During split
execution, either of sibling tablets aren't going anywhere since it
runs with state machine locked, so a single read spanning both
sibling tablets works as long as the selector works across tablet
boundaries.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Some of the tests in the file verify more subtle parts of the behavior
of tablets and rely on topology layouts or using keyspaces that violate
the invariant the `rf_rack_valid_keyspaces` configuration option is
trying to enforce. Because of that, we explicitly disable the option
to be able to enable it by default in the rest of the test suite in
the following commit.
We make sure that the keyspaces created in the test are always RF-rack-valid.
To achieve that, we change how the test is performed.
Before this commit, we first created a cluster and then ran the actual test
logic multiple times. Each of those test cases created a keyspace with a random
replication factor.
That cannot work with `rf_rack_valid_keyspaces` set to true. We cannot modify
the property file of a node (see commit: eb5b52f598),
so once we set up the cluster, we cannot adjust its layout to work with another
replication factor.
To solve that issue, we also recreate the cluster in each test case. Now we choose
the replication factor at random, create a cluster distributing nodes across as many
racks as RF, and perform the rest of the logic. We perform it multiple times in
a loop so that the test behaves as before these changes.
We distribute the nodes used in the test across two racks so we can
run the test with `rf_rack_valid_keyspaces` set to true.
We want to avoid cross-rack migrations and keep the test as realistic
as possible. Since host3 is supposed to function as a new node in the
cluster, we change the layout of it: now, host1 has 2 shards and resides
in a separate rack. Most of the remaining test logic is preserved and behaves
as before this commit.
There is a slight difference in the tablet migrations. Before the commit,
we were migrating a tablet between nodes of different shard counts. Now
it's impossible because it would force us to migrate tablets between racks.
However, since the test wants to simply verify that an ongoing migration
doesn't interfere with load balancing and still leads to a perfect balance,
that still happens: we explicitly migrate ONLY 1 tablet from host2 to host3,
so to achieve the goal, one more tablet needs to be migrated, and we test
that.
We assign the nodes created by the test to separate racks. It has no impact
on the test since the keyspace used in the test uses RF=2, so the tablet
replicas will still be the same.
We distribute the nodes used in the test between two racks. Although
that may affect how tablets behave in general, this change will not
have any real impact on the test. The test verifies that load balancing
eventually balances tablets in the cluster, which will still happen.
Because of that, the changes in this commit are safe to apply.
We distribute the nodes used in the test between two racks. Although that
may have an impact on how tablets behave, it's orthogonal to what the test
verifies -- whether the topology coordinator is continuously in the tablet
migration track. Because of that, it's safe to make this change without
influencing the test.
Currently, the `system.compaction_history` table miss information like the type of compaction (cleanup, major, resharding, etc), the sstable generations involved (in and out), shard's id the compaction was triggered on and statistics on purged tombstones to be collected during compaction.
The series extends the table with the following columns:
- "compaction_type" (text)
- "shard_id" (int)
- "sstables_in" (list<sstableinfo_type>)
- "sstables_out" (list<sstableinfo_type>)
- "total_tombstone_purge_attempt" (long)
- "total_tombstone_purge_failure_due_to_overlapping_with_memtable" (long)
- "total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable" (long)
with a user defined type `sstableinfo_type` that holds the information about sstable file
- generation (uuid)
- origin (text)
- size (long)
Additional statistics stored in the compaction_history have been incorporated in the API `/compaction_manager/compaction_history` and the `nodetool compactionhistory` command.
No backport is required. It extends the existing compaction history output.
Fixes https://github.com/scylladb/scylladb/issues/3791Closesscylladb/scylladb#21288
* github.com:scylladb/scylladb:
nodetool: Refactor of compactionhistory_operation
nodetool: Add more stats into compactionhistory output
api/compaction_manager: Extend compaction_history api
compaction: Collect tombstone purge stats during compaction
compacting_reader: Extend to accept tombstone purge statistics
mutation_compactor: Collect tombstone purge attempts
compaction_garbage_collector: Extend return type of max_purgeable_fn
compaction: Extend compaction_result to collect more information
system_keyspace: Upgrade compaction_history table
system_keyspace: Create UDT: sstableinfo_type
system_keyspace: Extract compaction_history struct
system_keyspace: Squeeze update_compaction_history parameters
compaction/compaction_manager: update_history accepts compaction_result as rvalue
It seems that tests in test/boost/combined_tests have to define a test
suite name, otherwise they aren't picked up by test.py.
Fixes#24199Closesscylladb/scylladb#24200
This pull request adds support for creating custom indexes (at a metadata level) as long as a supported custom class is provided (currently only vector search).
The patch contains:
- a change in CREATE INDEX statement that allows for the USING keyword to be present as long as one of the supported classes is used
- support for describing custom indexes in the DESCRIBE statement
- unit tests
Co-authored by: @Balwancia
Closesscylladb/scylladb#23720
* github.com:scylladb/scylladb:
test/cqlpy: add custom index tests
index: support storing metadata for custom indices
Currently, when a max purgeable timestamp is computed, there is no
information where it comes from and how the value was obtained.
Take compaction, if there are memtables or other uncompacting sstables
possibly shadowing data, the timestamp is decreased to ensure a
tombstone is not purged but the caller does not know what that the
timestamp has its value.
In this patch, we extend the return type of max_purgeable_fn to
contain not only a timestamp but also an information on how it was
computed. This information will be required to collect statistics
on tombstone purge failures due to overlapping memtables/uncompacting
sstables that come later in the series.
chunked_vector is a replacement for std::vector that avoids large contiguous
allocations.
In this series, we add some missing modifiers and improve quality-of-life for
chunked_vector users (the static_assert patch).
Those modifiers were generally unused since they have O(n) complexity
and therefore not useful for hot paths, but they are used in some
control plane code on vectors which we'd like to replace with chunked_vectors.
A candidate for such a replacement is token_range_vector (see #3335).
This is a prerequisite for fixing some minor stalls; I don't expect we'll backport
fixes to those stalls.
Closesscylladb/scylladb#24162
* github.com:scylladb/scylladb:
utils: chunked_vector: add swap() method
utils: chunked_vector: add range insert() overloads
utils: chunked_vector: relax static_assert
utils: chunked_vector: implement erase() for single elements and ranges
utils: chunked_vector: implement insert() for single-element inserts
Inserts an iterator range at some position.
Again we insert the range at the end and use std::rotate() to
move the newly inserted elements into place, forgoing possible
optimizations.
Unit tests are added.
Implement using std::rotate() and resize(). The elements to be erased
are rotated to the end, then resized out of existence.
Again we defer optimization for trivially copyable types.
Unit tests are added.
Needed for range_streamer with token_ranges using chunked_vector.
partition_range_compat's unwrap() needs insert if we are to
use it for chunked_vector (which we do).
Implement using push_back() and std::rotate().
emplace(iterator, args) is also implemented, though the benefit
is diluted (it will be moved after construction).
The implementation isn't optimal - if T is trivially copyable
then using std::memmove() will be much faster that std::rotate(),
but this complex optimization is left for later.
Unit tests are added.
CQLv5 introduced metadata_id, which is a checksum computed from column
names and types, to track schema changes in prepared statements. This
commit introduces calculate_metadata_id to compute such id for given
metadata.
Please note that calculate_metadata_id() produces different hashes
than Cassandra's computeResultMetadataId(). We use SHA256 truncated to
128 bits instead of MD5. There are also two smaller technical
differences: calculate_metadata_id() doesn't add unneeded zeros and it
adds a length of a string when an sstring is being fed to the hasher.
The difference is intentional because MD5 has known vulnerabilities,
moreover we don't want to introduce any dependency between our
metadata_id and Cassandra's.
This change:
- Add cql_metadata_id_type
- Implement metadata::calculate_metadata_id()
- Add boost tests to confirm correctness of the function
Added function returning custom index class name.
Added printing custom index class name when using DESCRIBE.
Changed validation to reflect current support of indices.
Apparently `test_kms_network_error` will succeed at any circumstances since most of our exceptions derive from `std::exception`, so whatever happens to the test, for whatever reason it will throw, the test will be marked as passed.
Start catching the exact exception that we expect to be thrown.
Closesscylladb/scylladb#24065
Refs scylladb/scylla-enterprise#5321
Adds two small test cases, for slight variations on KMIP host config
being missing when rebooting a node, and table/sstable resolution
failing due to this.
Mainly to verify that we fail as expected, without crashing.
Closesscylladb/scylladb#23544
Following a number of similar code cleanup PR, this one aims to be the last one, definitely dropping flat from all reader and related names.
Similarly, v2 is also dropped from reader names, although it still persists in mutation_fragment_v2, mutation_v2 and related names. This won't change in the foreseeable future, as we don't have plans to drop mutation (the v1 variant).
The changes in this PR are entirely mechanical, mostly just search-and-replace.
Code cleanup, no backport required.
Closesscylladb/scylladb#24087
* github.com:scylladb/scylladb:
test/boost/mutation_reader_another_test: drop v2 from reader and related names
test/boost/mutation_reader: s/puppet_reader_v2/puppet_reader/
test/boost/sstable_datafile_test: s/sstable_reader_v2/sstable_mutation_reader/
test/boost/mutation_test: s/consumer_v2/consumer/
test/lib/mutation_reader_assertions: s/flat_reader_assertions_v2/mutation_reader_assertions/
readers/mutation_readers: s/generating_reader_v2/generating_reader/
readers/mutation_readers: s/delegating_reader_v2/delegating_reader/
readers/mutation_readers: s/empty_flat_reader_v2/empty_mutation_reader/
readers/mutation_source: s/make_reader_v2/make_mutation_reader/
readers/mutation_source: s/flat_reader_v2_factory_type/mutation_reader_factory/
readers/mutation_reader: s/reader_consumer_v2/mutation_reader_consumer/
mutation/mutation_compactor: drop v2 from compactor and related names
replica/table: s/make_reader_v2/make_mutation_reader/
mutation_writer: s/bucket_writer_v2/bucket_writer/
readers/queue: drop v2 from reader and related names
readers/multishard: drop v2 from reader and related names
readers/evictable: drop v2 from reader and related names
readers/multi_range: remove flat from name