This reverts commit 8192f45e84.
The merge exposed a bug where truncate (via drop) fails and causes Raft
errors, leading to schema inconsistencies across nodes. This results in
test_table_drop_with_auto_snapshot failures with 'Keyspace test does not exist'
errors.
The specific problematic change was in commit 19b6207f which modified
truncate_table_on_all_shards to set use_sstable_identifier = true. This
causes exceptions during truncate that are not properly handled, leading
to Raft applier fiber stopping and nodes losing schema synchronization.
This reverts commit faad0167d7. It causes
a regression in
test_two_tablets_concurrent_repair_and_migration_repair_writer_level
in debug mode (with ~5%-10% probability).
Fixes#27510.
Closesscylladb/scylladb#27560
With the introduction of rack-lists and the reliance of materialized views on them, the `get_view_natural_endpoint` function can be greatly simplified. When using tablets, instead of doing any index-matching, we can now pair base tables with views only in the same rack.
In this series we remove no longer needed code and reorganize the needed code for better clarity.
After the changes, the `get_view_natural_endpoint` function goes down from 245 lines to 85 lines, while the whole pairing-related text goes down from 346 lines to 239 lines.
Fixes https://github.com/scylladb/scylladb/issues/26313Closesscylladb/scylladb#27383
* github.com:scylladb/scylladb:
mv: replace the simple/complex rack-aware pairing with exact rack matching
mv: split out vnode pairing code from get_view_natural_endpoint
mv: unify self-pairing and rack-aware pairing into one bool
mv: remove the workaround for left nodes when sending view updates
Refactor the way we decide the sstable belong to a tablet, fully or partially to simplify the flow and make it more readable. Also extract the logic and make it testable, add tests to cover changes
The change is purely aesthetic, no need to backport
Closesscylladb/scylladb#27101
* github.com:scylladb/scylladb:
streaming: remove unnecessary lambda creating sstable token range
streaming: simplify get_sstables_for_tablets logic
streaming: switch to range-based for loop
streaming: drop sstable skip microoptimization in tablet loop
streaming: replace reverse iterators with reverse view in sstables scan
streaming: return from get_sstables_for_tablets earlier
streaming: add get_sstables_by_tablet_range tests
test,sstables: add helper to set sstable first and last keys
streaming: refactor get_sstables_for_tablets to make it accessible
streaming: refactor get_sstables_for_tablets to make it testable
streaming: refactor tablet_sstable_streamer::stream by extracting SST filtering logic
Default number of retires in `eventually()` in `test_builder_with_concurrent_drop`
sometimes is not enough to observe changes in system tables on aarch64
builds.
This patch increases the number of retries to 30.
Fixesscylladb/scylladb#27370Closesscylladb/scylladb#27493
This patch adds tablet repair progress report support so that the user
could use the /task_manager/task_status API to query the progress.
In order to support this, a new system table is introduced to record the
user request related info, i.e, start of the request and end of the
request.
The progress is accurate when tablet split or merge happens in the
middle of the request, since the tokens of the tablet are recorded when
the request is started and when repair of each tablet is finished. The
original tablet repair is considered as finished when the finished
ranges cover the original tablet token ranges.
After this patch, the /task_manager/task_status API will report correct
progress_total and progress_completed.
Fixes#22564Fixes#26896Closesscylladb/scylladb#26924
Add a comprehensive test suite that exercises various combinations of
SSTable containment within tablet ranges. These cases cover boundary
conditions, partial overlaps, and full containment to validate all
recent changes made to `get_sstables_by_tablet_range`.
This change adds a new option to the REST api and correspondingly, to scylla nodetool: use_sstable_identifier.
When set, we use the sstable identifier, if available, to name each sstable in the snapshots directory
and the manifest.json file, rather than using the sstable generation.
This can be used by the user (e.g. Scylla Manager) for global deduplication with tablets, where an sstable
may be migrated across shards or across nodes, and in this case, its generation may change, but its
sstable identifier remains sstable.
Currently, Scylla manager uses the sstable generation to detect sstables that are already backed up to
object storage and exist in previous backed up snapshots.
Historically, the sstable generation was guaranteed to be unique only per table per node,
so the dedup code currently checks for deduplication in the node scope.
However, with tablet migration, sstables are renamed when migrated to a different shard,
i.e. their generation changes, and they may be renamed when migrated to another node,
but even if they are not, the dedup logic still assumes uniqueness only within a node.
To address both cases, we keep the sstable_id stable throughout the sstable life cycle (since 3a12ad96c7).
Given the globally unique sstable identifier, scylla manager can now detect duplicate sstables
in a wider scope. This can be cluster-wide, but we practically need only rack-wide deduplication
or dc-wide, as tablets are migrated across racks only in rare occasions (like when converting from a
numerical replication factor to a rack list containing a subset of the available racks in a datacenter).
Fixes#27181
* New feature, no backport required
Closesscylladb/scylladb#27184
* github.com:scylladb/scylladb:
database: truncate_table_on_all_shards: set use_sstable_identifier to true
nodetool: snapshot: add --use-sstable-identifier option
api: storage_service: take_snapshot: add use_sstable_identifier option
test: database_test: add snapshot_use_sstable_identifier_works
test: database_test: snapshot_works: add validate_manifest
sstable: write_scylla_metadata: add random_sstable_identifier error injection
table: snapshot_on_all_shards: take snapshot_options
sstable: add get_format getter
sstable: snapshot: add use_sstable_identifier option
db: snapshot_ctl: snapshot_options: add use_sstable_identifier options
db: snapshot_ctl: move skip_flush to struct snapshot_options
This pull request adds support for calculation and storing CRC32 digests for all SSTable components.
This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure
and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.
All important SSTable components (Index, Partitions, Rows, Summary, Filter, CompressionInfo, and TOC) are covered.
Several test cases where introduced to verify expected behaviour.
Backport is not required, it is a new feature
Fixes#20100Closesscylladb/scylladb#27287
* github.com:scylladb/scylladb:
sstable_test: add verification testcases of SSTable components digests persistance
sstables: store digest of all sstable components in scylla metadata
sstables: Add TemporaryScylla metadata component type
sstables: Extract file writer closing logic into separate methods
sstables: Add components_digests to scylla metadata components
sstables: Implement CRC32 digest-only writer
Fixes#27367Fixes#27362Fixes#27366
Makes http URL parser handle IPv6.
Makes KMIP host setup handle IPv6 hosts + use system trust if no truststore set
Moves Azure/KMS code to use shared http URL parser to avoid same regex everywhere.
Closesscylladb/scylladb#27368
* github.com:scylladb/scylladb:
ear::kms/ear::azure: Use utils::http URL parsing
ear::kmip_host: Handle ipv6 hosts + use system trust when not specified
utils::http: Handle ipv6 numeric host part in URL:s
Adds a generic test helper that writes a random SSTable, reloads it, and
verifies that the persisted CRC32 digest for each component matches the
digest computed from disk. Those covers all checksummed components test cases.
Fixes#27366
A URL with numeric host part formats special in case of ipv6,
to avoid confusion with port part.
The parser should handle this.
I.e.
http://[2001:db8:4006:812::200e]:8080
v2:
* Include scheme agnostic parse + case insensitive scheme matching
Test that taking a snapshot with the use_sstable_identifier
option (and injecting `random_sstable_identifier`) produces
different file names in the snapshot than the original
sstable names and validate te manifest.json file respectively.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Validate the manifest.json format by loading it using rjson::parse
and then validate its contents to ensure it lists exactly the
SSTables present in the snapshot directory.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The PRUNE MATERALIZED VIEW statement is performed as follows:
1. Perform a range scan of the view table from the view replicas based
on the ranges specified in the statement.
2. While reading the paged scan above, for each view row perform a read
from all base replicas at the corresponding primary key. If a discrepancy
is detected, delete the row in the view table.
When reading multiple rows, this is very slow because for each view row
we need to performe a single row query on multiple replicas.
In this patch we add an option to speed this up by performing many of the
single base row reads concurrently, at the concurrency specified in the
USING CONCURRENCY clause.
Aside from the unit test, I checked manually on a 3-node cluster with 10M rows, using vnodes. There were actually no ghost rows in the test, but we still had to iterate over all view rows and read the corresponding base rows. And actual ghost rows, if there are any, should be a tiny fraction of all rows. I compared concurrencies 1,2,10,100 and the results were:
* Pruning with concurrency 1 took total 1416 seconds
* Pruning with concurrency 2 took total 731 seconds
* Pruning with concurrency 10 took total 234 seconds
* Pruning with concurrency 100 took total 171 seconds
So after a concurrency of 10 or so we're hitting diminishing returns (at least in this setup). At that point we may be no longer bottlenecked by the reads, but by CPU on the shard that's handling the PRUNE
Fixes https://github.com/scylladb/scylladb/issues/27070Closesscylladb/scylladb#27097
* github.com:scylladb/scylladb:
mv: allow setting concurrency in PRUNE MATERIALIZED VIEW
cql: add CONCURRENCY to the USING clause
When the initial version of rack-aware pairing was introduced, materialized
views with tablets were still experimental. Since then, we decided that
we'll only allow materialized views in clusters where the base table and
the view are replicated on the same racks, with one replica of each tablet
on each rack.
This allows us to remove almost all logic from our base-view pairing. The
only check for the paired view replica is now whether it's in the same
rack as the base replica sending the update.
In this patch we replace the simple and complex rack-aware pairing with
the simple check above.
Because of this, we have to remove a test case from network_topology_strategy_test
which was testing complex pairing. The tested topology is not supported
for views with tablets (or is unlikely to be supported, as it's a random test),
so there's no use keeping the test.
The test case for simple rack aware pairing was kept, but now we only test
the case where each rack has one replica, not multiple.
Additionally, we split finding of an unpaired replica to a separate function
and partially rewrite it without reusing the helper stuctures that were
present when calculating the simple and complex rack-aware pairing.
We only look for an unpaired replica if we couldn't find a paired replica
ourselves or if the number of view replicas didn't match the base replicas.
If an unpaired replica appears while these conditions pass, we won't send
an extra update, but that would be a new bug altogether, because we only
expect the unpaired replica to appear during RF changes, so when these
conditions aren't fulfilled.
Fixes https://github.com/scylladb/scylladb/issues/26313
We always use "legacy self pairing" when not using tablets, and
the "rack aware pairing" has been enabled in every version where
views with tablets isn't experimental. So in practice, instead
of checking these variables we can just look at whether the
table uses tablets.
seastar::compat::source_location (which should not have been used
outside Seastar) is replaced with std::source_location to avoid
deprecation warnings. The relevant header, which was removed, is no
longer included.
* seastar 8c3fba7a...b5c76d6b (3):
> testing: There can be only one memory_data_sink
> util: Use std::source_location directly
> Merge 'net: support proxy protocol v2' from Avi Kivity
apps: httpd: add --load-balancing-algorithm
apps: httpd: add /shard endpoint
test: socket_test: add proxy protocol v2 test suite
test: socket_test: test load balancer with proxy protocol
net: posix_connected_socket: specialize for proxied connections
net: posix_server_socket_impl: implement proxy protocol in server sockets
net: posix_server_socket_impl: adjust indentation
net: posix_server_socket_impl: avoid immediately-invoked lambda
net: conntrack: complete handle nested class special member functions
net: posix_server_socket_impl: coroutinize accept()
Closesscylladb/scylladb#27316
Fixes#24346
When reading, we check for each entry and each chunk, if advancing there
will hit EOF of the segment. However, IFF the last chunk being read has
the last entry _exactly_ matching the chunk size, and the chunk ending
at _exactly_ segment size (preset size, typically 32Mb), we did not check
the position, and instead complained about not being able to read.
This has literally _never_ happened in actual commitlog (that was replayed
at least), but has apparently happened more and more in hints replay.
Fix is simple, just check the file position against size when advancing
said position, i.e. when reading (skipping already does).
v2:
* Added unit test
Closesscylladb/scylladb#27236
The tablet scheduler should not emit conflicting migrations for the same
tablet. This was addressed initially in scylladb/scylladb#26038 but the
check is missing in the merge colocation plan, so add it there as well.
Without this check, the merge colocation plan could generate a
conflicting migration for a tablet that is already scheduled for
migration, as the test demonstrates.
This can cause correctness problems, because if the load balancer
generates two migrations for a single tablet, both will be written as
mutations, and the resulting mutation could contain mixed cells from
both migrations.
Fixesscylladb/scylladb#27304Closesscylladb/scylladb#27312
And switch to std::source_location.
Upcoming seastar update will deprecate its compatibility layer.
The patch is
for f in $(git grep -l 'seastar::compat::source_location'); do
sed -e 's/seastar::compat::source_location/std::source_location/g' -i $f;
done
and removal of few header includes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#27309
The reason for this seastar update is to have the fixed handling
of the `integer` type in `seastar-json2code` because it's needed
for further development of ScyllaDB REST API.
The following changes were introduced to ScyllaDB code to ensure it
compiles with the updated seastar:
- Remove `seastar/util/modules.hh` includes as the file was removed
from seastar
- Modified `metrics::impl::labels_type` construction in
`test/boost/group0_test.cc` because now it requires `escaped_string`
* seastar 340e14a7...8c3fba7a (32):
> Merge 'Remove net::packet usage from dns.cc' from Pavel Emelyanov
dns: Optimize packet sending for newer c-ares versions
dns: Replace net::packet with vector<temporary_buffer>
dns: Remove unused local variable
dns: Remove pointless for () loop wrapping
dns: Introduce do_sendv_tcp() method
dns: Introduce do_send_udp() method
> test: Add http rules test of matching order
> Merge 'Generalize packet_data_source into memory_data_source' from Pavel Emelyanov
memcached: Patch test to use memory_data_source
memcached: Use memory_data_source in server
rpc: Use memory_data_sink without constructing net::packet
util: Generalize packet_data_source into memory_data_source
> tests: coroutines: restore "explicit this" tests
> reactor: remove blocking of SIGILL
> Merge 'Update compilers in GH actions scripts' from Pavel Emelyanov
github: Use gcc-14
github: Use clang-20
> Merge 'Reinforce DNS reverse resolution test ' from Pavel Emelyanov
test: Make test_resolve() try several addresses
test: Coroutinize test_resolve() helper
> modules: make module support standards-compliant
> Merge 'Fix incorrect union access in dns resolver' from Pavel Emelyanov
dns: Squash two if blocks together
dns: Do not check tcp entry for udp type
> coroutine: Fix compilation of execute_involving_handle_destruction_in_await_suspend
> promise: Document that promise is resolved at most once
> coroutine: exception: workaround broken destroy coroutine handle in await_suspend
> socket: Return unspecified socket_address for unconnected socket
> smp: Fix exception safety of invoke_on_... internal copying
> Merge 'Improve loads evaluation by reactor' from Pavel Emelyanov
reactor: Keep loads timer on reactor
reactor: Update loads evaluation loop
> Merge 'scripts: add 'integer' type to seastar-json2code' from Andrzej Jackowski
test: extend tests/unit/api.json to use 'integer' type
scripts: add 'integer' type to seastar-json2code
> Merge 'Sanitize tls::session::do_put(_one)? overloads' from Pavel Emelyanov
tls: Rename do_put_one(temporary_buffer) into do_put()
tls: Fix indentation after previous patch
tls: Move semaphore grab into iterating do_put()
> net: tcp: change unsent queue from packets to temporary_buffer:s
> timer: Enable highres timer based on next timeout value
> rpc: Add a new constructor in closed_error to accept string argument
> memcache: Implement own data sink for responses
> Merge 'file: recursive_remove_directory: general cleanup' from Avi Kivity
file: do_recursive_remove_directory(): move object when popping from queue
file: do_recursive_remove_directory(): adjust indentation
file: do_recursive_remove_directory(): coroutinize
file: do_recursive_remove_directory(): simplify conditional
file: do_recursive_remove_directory(): remove wrong const
file: do_recursive_remove_directory(): clean up work_entry
> tests: Move thread_context_switch_test into perf/
> test: Add unit test for append_challenged_posix_file
> Merge 'Prometheus metrics handler optimization' from Travis Downs
prometheus: optimize metrics aggregation
prometheus: move and test aggregate_by helper
prometheus: various optimizations
metrics: introduce escaped_string for label values
metric:value: implement + in terms of +=
tests: add prometheus text format acceptance tests
extract memory_data_sink.hh
metrics_perf: enhance metrics bench
> demos: Simplify udp_zero_copy_demo's way of preparing the packet
> metrics: Remove deprecated make_...-ers
> Merge 'Make slab_test be BOOST kind' from Pavel Emelyanov
test: Use BOOST_REQUIRE checkers
test: Replace some SEASTAR_ASSERT-s with static_assert-s
test: Convert slab test into boost kind
> Merge 'Coroutinize lister_test' from Pavel Emelyanov
test: Fix indentation after previuous patch
test: Coroutinize lister_test lister::report() method
test: Coroutinize lister_test main code
> file: recursive_remove_directory(): use a list instead of a deque
> Merge 'Stop using packets in tls data_sink and session' from Pavel Emelyanov
tls: Stop using net::packet in session::put()
tls: Fix indentation after previous patch
tls: Split session::do_put()
tls: Mark some session methods private
Closesscylladb/scylladb#27240
The PRUNE MATERALIZED VIEW statement is performed as follows:
1. Perform a range scan of the view table from the view replicas based
on the ranges specified in the statement.
2. While reading the paged scan above, for each view row perform a read
from all base replicas at the corresponding primary key. If a discrepancy
is detected, delete the row in the view table.
When reading multiple rows, this is very slow because for each view row
we need to performe a single row query on multiple replicas.
In this patch we add an option to speed this up by performing many of the
single base row reads concurrently, at the concurrency specified in the
USING CONCURRENCY clause.
Fixes https://github.com/scylladb/scylladb/issues/27070
Commit 6e4803a750 broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix the assign operator to call the destructor like it should.
Fixes https://github.com/scylladb/scylladb/issues/27141Closesscylladb/scylladb#27140
* https://github.com/scylladb/scylladb:
test: test that expired erm that held for too long triggers notification
token_metadata: fix notification about expiring erm held for to long
Vector indexes are going to be supported only for tablets (see VECTOR-322).
As a result, tests using vector indexes will be failing when run with vnodes.
This change ensures tests using vector indexes run exclusively with tablets.
Fixes: VECTOR-49
Closesscylladb/scylladb#26843
This patch enables integrity check in 'create_stream_sources()' by introducing a new 'sstable_data_stream_source_impl' class for handling the Data component of SSTables. The new implementation uses 'sstable::data_stream()' with 'integrity_check::yes' instead of the raw input_stream.
These additional checks require reading the digest and CRC components from disk, which may introduce some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data.
For compressed SSTables - where checksums are already embedded - the cost comes from reading, calculating and verifying the diges.
New test cases were added to verify that the integrity checks work correctly, detecting both data and digest mismatches.
Backport is not required, since it is a new feature
Fixes#21776Closesscylladb/scylladb#26702
* github.com:scylladb/scylladb:
file_stream_test: add sstable file streaming integrity verification test cases
streaming: prioritize sender-side errors in tablet_stream_files
sstables: enable integrity check for data file streaming
sstables: Add compressed raw streaming support
sstables: Allow to read digest and checksum from user provided file instance
sstables: add overload of data_stream() to accept custom file_input_stream_options
In ALTER KEYSPACE, when a datacenter name is omitted, its replication
factor is implicitly set to zero with vnodes, while with tablets,
it remains unchanged.
ALTER KEYSPACE should behave the same way for tablets as it does
for vnodes. However, this can be dangerous as we may mistakenly
drop the whole datacenter.
Reject ALTER KEYSPACE if it changes replication factor, but omits
a datacenter that currently contains tablet replicas.
Fixes: https://github.com/scylladb/scylladb/issues/25549.
Closesscylladb/scylladb#25731
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updates queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.
This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated, because we update mapping on each change of gossip
states. This made bootstrap impossible because nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.
To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.
It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.
Fixes#26865Closesscylladb/scylladb#26941
* github.com:scylladb/scylladb:
address_map: Use barrier() to wait for replication
address_map: Use more efficient and reliable replication method
utils: Introduce helper for replicated data structures
Add 'test_sstable_stream' to verify SSTable file streaming integrity check.
The new tests cover both compressed and uncompressed SSTables and includes:
- Checksum mismatch detection verification
- Digest mismatch detection verifivation
Fixes#26874
Due to certain people (me) not being able to tell forward from backward,
the data alignment to ensure partial uploads adhere to the 256k-align
rule would potentially _reorder_ trailing buffers generated iff the
source buffers input into the sink are small enough. Which, as a fun fact,
they are in backup upload.
Change the unit test to use raw sink IO and add two unit tests (of which
the smaller size provokes the bug) that checks the same 64k buf segmented
upload backup uses.
Closesscylladb/scylladb#26938
According to the IETF spec uuid variant bits should be set to '10'. All
others are either invalid or reserved. The patch change the code to
follow the spec.
Closesscylladb/scylladb#27073
This patch adds a metric for pre-compression size of sstable files.
This patch adds a per-table metric
`scylla_column_family_total_disk_space_before_compression`,
which measures the hypothetical total size of sstables on disk,
if Data.db was replaced with an uncompressed equivalent.
As for the implementation:
Before the patch, tables and sstable sets are already tracking their total physical file size.
Whenever sstables are added or removed, the size delta is propagated from the sstable up through sstable sets into table_stats.
To implement the new metric, we turn the size delta that is getting passed around from a one-dimensional to a two-dimensional value, which includes both the physical and the pre-compression size.
New functionality, no backport needed.
Closesscylladb/scylladb#26996
* github.com:scylladb/scylladb:
replica/table: add a metric for hypothetical total file size without compression
replica/table: keep track of total pre-compression file size
More efficient than 100 pings.
There was one ping in test which was done "so this shard notices the
clock advance". It's not necessary, since obsering completed SMP
call implies that local shard sees the clock advancement done within in.
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updated queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.
This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated. This made bootstrap impossible, since nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.
To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.
It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.
Fixes#26865Fixes#26835
Key goals:
- efficient (batching updates)
- reliable (no lost updates)
Will be used in data structures maintained on one designed owning
shard and replicated to other shards.
Somehow, the line of code responsible for freeing flushed nodes
in `trie_writer` is missing from the implementation.
This effectively means that `trie_writer` keeps the whole index in
memory until the index writer is closed, which for many dataset
is a guaranteed OOM.
Fix that, and add some test that catches this.
Fixesscylladb/scylladb#27082Closesscylladb/scylladb#27083
This PR enables integrity check of both checksum and digest for repair/streaming.
In the past, streaming readers only verified the checksum of compressed SSTables.
This change extends the checks to include the digest and the checksum (CRC) for both compressed and uncompressed SSTables. These additional checks require reading the digest and CRC components from disk, which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data, while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest.If the reader range doesn't cover the full SSTable, the digest is not loaded and check is skipped.
To support testing of these changes, a new option was added to the random_mutation_generator that allows disabling compression.
Several new test cases were added to verify that the repair_reader correctly detects corruption. These tests corrupt digest or data component of an SSTable and confirm that the system throws the expected `malformed_sstable_exception`.
Backport is not required, it is an improvement
Refs #21776Closesscylladb/scylladb#26444
* github.com:scylladb/scylladb:
boost/repair_test: add repair reader integrity verification test cases
test/lib: allow to disable compression in random_mutation_generator
sstables: Skip checksum and digest reads for unlinked SSTables
table: enable integrity checks for streaming reader
table: Add integrity option to table::make_sstable_reader()
sstables: Add integrity option to create_single_key_sstable_reader
Adds test cases to verify that repair_reader correctly detects SSTable(both comprossed and uncompressed) checksum mismatch.
Digest mismatch verification is not possible as repair readar may skip some sstable data, which automatically disables digest verification.
Each test corrupts the Data component on disk and ensures the reader throws a malformed_sstable_exception with the expected error message.
When generating CDC log mutations for some base mutation, use a CDC schema that is compatible with the base schema.
The compatible CDC schema has for every base column a corresponding CDC column with the same name. If using a non-compatible schema, we may encounter a situation, especially during ALTER, that we have a mutation with a base column set with some value, but the CDC schema doesn't have a column by that name. This would cause the user request to fail with an error.
We add to the schema object a schema_ptr that for CDC-enabled tables points to the schema object of the CDC table that is compatible with the schema. It is set by the schema merge algorithm when creating the schema for a table that is created or altered. We use the fact that a base table and its CDC table are created and altered in the same group0 operation, and this way we can find and set the cdc schema for a base table.
When transporting the base schema as a frozen schema between shards, we transport with it the frozen cdc schema as well.
The patch starts with a series of refactoring commits that make extending the frozen schema easier and cleans up some duplication in the code about the frozen schema. We combine the two types `frozen_schema_with_base_info` and `view_schema_and_base_info` to a single type `extended_frozen_schema` that holds a frozen schema with additional data that is not part of the schema mutations but needs to be transported with it to unfreeze it - base_info, and the frozen cdc schema which is added in a later commit.
Fixes https://github.com/scylladb/scylladb/issues/26405
backport not needed - enhancement
Closesscylladb/scylladb#24960
* github.com:scylladb/scylladb:
test: cdc: test cdc compatible schema
cdc: use compatiable cdc schema
db: schema_applier: create schema with pointer to CDC schema
db: schema_applier: extract cdc tables
schema: add pointer to CDC schema
schema_registry: remove base_info from global_schema_ptr
schema_registry: use extended_frozen_schema in schema load
schema_registry: replace frozen_schema+base_info with extended_frozen_schema
frozen_schema: extract info from schema_ptr in the constructor
frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema
Every table and sstable set keeps track of the total file size
of contained sstables.
Due to a feature request, we also want to keep track of the hypothetical
file size if Data files were uncompressed, to add a metric that
shows the compression ratio of sstables.
We achieve this by replacing the relevant `uint_64 bytes_on_disk`
counters everywhere with a struct that contains both the actual
(post-compression) size and the hypothetical pre-compression size.
This patch isn't supposed to change any observable behavior.
In the next patch, we will use these changes to add a new metric.
`topology_cooridinator::migrate_tablet_size()` was introduced in 10f07fb95a. It has a bug where the has_tablet_size() lambda always returns false because of bad comparison of iterators after a table and tablet search:
```
if (auto table_i = tables.find(gid.table); table_i != tables.find(gid.table)) {
if (auto size_i = table_i->second.find(trange); size_i != table_i->second.find(trange)) {
```
This change also fixes a problem where the `migrate_tablet_size()` would crash with a `std::out_of_range` if the pending node was not present in load_stats.
This change fixes these two problems and moves the functionality into a separate method of `load_stats`. It also adds tests for the new method.
A version containing this bug has not been released yet, so no backport is needed.
Closesscylladb/scylladb#26946
* github.com:scylladb/scylladb:
load_stats: add test for migrate_tablet_size()
load_stats: fix problem with tablet size migration
* Adds test fixture for AWS KMS
* Adds test fixture for Azure KMS
* Adds key provider proxy for Azure to pytests (ported dtests)
* Make test gather for boost tests handle suites
* Fix GCP test snafu
Fixes#26781Fixes#26780Fixes#26776Fixes#26775Closesscylladb/scylladb#26785
* github.com:scylladb/scylladb:
gcp_object_storage_test: Re-enable parallelism.
test::pylib: Add azure (mock) testing to EAR matrix
test::boost::encryption_at_rest: Remove redundant azure test indent
test::boost::encryption_at_rest: Move azure tests to use fixture
test::lib: Add azure mock/real server fixture
test::pylib::boost: Fix test gather to handle test suites
utils::gcp::object_storage: Fix typo in semaphore init
test::boost::encryption_at_rest_test: Remove redundant indent
test::boost::test_encryption_at_rest: Move to AWS KMS fixture for kms test
test::boost::test_encryption_at_rest: Reorder tests and helpers
ent::encryption: Make text helper routines take std::string
test::pylib::dockerized_service: Handle docker/podman bind error message
test::lib::aws_kms_fixture: Add a fixture object to run mock AWS KMS
test::lib::gcs_fixture: Only set port if running docker image + more retry
Those test cases use lister::scan_dir() to validate the contents of snapshot directory of a table against this table's base directory. This PR generalizes the listing code making it shorter.
Also, the snapshot_skip_flush_works case is missing the check for "schema.cql" file. Nothing is wrong with it, but the test is more accurate if checking it.
Also, the snapshot_with_quarantine_works case tries to check if one set of names is sub-set of another using lengthy code. Using std::includes improves the test readability a lot.
Also, the PR replaces lister::scan_dir() with directory_lister. The former is going to be removed some day (see also #26586)
Improving existing working test, no backport is needed.
Closesscylladb/scylladb#26693
* github.com:scylladb/scylladb:
database_test: Simplify snapshot_with_quarantine_works() test
database_test: Improve snapshot_skip_flush_works test
database_test: Simplify snapshot_works() tests
database_test: Use collect_files() to remove files
database_test: Use collectz_files() to count files in directory
database_test: Introduce collect_files() helper
_last_key is a multi-fragment buffer.
Some prefix of _last_key (up to _last_key_mismatch) is
unneeded because it's already a part of the trie.
Some suffix of _last_key (after needed_prefix) is unneeded
because _last_key can be differentiated from its neighbors even without it.
The job of write_last_key() is to find the middle fragments,
(containing the range `[_last_key_mismatch, needed_prefix)`)
trim the first and last of the middle fragments appropriately,
and feed them to the trie writer.
But there's an error in the current logic,
in the case where `_last_key_mismatch` falls on a fragment boundary.
To describe it with an example, if the key is fragmented like
`aaa|bbb|ccc`, `_last_key_mismatch == 3`, and `needed_prefix == 7`,
then the intended output to the trie writer is `bbb|c`,
but the actual output is `|bbb|c`. (I.e. the first fragment is empty).
Technically the trie writer could handle empty fragments,
but it has an assertion against them, because they are a questionable thing.
Fix that.
We also extend bti_index_test so that it's able to hit the assert
violation (before the patch). The reason why it wasn't able to do that
before the patch is that the violation requires decorated keys to differ
on the _first_ byte of a partition key column, but the keys generated
by the test only differed on the last byte of the column.
(Because the test was using sequential integers to make the values more
human-readable during debugging). So we modify the key generation
to use random values that can differ on any position.
Fixesscylladb/scylladb#26819Closesscylladb/scylladb#26839