- Move the initialization of log_done inside the try block to catch any
exceptions it may throw.
- Relocate the failure warning log after sink.close() cleanup
to guarantee sink.close() is always called before logging errors.
Refs #25497Closesscylladb/scylladb#25591
The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.
Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.
This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:
The repaired_at on disk field of a SSTable is used.
- A 64-bit number increases sequentially
The sstables_repaired_at is added to the system.tablets table.
- repaired_at <= sstables_repaired_at means the sstable is repaired
The being_repaired in memory field of a SSTable is added.
- A repair UUID tells which sstable has participated in the repair
Initial test results:
1) Medium dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~500GB
Cluster pre-populated with ~500GB of data before starting repairs job.
Results for Repair Timings:
The regular repair run took 210 mins.
Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
The speedup is: 183 mins / 48s = 228X
2) Small dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~167GB
Cluster pre-populated with ~167GB of data before starting the repairs job.
Regular repair 1st run took 110s, 2nd and 3rd runs took 110s.
Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
The speedup is: 110s / 1.5s = 73X
3) Large dataset results
Node amount: 6
Instance type: i4i.2xlarge, 3 racks
50% of base load, 50% read/write
Dataset == Sum of data on each node
Dataset Non-incremental repair (minutes)
1.3 TiB 31:07
3.5 TiB 25:10
5.0 TiB 19:03
6.3 TiB 31:42
Dataset Incremental repair (minutes)
1.3 TiB 24:32
3.0 TiB 13:06
4.0 TiB 5:23
4.8 TiB 7:14
5.6 TiB 3:58
6.3 TiB 7:33
7.0 TiB 6:55
Fixes#22472Closesscylladb/scylladb#24291
* github.com:scylladb/scylladb:
replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
compaction: Move compaction_reenabler to compaction_reenabler.hh
topology_coordinator: Make rpc::remote_verb_error to warning level
repair: Add metrics for sstable bytes read and skipped from sstables
test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair
test.py: Add tests for tablet incremental repair
repair: Add tablet incremental repair support
compaction: Add tablet incremental repair support
feature_service: Add TABLET_INCREMENTAL_REPAIR feature
tablet_allocator: Add tablet_force_tablet_count_increase and decrease
repair: Add incremental helpers
sstable: Add being_repaired to sstable
sstables: Add set_repaired_at to metadata_collector
mutation_compactor: Introduce add operator to compaction_stats
tablet: Add sstables_repaired_at to system.tablets table
test: Fix drain api in task_manager_client.py
The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.
Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.
This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:
The repaired_at on disk field of a SSTable is used.
- A 64-bit number increases sequentially
The sstables_repaired_at is added to the system.tablets table.
- repaired_at <= sstables_repaired_at means the sstable is repaired
The being_repaired in memory field of a SSTable is added.
- A repair UUID tells which sstable has participated in the repair
Initial test results:
1) Medium dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~500GB
Cluster pre-populated with ~500GB of data before starting repairs job.
Results for Repair Timings:
The regular repair run took 210 mins.
Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
The speedup is: 183 mins / 48s = 228X
2) Small dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~167GB
Cluster pre-populated with ~167GB of data before starting the repairs job.
Regular repair 1st run took 110s, 2nd and 3rd runs took 110s.
Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
The speedup is: 110s / 1.5s = 73X
3) Large dataset results
Node amount: 6
Instance type: i4i.2xlarge, 3 racks
50% of base load, 50% read/write
Dataset == Sum of data on each node
Dataset Non-incremental repair (minutes)
1.3 TiB 31:07
3.5 TiB 25:10
5.0 TiB 19:03
6.3 TiB 31:42
Dataset Incremental repair (minutes)
1.3 TiB 24:32
3.0 TiB 13:06
4.0 TiB 5:23
4.8 TiB 7:14
5.6 TiB 3:58
6.3 TiB 7:33
7.0 TiB 6:55
Fixes#22472
No need to start a local generator.
Can just use the table's sstable generation generator
to make new sstables now that it's stateless and doesn't
depend on the highest generation found.
Note that tablet_stream_files_handler used uuid generations
unconditionally from inception
(4018dc7f0d).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
As requested in #22114, moved the files and fixed other includes and build system.
Moved files:
- interval.hh
- Map_difference.hh
Fixes: #22114
This is a cleanup, no need to backport
Closesscylladb/scylladb#25095
This issue happens with removenode, when RBNO is disabled, so range
streamer is used.
The deadlock happens in a scenario like this:
1. Start 3 nodes: {A, B, C}, RF=2
2. Node A is lost
3. removenode A
4. Both B and C gain ownership of ranges.
5. Streaming sessions are started with crossed directions: B->C, C->B
Readers created by sender side exhaust streaming semaphore on B and C.
Receiver side attempts to obtain a permit indirectly by calling
check_needs_view_update_path(), which reads local tables. That read is
blocked and times-out, causing streaming to fail. The streaming writer
is already using a tracking-only permit.
To avoid that, run the query under a different scheduling group, which
translates to the system semaphore instead of the maintenance
semaphore, to break the dependency. The gossip group was chosen
because it shouldn't be contended and this change should not interfere
with it much.
Fixes: #24807
token_range_vector is a sequence of intervals of tokens. It is used
to describe vnodes or token ranges owned by shards.
Since tokens are bloated (16 bytes instead of 8), and intervals are bloated
(40 byte of overhead instead of 8), and since we have plenty of token ranges,
such vectors can exceed our allocation unit of 128 kB and cause allocation stalls.
This series fixes that by first generalizing some helpers and then changing
token_range_vector to use chunked_vector.
Although this touches IDL, there is no compatibility problem since the encoding
for vector and chunked_vector are identical.
There is no performance concern since token_range_vector is never used on
any hot path (hot paths always contain a partition key).
Fixes#3335.
Fixes#24115.
No backport: minor performance fix that isn't a regression.
Closesscylladb/scylladb#24205
* https://github.com/scylladb/scylladb:
dht: fragment token_range_vector
partition_range_compat: generalize wrap/unwrap helpers
token_range_vector is a linear vector containing intervals
of tokens. It can grow quite large in certain places
and so cause stalls.
Convert it to utils::chunked_vector, which prevents allocation
stalls.
It is not used in any hot path, as it usually describes
vnodes or similar things.
Fixes#3335.
* The new abort command explicitly represents the abortion flow in
mutation streaming, clearly identifying operations that are
intentionally aborted. This reduces ambiguity around failures in
streaming operations.
* In the error-handling section, aborted operations are now
explicitly marked as the cause of the streaming failure. This allows
us to differentiate them from genuine errors and appropriately adjust
log severity to reduce unnecessary alarm caused by aborted streaming
failures.
* To avoid alarming users with excessive error logs, log severity for
streaming failures caused by aborted operations has been downgraded.
This helps keep logs cleaner and prevents unnecessary concerns.
* A new feature has been added to ensure mixed clusters during updates
do not receive unsupported RPC messages, improving compatibility and
stability.
fixes: https://github.com/scylladb/scylladb/issues/23076Closesscylladb/scylladb#23214
Following a number of similar code cleanup PR, this one aims to be the last one, definitely dropping flat from all reader and related names.
Similarly, v2 is also dropped from reader names, although it still persists in mutation_fragment_v2, mutation_v2 and related names. This won't change in the foreseeable future, as we don't have plans to drop mutation (the v1 variant).
The changes in this PR are entirely mechanical, mostly just search-and-replace.
Code cleanup, no backport required.
Closesscylladb/scylladb#24087
* github.com:scylladb/scylladb:
test/boost/mutation_reader_another_test: drop v2 from reader and related names
test/boost/mutation_reader: s/puppet_reader_v2/puppet_reader/
test/boost/sstable_datafile_test: s/sstable_reader_v2/sstable_mutation_reader/
test/boost/mutation_test: s/consumer_v2/consumer/
test/lib/mutation_reader_assertions: s/flat_reader_assertions_v2/mutation_reader_assertions/
readers/mutation_readers: s/generating_reader_v2/generating_reader/
readers/mutation_readers: s/delegating_reader_v2/delegating_reader/
readers/mutation_readers: s/empty_flat_reader_v2/empty_mutation_reader/
readers/mutation_source: s/make_reader_v2/make_mutation_reader/
readers/mutation_source: s/flat_reader_v2_factory_type/mutation_reader_factory/
readers/mutation_reader: s/reader_consumer_v2/mutation_reader_consumer/
mutation/mutation_compactor: drop v2 from compactor and related names
replica/table: s/make_reader_v2/make_mutation_reader/
mutation_writer: s/bucket_writer_v2/bucket_writer/
readers/queue: drop v2 from reader and related names
readers/multishard: drop v2 from reader and related names
readers/evictable: drop v2 from reader and related names
readers/multi_range: remove flat from name
Currently, stream_session::prepare throws when a table in requests
or summaries is dropped. However, we do not want to fail streaming
if the table is dropped.
Delete table checks from stream_session::prepare. Further streaming
steps can handle the dropped table and finish the streaming successfully.
Fixes: #15257.
Closesscylladb/scylladb#23915
The method does several excessive things, that can be relaxed
1. In order to transfer a table-id to another shard, finds the table on
source shard, gets schema and captures schema id on invoke_on()'s
lambda. It can just capture the original table-id
2. In order to get sstable parameters (format, version, etc.) generates
toc_filename(), then calls parse_path() to convert it into the
entry_descriptor. The descriptor can be read from sstable directly.
3. Logging "success" includes target shard into the message, but happens
on the source shard. The message can be just logged on target shard.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#23197
During streaming receiving node gets and processes mutation fragments.
If this operation fails, receiver responds with -1 status code, unless
it failed due to no_such_column_family in which case streaming of this
table should be skipped.
However, when the table was dropped, an exception handler on receiver
side may get not only data_dictionary::no_such_column_family, but also
seastar::nested_exception of two no_such_column_family.
Encountered example:
```
ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14))
```
In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family>
clause and gets handled the same as any other exception type.
Replace try_catch clause with table_sync_and_check that synchronizes
the schema and check if the table exists.
Fixes: https://github.com/scylladb/scylladb/issues/22834.
Needs backport to all live version, as they all contain the bug
Closesscylladb/scylladb#22868
* github.com:scylladb/scylladb:
streaming: fix the way a reason of streaming failure is determined
streaming: save a continuation lambda
streaming: use streaming namespace in table_check.{cc,hh}
repair: streaming: move table_check.{cc,hh} to streaming
During streaming receiving node gets and processes mutation fragments.
If this operation fails, receiver responds with -1 status code, unless
it failed due to no_such_column_family in which case streaming of this
table should be skipped.
However, when the table was dropped, an exception handler on receiver
side may get not only data_dictionary::no_such_column_family, but also
seastar::nested_exception of two no_such_column_family.
Encountered example:
```
ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14))
```
In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family>
clause and gets handled the same as any other exception type.
Replace try_catch clause with table_sync_and_check that synchronizes
the schema and check if the table exists.
Fixes: https://github.com/scylladb/scylladb/issues/22834.
In the following patches, an additional preemption point will be
added to the coroutine lambda in register_stream_mutation_fragments.
Assign a lambda to a variable to prolong the captures lifetime.
Replace value-based exception catching with reference-based catching to address
GCC warnings about polymorphic type slicing:
```
warning: catching polymorphic type ‘class seastar::rpc::stream_closed’ by value [-Wcatch-value=]
```
When catching polymorphic exceptions by value, the C++ runtime copies the
thrown exception into a new instance of the specified type, slicing the
actual exception and potentially losing important information. This change
ensures all polymorphic exceptions are caught by reference to preserve the
complete exception state.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23064
Both, repair and streaming depend on view builder, but since the builder is started too late, both keep sharded<> reference on it and apply `if (view_builder.local_is_initialized())` safety checks.
However, view builder can do its sharded start much earlier, there's currently nothing that prevents it from doing so. This PR moves view builder start up together with some other of its dependencies, and relaxes the way repair and streaming use their view-builder references, in particular -- removes those ugly initialization checks.
refs: scylladb/scylladb#2737Closesscylladb/scylladb#22676
* github.com:scylladb/scylladb:
streaming: Relax streaming::make_streamig_consumer() view builder arg
streaming: Keep non-sharded view_builder dependency reference
streaming: Remove view_builder.local_is_initialized() checks
repair: Keep non-sharded view_builder dependency reference
repair: Remove view_builder.local_is_initialized() checks
main: Start sharded<view_builder> earlier
test/cql_env: Move stream manager start lower
This commit eliminates unused boost header includes from the tree.
Removing these unnecessary includes reduces dependencies on the
external Boost.Adapters library, leading to faster compile times
and a slightly cleaner codebase.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22857
Two callers of it -- repair and stream-manager -- both have non-sharded
reference and can just use it as argument. The helper in question gets
sharded<> one by itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Continuation of the previous path -- view builder is started early
enough and construction of stream manager can happen with non-sharded
reference on it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The following is observed in pytest:
1) node1, stream master, tried to pull data from node3
2) node3, stream follower, found node1 restarted
3) node3 killed the rpc stream
4) node1 did not get the stream session failure message from node3. This
failure message was supposed to kill the stream plan on node1. That's the
reason node1 failed the stream session much later at "2024-08-19 21:07:45,539".
Note, node3 failed the stream on its side, so it should have sent the stream
session failure message.
```
$ cat node1.log |grep f890bea0-5e68-11ef-99ae-e5bca04385fc
INFO 2024-08-19 20:24:01,162 [shard 0:strm] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Executing streaming plan for Tablet migration-ks-index-0 with peers={127.0.34.3}, master
ERROR 2024-08-19 20:24:01,190 [shard 1:strm] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=127.0.34.3: seastar::nested_exception: seastar::rpc::stream_closed (rpc stream was closed by peer) (while cleaning up after seastar::rpc::stream_closed (rpc stream was closed by peer))
WARN 2024-08-19 21:07:45,539 [shard 0:main] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Streaming plan for Tablet migration-ks-index-0 failed, peers={127.0.34.3}, tx=0 KiB, 0.00 KiB/s, rx=484 KiB, 0.18 KiB/s
$ cat node3.log |grep f890bea0-5e68-11ef-99ae-e5bca04385fc
INFO 2024-08-19 20:24:01,163 [shard 0:strm] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Executing streaming plan for Tablet migration-ks-index-0 with peers=127.0.34.1, slave
INFO 2024-08-19 20:24:01,164 [shard 1:strm] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Start sending ks=ks, cf=cf, estimated_partitions=2560, with new rpc streaming
WARN 2024-08-19 20:24:01,187 [shard 0: gms] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Streaming plan for Tablet migration-ks-index-0 failed, peers={127.0.34.1}, tx=633 KiB, 26506.81 KiB/s, rx=0 KiB, 0.00 KiB/s
WARN 2024-08-19 20:24:01,188 [shard 0:strm] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] stream_transfer_task: Fail to send to 127.0.34.1:0: seastar::rpc::stream_closed (rpc stream was closed by peer)
WARN 2024-08-19 20:24:01,189 [shard 0:strm] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Failed to send: seastar::rpc::stream_closed (rpc stream was closed by peer)
WARN 2024-08-19 20:24:01,189 [shard 0:strm] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Streaming error occurred, peer=127.0.34.1
```
To be safe in case the stream fail message is not received, node1 could fail
the stream plan as soon as the rpc stream is aborted in the
stream_mutation_fragments handler.
Fixes#20227Closesscylladb/scylladb#21960
File based stream is a new feature that optimizes tablet movement
significantly. It streams the entire SSTable files without deserializing
SSTable files into mutation fragments and re-serializing them back into
SSTables on receiving nodes. As a result, less data is streamed over the
network, and less CPU is consumed, especially for data models that
contain small cells.
The following patches are imported from the scylla enterprise:
*) Merge 'Introduce file stream for tablet' from Asias He
This patch uses Seastar RPC stream interface to stream sstable files on
network for tablet migration.
It streams sstables instead of mutation fragments. The file based
stream has multiple advantages over the mutation streaming.
- No serialization or deserialization for mutation fragments
- No need to read and process each mutation fragments
- On wire data is more compact and smaller
In the test below, a significant speed up is observed.
Two nodes, 1 shard per node, 1 initial_tablets:
- Start node 1
- Insert 10M rows of data with c-s
- Bootstrap node 2
Node 1 will migration data to node2 with the file stream.
Test results:
1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s
[shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0]
Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds
2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s
[shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058]
Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds
Test Summary:
File stream v.s. Mutation stream improvements
- Stream bandwidth = 836 / 128 (MB/s) = 6.53X
- Stream time = 23.60 / 1.08 (Seconds) = 21.85X
- Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X
Closes scylladb/scylla-enterprise#3438
* github.com:scylladb/scylla-enterprise:
tests: Add file_stream_test
streaming: Implement file stream for tablet
*) streaming: Use new take_storage_snapshot interface
The new take_storage_snapshot returns a file object instead of a file
name. This allows the file stream sender to read from the file even if
the file is deleted by compaction.
Closes scylladb/scylla-enterprise#3728
*) streaming: Protect unsupported file types for file stream
Currently, we assume the file streamed over the stream_blob rpc verb is
a sstable file. This patch rejects the unsupported file types on the
receiver side. This allows us to stream more file types later using the
current file stream infrastructure without worrying about old nodes
processing the new file types in the wrong way.
- The file_ops::noop is renamed to file_ops::stream_sstables to be
explicit about the file types
- A missing test_file_stream_error_injection is added to the idl
Fixes: #3846
Tests: test_unsupported_file_ops
Closesscylladb/scylla-enterprise#3847
*) idl: Add service::session_id id to idl
It will be used in the next patch.
Refs #3907
*) streaming: Protect file stream with topology_guard
Similar to "storage_service, tablets: Use session to guard tablet
streaming", this patch protects file stream with topology_guard.
Fixes#3907
*) streaming: Take service topology_guard under the try block
Taking the service::topology_guard could throw. Currently, it throws
outside the try block, so the rpc sink will not be closed, causing the
following assertion:
```
scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual
seastar::rpc::sink_impl<netw::serializer,
streaming::stream_blob_cmd_data>::~sink_impl() [Serializer =
netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion
`this->_con->get()->sink_closed()' failed.
```
To fix, move more code including the topology_guard taking code to the
try block.
Fixes https://github.com/scylladb/scylla-enterprise/issues/4106Closesscylladb/scylla-enterprise#4110
*) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho
We're not preserving the SSTable state across file based migration, so
staging SSTables for example are being placed into main directory, and
consequently, we're mixing staging and non-staging data, losing the
ability to continue from where the old replica left off.
It's expected that the view update backlog is transferred from old
into new replica, as migration doesn't wait for leaving replica to
complete view update work (which can take long). Elasticity is preferred.
So this fix guarantees that the state of the SSTable will be preserved
by propagating it in form of subdirectory (each subdirectory is
statically mapped with a particular state).
The staging sstables aren't being registered into view update generator
yet, as that's supposed to be fixed in OSS (more details can be found
at https://github.com/scylladb/scylladb/issues/19149).
Fixes#4265.
Closesscylladb/scylla-enterprise#4267
* github.com:scylladb/scylla-enterprise:
tablet: Preserve original SSTable state with file based tablet migration
sstables: Add get method for sstable state
*) sstable: (Re-)add shareabled_components getter
*) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund
Fixes#4246
Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier.
Ensures we transfer and pre-process scylla metadata for streamed
file blobs first, then properly apply receiving nodes local config
by using a source and sink layer exported from sstables, which
handles things like ordering, metadata filtering (on source) as well
as handling metadata and proper IO paths when writing data on
receiver node (sink).
This implementation maintains the statelessness of the current
design, and the delegated sink side will re-read and re-write the
metadata for each component processed. This is a little wasteful,
but the meta is small, and it is less error prone than trying to do
caching cross-shards etc. The transport is isolated from the
knowledge.
This is an alternative/complement to #4436 and #4472, fixing the
underlying issue. Note that while the layers/API:s here allows easy
fixing of other fundamental problems in the feature (such as
destination location etc), these are not included in the PR, to keep
it as close to the current behaviour as possible.
Closesscylladb/scylla-enterprise#4646
* github.com:scylladb/scylla-enterprise:
raft_tests: Copy/add a topology test with encryption
file streaming: Use sstable source/sink to transfer snapshots
sstables: Add source and sink objects + producers for transfering a snapshot
sstable::types: Add remove accessor for extension info in metadata
*) The change for error injection in merge commit 966ea5955dd8760:
File streaming now has "stream_mutation_fragments" error injection points
so test_table_dropped_during_streaming works with file streaming.
*) doc: document file-based streaming
This commit adds a description of the file-based streaming feature to the documentation.
It will be displayed in the docs using the scylladb_include_flag directive after
https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0,
and, in turn, branch-2024.2.
Refs https://github.com/scylladb/scylla-enterprise/issues/4585
Refs https://github.com/scylladb/scylla-enterprise/issues/4254Closesscylladb/scylla-enterprise#4587
*) doc: move File-based streaming to the Tablets source file-based-streaming
This commit moves the description of file-based streaming from a common include file
to the regular doc source file where tablets are described.
Closesscylladb/scylla-enterprise#4652
*) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference
Closesscylladb/scylladb#22467
File based stream is a new feature that optimizes tablet movement
significantly. It streams the entire SSTable files without deserializing
SSTable files into mutation fragments and re-serializing them back into
SSTables on receiving nodes. As a result, less data is streamed over the
network, and less CPU is consumed, especially for data models that
contain small cells.
The following patches are imported from the scylla enterprise:
*) Merge 'Introduce file stream for tablet' from Asias He
This patch uses Seastar RPC stream interface to stream sstable files on
network for tablet migration.
It streams sstables instead of mutation fragments. The file based
stream has multiple advantages over the mutation streaming.
- No serialization or deserialization for mutation fragments
- No need to read and process each mutation fragments
- On wire data is more compact and smaller
In the test below, a significant speed up is observed.
Two nodes, 1 shard per node, 1 initial_tablets:
- Start node 1
- Insert 10M rows of data with c-s
- Bootstrap node 2
Node 1 will migration data to node2 with the file stream.
Test results:
1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s
[shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0]
Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds
2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s
[shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058]
Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds
Test Summary:
File stream v.s. Mutation stream improvements
- Stream bandwidth = 836 / 128 (MB/s) = 6.53X
- Stream time = 23.60 / 1.08 (Seconds) = 21.85X
- Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X
Closes scylladb/scylla-enterprise#3438
* github.com:scylladb/scylla-enterprise:
tests: Add file_stream_test
streaming: Implement file stream for tablet
*) streaming: Use new take_storage_snapshot interface
The new take_storage_snapshot returns a file object instead of a file
name. This allows the file stream sender to read from the file even if
the file is deleted by compaction.
Closes scylladb/scylla-enterprise#3728
*) streaming: Protect unsupported file types for file stream
Currently, we assume the file streamed over the stream_blob rpc verb is
a sstable file. This patch rejects the unsupported file types on the
receiver side. This allows us to stream more file types later using the
current file stream infrastructure without worrying about old nodes
processing the new file types in the wrong way.
- The file_ops::noop is renamed to file_ops::stream_sstables to be
explicit about the file types
- A missing test_file_stream_error_injection is added to the idl
Fixes: #3846
Tests: test_unsupported_file_ops
Closesscylladb/scylla-enterprise#3847
*) idl: Add service::session_id id to idl
It will be used in the next patch.
Refs #3907
*) streaming: Protect file stream with topology_guard
Similar to "storage_service, tablets: Use session to guard tablet
streaming", this patch protects file stream with topology_guard.
Fixes#3907
*) streaming: Take service topology_guard under the try block
Taking the service::topology_guard could throw. Currently, it throws
outside the try block, so the rpc sink will not be closed, causing the
following assertion:
```
scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual
seastar::rpc::sink_impl<netw::serializer,
streaming::stream_blob_cmd_data>::~sink_impl() [Serializer =
netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion
`this->_con->get()->sink_closed()' failed.
```
To fix, move more code including the topology_guard taking code to the
try block.
Fixes https://github.com/scylladb/scylla-enterprise/issues/4106Closesscylladb/scylla-enterprise#4110
*) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho
We're not preserving the SSTable state across file based migration, so
staging SSTables for example are being placed into main directory, and
consequently, we're mixing staging and non-staging data, losing the
ability to continue from where the old replica left off.
It's expected that the view update backlog is transferred from old
into new replica, as migration doesn't wait for leaving replica to
complete view update work (which can take long). Elasticity is preferred.
So this fix guarantees that the state of the SSTable will be preserved
by propagating it in form of subdirectory (each subdirectory is
statically mapped with a particular state).
The staging sstables aren't being registered into view update generator
yet, as that's supposed to be fixed in OSS (more details can be found
at https://github.com/scylladb/scylladb/issues/19149).
Fixes#4265.
Closesscylladb/scylla-enterprise#4267
* github.com:scylladb/scylla-enterprise:
tablet: Preserve original SSTable state with file based tablet migration
sstables: Add get method for sstable state
*) sstable: (Re-)add shareabled_components getter
*) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund
Fixes#4246
Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier.
Ensures we transfer and pre-process scylla metadata for streamed
file blobs first, then properly apply receiving nodes local config
by using a source and sink layer exported from sstables, which
handles things like ordering, metadata filtering (on source) as well
as handling metadata and proper IO paths when writing data on
receiver node (sink).
This implementation maintains the statelessness of the current
design, and the delegated sink side will re-read and re-write the
metadata for each component processed. This is a little wasteful,
but the meta is small, and it is less error prone than trying to do
caching cross-shards etc. The transport is isolated from the
knowledge.
This is an alternative/complement to #4436 and #4472, fixing the
underlying issue. Note that while the layers/API:s here allows easy
fixing of other fundamental problems in the feature (such as
destination location etc), these are not included in the PR, to keep
it as close to the current behaviour as possible.
Closesscylladb/scylla-enterprise#4646
* github.com:scylladb/scylla-enterprise:
raft_tests: Copy/add a topology test with encryption
file streaming: Use sstable source/sink to transfer snapshots
sstables: Add source and sink objects + producers for transfering a snapshot
sstable::types: Add remove accessor for extension info in metadata
*) The change for error injection in merge commit 966ea5955dd8760:
File streaming now has "stream_mutation_fragments" error injection points
so test_table_dropped_during_streaming works with file streaming.
*) doc: document file-based streaming
This commit adds a description of the file-based streaming feature to the documentation.
It will be displayed in the docs using the scylladb_include_flag directive after
https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0,
and, in turn, branch-2024.2.
Refs https://github.com/scylladb/scylla-enterprise/issues/4585
Refs https://github.com/scylladb/scylla-enterprise/issues/4254Closesscylladb/scylla-enterprise#4587
*) doc: move File-based streaming to the Tablets source file-based-streaming
This commit moves the description of file-based streaming from a common include file
to the regular doc source file where tablets are described.
Closesscylladb/scylla-enterprise#4652
*) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference
Closesscylladb/scylladb#22034
In Scylla there are two options that control IO bandwidth limit -- the /storage_service/(compaction|stream)_throughput REST API endpoints. The endpoints are partially implemented and have no counterparts in the nodetool.
This set implements the missing bits and adds tests for new functionality.
Closesscylladb/scylladb#21877
* github.com:scylladb/scylladb:
nodetool: Implement [gs]etstreamthroughput commands
nodetool: Implement [gs]etcompationthroughput commands
test: Add validation of how IO-updating endpoints work
api: Implement /storage_service/(stream|compaction)_throughput endpoints
api: Disqualify const config reference
api: Implement /storage_service/stream_throughput endpoint
api: Move stream throughput set/get endpoints from storage service block
api: Move set_compaction_throughput_mb_per_sec to config block
util: Include fmt/ranges.h in config_file.hh
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::any_of`.
in this change, we replace `boost::algorithm::any_of` with
`std::ranges::any_of`
to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.
this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The `reader_consumer_v2` type
(`std::function<future<> (mutation_reader)>`) is defined alongside
`mutation_reader` in `mutation_reader.hh`.
before this change, we sometimes use
`std::function<future<> (mutation_reader)>` directly when defining a
consumer parameter or a consumer variable.
in this change, we improve maintainability by:
- Reducing duplicate function type declarations
- Centralizing the consumer type definition
- Making future signature updates easier to implement
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21369
The stream-session is the receiving end of streaming, it reads the
mutation fragment stream from an RPC stream and writes it onto the disk.
As such, this part does no disk IO and therefore, using a permit with
count resources is superfluous. Furthermore, after
d98708013c, the count resources on this
permit can cause a deadlock on the receiver end, via the
`db::view::check_view_update_path()`, which wants to read the content of
a system table and therefore has to obtain a permit of its own.
Switch to a tracking-only permit, primarily to resolve the deadlock, but
also because admission is not necessary for a read which does no IO.
Refs: scylladb/scylladb#20885 (partial fix, solves only one of the deadlocks)
Fixes: scylladb/scylladb#21264Closesscylladb/scylladb#21059
the log.hh under the root of the tree was created keep the backward
compatibility when seastar was extracted into a separate library.
so log.hh should belong to `utils` directory, as it is based solely
on seastar, and can be used all subsystems.
in this change, we move log.hh into utils/log.hh to that it is more
modularized. and this also improves the readability, when one see
`#include "utils/log.hh"`, it is obvious that this source file
needs the logging system, instead of its own log facility -- please
note, we do have two other `log.hh` in the tree.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.
Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.
To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.
[1] 66ef711d68Closesscylladb/scylladb#20006