Commit Graph

3696 Commits

Author SHA1 Message Date
Amnon Heiman
94ba8af788 sstables.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_sstables_cell_tombstone_writes
scylla_sstables_range_tombstone_reads
scylla_sstables_range_tombstone_writes
scylla_sstables_row_tombstone_reads
scylla_sstables_tombstone_writes
2025-03-03 16:58:38 +02:00
Kefu Chai
a3ac7c3d33 remove redundant std::move() from position_in_partition::key()
Fix GCC warning about moving from a const reference in mp_row_consumer_k_l::flush_if_needed.
Since position_in_partition::key() returns a const reference, std::move has no effect.

Considered adding an rvalue reference overload (clustering_key_prefix&& key() &&) but
since the "me" sstable format is mandatory since 63b266e9, this approach offers no benefit.

This change simply removes the redundant std::move() call to silence the warning and
improve code clarity.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23085
2025-03-03 12:51:40 +03:00
Kefu Chai
9e0e99347f sstables: explicitly call parent's default constructor in copy constructor
When implementing the copy constructor for `sstable_set` (derived from
`enable_lw_shared_from_this`), we intentionally need the parent's default
constructor rather than its copy constructor. This is because each new
`sstable_set` instance maintains its own reference count and owns a clone
of the source object's implementation (`x._impl->clone()`).

Although this behavior is correct, GCC warns about not calling the parent's
copy constructor. This change explicitly calls the parent's default constructor
to:

1. Silence GCC warnings
2. Clearly document our intention to use the default constructor
3. Follow best practices for constructor initialization

The functionality remains unchanged, but the code is now more explicit about
its design and free of compiler warnings.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23083
2025-02-28 13:52:24 +03:00
Pavel Emelyanov
eff61b167c treewide: Reduce db/config.hh header fanout
Drop it from files that obviously don't need it. Also kill some forward
declarations while at it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#22979
2025-02-25 15:16:40 +01:00
Kefu Chai
9fdbe0e74b tree: Remove unused boost headers
This commit eliminates unused boost header includes from the tree.

Removing these unnecessary includes reduces dependencies on the
external Boost.Adapters library, leading to faster compile times
and a slightly cleaner codebase.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22997
2025-02-25 10:32:32 +03:00
Botond Dénes
6ab16006a2 Merge 'Untangle sstable-directory vs sstable in pending log creation code' from Pavel Emelyanov
There's a sstable_directory::create_pending_deletion_log() helper method that's called by sstable's filesystem_storage atomic-delete methods and that prepares the deletion log for a bunch of sstables. For that method to do its job it needs to get private sstable->_storage field (which is always the filesystem_storage one), also the atomic-delete transparent context object is leaked into the sstable_directory code and low-level sstable storage code needs to include higher-level sstable_directory header.

This patch unties these knots. As the result:
- friendship between sstable and sstable_directory is removed
- transparent atomic_delete_context is encapsulated in storage.(cc|hh) code
- less code for create_pending_deletion_log() to dump TOC filename into log

Closes scylladb/scylladb#22823

* github.com:scylladb/scylladb:
  sstable: Unfriend sstable_directory class
  sstable_directory: Move sstable_directory::pending_delete_result
  sstable_directory: Calculate prefixes outside of create_pending_deletion_log()
  sstable_directory: Introduce local pending_delete_log variable
  sstable_directory: Relax toc file dumping to deletion log
2025-02-24 14:58:37 +02:00
Avi Kivity
1891e10b7b sstables: writer.hh: drop unneeded boost depedencies
Closes scylladb/scylladb#22955
2025-02-24 13:26:44 +03:00
Pavel Emelyanov
a6c882e4e3 sstables: Remove dead get_config() and db::config declarations
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#22974
2025-02-21 15:56:04 +01:00
Pavel Emelyanov
d79eec2e76 sstable: Unfriend sstable_directory class
It was only needed there for create_pending_deletion_log() method to get
private "_storage" from sstable. Now it's all gone and friendship can be
broken.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-02-19 13:09:04 +03:00
Pavel Emelyanov
96a867c869 sstable_directory: Move sstable_directory::pending_delete_result
... to where it belongs -- to the filesystem storage driver itself.
Continuation of the previous patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-02-19 13:09:04 +03:00
Pavel Emelyanov
f6de6d6887 sstable_directory: Calculate prefixes outside of create_pending_deletion_log()
The method in question walks the list of sstables and accumulates
sstables' prefixes into a set on pending_delete_result object. The set
in question is not used at all in this method and is in fact alien to it
-- the p.d._result object is used by the filesystem storage driver as
atomic deletion prepare/commit transparent context.

Said that, move the whole pending_delete_result to where it belongs and
relax the create_pending_deletion_log() to only return the log
directory path string.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-02-19 13:09:04 +03:00
Pavel Emelyanov
b0c1a77528 sstable_directory: Introduce local pending_delete_log variable
This is simply to reduce the churn in the next patch, nothing special
here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-02-19 13:09:04 +03:00
Pavel Emelyanov
5b92c4549e sstable_directory: Relax toc file dumping to deletion log
The current code takes sstable prefix() (e.g. the /foo/bar string), then
trims from its fron the basedir (e.g. the /foo/ string) and then writes
the remainder, a slash and TOC component name (e.g. the xxx-TOC.txt
string). The final result is "bar/xxx-TOC.txt" string.

The taking into account sstable.toc_filename() renders into
sstable.prefix + \slash + component-name, the above result can be
achieved by trimming basedir directory from toc_filename().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-02-19 13:09:04 +03:00
Avi Kivity
30a38e61d4 Merge 'sstables_manager: trigger reclaim/reload on components_memory_reclaim_threshold update' from Lakshmi Narayanan Sreethar
The config variable `components_memory_reclaim_threshold` limits the
memory available to the sstable bloom filters. Any change to its value
is not immediately propagated to the sstable manager, despite it being
a LiveUpdate variable. The updated value takes effect only when a new
sstable is created or deleted.

This PR first refactors the reclaim and reload logic into a single
background fiber. It then updates the sstable manager to subscribe to
changes in the `components_memory_reclaim_threshold` configuration value
and immediately triggers the reclaim/reload fiber when a change is
detected.

Fixes #21947

This is an improvement and does not need to be backported.

Closes scylladb/scylladb#22725

* github.com:scylladb/scylladb:
  sstables_manager: trigger reclaim/reload on `components_memory_reclaim_threshold` update
  sstables_manager: maybe_reclaim_components: yield between iterations
  sstables_manager: rename `increment_total_reclaimable_memory_and_maybe_reclaim()`
  sstables_manager: move reclaim logic into `components_reclaim_reload_fiber()`
  sstables_manager: rename `_sstable_deleted_event` condition variable
  sstables_manager: rename `components_reloader_fiber()`
  sstables_manager: fix `maybe_reclaim_components()` indentation
  sstables_manager: reclaim components memory until usage falls below threshold
  sstables_manager: introduce `get_components_memory_reclaim_threshold()`
  sstables_manager: extract `maybe_reclaim_components()`
  sstables_manager: fix `maybe_reload_components()` indentation
  sstables_manager: extract out `maybe_reload_components()`
2025-02-17 22:33:33 +02:00
Lakshmi Narayanan Sreethar
064bf2fd85 sstables_manager: trigger reclaim/reload on components_memory_reclaim_threshold update
The config variable `components_memory_reclaim_threshold` limits the
memory available to the sstable bloom filters. Any change to its value
is not immediately propagated to the sstable manager, despite it being
a LiveUpdate variable. The updated value takes effect only when a new
sstable is created or deleted.

This patch updates the sstable manager to subscribe to any changes in
the above mentioned config value and immediately trigger the
reclaim/reload fiber when a change occurs. Also, adds a testcase to
verify the fix.

Fixes #21947

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-02-17 20:55:45 +05:30
Kefu Chai
7ff0d7ba98 tree: Remove unused boost headers
This commit eliminates unused boost header includes from the tree.

Removing these unnecessary includes reduces dependencies on the
external Boost.Adapters library, leading to faster compile times
and a slightly cleaner codebase.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22857
2025-02-15 20:32:22 +02:00
Lakshmi Narayanan Sreethar
10fffcd646 sstables_manager: maybe_reclaim_components: yield between iterations
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-02-14 22:11:04 +05:30
Lakshmi Narayanan Sreethar
77107ddaa3 sstables_manager: rename increment_total_reclaimable_memory_and_maybe_reclaim()
Renamed the aboved mentioned method to `increment_total_reclaimable_memory()`
as it doesn't directly reclaim memory anymore.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-02-14 22:11:04 +05:30
Lakshmi Narayanan Sreethar
7f0f839d6d sstables_manager: move reclaim logic into components_reclaim_reload_fiber()
Move the sstable reclaim logic into `components_reclaim_reload_fiber()`
in preparation for the fix for #21947. This also simplifies the overall
reclaim/reload logic by preventing multiple fibers from attempting to
reclaim/reload component memory concurrently.

Also, update the existing test cases to adapt to this change.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-02-14 22:11:04 +05:30
Lakshmi Narayanan Sreethar
f73b6abcc7 sstables_manager: rename _sstable_deleted_event condition variable
Rename the `_sstable_deleted_event` condition variable to
`_components_memory_change_event` as it will be used by future patches
to signal changes in sstable component memory consumption, (i.e.)
during sstable create and delete, and also when the
`components_memory_reclaim_threshold` config value is changed.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-02-14 22:11:04 +05:30
Lakshmi Narayanan Sreethar
35a4de3eeb sstables_manager: rename components_reloader_fiber()
A future patch will move components reclaim logic into the current
`components_reloader_fiber()`, so to reflect its new purpose, rename it
to `components_reclaim_reload_fiber()`.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-02-14 22:11:04 +05:30
Lakshmi Narayanan Sreethar
4d396b9578 sstables_manager: fix maybe_reclaim_components() indentation
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-02-14 22:11:04 +05:30
Lakshmi Narayanan Sreethar
f53fd40ff0 sstables_manager: reclaim components memory until usage falls below threshold
The current implementation reclaims memory from SSTables only when a new
SSTable is created. An upcoming patch will move this reclaim logic into
the existing component reloader fiber. To support this change, the
`maybe_reclaim_components()` method is updated to reclaim memory until
the total memory consumption falls below the configured threshold.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-02-14 22:11:04 +05:30
Lakshmi Narayanan Sreethar
30184ead79 sstables_manager: introduce get_components_memory_reclaim_threshold()
Introduce `get_components_memory_reclaim_threshold()`, which returns
the components' memory threshold based on the total available memory.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-02-14 22:11:04 +05:30
Lakshmi Narayanan Sreethar
4d12ae433a sstables_manager: extract maybe_reclaim_components()
Extract the code from
`increment_total_reclaimable_memory_and_maybe_reclaim()` that reclaims
the components memory into `maybe_reclaim_components()`. The extracted
new method will be used by a following patch to handle reclaim within
the components reload fiber.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-02-14 22:11:04 +05:30
Lakshmi Narayanan Sreethar
59cbee6fc7 sstables_manager: fix maybe_reload_components() indentation
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-02-14 22:11:03 +05:30
Lakshmi Narayanan Sreethar
ce2aa15d19 sstables_manager: extract out maybe_reload_components()
Extract the logic that reloads reclaimed components into memory in the
`components_reloader_fiber()` method into a separate method. This is in
preparation for moving the reclaim logic into the same fiber.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-02-14 22:11:03 +05:30
Avi Kivity
5adaf0a605 Merge 'tree: migrate from boost::remove_if() to the standard library based alternatives' from Kefu Chai
Replace boost::remove_if() with the standard library's std::erase_if() or std::ranges::remove_if() to reduce external dependencies and simplify the codebase. This change eliminates the requirement for boost::range and makes the implementation more maintainable.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#22788

* github.com:scylladb/scylladb:
  service: migrate from boost::range::remove_if() to std::ranges::remove_if
  sstable: migrate from boost::remove_if() to std::erase_if()
2025-02-11 14:07:48 +02:00
Kefu Chai
481397317d sstables, test: migrate from boost::copy() to std::ranges::copy()
Replace boost::copy() with the standard library's std::ranges::copy()
to reduce external dependencies and simplify the codebase. This change
eliminates the requirement for boost::range and makes the implementation
more maintainable.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22789
2025-02-11 14:55:25 +03:00
Kefu Chai
ba724a26f4 sstable: migrate from boost::remove_if() to std::erase_if()
Replace boost::remove_if() with the standard library's std::erase_if()
to reduce external dependencies and simplify the codebase. This change
eliminates the requirement for boost::range and makes the implementation
more maintainable.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-02-11 09:15:14 +08:00
Kefu Chai
c6bf9d8d11 sstables: switch from boost to std::ranges::all_of()
Replace boost::algorithm::all_of_equal() to std::ranges::all_of()

In order to reduce the header dependency to boost ranges library, let's
use the utility from the standard library when appropriate.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22730
2025-02-10 15:44:55 +03:00
Ernest Zaslavsky
d534051bea aws creds: add env. and file credentials providers
This commit entirely removes credentials from the endpoint configuration. It also eliminates all instances of manually retrieving environment credentials. Instead, the construction of file and environment credentials has been moved to their respective providers. Additionally, a new aws_credentials_provider_chain class has been introduced to support chaining of multiple credential providers.
2025-02-05 14:57:19 +02:00
Pavel Emelyanov
4aff86ac64 sstables: Mark sstable::sstable_buffer_size const
It really never changes once set in constructor

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#22533
2025-01-29 19:51:22 +02:00
Avi Kivity
c71f383cf2 Merge 'sstables: use std::variant instead of boost::variant' from Botond Dénes
Continue replacing boost types with std one where possible.

Improvement, no backport needed

Closes scylladb/scylladb#22290

* github.com:scylladb/scylladb:
  sstables: disk_types: disk_set_of_tagged_union: boost::variant -> std::variant
  scylla-gdb.py: std_variant: fix get()
  sstables: disk_types: remove unused disk_tagged_union
2025-01-29 15:29:15 +02:00
Botond Dénes
b70dccb638 sstables: disk_types: disk_set_of_tagged_union: boost::variant -> std::variant
In the spirit of using standard-library types, instead of boost ones
where possible.
Although a disk type, it is serialized/deserialized with custom code, so
the change shouldn't cause any changes in the disk representation.
2025-01-27 09:29:26 -05:00
Botond Dénes
08f1aecc1e sstables: disk_types: remove unused disk_tagged_union 2025-01-27 09:29:26 -05:00
Avi Kivity
a23a3110b5 utils: config_file: forward_declare boost::program_options classes
Avoid pulling in boost dependencies when all we need is the class name.

Closes scylladb/scylladb#22453
2025-01-27 10:45:43 +03:00
Kefu Chai
0237913337 sstables: Migrate from boost::adaptors::indexed to std::views::enumerate
This change modernizes the codebase by:
- Replacing Boost's indexed adaptor with C++20's std::views::enumerate
- Removing unnecessary Boost header inclusion

With this change, we can:
- Reduce external dependencies
- Leverage standard library features
- Improve long-term code maintainability

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22469
2025-01-26 20:51:14 +02:00
Kefu Chai
4a268362b9 compress: fix compressor initialization order by making namespace_prefix a function
Fixes a race condition where COMPRESSOR_NAME in zstd.cc could be
initialized before compressor::namespace_prefix due to undefined
global variable initialization order across translation units. This
was causing ZstdCompressor to be unregistered in release builds,
making it impossible to create tables with Zstd compression.

Replace the global namespace_prefix variable with a function that
returns the fully qualified compressor name. This ensures proper
initialization order and fixes the registration of the ZstdCompressor.

Fixes scylladb/scylladb#22444
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22451
2025-01-26 13:43:02 +02:00
Asias He
4018dc7f0d Introduce file stream for tablet
File based stream is a new feature that optimizes tablet movement
significantly. It streams the entire SSTable files without deserializing
SSTable files into mutation fragments and re-serializing them back into
SSTables on receiving nodes. As a result, less data is streamed over the
network, and less CPU is consumed, especially for data models that
contain small cells.

The following patches are imported from the scylla enterprise:

*) Merge 'Introduce file stream for tablet' from Asias He

    This patch uses Seastar RPC stream interface to stream sstable files on
    network for tablet migration.

    It streams sstables instead of mutation fragments. The file based
    stream has multiple advantages over the mutation streaming.

    - No serialization or deserialization for mutation fragments
    - No need to read and process each mutation fragments
    - On wire data is more compact and smaller

    In the test below, a significant speed up is observed.

    Two nodes, 1 shard per node, 1 initial_tablets:

    - Start node 1
    - Insert 10M rows of data with c-s
    - Bootstrap node 2

    Node 1 will migration data to node2 with the file stream.

    Test results:

    1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s

    [shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0]
	Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s
    [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds

    2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s

    [shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058]
	Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s
    [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds

    Test Summary:

    File stream v.s. Mutation stream improvements

    - Stream bandwidth = 836 / 128  (MB/s)  = 6.53X

    - Stream time = 23.60 / 1.08  (Seconds) = 21.85X

    - Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X

    Closes scylladb/scylla-enterprise#3438

    * github.com:scylladb/scylla-enterprise:
      tests: Add file_stream_test
      streaming: Implement file stream for tablet

*) streaming: Use new take_storage_snapshot interface

    The new take_storage_snapshot returns a file object instead of a file
    name. This allows the file stream sender to read from the file even if
    the file is deleted by compaction.

    Closes scylladb/scylla-enterprise#3728

*) streaming: Protect unsupported file types for file stream

    Currently, we assume the file streamed over the stream_blob rpc verb is
    a sstable file. This patch rejects the unsupported file types on the
    receiver side. This allows us to stream more file types later using the
    current file stream infrastructure without worrying about old nodes
    processing the new file types in the wrong way.

    - The file_ops::noop is renamed to file_ops::stream_sstables to be
      explicit about the file types

    - A missing test_file_stream_error_injection is added to the idl

    Fixes: #3846
    Tests: test_unsupported_file_ops

    Closes scylladb/scylla-enterprise#3847

*) idl: Add service::session_id id to idl

    It will be used in the next patch.

    Refs #3907

*) streaming: Protect file stream with topology_guard

    Similar to "storage_service, tablets: Use session to guard tablet
    streaming", this patch protects file stream with topology_guard.

    Fixes #3907

*) streaming: Take service topology_guard under the try block

    Taking the service::topology_guard could throw. Currently, it throws
    outside the try block, so the rpc sink will not be closed, causing the
    following assertion:

    ```
    scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual
    seastar::rpc::sink_impl<netw::serializer,
    streaming::stream_blob_cmd_data>::~sink_impl() [Serializer =
    netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion
    `this->_con->get()->sink_closed()' failed.
    ```

    To fix, move more code including the topology_guard taking code to the
    try block.

    Fixes https://github.com/scylladb/scylla-enterprise/issues/4106

    Closes scylladb/scylla-enterprise#4110

*) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho

    We're not preserving the SSTable state across file based migration, so
    staging SSTables for example are being placed into main directory, and
    consequently, we're mixing staging and non-staging data, losing the
    ability to continue from where the old replica left off.
    It's expected that the view update backlog is transferred from old
    into new replica, as migration doesn't wait for leaving replica to
    complete view update work (which can take long). Elasticity is preferred.

    So this fix guarantees that the state of the SSTable will be preserved
    by propagating it in form of subdirectory (each subdirectory is
    statically mapped with a particular state).

    The staging sstables aren't being registered into view update generator
    yet, as that's supposed to be fixed in OSS (more details can be found
    at https://github.com/scylladb/scylladb/issues/19149).

    Fixes #4265.

    Closes scylladb/scylla-enterprise#4267

    * github.com:scylladb/scylla-enterprise:
      tablet: Preserve original SSTable state with file based tablet migration
      sstables: Add get method for sstable state

*) sstable: (Re-)add shareabled_components getter

*) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund

    Fixes #4246

    Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier.

    Ensures we transfer and pre-process scylla metadata for streamed
    file blobs first, then properly apply receiving nodes local config
    by using a source and sink layer exported from sstables, which
    handles things like ordering, metadata filtering (on source) as well
    as handling metadata and proper IO paths when writing data on
    receiver node (sink).

    This implementation maintains the statelessness of the current
    design, and the delegated sink side will re-read and re-write the
    metadata for each component processed. This is a little wasteful,
    but the meta is small, and it is less error prone than trying to do
    caching cross-shards etc. The transport is isolated from the
    knowledge.

    This is an alternative/complement to #4436 and #4472, fixing the
    underlying issue. Note that while the layers/API:s here allows easy
    fixing of other fundamental problems in the feature (such as
    destination location etc), these are not included in the PR, to keep
    it as close to the current behaviour as possible.

    Closes scylladb/scylla-enterprise#4646

    * github.com:scylladb/scylla-enterprise:
      raft_tests: Copy/add a topology test with encryption
      file streaming: Use sstable source/sink to transfer snapshots
      sstables: Add source and sink objects + producers for transfering a snapshot
      sstable::types: Add remove accessor for extension info in metadata

*) The change for error injection in merge commit 966ea5955dd8760:

    File streaming now has "stream_mutation_fragments" error injection points
    so test_table_dropped_during_streaming works with file streaming.

*) doc: document file-based streaming

    This commit adds a description of the file-based streaming feature to the documentation.

    It will be displayed in the docs using the scylladb_include_flag directive after
    https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0,
    and, in turn, branch-2024.2.

    Refs https://github.com/scylladb/scylla-enterprise/issues/4585
    Refs https://github.com/scylladb/scylla-enterprise/issues/4254

    Closes scylladb/scylla-enterprise#4587

*) doc: move File-based streaming to the Tablets source file-based-streaming

    This commit moves the description of file-based streaming from a common include file
    to the regular doc source file where tablets are described.

    Closes scylladb/scylla-enterprise#4652

*) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference

Closes scylladb/scylladb#22467
2025-01-26 12:51:59 +02:00
Avi Kivity
59d3a66d18 Revert "Introduce file stream for tablet"
This reverts commit 8208688178. It was
contributed from enterprise, but is too different from the original
for me to merge back.
2025-01-22 09:42:20 +02:00
Benny Halevy
88ae067ddb everywhere: add skeletal support for the in_memory_tables feature
Forward-ported from scylla-enterprise.
Note that the feature has been deprecated and the implementation
is provided only for backward compatibility with pre-existing
features and schema.

Tested manually after adding the following to feature_service:
```
    gms::feature workload_prioritization { *this, "WORKLOAD_PRIORITIZATION"sv };
```

Launched a single-node cluster running 2023.1.10
```
cqlsh> create KEYSPACE ks WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
cqlsh> create TABLE ks.test ( pk int PRIMARY KEY, val int ) WITH compaction = {'class': 'InMemoryCompactionStrategy'};
```

log:
```
Scylla version 2023.1.10-0.20241227.21cffccc1ccd with build-id bd65b8399cb13b713a87e57fe333cfcabfd50be7 starting ...
...
INFO  2024-12-27 19:45:16,563 [shard 0] migration_manager - Create new ColumnFamily: org.apache.cassandra.config.CFMetaData@0x600000f1b400[cfId=5529c630-c47a-11ef-bd1d-4295734ce5a8,ksName=ks,cfName=test,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,readRepairChance=0,dcLocalReadRepairChance=0,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,keyValidator=org.apache.cassandra.db.marshal.Int32Type,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=val, type=org.apache.cassandra.db.marshal.Int32Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.InMemoryCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,in_memory=false,version=5529c631-c47a-11ef-bd1d-4295734ce5a8,droppedColumns={},collections={},indices={}]
INFO  2024-12-27 19:45:16,564 [shard 0] schema_tables - Creating ks.test id=5529c630-c47a-11ef-bd1d-4295734ce5a8 version=ec88d510-6aff-344a-914d-541d37081440
```

Upgraded to this branch and started scylla.
Verified that ks.test was successfuly loaded:

log:
```
INFO  2024-12-27 19:48:58,115 [shard 0:main] init - Scylla version 6.3.0~dev-0.20241227.a64c6dfc153e with build-id f9496134a09cf2e55d3865b9e9ff499f672aa7da starting ...
...
WARN  2024-12-27 19:53:02,948 [shard 1:main] CompactionStrategy - InMemoryCompactionStrategy is no longer supported. Defaulting to NullCompactionStrategy.
...
INFO  2024-12-27 19:53:02,948 [shard 0:main] database - Keyspace ks: Reading CF test id=5529c630-c47a-11ef-bd1d-4295734ce5a8 version=ec88d510-6aff-344a-914d-541d37081440 storage=/home/bhalevy/scylladb/data/ks/test-5529c630c47a11efbd1d4295734ce5a8
```

Then, tested:
```
cqlsh> describe KEYSPACE ks;

CREATE KEYSPACE ks WITH replication = {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true AND tablets = {'enabled': false};

CREATE TABLE ks.test (
    pk int,
    val int,
    PRIMARY KEY (pk)
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'InMemoryCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND speculative_retry = '99.0PERCENTILE';

cqlsh> alter TABLE ks.test with compaction = {'class': 'SizeTieredCompactionStrategy'};
cqlsh> describe KEYSPACE ks;

CREATE KEYSPACE ks WITH replication = {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true AND tablets = {'enabled': false};

CREATE TABLE ks.test (
    pk int,
    val int,
    PRIMARY KEY (pk)
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND speculative_retry = '99.0PERCENTILE'
    AND tombstone_gc = {'mode': 'timeout', 'propagation_delay_in_seconds': '3600'};
```

log:
```
INFO  2024-12-27 19:56:40,465 [shard 0:stmt] migration_manager - Update table 'ks.test' From org.apache.cassandra.config.CFMetaData@0x60000362d800[cfId=5529c630-c47a-11ef-bd1d-4295734ce5a8,ksName==ks,cfName=test,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=val, type=org.apache.cassandra.db.marshal.Int32Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.InMemoryCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,version=ec88d510-6aff-344a-914d-541d37081440,droppedColumns={},collections={},indices={}] To org.apache.cassandra.config.CFMetaData@0x60000336e000[cfId=5529c630-c47a-11ef-bd1d-4295734ce5a8,ksName==ks,cfName=test,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=val, type=org.apache.cassandra.db.marshal.Int32Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,version=ecccf010-c47b-11ef-b52c-622f2f0e87c4,droppedColumns={},collections={},indices={}]
INFO  2024-12-27 19:56:40,466 [shard 0: gms] schema_tables - Altering ks.test id=5529c630-c47a-11ef-bd1d-4295734ce5a8 version=ecccf010-c47b-11ef-b52c-622f2f0e87c4
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#22068
2025-01-20 16:55:17 +02:00
Asias He
8208688178 Introduce file stream for tablet
File based stream is a new feature that optimizes tablet movement
significantly. It streams the entire SSTable files without deserializing
SSTable files into mutation fragments and re-serializing them back into
SSTables on receiving nodes. As a result, less data is streamed over the
network, and less CPU is consumed, especially for data models that
contain small cells.

The following patches are imported from the scylla enterprise:

*) Merge 'Introduce file stream for tablet' from Asias He

    This patch uses Seastar RPC stream interface to stream sstable files on
    network for tablet migration.

    It streams sstables instead of mutation fragments. The file based
    stream has multiple advantages over the mutation streaming.

    - No serialization or deserialization for mutation fragments
    - No need to read and process each mutation fragments
    - On wire data is more compact and smaller

    In the test below, a significant speed up is observed.

    Two nodes, 1 shard per node, 1 initial_tablets:

    - Start node 1
    - Insert 10M rows of data with c-s
    - Bootstrap node 2

    Node 1 will migration data to node2 with the file stream.

    Test results:

    1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s

    [shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0]
	Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s
    [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds

    2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s

    [shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058]
	Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s
    [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds

    Test Summary:

    File stream v.s. Mutation stream improvements

    - Stream bandwidth = 836 / 128  (MB/s)  = 6.53X

    - Stream time = 23.60 / 1.08  (Seconds) = 21.85X

    - Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X

    Closes scylladb/scylla-enterprise#3438

    * github.com:scylladb/scylla-enterprise:
      tests: Add file_stream_test
      streaming: Implement file stream for tablet

*) streaming: Use new take_storage_snapshot interface

    The new take_storage_snapshot returns a file object instead of a file
    name. This allows the file stream sender to read from the file even if
    the file is deleted by compaction.

    Closes scylladb/scylla-enterprise#3728

*) streaming: Protect unsupported file types for file stream

    Currently, we assume the file streamed over the stream_blob rpc verb is
    a sstable file. This patch rejects the unsupported file types on the
    receiver side. This allows us to stream more file types later using the
    current file stream infrastructure without worrying about old nodes
    processing the new file types in the wrong way.

    - The file_ops::noop is renamed to file_ops::stream_sstables to be
      explicit about the file types

    - A missing test_file_stream_error_injection is added to the idl

    Fixes: #3846
    Tests: test_unsupported_file_ops

    Closes scylladb/scylla-enterprise#3847

*) idl: Add service::session_id id to idl

    It will be used in the next patch.

    Refs #3907

*) streaming: Protect file stream with topology_guard

    Similar to "storage_service, tablets: Use session to guard tablet
    streaming", this patch protects file stream with topology_guard.

    Fixes #3907

*) streaming: Take service topology_guard under the try block

    Taking the service::topology_guard could throw. Currently, it throws
    outside the try block, so the rpc sink will not be closed, causing the
    following assertion:

    ```
    scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual
    seastar::rpc::sink_impl<netw::serializer,
    streaming::stream_blob_cmd_data>::~sink_impl() [Serializer =
    netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion
    `this->_con->get()->sink_closed()' failed.
    ```

    To fix, move more code including the topology_guard taking code to the
    try block.

    Fixes https://github.com/scylladb/scylla-enterprise/issues/4106

    Closes scylladb/scylla-enterprise#4110

*) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho

    We're not preserving the SSTable state across file based migration, so
    staging SSTables for example are being placed into main directory, and
    consequently, we're mixing staging and non-staging data, losing the
    ability to continue from where the old replica left off.
    It's expected that the view update backlog is transferred from old
    into new replica, as migration doesn't wait for leaving replica to
    complete view update work (which can take long). Elasticity is preferred.

    So this fix guarantees that the state of the SSTable will be preserved
    by propagating it in form of subdirectory (each subdirectory is
    statically mapped with a particular state).

    The staging sstables aren't being registered into view update generator
    yet, as that's supposed to be fixed in OSS (more details can be found
    at https://github.com/scylladb/scylladb/issues/19149).

    Fixes #4265.

    Closes scylladb/scylla-enterprise#4267

    * github.com:scylladb/scylla-enterprise:
      tablet: Preserve original SSTable state with file based tablet migration
      sstables: Add get method for sstable state

*) sstable: (Re-)add shareabled_components getter

*) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund

    Fixes #4246

    Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier.

    Ensures we transfer and pre-process scylla metadata for streamed
    file blobs first, then properly apply receiving nodes local config
    by using a source and sink layer exported from sstables, which
    handles things like ordering, metadata filtering (on source) as well
    as handling metadata and proper IO paths when writing data on
    receiver node (sink).

    This implementation maintains the statelessness of the current
    design, and the delegated sink side will re-read and re-write the
    metadata for each component processed. This is a little wasteful,
    but the meta is small, and it is less error prone than trying to do
    caching cross-shards etc. The transport is isolated from the
    knowledge.

    This is an alternative/complement to #4436 and #4472, fixing the
    underlying issue. Note that while the layers/API:s here allows easy
    fixing of other fundamental problems in the feature (such as
    destination location etc), these are not included in the PR, to keep
    it as close to the current behaviour as possible.

    Closes scylladb/scylla-enterprise#4646

    * github.com:scylladb/scylla-enterprise:
      raft_tests: Copy/add a topology test with encryption
      file streaming: Use sstable source/sink to transfer snapshots
      sstables: Add source and sink objects + producers for transfering a snapshot
      sstable::types: Add remove accessor for extension info in metadata

*) The change for error injection in merge commit 966ea5955dd8760:

    File streaming now has "stream_mutation_fragments" error injection points
    so test_table_dropped_during_streaming works with file streaming.

*) doc: document file-based streaming

    This commit adds a description of the file-based streaming feature to the documentation.

    It will be displayed in the docs using the scylladb_include_flag directive after
    https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0,
    and, in turn, branch-2024.2.

    Refs https://github.com/scylladb/scylla-enterprise/issues/4585
    Refs https://github.com/scylladb/scylla-enterprise/issues/4254

    Closes scylladb/scylla-enterprise#4587

*) doc: move File-based streaming to the Tablets source file-based-streaming

    This commit moves the description of file-based streaming from a common include file
    to the regular doc source file where tablets are described.

    Closes scylladb/scylla-enterprise#4652

*) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference

Closes scylladb/scylladb#22034
2025-01-20 16:43:21 +02:00
Pavel Emelyanov
14c3fbbf8c Merge 'sstable_directory: do not load remote unshared sstables in process_descriptor()' from Lakshmi Narayanan Sreethar
The sstable loader relied on the generation id to provide an efficient
hint about the shard that owns an sstable. But, this hint was rendered
ineffective with the introduction of UUID generation, as the shard id
was no longer embedded in the generation id. This also became suboptimal
with the introduction of tablets. Commit 0c77f77 addressed this issue by
reading the minimum from disk to determine sstable ownership but this
improvement was lost with commit 63f1969, which optimistically assumed
that hints would work most of the time, which isn't true.

This commit restores that change - shard id of a table is deduced by
reading minially from disk and then the sstable is fully loaded only if
it belongs to the local shard. This patch also adds a testcase to verify
that the sstable are loaded only in their respective shards.

Fixes #21015

This fixes a regression and should be backported.

Closes scylladb/scylladb#22263

* github.com:scylladb/scylladb:
  sstable_directory: do not load remote sstables in process_descriptor
  sstable_directory: update `load_sstable()` definition
  sstable_directory: reintroduce `get_shards_for_this_sstable()`
2025-01-17 11:17:54 +03:00
Kefu Chai
7215d4bfe9 utils: do not include unused headers
these unused includes were identifier by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.

please note, because quite a few source files relied on
`utils/to_string.hh` to pull in the specialization of
`fmt::formatter<std::optional<T>>`, after removing
`#include <fmt/std.h>` from `utils/to_string.hh`, we have to
include `fmt/std.h` directly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-01-14 07:56:39 -05:00
Lakshmi Narayanan Sreethar
63100b34da sstable_directory: do not load remote sstables in process_descriptor
The sstable loader relied on the generation id to provide an efficient
hint about the shard that owns an sstable. But, this hint was rendered
ineffective with the introduction of UUID generation, as the shard id
was no longer embedded in the generation id. This also became suboptimal
with the introduction of tablets. Commit 0c77f77 addressed this issue by
reading the minimum from disk to determine sstable ownership but this
improvement was lost with commit 63f1969, which optimistically assumed
that hints would work most of the time, which isn't true.

This commit restores that change - shard id of a table is deduced by
reading minially from disk and then the sstable is fully loaded only if
it belongs to the local shard. This patch also adds a testcase to verify
that the sstable are loaded only in their respective shards.

Fixes #21015

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-01-13 20:01:30 +05:30
Lakshmi Narayanan Sreethar
6e3ecc70a6 sstable_directory: update load_sstable() definition
Updated `sstable_directory::load_sstable()` to directly accept
`data_dictionary::storage_options` instead of a function that returns
the same. This is required to ensure `process_descriptor()` loads the
sstable only once in the right shard.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-01-13 20:00:29 +05:30
Lakshmi Narayanan Sreethar
d2ba45a01f sstable_directory: reintroduce get_shards_for_this_sstable()
Reintroduce `get_shards_for_this_sstable()` that was removed in commit
ad375fbb. This will be used in the following patch to ensure that an
sstable is loaded only once.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-01-10 23:32:58 +05:30
Calle Wilund
9f06a0e3a3 sstables: add get_shared_components accessor
To access the shared components.
2025-01-08 12:50:03 +00:00
Benny Halevy
3e22998dc1 sstables: parse(summary): reserve positions vector
We know the number of positions in advance
so reserve the chunked_vector capacity for that.

Note: reservation replaces the existing reset of the
positions member.  This is safe since we parse the summary
only once as sstable::read_summary() returns early
if the summary component is already populated.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#21767
2024-12-26 13:33:29 +02:00