Rather than tracking only the replica host_id, keep
track of the locator:::node& to prepare for
rack-aware pairing.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Simplify the function logic by calculating the predicate
function once, before scanning all base and view replicas,
rather than testing the different options in the inner loop.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"self-pairing" is enabled only when use_legacy_self_pairing
is enabled. That is currently unclear in the documentation
comment for this function.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently we always lookup both `my_address` and *target_endpoint
in remote_endpoints. But if my_address is in remote_endpoints
in some cases the second lookup is not needed, so do it only
to decide whether to swap target_endpoint with my_address, if
found in remote_endpoints, or to remove that match, if
*target_endpoint is already pending as well.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Although at the moment storage_service::replicate_to_all_cores
may yield between updating the base and view tables with
a new effective_replication_map, scylladb/scylladb#21781
was submitted to change that so that they are updated
atomically together.
This change prepares for the above change, and is harmless
at the moment.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
All view live in the same keyspace as their base
table, so calculate the keyspace-dependent flags
once, outside the per-view update loop.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Previously, during backup, SSTable components are preserved in the
snapshot directory even after being uploaded. This leads to redundant
uploads in case of failed backups or restarts, wasting time and
resources (S3 API calls).
This change
- adds an optional query parameter named "move_files" to
"/storage_service/backup" API. if it is set to "true", SSTable
components are removed once they are backed up to object storage.
- conditionally removes SSTable components from the snapshot directory once
they are successfully uploaded to the target location. This prevents
re-uploading the same files and reduces disk usage.
This change only "Refs" #20655, because, we can move further optimize
the backup process, consider:
- Sending HEAD requests to S3 to check for existing files before uploading.
- Implementing support for resuming partially uploaded files.
Fixes#21799
Refs #20655
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Extract upload_component() from backup_task_impl::do_backup() to improve
readability and prepare for optional post-upload cleanup. This refactoring
simplifies the main backup flow by isolating the upload logic into its own
function.
The change is motivated by an upcoming feature that will allow optional
deletion of components after successful upload, which would otherwise add
complexity to do_backup().
Refs scylladb/scylladb#21799
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead of implementing `snapshot_ctl::run_snapshot_modify_operation()` as
a template function, let it accept as plain noncopyable_function
instance, and return `future<>`.
Previously, `snapshot_ctl::run_snapshot_modify_operation` was a template
function that accepted a templated functor parameter. This approach
limited its usability because callers needed to be defined in the same
translation unit as the template implementation.
however, `backup_task_impl` is defined in another translation unit,
and we intend to call `snapshot_ctl::run_snapshot_modify_operation()`
in its implementation. so in order to cater this need, there are two
options:
1. to move the definition of the template function into the header file.
but the downside is that this slows down the compilation by increaing
the size of header.
2. to change the template function to a regular function. This change
restricts the function's parameter to a specific signature. However, all
current callers already return a `future<>` object, so there's minimal
impact.
in this change, we implement the second option. this allows us to call
this function from another translation unit.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
As discussed in
https://github.com/scylladb/scylladb/issues/12263#issuecomment-1853576813,
compact storage tables are deprecated.
Yet, there's is nothing in the code that prevents users
from creating such tables.
This patch adds a live-updateable config option:
`enable_create_table_with_compact_storage`, set to
`false` by default, that require users to opt-in
in order to create new tables WITH COMPACT STORAGE.
Refs scylladb/scylladb#12263, scylladb/scylladb#16375
* Since this guardrail is an enhancement, no backport is needed
Closesscylladb/scylladb#16403
* github.com:scylladb/scylladb:
docs: ddl: document the deprecation of compact tables
test: enable_create_table_with_compact_storage for tests that need it
config: add enable_create_table_with_compact_storage
File based stream is a new feature that optimizes tablet movement
significantly. It streams the entire SSTable files without deserializing
SSTable files into mutation fragments and re-serializing them back into
SSTables on receiving nodes. As a result, less data is streamed over the
network, and less CPU is consumed, especially for data models that
contain small cells.
The following patches are imported from the scylla enterprise:
*) Merge 'Introduce file stream for tablet' from Asias He
This patch uses Seastar RPC stream interface to stream sstable files on
network for tablet migration.
It streams sstables instead of mutation fragments. The file based
stream has multiple advantages over the mutation streaming.
- No serialization or deserialization for mutation fragments
- No need to read and process each mutation fragments
- On wire data is more compact and smaller
In the test below, a significant speed up is observed.
Two nodes, 1 shard per node, 1 initial_tablets:
- Start node 1
- Insert 10M rows of data with c-s
- Bootstrap node 2
Node 1 will migration data to node2 with the file stream.
Test results:
1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s
[shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0]
Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds
2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s
[shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058]
Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s
[shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds
Test Summary:
File stream v.s. Mutation stream improvements
- Stream bandwidth = 836 / 128 (MB/s) = 6.53X
- Stream time = 23.60 / 1.08 (Seconds) = 21.85X
- Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X
Closes scylladb/scylla-enterprise#3438
* github.com:scylladb/scylla-enterprise:
tests: Add file_stream_test
streaming: Implement file stream for tablet
*) streaming: Use new take_storage_snapshot interface
The new take_storage_snapshot returns a file object instead of a file
name. This allows the file stream sender to read from the file even if
the file is deleted by compaction.
Closes scylladb/scylla-enterprise#3728
*) streaming: Protect unsupported file types for file stream
Currently, we assume the file streamed over the stream_blob rpc verb is
a sstable file. This patch rejects the unsupported file types on the
receiver side. This allows us to stream more file types later using the
current file stream infrastructure without worrying about old nodes
processing the new file types in the wrong way.
- The file_ops::noop is renamed to file_ops::stream_sstables to be
explicit about the file types
- A missing test_file_stream_error_injection is added to the idl
Fixes: #3846
Tests: test_unsupported_file_ops
Closesscylladb/scylla-enterprise#3847
*) idl: Add service::session_id id to idl
It will be used in the next patch.
Refs #3907
*) streaming: Protect file stream with topology_guard
Similar to "storage_service, tablets: Use session to guard tablet
streaming", this patch protects file stream with topology_guard.
Fixes#3907
*) streaming: Take service topology_guard under the try block
Taking the service::topology_guard could throw. Currently, it throws
outside the try block, so the rpc sink will not be closed, causing the
following assertion:
```
scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual
seastar::rpc::sink_impl<netw::serializer,
streaming::stream_blob_cmd_data>::~sink_impl() [Serializer =
netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion
`this->_con->get()->sink_closed()' failed.
```
To fix, move more code including the topology_guard taking code to the
try block.
Fixes https://github.com/scylladb/scylla-enterprise/issues/4106Closesscylladb/scylla-enterprise#4110
*) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho
We're not preserving the SSTable state across file based migration, so
staging SSTables for example are being placed into main directory, and
consequently, we're mixing staging and non-staging data, losing the
ability to continue from where the old replica left off.
It's expected that the view update backlog is transferred from old
into new replica, as migration doesn't wait for leaving replica to
complete view update work (which can take long). Elasticity is preferred.
So this fix guarantees that the state of the SSTable will be preserved
by propagating it in form of subdirectory (each subdirectory is
statically mapped with a particular state).
The staging sstables aren't being registered into view update generator
yet, as that's supposed to be fixed in OSS (more details can be found
at https://github.com/scylladb/scylladb/issues/19149).
Fixes#4265.
Closesscylladb/scylla-enterprise#4267
* github.com:scylladb/scylla-enterprise:
tablet: Preserve original SSTable state with file based tablet migration
sstables: Add get method for sstable state
*) sstable: (Re-)add shareabled_components getter
*) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund
Fixes#4246
Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier.
Ensures we transfer and pre-process scylla metadata for streamed
file blobs first, then properly apply receiving nodes local config
by using a source and sink layer exported from sstables, which
handles things like ordering, metadata filtering (on source) as well
as handling metadata and proper IO paths when writing data on
receiver node (sink).
This implementation maintains the statelessness of the current
design, and the delegated sink side will re-read and re-write the
metadata for each component processed. This is a little wasteful,
but the meta is small, and it is less error prone than trying to do
caching cross-shards etc. The transport is isolated from the
knowledge.
This is an alternative/complement to #4436 and #4472, fixing the
underlying issue. Note that while the layers/API:s here allows easy
fixing of other fundamental problems in the feature (such as
destination location etc), these are not included in the PR, to keep
it as close to the current behaviour as possible.
Closesscylladb/scylla-enterprise#4646
* github.com:scylladb/scylla-enterprise:
raft_tests: Copy/add a topology test with encryption
file streaming: Use sstable source/sink to transfer snapshots
sstables: Add source and sink objects + producers for transfering a snapshot
sstable::types: Add remove accessor for extension info in metadata
*) The change for error injection in merge commit 966ea5955dd8760:
File streaming now has "stream_mutation_fragments" error injection points
so test_table_dropped_during_streaming works with file streaming.
*) doc: document file-based streaming
This commit adds a description of the file-based streaming feature to the documentation.
It will be displayed in the docs using the scylladb_include_flag directive after
https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0,
and, in turn, branch-2024.2.
Refs https://github.com/scylladb/scylla-enterprise/issues/4585
Refs https://github.com/scylladb/scylla-enterprise/issues/4254Closesscylladb/scylla-enterprise#4587
*) doc: move File-based streaming to the Tablets source file-based-streaming
This commit moves the description of file-based streaming from a common include file
to the regular doc source file where tablets are described.
Closesscylladb/scylla-enterprise#4652
*) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference
Closesscylladb/scylladb#22034
As discussed in
https://github.com/scylladb/scylladb/issues/12263#issuecomment-1853576813,
compact storage tables are deprecated.
Yet, there's is nothing in the code that prevents users
from creating such tables.
This patch adds a live-updateable config option:
`enable_create_table_with_compact_storage` that require users
to opt-in in order to create new tables WITH COMPACT STORAGE.
The option is currently set to `true` by default in db/config
to reduce the churn to tests and to `false` in scylla.yaml,
for new clusters.
TODO: once regressions tests that use compact storage
are converted to enable the option, change the default in
db/config to false.
A unit test was added to test/cql-pytest that
checks that the respective cql query fails as expected
with the default option or when it is explicitly set to `false`,
and that the query succeeds when the option is set to `true`.
Note that `check_restricted_table_properties` already
returns an optional warning, but it is only logged
but not returned in the `prepared_statement`.
Fixing that is out of the scope of this patch.
See https://github.com/scylladb/scylladb/issues/20945
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Introduces a comprehensive audit system to track database operations for security
and compliance purposes. This change includes:
Core Components:
- New audit subsystem for logging database operations
- Service level integration for proper resource management
- CQL statement tracking with operation categories
- Login process integration for tenant management
Key Features:
- Configurable audit logging (syslog/table)
- Operation categorization (QUERY/DML/DDL/DCL/AUTH/ADMIN)
- Selective auditing by keyspace/table
- Password sanitization in audit logs
- Service level shares support (1-1000) for workload prioritization
- Proper lifecycle management and cleanup
I ran the dtests for audit (manually enabled) and they pass.
The in-repo tests pass.
Notably, there should be no non-whitespace changes between this and scylla-enterprise
Fixesscylladb/scylla-enterprise#4999Closesscylladb/scylladb#22147
* github.com:scylladb/scylladb:
audit: Add shares support to service level management
audit: Add service level support to CQL login process
audit: Add support to CQL statements
audit: Integrate audit subsystem into Scylla main process
audit: Add documentation for the audit subsystem
audit: Add the audit subsystem
Now that all topology related code uses host ids there is not point to
maintain ip to id (and back) mappings in the token metadata. After the
patch the mapping will be maintained in the gossiper only. The rest of
the system will use host ids and in rare cases where translation is
needed (mostly for UX compatibility reasons) the translation will be
done using gossiper.
Fixes: scylladb/scylla#21777
* 'gleb/drop-ip-from-tm-v3' of github.com:scylladb/scylla-dev: (57 commits)
hint manager: do not translate ip to id in case hint manager is stopped already
locator: token_metadata: drop update_host_id() function that does nothing now
locator: topology: drop indexing by ips
repair: drop unneeded code
storage_service: use host_id to look for a node in on_alive handler
storage_proxy: translate ips to ids in forward array using gossiper
locator: topology: remove unused functions
storage_service: check for outdated ip in on_change notification in the peers table
storage_proxy: translate id to ip using address map in tablets's describe_ring code instead of taking one from the topology
topology coordinator: change connection dropping code to work on host ids
cql3: report host id instead of ip in error during SELECT FROM MUTATION_FRAGMENTS query
locator: drop unused function from tablet_effective_replication_map
api: view_build_statuses: do not use IP from the topology, but translate id to ip using address map instead
locator: token_metadata: remove unused ip based functions
locator: network_topology_strategy: use host_id based function to check number of endpoints in dcs
gossiper: drop get_unreachable_token_owners functions
storage_service: use gossiper to map ip to id in node_ops operations
storage_service: fix indentation after the last patch
storage_service: drop loops from node ops replace_prepare handling since there can be only one replacing node
token_metadata: drop no longer used functions
...
Declaring-but-not-defining a fully specialized template is a great way to
cut dependencies between users and providers, but unfortunately not
supported for variable templates. Clang 18 does support it, but
apparently it is a misinterpretation of the standard, and was removed
in clang 19.
We started using this non-feature in 7ed89266b3.
The fix is to use function templates. This is more verbose as
each specialization needs to define a static variable to return,
but is fully supported.
Closesscylladb/scylladb#22299
host_id_or_endpoint is a helper class that hold either id or ip and
translate one into another on demand. Use gossiper to do a translation
there instead of token_metadata since we want to drop ip based APIs from
the later.
One of its caller is in the RESTful API which gets ips from the user, so
we convert ips to ids inside the API handler using gossiper before
calling the function. We need to deprecate ip based API and move to host
id based.
When adding a new view for building, first write the status to the
system tables and then add the view building step that will start
building it.
Otherwise, if we start building it before the status is written to the
table, it may happen that we complete building the view, write the
SUCCESS status, and then overwrite it with the STARTED status. The
view_build_status table will remain in incorrect state indicating the
view building is not complete.
Fixes#20638
The PR contains few additional small fixes in separate commits related to the view build status table.
It addresses flakiness issues in tests that use the view build status table to determine when view building is complete. The table may be in incorrect state due to these issues, having a row with status STARTED when it actually finished building the view, which will cause us to wait in `wait_for_view` until it timeouts.
For testing I used a test similar to `test_view_build_status_with_replace_node`, but it only creates the views and calls `wait_for_view`. Without these commits it failed in 4/1024 runs, and with the commits it passed 2048/2048.
backport to fix the bugs that affects previous versions and improve CI stability
Closesscylladb/scylladb#22307
* github.com:scylladb/scylladb:
view_builder: hold semaphore during entire startup
view_builder: pass view name by value to write_view_build_status
view_builder: write status to tables before starting to build
This change introduces a new audit subsystem that allows tracking and logging of database operations for security and compliance purposes. Key features include:
- Configurable audit logging to either syslog or a dedicated system table (audit.audit_log)
- Selective auditing based on:
- Operation categories (QUERY, DML, DDL, DCL, AUTH, ADMIN)
- Specific keyspaces
- Specific tables
- New configuration options:
- audit: Controls audit destination (none/syslog/table)
- audit_categories: Comma-separated list of operation categories to audit
- audit_tables: Specific tables to audit
- audit_keyspaces: Specific keyspaces to audit
- audit_unix_socket_path: Path for syslog socket
- audit_syslog_write_buffer_size: Buffer size for syslog writes
The audit logs capture details including:
- Operation timestamp
- Node and client IP addresses
- Operation category and query
- Username
- Success/failure status
- Affected keyspace and table names
these unused includes were identifier by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.
please note, because quite a few source files relied on
`utils/to_string.hh` to pull in the specialization of
`fmt::formatter<std::optional<T>>`, after removing
`#include <fmt/std.h>` from `utils/to_string.hh`, we have to
include `fmt/std.h` directly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Guard the whole view builder startup routine by holding the semaphore
until it's done instead of releasing it early, so that it's not
intercepted by migration notifications.
The function write_view_build_status takes two lambda functions and
chooses which of them to run depending on the upgrade state. It might
run both of them.
The parameters ks_name and view_name should be passed by value instead
of by reference because they are moved inside each lambda function.
Otherwise, if both lambdas are run, the second call operates on invalid
values that were moved.
When adding a new view for building, first write the status to the
system tables and then add the view building step that will start
building it.
Otherwise, if we start building it before the status is written to the
table, it may happen that we complete building the view, write the
SUCCESS status, and then overwrite it with the STARTED status. The
view_build_status table will remain in incorrect state indicating the
view building is not complete.
Fixesscylladb/scylladb#20638
Fixes https://github.com/scylladb/scylla-enterprise/issues/5016#issuecomment-2558464631
EAR - encryption at rest. Allows on-disk file encryption of sstables and commitlog data.
Introduces OpenSSL based file level encrypted storage, managed via a set of providers
ranging from local files to cloud KMS providers.
For a more comprehensive explanation, see the included docs (or if possible, original
source tree).
Manual bulk merge of EAR feature from enterprise repo to main scylla repo.
Breaks some features apart, but main EAR is still a humongous commit, because to separate this
I would have to mess with code incrementally, adding time and risk.
This PR includes the local file gen tool, tests and also p11 validation.
Note: CI will not execute the full tests unless master CI is set to provide the same environment
as the enterprise one. Not sure about the status of this ATM.
Note: Includes code to compile against cryptsoft kmipc SDK, but not the SDK. If you happen to
check out this tree in the scylla folder and configure, it will be linked against and KMIP functionality
will be enabled, otherwise not.
Closesscylladb/scylladb#22233
* github.com:scylladb/scylladb:
docs: Add EAR docs
main/build: Add p11-kit and initialize
tools: Add local-file-key-generator tool
tests: Add EAR tests
tmpdir: shorten test tempdir path
EAR: port the ear feature from enterprise
cql_test_env: Add optional query timeout
schema/migration_manager: Add schema validate
sstables: add get_shared_components accessor
config/config_file: Add exports and definitions of config_type_for<>
Instantiated only on shard 0.
Currently, only subscribe from unit test
Manual unit test using loop mount was added.
Note that the test requires sudo access
and root access to /dev/loop, so it cannot
run in rootless podman instance, and it'd
fail with Permission denied.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#21523
This PR extends authentication with 2 mechanisms:
- a new role_manager subclass, which allows managing users via
LDAP server,
- a new authenticator, which delegates plaintext authentication
to a running saslauthd daemon.
The features have been ported from the enterprise repository
with their test.py tests and the documentation as part of
changing license to source available.
Fixes: scylladb/scylla-enterprise#5000Fixes: scylladb/scylla-enterprise#5001Closesscylladb/scylladb#22030
When starting the view builder, we find all existing views in
`calculate_shard_build_step` and then register a listener for new views.
Between these steps we may yield and create a new view, then we miss
initializing the view build step for the new view, and we won't start
building it.
To fix this we first register the listener and then read existing views,
so a view can't be missed.
Fixesscylladb/scylladb#20338Closesscylladb/scylladb#22184
Usually, the smaller the messsage, the higher the CPU cost per each network byte
saved by compression, so it often makes sense to reserve heavier compression
for bigger messages (where it can make the biggest impact for a given CPU budget)
and use ligher compression for smaller messages.
There is a knob -- internode_compression_zstd_min_message_size -- which
excludes RPC messages below certain size from being compressed with zstd.
We arbitrarily set its default to 0 bytes before.
Now we want to arbitrarily set it to 1024 bytes.
This is based purely on intuition and isn't backed by any solid data.
Fixesscylladb/scylla-enterprise#4731Closesscylladb/scylla-enterprise#4990Closesscylladb/scylladb#22204
We still have a number of issues to be solved for views with tablets.
Until they are fixed, we should prevent users from creating them,
and use the vnode-based views instead.
This patch prepares the feature for enabling views with tablets. The
feature is disabled by default, but currently it has no effect.
After all tests are adjusted to use the feature, we should depend
on the feature for deciding whether we can create materialized views
in tablet-enabled keyspaces.
The unit tests are adjusted to enable this feature explicitly, and it's
also added to the scylla sstable tool config - this tool treats all
tables as if they were tablet-based (surprisingly, with SimpleStrategy),
so for it to work on views, the new feature must be enabled.
Refs scylladb/scylladb#21832Closesscylladb/scylladb#21833
mark the config parameter --commitlog-use-hard-size-limit as deprecated so the
default 'true' is always used, making the hard limit mandatory.
Fixesscylladb/scylladb#16471Closesscylladb/scylladb#21804
node_ops_virtual_task does not filter the entries of system.topology_request
and so it creates statuses of operations that aren't node ops.
Filter the entries used by node_ops_virtual_task. With this change, the status
of a bootstrap of the first node will not be visible.
Fixes: https://github.com/scylladb/scylladb/issues/22008.
Needs backport to 6.2 that introduced node_ops_virtual_task
Closesscylladb/scylladb#22009
* github.com:scylladb/scylladb:
test: truncate the table before node ops task checks
node_ops: rename a method that get node ops entries
node_ops: filter topology_requests entries
now that we are allowed to use C++23. we now have the luxury of using
`std::views::reverse`.
- replace `boost::adaptors::transformed` with `std::views::transform`
- remove unused `#include <boost/range/adaptor/reversed.hpp>`
this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Replace usages of `boost::algorithm::join()` with `fmt::join()` to improve
performance and reduce dependency on Boost. `fmt::join()` allows direct
formatting of ranges and tuples with custom separators without creating
intermediate strings.
When formatting comma-separated values into another string, fmt::join()
avoids the overhead of temporary string creation that
`boost::algorithm::join()` requires. This change also helps streamline
our dependencies by leveraging the existing fmt library instead of
Boost.Algorithm.
To avoid the ambiguity, some caller sites were updated to call
`seastar::format()` explicitly.
See also
- boost::algorithm::join():
https://www.boost.org/doc/libs/1_87_0/doc/html/string_algo/reference.html#doxygen.join_8hpp
- fmt::join():
https://fmt.dev/11.0/api/#ranges-api
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22082
ICS is a compaction strategy that inherits size tiered properties --
therefore it's write optimized too -- but fixes its space overhead of
100% due to input files being only released on completion. That's
achieved with the concept of sstable run (similar in concept to LCS
levels) which breaks a large sstable into fixed-size chunks (1G by
default), known as run fragments. ICS picks similar-sized runs
for compaction, and fragments of those runs can be released
incrementally as they're compacted, reducing the space overhead
to about (number_of_input_runs * 1G). This allows user to increase
storage density of nodes (from 50% to ~80%), reducing the cost of
ownership.
NOTE: test_system_schema_version_is_stable adjusted to account for batchlog
using IncrementalCompactionStrategy
contains:
compaction/: added incremental_compaction_strategy.cc (.hh), incremental_backlog_tracker.cc (.hh)
compaction/CMakeLists.txt: include ICS cc files
configure.py: changes for ICS files, includes test
db/legacy_schema_migrator.cc / db/schema_tables.cc: fallback to ICS when strategy is not supported
db/system_keyspace: pick ICS for some system tables
schema/schema.hh: ICS becomes default
test/boost: Add incremental_compaction_test.cc
test/boost/sstable_compaction_test.cc: ICS related changes
test/cqlpy/test_compaction_strategy_validation.py: ICS related changes
docs/architecture/compaction/compaction-strategies.rst: changes to ICS section
docs/cql/compaction.rst: changes to ICS section
docs/cql/ddl.rst: adds reference to ICS options
docs/getting-started/system-requirements.rst: updates sentence mentioning ICS
docs/kb/compaction.rst: changes to ICS section
docs/kb/garbage-collection-ics.rst: add file
docs/kb/index.rst: add reference to <garbage-collection-ics>
docs/operating-scylla/procedures/tips/production-readiness.rst: add ICS section
some relevant commits throughout the ICS history:
commit 434b97699b39c570d0d849d372bf64f418e5c692
Merge: 105586f747 30250749b8
Author: Paweł Dziepak <pdziepak@scylladb.com>
Date: Tue Mar 12 12:14:23 2019 +0000
Merge "Introduce Incremental Compaction Strategy (ICS)" from Raphael
"
Introduce new compaction strategy which is essentially like size tiered
but will work with the existing incremental compaction. Thus incremental
compaction strategy.
It works like size tiered, but each element composing a tier is a sstable
run, meaning that the compaction strategy will look for N similar-sized
sstable runs to compact, not just individual sstables.
Parameters:
* "sstable_size_in_mb": defines the maximum sstable (fragment) size
composing
a sstable run, which impacts directly the disk space requirement which is
improved with incremental compaction.
The lower the value the lower the space requirement for compaction because
fragments involved will be released more frequently.
* all others available in size tiered compaction strategy
HOWTO
=====
To change an existing table to use it, do:
ALTER TABLE mykeyspace.mytable WITH compaction =
{'class' : 'IncrementalCompactionStrategy'};
Set fragment size:
ALTER TABLE mykeyspace.mytable WITH compaction =
{'class' : 'IncrementalCompactionStrategy', 'sstable_size_in_mb' : 1000 }
"
commit 94ef3cd29a196bedbbeb8707e20fe78a197f30a1
Merge: dca89ce7a5 e08ef3e1a3
Author: Avi Kivity <avi@scylladb.com>
Date: Tue Sep 8 11:31:52 2020 +0300
Merge "Add feature to limit space amplification in Incremental Compaction" from Raphael
"
A new option, space_amplification_goal (SAG), is being added to ICS. This option
will allow ICS user to set a goal on the space amplification (SA). It's not
supposed to be an upper bound on the space amplification, but rather, a goal.
This new option will be disabled by default as it doesn't benefit write-only
(no overwrites) workloads and could hurt severely the write performance.
The strategy is free to delay triggering this new behavior, in order to
increase overall compaction efficiency.
The graph below shows how this feature works in practice for different values
of space_amplification_goal:
https://user-images.githubusercontent.com/1409139/89347544-60b7b980-d681-11ea-87ab-e2fdc3ecb9f0.png
When strategy finds space amplification crossed space_amplification_goal, it
will work on reducing the SA by doing a cross-tier compaction on the two
largest tiers. This feature works only on the two largest tiers, because taking
into account others, could hurt the compaction efficiency which is based on
the fact that the more similar-sized sstables are compacted together the higher
the compaction efficiency will be.
With SAG enabled, min_threshold only plays an important role on the smallest
tiers, given that the second-largest tier could be compacted into the largest
tier for a space_amplification_goal value < 2.
By making the options space_amplification_goal and min_threshold independent,
user will be able to tune write amplification and space amplification, based on
the needs. The lower the space_amplification_goal the higher the write
amplification, but by increasing the min threshold, the write amplification
can be decreased to a desired amount.
"
commit 7d90911c5fb3fa891ad64a62147c3a6ca26d61b1
Author: Raphael S. Carvalho <raphaelsc@scylladb.com>
Date: Sat Oct 16 13:41:46 2021 -0300
compaction: ICS: Add garbage collection
Today, ICS lacks an approach to persist expired tombstones in a timely manner,
which is a problem because accumulation of tombstones are known to affecting
latency considerably.
For an expired tombstone to be purged, it has to reach the top of the LSM tree
and hope that older overlapping data wasn't introduced at the bottom.
The condition are there and must be satisfied to avoid data resurrection.
STCS, today, has an inefficient garbage collection approach because it only
picks a single sstable, which satisfies the tombstone density threshold and
file staleness. That's a problem because overlapping data either on same tier
or smaller tiers will prevent tombstones from being purged. Also, nothing is
done to push the tombstones to the top of the tree, for the conditions to be
eventually satisfied.
Due to incremental compaction, ICS can more easily have an effecient GC by
doing cross-tier compaction of relevant tiers.
The trigger will be file staleness and tombstone density, which threshold
values can be configured by tombstone_compaction_interval and
tombstone_threshold, respectively.
If ICS finds a tier which meets both conditions, then that tier and the
larger[1] *and* closest-in-size[2] tier will be compacted together.
[1]: A larger tier is picked because we want tombstones to eventually reach the
top of the tree.
[2]: It also has to be the closest-in-size tier as the smaller the size
difference the higher the efficiency of the compaction. We want to minimize
write amplification as much as possible.
The staleness condition is there to prevent the same file from being picked
over and over again in a short interval.
With this approach, ICS will be continuously working to purge garbage while
not hurting overall efficiency on a steady state, as same-tier compactions are
prioritized.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211016164146.38010-1-raphaelsc@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#22063