In PR 5b6570be52 we introduced the config option
`sstable_compression_user_table_options` to allow adjusting the default
compression settings for user tables. However, the new option was hooked
into the CQL layer and applied only to CQL base tables, not to the whole
spectrum of user tables: CQL auxiliary tables (materialized views,
secondary indexes, CDC log tables), Alternator base tables, Alternator
auxiliary tables (GSIs, LSIs, Streams).
Fix this by moving the logic into the `schema_builder` via a schema
initializer. This ensures that the default compression settings are
applied uniformly regardless of how the table is created, while also
keeping the logic in a central place.
Register the initializer at startup in all executables where schemas are
being used (`scylla_main()`, `scylla_sstable_main()`, `cql_test_env`).
Finally, remove the ad-hoc logic from `create_table_statement`
(redundant as of this patch), remove the xfail markers from the relevant
tests and adjust `test_describe_cdc_log_table_create_statement` to
expect LZ4WithDicts as the default compressor.
Fixes#26914.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 1e37781d86)
The `sstable_compression_user_table_options` config option determines
the default compression settings for user tables.
In patch 2fc812a1b9, the default value of this option was changed from
LZ4 to LZ4WithDicts and a fallback logic was introduced during startup
to temporarily revert the option to LZ4 until the dictionary compression
feature is enabled.
Replace this fallback logic with an accessor that returns the correct
settings depending on the feature flag. This is cleaner and more
consistent with the way we handle the `sstable_format` option, where the
same problem appears (see `get_preferred_sstable_version()`).
As a consequence, the configuration option must always be accessed
through this accessor. Add a comment to point this out.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 76b2d0f961)
When direct failure detector was introduces the idea was that it will
run on the same connection raft group0 verbs are running, but in
60f1053087 raft verbs were moved to run on the gossiper connection
while DIRECT_FD_PING was left where it was. This patch move it to
gossiper connection as well and fix the pinger code to run in gossiper
scheduling group.
(cherry picked from commit 86dde50c0d)
To avoid case when an old coordinator (which hasn't been stopped yet)
dictates what should be done, add raft term to the `work_on_view_building_tasks`
RPC.
The worker needs to check if the term matches the current term from raft
server, and deny the request when the term is bad.
(cherry picked from commit fb8cbf1615)
`sstable_compression_user_table_options` allows configuring a
node-global SSTable compression algorithm for user tables via
scylla.yaml. The current default is `LZ4Compressor` (inherited from
Cassandra).
Make `LZ4WithDictsCompressor` the new default. Metrics from real datasets
in the field have shown significant improvements in compression ratios.
If the dictionary compression feature is not enabled in the cluster
(e.g., during an upgrade), fall back to the `LZ4Compressor`. Once the
feature is enabled, flip the default back to the dictionary compressor
using with a listener callback.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 2fc812a1b9)
The option is a knob that allows to reject dictionary-aware compressors
in the validation stage of CREATE/ALTER statements, and in the
validation of `sstable_compression_user_table_options`. It was
introduced in 7d26d3c7cb to allow the admins of Scylla Cloud to
selectively enable it in certain clusters. For more details, check:
https://github.com/scylladb/scylla-enterprise/issues/5435
As of this series, we want to start offering dictionary compression as
the default option in all clusters, i.e., treat it as a generally
available feature. This makes the knob redundant.
Additionally, making dictionary compression the default choice in
`sstable_compression_user_table_options` creates an awkward dependency
with the knob (disabling the knob should cause
`sstable_compression_user_table_options` to fall back to a non-dict
compressor as default). That may not be very clear to the end user.
For these reasons, mark the option as "Deprecated", remove all relevant
tests, and adjust the business logic as if dictionary compression is
always available.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 96e727d7b9)
Load-and-stream is broken when running concurrently to the
finalization step of tablet split.
Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets
two possible fixes (maybe both):
1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets
This patch implements #1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.
Fixes#26455.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 3abc66da5a)
Materialized views are currently in the experimental phase and using them
in tablet-based keyspaces requires starting Scylla with an experimental feature,
`views-with-tablets`. Any attempts to create a materialized view or secondary
index when it's not enabled will fail with an appropriate error.
After considerable effort, we're drawing close to bringing views out of the
experimental phase, and the experimental feature will no longer be needed.
However, materialized views in tablet-based keyspaces will still be restricted,
and creating them will only be possible after enabling the configuration option
`rf_rack_valid_keyspaces`. That's what we do in this PR.
In this patch, we adjust existing tests in the tree to work with the new
restriction. That shouldn't have been necessary because we've already seemingly
adjusted all of them to work with the configuration option, but some tests hid
well. We fix that mistake now.
After that, we introduce the new restriction. What's more, when starting Scylla,
we verify that there is no materialized view that would violate the contract.
If there are some that do, we list them, notify the user, and refuse to start.
High-level implementation strategy:
1. Name the restrictions in form of a function.
2. Adjust existing tests.
3. Restrict materialized views by both the experimental feature
and the configuration option. Add validation test.
4. Drop the requirement for the experimental feature. Adjust the added test
and add a new one.
5. Update the user documentation.
Fixes scylladb/scylladb#23030
Backport: 2025.4, as we are aiming to support materialized views for tablets from that version.
- (cherry picked from commit a1254fb6f3)
- (cherry picked from commit d6fcd18540)
- (cherry picked from commit 994f09530f)
- (cherry picked from commit 6322b5996d)
- (cherry picked from commit 71606ffdda)
- (cherry picked from commit 00222070cd)
- (cherry picked from commit 288be6c82d)
- (cherry picked from commit b409e85c20)
Parent PR: #25802Closesscylladb/scylladb#26416
* github.com:scylladb/scylladb:
view: Stop requiring experimental feature
db/view: Verify valid configuration for tablet-based views
db/view: Require rf_rack_valid_keyspaces when creating view
test/cluster/random_failures: Skip creating secondary indexes
test/cluster/mv: Mark test_mv_rf_change as skipped
test/cluster: Adjust MV tests to RF-rack-validity
test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces
db/view: Name requirement for views with tablets
Next commit will move `discover_existing_staging_sstables()`
to the foreground, so to prepare for this we need to futurize
`start_background_fibers()` method and change its name to better reflect
its purpose.
(cherry picked from commit 575dce765e)
Creating a materialized view or a secondary index in a tablet-based
keyspace requires that the user enabled two options:
* experimental feature `views-with-tablets`,
* configuration option `rf_rack_vaid_keyspaces`.
Because the latter has only become a necessity recently (in this series),
it's possible that there are already existing materialized views that
violate it.
We add a new check at start-up that iterates over existing views and
makes sure that that is not the case. Otherwise, Scylla notifies the user
of the problem.
(cherry picked from commit 288be6c82d)
ScyllaDB offers the `compression` DDL property for configuring compression per user table (compression algorithm and chunk size). If not specified, the default compression algorithm is the LZ4Compressor with a 4KiB chunk size. The same default applies to system tables as well.
This series introduces a new configuration option to allow customizing the default for user tables. It also adds some tests for the new functionality.
Fixes#25195.
Closesscylladb/scylladb#26003
* github.com:scylladb/scylladb:
test/cluster: Add tests for invalid SSTable compression options
test/boost: Add tests for SSTable compression config options
main: Validate SSTable compression options from config
db/config: Add SSTable compression options for user tables
db/config: Prepare compression_parameters for config system
compressor: Validate presence of sstable_compression in parameters
compressor: Add missing space in exception message
`compression_parameters` provides two levels of validation:
* syntactic checks - implemented in the constructor
* semantic checks - implemented by `compression_parameters::validate()`
The former are applied implicitly when parsing the options from the
command line or from scylla.yaml. The latter are currently not applied,
but they should.
In lack of a better place, apply them in main, right after joining the
cluster, to make sure that the cluster features have been negotiated.
The feature needed here is the `SSTABLE_COMPRESSION_DICTS`. Validation
will fail if the feature is disabled and a dictionary compression
algorithm has been selected.
Also, mark `validate()` as const so that it can be called from a config
object.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The namespace usage in this directory is very inconsistent, with files
and classes scattered in:
* global namespace
* namespace compaction
* namespace sstables
With cases, where all three used in the same file. This code used to
live in sstables/ and some of it still retains namespace sstables as a
heritage of that time. The mismatch between the dir (future module) and
the namespace used is confusing, so finish the migration and move all
code in compaction/ to namespace compaction too.
This patch, although large, is mechanic and only the following kind of
changes are made:
* replace namespace sstable {} with namespace compaction {}
* add namespace compaction {}
* drop/add sstables::
* drop/add compaction::
* move around forward-declarations so they are in the correct namespace
context
This refactoring revealed some awkward leftover coupling between
sstables and compaction, in sstables/sstable_set.cc, where the
make_sstable_set() methods of compaction strategies are implemented.
The background fibers of the view building worker are indirectly spawned by the main function, thus the fibers inherit the "main" scheduling group. The main scheduling group is not supposed to be used for regular work, only for initialization and deinitialization, so this is wrong.
Wrap the call to `start_backgroud_fibers()` with `with_scheduling_group` and use the streaming scheduling group. The view building worker already handles RPCs in the streaming scheduling group (which do most of the work; background fibers only do some maintenance), so this seems like a good fit.
No need to backport, view build coordinator is not a part of any release yet.
Closesscylladb/scylladb#26122
* github.com:scylladb/scylladb:
mv: fix typo in start_backgroud_fibers
mv: run view building worker fibers in streaming group
Vector search related implementation moved to a new module vector_search.
As the vector search functionality is going to be extended, it is
better to keep it in a separate module.
Our sstable format selection logic is weird, and hard to follow.
If I'm not misunderstanding, the pieces are:
1. There's the `sstable_format` config entry, which currently
doesn't do anything, but in the past it used to disable
cluster features for versions newer than the specified one.
2. There are deprecated and unused config entries for individual
versions (`enable_sstables_mc_format`, `enable_sstables_md_format`,
etc).
3. There is a cluster feature for each version:
ME_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, etc.
(Currently all sstable version features have been grandfathered,
and aren't checked by the code anymore).
4. There's an entry in `system.scylla_local` which contains the
latest enabled sstable version. (Why? Isn't this directly derived
from cluster features anyway)?
5. There's `sstable_manager::_format` which contains the
sstable version to be used for new writes.
This field is updated by `sstables_format_selector`
based on cluster features and the `system.scylla_local` entry.
I don't see why those pieces are needed. Version selection has the
following constraints:
1. New sstables must be written with a format that supports existing
data. For example, range tombstones with an infinite bound are only
supported by sstables since version "mc". So if a range tombstone
with an infinite bound exists somewhere in the dataset,
the format chosen for new sstables has to be at least as new as "mc".
2. A new format might only be used after a corresponding cluster feature
is enabled. (Otherwise new sstables might become unreadable if they
are sent to another node, or if a node is downgraded).
3. The user should have a way to inhibit format ugprades if he wishes.
So far, constraint (1) has been fulfilled by never using formats older
than the newest format ever enabled on the node. (With an exception
for resharding and reshaping system tables).
Constraint (2) has been fulfilled by calling `sstable_manager::set_format`
only after the corresponsing cluster feature is enabled.
Constraint (3) has been fulfilled by the ability to inhibit cluster
features by setting `sstable_format` by some fixed value.
The main thing I don't like about this whole setup is that it doesn't
let me downgrade the preferred sstable format. After a format is
enabled, there is no way to go back to writing the old format again.
That is no good -- after I make some performance-sensitive changes
in a new format, it might turn out to be a pessimization for the
particular workload, and I want to be able to go back.
This patch aims to give a way to downgrade formats without violating
the constraints. What it does is:
1. The entry in `system.scylla_local` becomes obsolete.
After the patch we no longer update or read it.
As far as I understand, the purpose of this entry is to prevent
unwanted format downgrades (which is something cluster features
are designed for) and it's updated if and only if relevant
cluster features are updated. So there's no reason to have it,
we can just directly use cluster features.
2. `sstable_format_selector` gets deleted.
Without the `system.scylla_local` around, it's just a glorified
feature listener.
3. The format selection logic is moved into `sstable_manager`.
It already sees the `db::config` and the `gms::feature_service`.
For the foreseeable future, the knowledge of enabled cluster features
and current config should be enough information to pick the right formats.
4. The `sstable_format` entry in `db::config` is no longer intended to
inhibit cluster features. Instead, it is intended to select the
format for new sstables, and it becomes live-updatable.
5. Instead of writing new sstables with "highest supported" format,
(which used to be set by `sstables_format_selector`) we write
them with the "preferred" format, which is determined by
`sstable_manager` based on the combination of enabled features
and the current value of `sstable_format`.
Closesscylladb/scylladb#26092
[avi: Pavel found the reason for the scylla_local entry -
it predates stable storage for cluster features]
The latter is recommended in seastar, and the former was left as
compatibility alias. Latest seastar explicitly marks it as deprecated so
once the submodule is updated, compilation logs will explode.
Most of the patch is generated with
for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]*>') ; do sed -e 's/\<distributed<\([A-Za-z0-9:_]*\)>/sharded<\1>/g' -i $f; done
for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done
and a small manual change in test/perf/perf.hh
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26136
The background fibers of the view building worker are indirectly spawned
by the main function, thus the fibers inherit the "main" scheduling
group. The main scheduling group is not supposed to be used for regular
work, only for initialization and deinitialization, so this is wrong.
Wrap the call to `start_backgroud_fibers()` with `with_scheduling_group`
and use the streaming scheduling group. The view building worker already
handles RPCs in the streaming scheduling group (which do most of the
work; background fibers only do some maintenance), so this seems like a
good fit.
Similarly to view builder, view building worker needs to be drained
in `storage_service::do_drain()`.
Storage service drain is happening at the same beginning of shutdown
procedure. Before this patch, the worker was still building views
after the storage service was drained and this caused errors like:
`Error applying view update to (named_gate_closed_exception)` and
`locator::no_such_tablet_map`.
Fixesscylladb/scylladb#25908Closesscylladb/scylladb#25984
The --help text says about --large-memory-allocation-warning-threshold:
"Warn about memory allocations above this size; set to zero to disable."
That's half-true: setting the value to zero spams logs with warnings of
allocation of any size, as seastar treats zero threshold literaly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#25850
These two "blocks" of endpoints have different URL prefixes, but work with the same "service", which is sharded<replica::database>. The latter block had already been fixed to carry the sharded<database>& around (#25467), now it's the "cache" turn. However, since these endpoints also work with the database, there's no need in dedicated top-level set/unset machinery (similarly, gossiper has two API set/unset blocks that come together, see #19425), it's enough to just set/unset them next to each other.
Ongoing http_context dependency cleanup, no need to backport
Closesscylladb/scylladb#25674
* github.com:scylladb/scylladb:
api: Capture and use db in cache_service handlers
api: Add sharded<database>& arg to set_cache_service()
api: Squash (un)set_cache_service into ..._column_family
api: Coroutinize set_server_column_family()
When a scaling out is delayed or fails, it is crucial to ensure that clusters remain operational
and recoverable even under extreme conditions. To achieve this, the following proactive measures
are implemented:
- reject writes
- includes: inserts, updates, deletes, counter updates, hints, read+repair and lwt writes
- applicable to: user tables, views, CDC log, audit, cql tracing
- stop running compactions/repairs and prevent from starting new ones
- reject incoming tablet migrations
The aforementioned mechanisms are automatically enabled when node's disk utilization reaches
the critical level (default: 98%) and disabled when the utilization drop below the threshold.
Apart from that, the series add tests that require mounted volumes to simulate out of space.
The paths to the volumes can be provided using the a pytest argument, i.e. `--space-limited-dirs`.
When not provided, tests are skipped.
Test scenarios:
1. Start a cluster and write data until one of the nodes reaches 90% of the disk utilization
2. Perform an **operation** that would take the nodes over 100%
3. The nodes should not exceed the critical disk utilization (98% by default)
4. Scale out the cluster by adding one node per rack
5. Retry or wait for the **operation** from step 2
The **operation** is: writing data, running compactions, building materialized views, running repair,
migrating tablets (caused by RF change, decommission).
The test is successful, if no nodes run out of space, the **operation** from step 2 is
aborted/paused/timed out and the **operation** from step 5 is successful.
`perf-simple-query --smp 1 -m 1G` results obtained for fixed 400MHz frequency:
Read path (before)
```
instructions_per_op:
mean= 39661.51 standard-deviation=34.53
median= 39655.39 median-absolute-deviation=23.33
maximum=39708.71 minimum=39622.61
```
Read path (after)
```
instructions_per_op:
mean= 39691.68 standard-deviation=34.54
median= 39683.14 median-absolute-deviation=11.94
maximum=39749.32 minimum=39656.63
```
Write path (before):
```
instructions_per_op:
mean= 50942.86 standard-deviation=97.69
median= 50974.11 median-absolute-deviation=34.25
maximum=51019.23 minimum=50771.60
```
Write path (after):
```
instructions_per_op:
mean= 51000.15 standard-deviation=115.04
median= 51043.93 median-absolute-deviation=52.19
maximum=51065.81 minimum=50795.00
```
Fixes: https://github.com/scylladb/scylladb/issues/14067
Refs: https://github.com/scylladb/scylladb/issues/2871
No backport, as it is a new feature.
Closesscylladb/scylladb#23917
* github.com:scylladb/scylladb:
tests/cluster: Add new storage tests
test/scylla_cluster: Override workdir when passed via cmdline
streaming: Reject incoming migrations
storage_service: extend locator::load_stats to collect per-node critical disk utilization flag
repair_service: Add a facility to disable the service
compaction_manager: Subscribe to out of space controller
compaction_manager: Replace enabled/disabled states with running state
database: Add critical_disk_utilization mode database can be moved to
disk_space_monitor: add subscription API for threshold-based disk space monitoring
docs: Add feature documentation
config: Add critical_disk_utilization_level option
replica/exceptions: Add a new custom replica exception
This patch introduces `view_building_coordinator`, a single entity within whole cluster responsible for building tablet-based views.
The view building coordinator takes slightly different approach than the existing node-local view builder. The whole process is split into smaller view building tasks, one per each tablet replica of the base table.
The coordinator builds one base table at a time and it can choose another when all views of currently processing base table are built.
The tasks are started by setting `STARTED` state and they are executed by node-local view building worker. The tasks are scheduled in a way, that each shard processes only one tablet at a time (multiple tasks can be started for a shard on a node because a table can have multiple views but then all tasks have the same base table and tablet (last_token)). Once the coordinator starts the tasks, it sends `work_on_view_building_tasks` RPC to start the tasks and receive their results.
This RPC is resilient to RPC failure or raft leader change, meaning if one RPC call started a batch of tasks but then failed (for instance the raft leader was changed and caller aborted waiting for the response), next RPC call will attach itself to the already started batch.
The coordinator plugs into handling tablet operations (migration/resize/RF change) and adjusts its tasks accordingly. At the start of each tablet operation, the coordinator aborts necessary view building tasks to prevent https://github.com/scylladb/scylladb/issues/21564. Then, new adjusted tasks are created at the end of the operation.
If the operation fails at any moment, aborted tasks are rollback.
The view building coordinator can also handle staging sstables using process_staging view building tasks. We do this because we don't want to start generating view updates from a staging sstable prematurely, before the writes are directed to the new replica (https://github.com/scylladb/scylladb/issues/19149).
For detailed description check: `docs/dev/view-building-coordinator.md`
Fixes https://github.com/scylladb/scylladb/issues/22288
Fixes https://github.com/scylladb/scylladb/issues/19149
Fixes https://github.com/scylladb/scylladb/issues/21564
Fixes https://github.com/scylladb/scylladb/issues/17603
Fixes https://github.com/scylladb/scylladb/issues/22586
Fixes https://github.com/scylladb/scylladb/issues/18826
Fixes https://github.com/scylladb/scylladb/issues/23930
---
This PR is reimplementation of https://github.com/scylladb/scylladb/pull/21942Closesscylladb/scylladb#23760
* github.com:scylladb/scylladb:
test/cluster: add view build status tests
test/cluster: add view building coordinator tests
utils/error_injection: allow to abort `injection_handler::wait_for_message()`
test: adjust existing tests
utils/error_injection: add injection with `sleep_abortable()`
db/view/view_builder: ignore `no_such_keyspace` exception
docs/dev: add view building coordinator documentation
db/view/view_building_worker: work on `process_staging` tasks
db/view/view_building_worker: register staging sstable to view building coordinator when needed
db/view/view_building_worker: discover staging sstables
db/view/view_building_worker: add method to register staging sstable
db/view/view_update_generator: add method to process staging sstables instantly
db/view/view_update_generator: extract generating updates from staging sstables to a method
db/view/view_update_generator: ignore tablet-based sstables
db/view/view_building_coordinator: update view build status on node join/left
db/view/view_building_coordinator: handle tablet operations
db/view: add view building task mutation builder
service/topology_coordinator: run view building coordinator
db/view: introduce `view_building_coordinator`
db/view/view_building_worker: update built views locally
db/view: introduce `view_building_worker`
db/view: extract common view building functionalities
db/view: prepare to create abstract `view_consumer`
message/messaging_service: add `work_on_view_building_tasks` RPC
service/topology_coordinator: make `term_changed_error` public
db/schema_tables: create/cleanup tasks when an index is created/dropped
service/migration_manager: cleanup view building state on drop keyspace
service/migration_manager: cleanup view building state on drop view
service/migration_manager: create view building tasks on create view
test/boost: enable proxy remote in some tests
service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()`
service/migration_manager: coroutinize `prepare_new_view_announcement()`
service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine`
service: reload `view_building_state_machine` on group0 apply()
service/vb_coordinator: add currently processing base
db/system_keyspace: move `get_scylla_local_mutation()` up
db/system_keyspace: add `view_building_tasks` table
db/view: add view_building_state and views_state
db/system_keyspace: add method to get view build status map
db/view: extract `system.view_build_status_v2` cql statements to system_keyspace
db/system_keyspace: move `internal_system_query_state()` function earlier
db/view: ignore tablet-based views in `view_builder`
gms/feature_service: add VIEW_BUILDING_COORDINATOR feature
Repair service currently have two functions: stop() and shutdown() that
stop all ongoing repairs and prevent any further repairs from being started.
It is possible to stop the repair_service once. Once stopped, it cannot
be restarted. We would like, however, to enable / disable the repair service
many times.
Similarly to compaction_manager, the repair service provides two new functions:
- drain() - abort all ongoing local repair task and disable the service,
i.e. no new local task will be scheduled and data received from
the repair master is rejected. It's, though, still possible to
schedule a global repair request
- enable() - enable the service
By default, the repair service is enabled immediately once started.
For tablet-based keyspaces, the new facility prevents tablets from being
repaired. Whenever the repair_service is disabled and the request to repair
a tablet arrives, an exception is returned.
Once the exception is thrown, the tablet is moved into the end_repair
state and the operation will be retried later. Hence, disabling the service
does not fail the global tablet repair request.
When database operates in the critical disk utilization mode, all
mutation writes including inserts, updates, deletes, counter updates,
hints, read+repair, lwt writes) to user tables and other associated
with them tables like views, CDC log, audit are rejected, with a clear
error exception returned.
The mode is meant to be used with the disk space monitor in order
to prevent any user writes when node's disk utilization is too high.
Introduce the `subscribe` method to disk_space_monitor, allowing clients to
register callbacks triggered when disk utilization crosses a configurable
threshold.
The API supports flexible trigger options, including notifications on threshold
crossing and direction (above/below). This enables more granular and efficient
disk space monitoring for consumers.
Change return type of `check_needs_view_update_path()`. Instead of
retrning bool which tells whether to use staging directory (and register
to `view_update_generator`) or use normal directory.
Now the function returns enum with possible values:
- `normal_directory` - use normal directory for the sstable
- `staging_directly_to_generator` - use staging directory and register
to `view_update_generator`
- `staging_managed_by_vbc` - use staging directory but don't register it
to `view_update_generator` but create view building tasks for
later
The third option is new, it's used when the table has any view which is
in building process currrently. In this case, registering it to `view_update_generator`
prematurely may lead to base-view inconsistency
(for example when a replica is in a pending state).
The method will be used when a new staging sstable needs to go through
the view building coordinator (the coordinator will decide when to
process this staging sstable).
Callers push new staging sstables to a queue and notifiy the async fiber
to create `view_building_task`s from the sstables and commit them to
group0.
The worker is responsible for building tablet-based views by
executing tasks scheduled by the view building coordinator.
It observes view building state machine and wait on the machine's
conditional variable (so the worker is woken up when group0 state is
applied).
The tasks are executed in batches, all tasks in one batch need to have
the same: type, base_id, table_id. One shard can only execute one batch
at a time (at least for now, in the future we might want to change
that).
That worker keeps track of finished and failed tasks in its local state.
The state is cleared when `view_building_state::currently_processed_base_table`
is changed.
The state may be also reloaded on `topology_change` or `mixed_change`
because topology coordinator may change view building tasks during
tablet operations.
The new service, `auth_integration`, has taken over the responsibility
over managing effective service levels from `service_level_controller`.
However, before these changes, it still accessed `auth::service` via
the service level controller. Let's change that.
Note that we also remove a check that `auth::service` has been
initialized. It's not necessary anymore because the lifetime of
`auth_integration` is strictly nested within the lifetime of `auth::service`.
In actuality, `service_level_controller` should lose its reference to
`auth::service` completely. All of the management over effective service
levels has already been moved to `auth_integration`. However, the
referernce is still needed when dropping a distributed service level
because we need to update the corresponding attribute for relevant
roles.
That should not lead to invalid accesses, though. Dropping a service level
should not be possible when `auth::service` is not initialized.
Since `service_level_controller` outlives `auth_integration`, it may
happen that we try to access it when it has already been deinitialized.
To prevent that, we only try to reload or clear the effective service
level cache when the object is still alive.
These changes solve an existing problem with an invalid memory access.
For more context, see issue scylladb/scylladb#24792.
We provide a reproducer test that consistently fails before these
changes but passes after them.
Fixesscylladb/scylladb#24792
We introduce a new type, `auth_integration`, that will be used internally
by `service_level_controller`. Its purpose is to take over the responsibility
over managing effective service levels.
The main problem of the current implementation of service level controller
is its dependency on `auth::service` whose lifetime is strictly nested
within the lifetime of service level controller. That may and already have
led to invalid memory accesses; for an example, see issue
scylladb/scylladb#24792.
Our strategy is to split service level controller into smaller parts and
ensure that we access `auth::service` only when it's valid to do so.
This commit is the first step towards that.
We don't change anything in the logic yet, just add the new type. Further
adjustments will be made in following commits.
The set_server_column_family() registers API handlers that work with
replica::database. The set_server_cache() does the very same thing, but
registers handlers with some other prefix. Squash the latter into
former, later "cache" handlers will also make use of the database
reference argument that's already available in ..._column_family()
setter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the configuration option `rf_rack_valid_keyspaces` is enabled and there
is an RF-rack-invalid keyspace, starting a node fails. However, when the
configuration option is disabled, but there still is a keyspace that violates
the condition, we'd like Scylla to print a warning informing the user about
the fact. That's what happens in this commit.
We provide a validation test.
- **remove redis documentation**
First, remove the redis documentation.
- **remove ./redis and dependencies**
Second, remove the redis directory and its dependencies from the project.
Fixes: #25144
This is a cleanup, no need to backport.
Closesscylladb/scylladb#25148
* github.com:scylladb/scylladb:
remove ./redis and dependencies
remove redis documentation
Similarly to other API handlers, instead of using a database from http
context, patch the setting methods to capture the database from main
code and pass it around to handlers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Instead of storing it partially in tombstone_gc and partially in an
external map. Move all external parts into the new
shared_tombstone_gc_state. This new class is responsible for
keeping and updating the repair history. tombstone_gc_state just keeps
const pointers to the shared state as before and is only responsible for
querying the tombstone gc before times.
This separation makes the code easier to follow and also enables further
patching of tombstone_gc_state.
Previously, `raft_group0::abort()` was called in `storage_service::do_drain` (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because `raft::server` depends on storage (via `raft_sys_table_storage` and `group0_state_machine`).
However, this caused issues: services like `sstable_dict_autotrainer` and `auth::service`, which use `group0_client` but are not stopped by `storage_service`, could trigger use-after-free if `raft_group0` was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used.
This PR reworks the shutdown logic:
* Introduces `abort_and_drain()`, which aborts the server and waits for background tasks to finish, but keeps the server object alive. Clients will see `raft::stopped_error` if they try to access group0 after this method is called.
* Final destruction now happens in `abort_and_destroy()`, called later from `main.cc`, ensuring safe cleanup.
The `raft_server_for_group::aborted` is changed to a `shared_future`, as it is now awaited in both abort methods.
Node startup can fail before reaching `storage_service`, in which case `drain_on_shutdown()` and `abort_and_drain()` are never called. To ensure proper cleanup, `raft_group0` deinitialization logic must be included in both `abort_and_drain()` and `abort_and_destroy()`.
Refs #25115Fixes#24625
Backport: the changes are complicated and not safe to backport, we'll backport a revert of the original patch (#24418) in a separate PR.
Closesscylladb/scylladb#25151
* https://github.com/scylladb/scylladb:
raft_group0: split shutdown into abort_and_drain and destroy
Revert "main.cc: fix group0 shutdown order"
Nowadays the way to configure an internal service is
1. service declares its config struct
2. caller (main/test/tool) fills the respective config with values it wants
3. the service is started with the config passed by value
The feature service code behaves likewise, but provides a helper method to create its config out of db::config. This PR moves this helper out of gms code, so that it doesn't mess with system-wide db::config and only needs its own small struct feature_config.
For the reference: similar changes with other services: #23705 , #20174 , #19166Closesscylladb/scylladb#25118
* github.com:scylladb/scylladb:
gms,init: Move get_disabled_features_from_db_config() from gms
code: Update callers generating feature service config
gms: Make feature_config a simple struct
gms: Split feature_config_from_db_config() into two
Previously, raft_group0::abort() was called in
storage_service::do_drain (introduced in #24418) to
stop the group0 Raft server before destroying local storage.
This was necessary because raft::server depends on storage
(via raft_sys_table_storage and group0_state_machine).
However, this caused issues: services like
sstable_dict_autotrainer and auth::service, which use
group0_client but are not stopped by storage_service,
could trigger use-after-free if raft_group0 was destroyed
too early. This can happen both during normal shutdown
and when 'nodetool drain' is used.
This commit reworks the shutdown logic:
* Introduces abort_and_drain(), which aborts the server
and waits for background tasks to finish, but keeps the
server object alive. Clients will see raft::stopped_error if
they try to access group0 after abort_and_drain().
* Final destruction happens in a separate method destroy(),
called later from main.cc.
The raft_server_for_group::aborted is changed to a
shared_future -- abort_server now returns a future so that
we can wait for it in abort_and_drain(), it should return
the future from the previous abort_server call, which can
happen in the on_background_error callback.
Node startup can fail before reaching storage_service,
in which case ss.drain_on_shutdown() and abort_and_drain()
are never called. To ensure proper cleanup,
abort_and_drain() is called from main.cc before destroy().
Clients of raft_group_registry are expected to call
destroy_server() for the servers they own. Currently,
the only such client is raft_group0, which satisfies
this requirement. As a result,
raft_group_registry::stop_servers() is no longer needed.
Instead, raft_group_registry::stop() now verifies that all
servers have been properly destroyed.
If any remain, it calls on_internal_error().
The call to drain_on_shutdown() in cql_test_env.cc
appears redundant. The only source of raft::server
instances in raft_group_registry is group0_service, and
if group0_service.start() succeeds, both abort_and_drain()
and destroy() are guaranteed to be called during shutdown.
We call paxos_store::ensure_initialized in the beginning of
storage_proxy::cas to create a paxos state table for a user table if
it doesn't exist. When the LWT coordinator sends RPCs to replicas,
some of them may not yet have the paxos schema. In
paxos_store::get_paxos_state_schema we just wait for them to appear,
or throw 'no_such_column_family' if the base table was dropped.
Introduce paxos_store abstraction to isolate Paxos state access.
Prepares for supporting either system.paxos or a co-located
table as the storage backend.
CPU scheduling has been with us since 641aaba12c
(2017), and no one ever disables it. Likely nothing really works without
it.
Make it mandatory and mark the option unused.
Closesscylladb/scylladb#24894
Instead of requesting it from gms code, create it "by hand" with the
help of get_disabled_features_from_db_config() method. This is how other
services are configured by main/tools/testing code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive.
This PR addresses the issue in two ways:
1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case).
2) `passwords::check` is moved to a dedicated alien thread.
Regarding point 1: before this change, the following hashing schemes were supported by `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However:
- The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it not good idea to fix or use it.
- SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512.
- MD5 is no longer considered secure for password hashing.
Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers.
Fixes https://github.com/scylladb/scylladb/issues/24524
Backport not needed, as it is a new feature.
Closesscylladb/scylladb#24924
* github.com:scylladb/scylladb:
main: utils: add thread names to alien workers
auth: move passwords::check call to alien thread
test: wait for 3 clients with given username in test_service_level_api
auth: refactor password checking in password_authenticator
auth: make SHA-512 the only password hashing scheme for new passwords
auth: whitespace change in identify_best_supported_scheme()
auth: require scheme as parameter for `generate_salt`
auth: check password hashing scheme support on authenticator start