Our sstable format selection logic is weird, and hard to follow.
If I'm not misunderstanding, the pieces are:
1. There's the `sstable_format` config entry, which currently
doesn't do anything, but in the past it used to disable
cluster features for versions newer than the specified one.
2. There are deprecated and unused config entries for individual
versions (`enable_sstables_mc_format`, `enable_sstables_md_format`,
etc).
3. There is a cluster feature for each version:
ME_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, etc.
(Currently all sstable version features have been grandfathered,
and aren't checked by the code anymore).
4. There's an entry in `system.scylla_local` which contains the
latest enabled sstable version. (Why? Isn't this directly derived
from cluster features anyway)?
5. There's `sstable_manager::_format` which contains the
sstable version to be used for new writes.
This field is updated by `sstables_format_selector`
based on cluster features and the `system.scylla_local` entry.
I don't see why those pieces are needed. Version selection has the
following constraints:
1. New sstables must be written with a format that supports existing
data. For example, range tombstones with an infinite bound are only
supported by sstables since version "mc". So if a range tombstone
with an infinite bound exists somewhere in the dataset,
the format chosen for new sstables has to be at least as new as "mc".
2. A new format might only be used after a corresponding cluster feature
is enabled. (Otherwise new sstables might become unreadable if they
are sent to another node, or if a node is downgraded).
3. The user should have a way to inhibit format ugprades if he wishes.
So far, constraint (1) has been fulfilled by never using formats older
than the newest format ever enabled on the node. (With an exception
for resharding and reshaping system tables).
Constraint (2) has been fulfilled by calling `sstable_manager::set_format`
only after the corresponsing cluster feature is enabled.
Constraint (3) has been fulfilled by the ability to inhibit cluster
features by setting `sstable_format` by some fixed value.
The main thing I don't like about this whole setup is that it doesn't
let me downgrade the preferred sstable format. After a format is
enabled, there is no way to go back to writing the old format again.
That is no good -- after I make some performance-sensitive changes
in a new format, it might turn out to be a pessimization for the
particular workload, and I want to be able to go back.
This patch aims to give a way to downgrade formats without violating
the constraints. What it does is:
1. The entry in `system.scylla_local` becomes obsolete.
After the patch we no longer update or read it.
As far as I understand, the purpose of this entry is to prevent
unwanted format downgrades (which is something cluster features
are designed for) and it's updated if and only if relevant
cluster features are updated. So there's no reason to have it,
we can just directly use cluster features.
2. `sstable_format_selector` gets deleted.
Without the `system.scylla_local` around, it's just a glorified
feature listener.
3. The format selection logic is moved into `sstable_manager`.
It already sees the `db::config` and the `gms::feature_service`.
For the foreseeable future, the knowledge of enabled cluster features
and current config should be enough information to pick the right formats.
4. The `sstable_format` entry in `db::config` is no longer intended to
inhibit cluster features. Instead, it is intended to select the
format for new sstables, and it becomes live-updatable.
5. Instead of writing new sstables with "highest supported" format,
(which used to be set by `sstables_format_selector`) we write
them with the "preferred" format, which is determined by
`sstable_manager` based on the combination of enabled features
and the current value of `sstable_format`.
Closesscylladb/scylladb#26092
[avi: Pavel found the reason for the scylla_local entry -
it predates stable storage for cluster features]
The latter is recommended in seastar, and the former was left as
compatibility alias. Latest seastar explicitly marks it as deprecated so
once the submodule is updated, compilation logs will explode.
Most of the patch is generated with
for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]*>') ; do sed -e 's/\<distributed<\([A-Za-z0-9:_]*\)>/sharded<\1>/g' -i $f; done
for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done
and a small manual change in test/perf/perf.hh
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26136
currently repair requests can't be added or deleted on non-base
colocated tables. improve the error message and comments to be more
clear and detailed.
The compaction_manager::stop_compaction() method internally walks the
list of tasks and compares each task's compacting_table (which is
compaction group view pointer) with the given one. In case this
stop_compaction() method is called via API for a specific table, the
method walks the list of tasks for every compaction group from the
table, thus resulting in nr_groups * nr_tasks complexity. Not terrible,
but not nice either.
The proposal is to pass filtering function into the inner
do_stop_ongoing_compactions() method. Some users will pass a simple
"return true" lambda, but those that need to stop compactions for a
specitif table (e.g. -- the API handler) will effectively walk the
list of tasks once comparing the given compaction group's schema with
the target table one (spoiler: eventually this place will also be
simplified not to mess with replica::table at all).
One ugliness with the change is the way "scope" for logging message is
collected. If all tasks belong to the same table, then "for table ..."
is printed in logs. With the change the scope is no longer known
instantly and is evaluated dynamically while walking the list of tasks.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#25846
task_status contains a vector of children identities. If the number
of children is large, we may hit oversized allocation.
Change all types of children-related containers to chunked_vector.
Modify the children type returned from task manager API.
Fixes: scylladb#25795.
Closesscylladb/scylladb#25923
The handler uses database service, not storage_service, and should
belong to the corresponding API module from column_family.cc
Once moved, the handler can use captured sharded<database> reference and
forget about http_context::db.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#25834
This patch introduces a new `incremental_mode` parameter to the tablet
repair REST API, providing more fine-grained control over the
incremental repair process.
Previously, incremental repair was on and could not be turned off. This
change allows users to select from three distinct modes:
- `regular`: This is the default mode. It performs a standard
incremental repair, processing only unrepaired sstables and skipping
those that are already repaired. The repair state (`repaired_at`,
`sstables_repaired_at`) is updated.
- `full`: This mode forces the repair to process all sstables, including
those that have been previously repaired. This is useful when a full
data validation is needed without disabling the incremental repair
feature. The repair state is updated.
- `disabled`: This mode completely disables the incremental repair logic
for the current repair operation. It behaves like a classic
(pre-incremental) repair, and it does not update any incremental
repair state (`repaired_at` in sstables or `sstables_repaired_at` in
the system.tablets table).
The implementation includes:
- Adding the `incremental_mode` parameter to the
`/storage_service/repair/tablet` API endpoint.
- Updating the internal repair logic to handle the different modes.
- Adding a new test case to verify the behavior of each mode.
- Updating the API documentation and developer documentation.
Fixes#25605Closesscylladb/scylladb#25693
Parsiong scrub options may throw after a snapshot is taken thus leaving
it on disk even though an operation reported as "failed". Not, probably,
critical, but not nice either.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The handler validates if the given ks:cf pair exists in the database,
then finds the table id to process further. There's a helper that does
both.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#25669
POSTing on the same URL launches storage_service::drain(), so GETing on
it should (not that it's restriced somehow, but still) work on the same
service. This changes removes one more user of http_context::database
which in turn will allow removding database reference from context
eventually.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#25677
These two "blocks" of endpoints have different URL prefixes, but work with the same "service", which is sharded<replica::database>. The latter block had already been fixed to carry the sharded<database>& around (#25467), now it's the "cache" turn. However, since these endpoints also work with the database, there's no need in dedicated top-level set/unset machinery (similarly, gossiper has two API set/unset blocks that come together, see #19425), it's enough to just set/unset them next to each other.
Ongoing http_context dependency cleanup, no need to backport
Closesscylladb/scylladb#25674
* github.com:scylladb/scylladb:
api: Capture and use db in cache_service handlers
api: Add sharded<database>& arg to set_cache_service()
api: Squash (un)set_cache_service into ..._column_family
api: Coroutinize set_server_column_family()
Both handlers need database to proceed and thus need to be registered
(and unregistered) in a group that captures database for its handlers.
Once moved, the used get_cf_stats() method can be marked local to
column_family.cc file.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#25671
Add precompiled header support to CMakeLists.txt and configure.py -
it improves compilation time by approximately 10%.
New header `stdafx.hh` is added, don't include it manually -
the compiler will include it for you. The header contains includes from
external libraries used by Scylla - seastar, standard library,
linux headers and zlib.
The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER`
or configure.py --disable-precompiled-header to disable.
The feature should be disabled, when trying to check headers - otherwise
you might get false negatives on missing includes from seastar / abseil and so on.
Note: following configuration needs to be added to ccache.conf:
sloppiness = pch_defines,time_macros
Closes#25182
The reference is already available in set_server_column_family(), pass
it further so that "cache" handlers are able to use it (next patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The set_server_column_family() registers API handlers that work with
replica::database. The set_server_cache() does the very same thing, but
registers handlers with some other prefix. Squash the latter into
former, later "cache" handlers will also make use of the database
reference argument that's already available in ..._column_family()
setter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Update more handlers not to get databse from context, but to capture it
directly on handlers' lambdas.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now it accepts http context and immediately gets the database from it to
pass to map_reduce_cf. Callers are updated to pass database from where
the context they already have.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch finalizes the change started by the previous patch of the
similar title -- the map_reduce_cf(_raw) is switched to work only with
sharded<replica::database> reference. All callers were updated by
previous patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are some of them left that still pass http_context. These handlers
will eventually get their captured sharded database reference, but for
now make them explicitly use one from context. This will allow to
de-templatize map_reduce_cf... helpers making the code simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Not all of them can switch from ctx to database, so in few places both,
the database and ctx, are captured. However, the ctx.db reference is no
longer used by the column_family handlers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similarly to other API handlers, instead of using a database from http
context, patch the setting methods to capture the database from main
code and pass it around to handlers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patches are going to change a bunch of map_reduce_cf_... callers to
pass sharded<database> reference to it, not the http context. Not to
patch all the api/ code at once, keep the ability to call it with ctx at
hand. Eventually only the sharded<database>& overload will be kept.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Since table_state is a view to a compaction group, it makes sense
to rename it as so.
With upcoming incremental repair, each replica::compaction_group
will be actually two compaction groups, so there will be two
views for each replica::compaction_group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Derive both vnode_effective_replication_map
and local_effective_replication_map from
static_effective_replication_map as both are static and per-keyspace.
However, local_effective_replication_map does not need vnodes
for the mapping of all tokens to the local node.
Refs #22733
* No backport required
Closesscylladb/scylladb#25222
* github.com:scylladb/scylladb:
locator: abstract_replication_strategy: implement local_replication_strategy
locator: vnode_effective_replication_map: convert clone_data_gently to clone_gently
locator: abstract_replication_map: rename make_effective_replication_map
locator: abstract_replication_map: rename calculate_effective_replication_map
replica: database: keyspace: rename {create,update}_effective_replication_map
locator: effective_replication_map_factory: rename create_effective_replication_map
locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al
locator: abstract_replication_strategy: rename global_vnode_effective_replication_map
keyspace: rename get_vnode_effective_replication_map
dht: range_streamer: use naked e_r_m pointers
storage_service: use naked e_r_m pointers
alternator: ttl: use naked e_r_m pointers
locator: abstract_replication_strategy: define is_local
to get_static_effective_replication_map, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map (both are per-keyspace).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prefer for specializing the local replication strategy,
local effective replication map, et. al byt defining
an is_local() predicate, similar to uses_tablets().
Note that is_vnode_based() still applies to local replication
strategy.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Added a new POST endpoint `/storage_service/drop_quarantined_sstables` to the REST API.
This endpoint allows dropping all quarantined SSTables either globally or
for a specific keyspace and tables.
Optional query parameters `keyspace` and `tables` (comma-separated table names) can be
provided to limit the scope of the operation.
Fixesscylladb/scylladb#19061
Backport is not required, it is new functionality
Closesscylladb/scylladb#25063
* github.com:scylladb/scylladb:
docs: Add documentation for the nodetool dropquarantinedsstables command
nodetool: add command for dropping quarantine sstables
rest_api: add endpoint which drops all quarantined sstables
Instead of using lambda, pass pointer to struct member. The result is
the same, but the code is nicer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#25123
Added a new POST endpoint `/storage_service/drop_quarantined_sstables` to the REST API.
This endpoint allows dropping all quarantined SSTables either globally or
for a specific keyspace and tables.
Optional query parameters `keyspace` and `tables` (comma-separated table names) can be
provided to limit the scope of the operation.
Fixesscylladb/scylladb#19061
In c8ce9d1c60 we introduced
raft_topology_get_cmd_status REST api but the commit forgot to
unregister the handler during shutdown.
Fixes#24910Closesscylladb/scylladb#24911