Currently, feature service uses `system_keyspace::load_topology_state`
to load information about features from the `system.topology` table.
This function implicitly assumes that it is called after schema
commitlog replay and will correspond to the state of the topology state
machine after some command is applied.
However, feature check happens before the commitlog replay. If some
group 0 command consists of multiple mutations that are not applied
atomically, the `load_topology_state` function may fail to construct a
`service::topology` object based on the table state. Moreover, this
function not only checks `system.topology` but also
`system.cdc_generations_v3` - in the case of the issue, the entry that
was loaded from the this table didn't contain the `num_ranges`
parameter.
In order to fix this, the feature check code now uses
`load_topology_features_state` which only loads enabled and supported
features from `system.topology`. Only this information is really
necessary for the feature check, and it doesn't have any invariants to
check.
Fixes: #14944Closes#14955
* github.com:scylladb/scylladb:
feature_service: don't load whole topology state to check features
system_keyspace: separate loading topology_features from topology
topology_state_machine: extract features-related fields to a struct
untyped_result_set: add missing_column_exception
Hold a (newly added) group0_state_machine gate
that is closed and waited on in group0_state_machine::abort()
To prevent use-after-free when destroying the group0_state_machine
while transfer_snapshot runs.
Fixes#14907
Also, use an abort_source in group0_state_machine
to abort an ongoing transfer_snapshot operation
on group0_state_machine::abort()
Closes#14952
* github.com:scylladb/scylladb:
raft: group0_state_machine: transfer_snapshot: make abortable
raft: group0_state_machine: transfer_snapshot: hold gate
Currently, feature service uses `system_keyspace::load_topology_state`
to load information about features from the `system.topology` table.
This function implicitly assumes that it is called after schema
commitlog replay and will correspond to the state of the topology state
machine after some command is applied.
However, feature check happens before the commitlog replay. If some
group 0 command consists of multiple mutations that are not applied
atomically, the `load_topology_state` function may fail to construct a
`service::topology` object based on the table state. Moreover, this
function not only checks `system.topology` but also
`system.cdc_generations_v3` - in the case of the issue, the entry that
was loaded from the this table didn't contain the `num_ranges`
parameter.
In order to fix this, the feature check code now uses
`load_topology_features_state` which only loads enabled and supported
features from `system.topology`. Only this information is really
necessary for the feature check, and it doesn't have any invariants to
check.
Fixes: #14944
Now, it is possible to load topology_features separately from the
topology struct. It will be used in the code that checks enabled
features on startup.
`enabled_features` and `supported_features` are now moved to a new
`topology::features` struct. This will allow to move load this
information independently from the `topology` struct, which will be
needed for feature checking during start.
Getting reason argument in task_manager_module::get_progress is deceiving
as the method works properly only for streaming::stream_reason::repair
(repair::shard_repair_task_impl::nr_ranges_finished isn't updated for
any other reason).
If some state update in database::add_column_family throws,
info about a column family would be inconsistent.
Undo already performed operations in database::add_column_family
when one throws.
Fixes: #14666.
Closes#14672
* github.com:scylladb/scylladb:
replica: undo the changes if something fails
replica: start table earlier in database::add_column_family
All compaction task executors, except for regular compaction one,
become task manager compaction tasks.
Creating and starting of major_compaction_task_executor is modified
to be consistent with other compaction task executors.
Closes#14505
* github.com:scylladb/scylladb:
test: extend test_compaction_task.py to cover compaction group tasks
compaction: turn custom_task_executor into compaction_task_impl
compaction: turn sstables_task_executor into sstables_compaction_task_impl
compaction: change sstables compaction tasks type
compaction: move table_upgrade_sstables_compaction_task_impl
compaction: pass task_info through sstables compaction
compaction: turn offstrategy_compaction_task_executor into offstrategy_compaction_task_impl
compaction: turn cleanup_compaction_task_executor into cleanup_compaction_task_impl
comapction: use optional task info in major compaction
compaction: use perform_compaction in compaction_manager::perform_major_compaction
While describing materialized view, print `synchronous_updates` option
only if the tag is present in schema's extensions map. Previously if the
key wasn't present, the default (false) value was printed.
Fixes: #14924Closes#14928
we wait for the same condition couple lines before, so no need to
check it again using `BOOST_CHECK_EQUAL()`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14921
Currently, when one tries to access a column that an untyped_result_set
does not contain, a `std::bad_variant_access` exception is thrown. This
exception's message provides very little context and it can be difficult
to even figure out where this message is coming from.
In order to improve the situation, a new exception `missing_column` is
introduced which includes the missing column's name in its error
message. The exception derives from `std::bad_variant_access` for
compatibility with existing code that may want to catch it.
`boost::program_options::value()` create a new typed_value<T> object,
without holding it with a shared_ptr. boost::program_options expects
developer to construct a `bpo::option_description` right away from it.
and `boost::program_options::option_description` takes the ownership
of the `type_value<T>*` raw pointer, and manages its life cycle with
a shared_ptr. but before passing it to a `bpo::option_description`,
the pointer created by `boost::program_options::value()` is a still
a raw pointer.
before this change, we initialize positional options as global
variables using `boost::program_options::value()`. but unfortunately,
we don't always initialize a `bpo::option_description` from it --
we only do this on demand when the corresponding subcommand is
called.
so, if the corresponding subcommand is not called, the created
`typed_value<T>` objects are leaked. hence LeakSanitizer warns us.
after this change, we create the option vector as a static
local variable in a function so it is created on demand as well.
as an alternative, we could initialize the options vector as local
variable where it used. but to be more consistent with how
`global_option` is specified. and to colocate them in a single
place, let's keep the existing code layout.
Fixes#14929
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14939
Commit bc5f6cf45d
added a reserve call to the `ranges` vector before
inserting all the returned token ranges into it.
However, that reservation is too small as we need
to express size+1 ranges for size tokens with
<unbound, token[0]> and <token[size-1], unbound>
ranges at the front and back, respectively.
Fixes#14849
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14938
Use an abort_source in group0_state_machine
to abort an ongoing transfer_snapshot operation
on group0_state_machine::abort()
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Hold a (newly added) group0_state_machine gate
that is closed and waited on in group0_state_machine::abort()
To prevent use-after-free when destroying the group0_state_machine
while transfer_snapshot runs.
Fixes#14907
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch adds the ranges_parallelism option to repair restful API.
Users can use this option to optionally specify the number of ranges to repair in parallel per repair job to a smaller number than the Scylla core calculated default max_repair_ranges_in_parallel.
Scylla manager can also use this option to provide more ranges (>N) in a single repair job but only repairing N ranges_parallelism in parallel, instead of providing N ranges in a repair job.
To make it safer, unlike the PR #4848, this patch does not allow user to exceed the max_repair_ranges_in_parallel.
Fixes#4847Closes#14886
* github.com:scylladb/scylladb:
repair: Add ranges_parallelism option
repair: Change to use coroutine in do_repair_ranges
before this change, if the object_store test fails, the tempdir
will be preserved. and if our CI test pipeline is used to perform
the test, the test job would scan for the artifacts, and if the
test in question fails, it would take over 1 hour to scan the tempdir.
to alleviate the pain, let's just keep the scylla logging file
no matter the test fails or succeeds. so that jenkins can scan the
artifacts faster if the test fails.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14880
Per-table metrics are very valuable for the users, it does come with a high load on both the reporting and the collecting metrics systems.
This patch adds a small subset of per-metrics table that will be reported on the node level.
The list of metrics is:
system_column_family_memtable_switch - Number of times flush has
resulted in the memtable being switched out
system_column_family_memtable_partition_writes - Number of write
operations performed on partitions in memtables
system_column_family_memtable_partition_hits - Number of times a write
operation was issued on an existing partition in memtables
system_column_family_memtable_row_writes - Number of row writes
performed in memtables
system_column_family_memtable_row_hits - Number of rows overwritten by write operations in memtables
system_column_family_total_disk_space - Total disk space used system_column_family_live_sstable - Live sstable count system_column_family_read_latency_count - Number of reads system_column_family_write_latency_count - Number of writes
The names of the read/write metrics is based on the histogram convention, so when latencies histograms will be added, the names will not change.
The metrics are label with a specific label __per_table="node" so it will be possible to easily manipulate it.
The metrics will be available when enable_metrics_reporting (the per-table full metrics flag) is off
Fixes#2198Closes#13293
* github.com:scylladb/scylladb:
replica/table.cc: Add node-per-table metrics
config: add enable_node_table_metrics flag
This series cleans up and hardens the endpoint locking design and
implementation in the gossiper and endpoint-state subscribers.
We make sure that all notifications (expect for `before_change`, that
apparently can be dropped) are called under lock_endpoint, as well as
all calls to gossiper::replicate, to serialize endpoint_state changes
across all shards.
An endpoint lock gets a unique permit_id that is passed to the
notifications and passed back by them if the notification functions call
the gossiper back for the same endpoint on paths that modify the
endpoint_state and may acquire the same endpoint lock - to prevent a
deadlock.
Fixes scylladb/scylladb#14838
Refs scylladb/scylladb#14471
Closes#14845
* github.com:scylladb/scylladb:
gossiper: replicate: ensure non-null permit
gossiper: add_saved_endpoint: lock_endpoint
gossiper: mark_as_shutdown: lock_endpoint
gossiper: real_mark_alive: lock_endpoint
gossiper: advertise_token_removed: lock_endpoint
gossiper: do_status_check: lock_endpoint
gossiper: remove_endpoint: lock_endpoint if needed
gossiper: force_remove_endpoint: lock_endpoint if needed
storage_service: lock_endpoint when removing node
gossiper: use permit_id to serialize state changes while preventing deadlocks
gossiper: lock_endpoint: add debug messages
utils: UUID: make default tagged_uuid ctor constexpr
gossiper: lock_endpoint must be called on shard 0
gossiper: replicate: simplify interface
gossiper: mark_as_shutdown: make private
gossiper: convict: make private
gossiper: mark_as_shutdown: do not call convict
This PR implements the functionality of the raft-based cluster features
needed to safely manage and enable cluster features, according to the
cluster features on raft design doc.
Enabling features is a two phase process, performed by the topology
coordinator when it notices that there are no topology changes in
progress and there are some not-yet enabled features that are declared
to be supported by all nodes:
1. First, a global barrier is performed to make sure that all nodes saw
and persisted the same state of the `system.topology` table as the
coordinator and see the same supported features of all nodes. When
booting, nodes are now forbidden to revoke support for a feature if all
nodes declare support for it, a successful barrier this makes sure that
no node will restart and disable the features.
2. After a successful barrier, the features are marked as enabled in the
`system.topology` table.
The whole procedure is a group 0 operation and fails if the topology
table is modified in the meantime (e.g. some node changes its supported
features set).
For now, the implementation relies on gossip shadow round check to
protect from nodes without all features joining the cluster. In a
followup, a new joining procedure will be implemented which involves the
topology coordinator and lets it verify joining node's cluster features
before the new node is added to group 0 and to the cluster.
A set of tests for the new implementation is introduced, containing the
same tests as for the non-raft-based cluster feature implementation plus
one additional test, specific to this implementation.
Closes#14722
* github.com:scylladb/scylladb:
test: topology_experimental_raft: cluster feature tests
test: topology: fix a skipped test
storage_service: add injection to prevent enabling features
storage_service: initialize enabled features from first node
topology_state_machine: add size(), is_empty()
group0_state_machine: enable features when applying cmds/snapshots
persistent_feature_enabler: attach to gossip only if not using raft
feature_service: enable and check raft cluster features on startup
storage_service: provide raft_topology_change_enabled flag from outside
storage_service: enable features in topology coordinator
storage_service: add barrier_after_feature_update
topology_coordinator: exec_global_command: make it optional to retake the guard
topology_state_machine: add calculate_not_yet_enabled_features
Skip over verification of owner and mode of the snapshots
sub-directory as this might race with scylla-manager
trying to delete old snapshots concurrently.
Fixes#12010Closes#14892
* github.com:scylladb/scylladb:
distributed_loader: process_sstable_dir: do not verify snapshots
utils/directories: verify_owner_and_mode: add recursive flag
for faster build times and clear inter-module dependencies, we
should not #includes headers not directly used. instead, we should
only #include the headers directly used by a certain compilation
unit.
in this change, the source files under "/compaction" directories
are checked using clangd, which identifies the cases where we have
an #include which is not directly used. all the #includes identified
by clangd are removed, except for "test/lib/scylla_test_case.hh"
as it brings some command line options used by scylla tests.
see also https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14922
get_compacted_fragments_writer() returns a instance of
`compacted_fragments_writer`, there is no need to cast it again.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14919
Per-table metrics are very valuable for the users, it does come with a
high load on both the reporting and the collecting metrics systems.
This patch adds a small subset of per-metrics table that will be
reported on the node level.
The list of metrics is:
system_column_family_memtable_switch - Number of times flush has
resulted in the memtable being switched out
system_column_family_memtable_partition_writes - Number of write
operations performed on partitions in memtables
system_column_family_memtable_partition_hits - Number of times a write
operation was issued on an existing partition in memtables
system_column_family_memtable_row_writes - Number of row writes
performed in memtables
system_column_family_memtable_row_hits - Number of rows overwritten by
write operations in memtables
system_column_family_total_disk_space - Total disk space used
system_column_family_live_sstable - Live sstable count
system_column_family_read_latency_count - Number of reads
system_column_family_write_latency_count - Number of writes
The names of the read/write metrics is based on the histogram convention,
so when latencies histograms will be added, the names will not change.
The metrics are label with a specific label __per_table="node" so it
will be possible to easily manipulate it.
The metrics will be available when enable_metrics_reporting (the
per-table full metrics flag) is off and enable_node_table_metrics is
true.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
By default, per-table-per-shard metrics reporting is turned off, and the
aggregated version of the metrics (per-table-per-node) will be turned
on.
There could be a situation where a user with an excessive number of
tables would suffer from performance issues, both from the network and
the metrics collection server.
This patch adds a config option, enable_node_table_metrics, which allows
users to turn off per-table metrics reporting altogether.
For example, when running Scylla with the command line argument
'--enable-node-aggregated-table_metrics 0' per-table metrics will not be reported.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
add fmt formatter for `compaction_task_executor::state` and
`compaction_task_executor` and its derived classes.
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `compaction_task_executor`, its derived classes and
`compaction_task_executor::state` without the help of `operator<<`.
since all of the callers of 'operator<<' of these types now use
formatters, the operator<< are removed in this change. the helpers
like `to_string()` and `describe()` are removed as well, as it'd
be more consistent if we always use fmtlib for formatting instead
of inventing APIs with different names.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14906
In branch 5.2 we erase `dc` from `_datacenters` if there are
no more endpoints listed in `_dc_endpoints[dc]`.
This was lost unintentionally in f3d5df5448
and this commit restores that behavior, and fixes test_remove_endpoint.
Fixesscylladb/scylladb#14896
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14897
Although the implementation of cluster features on raft is not complete
yet, it makes sense to add some tests for the existing implementation.
The `test_raft_cluster_features.py` file includes the same set of tests
as the file with non-raft-based cluster feature tests, plus one
additional test which checks that a node will not allow disabling a
feature if it sees that other nodes support it (even though the feature
is not enabled yet).
The `test_partial_upgrade_can_be_finished_with_removenode` test does not
work because the `cql` variable is used before it is declared. It was
not noticed because the test is marked as skipped, and does not work for
the non-raft cluster feature implementation. The variable declaration is
moved higher and the test now works; it will be used to test the raft
cluster feature implementation.
The first node in the cluster defines it and it does not need to consult
with anybody whether its features should be enabled or not. We can
immediately mark those features as enabled in raft when the first node
inserts its join request to the topology table.
The enable_features_on_join function is now only called if the node does
not use topology over raft, and so the node will not react to changes in
gossip features.
In the future, support for switching to topology coordinator in runtime
will be added and the persistent feature enabler should disconnect
itself during the upgrade procedure. We don't have such procedure yet,
so a bunch of TODOs is added instead.
The enable_features_on_startup method is adjusted for the raft-based
cluster features. In topology coordinator mode:
- Information about enabled features is taken from system.topology
instead of the usual system.scylla_local (`enabled_features` key).
- Features which, according to the local state, are supported by all
nodes but not enabled yet are also checked. Support for such features
cannot be revoked safely because the topology coordinator might have
performed a successful global barrier and might have proceeded with
marking the feature as enabled.
Information about whether we are using topology changes on raft or not
will be soon necessary for the persistent feature enabler, so that it
can do some additional checks based on the local raft topology state.
If the topology coordinator notices that there are no nodes requesting
to be joined, no topology operations in progress and there are some
features that are declared to be supported by all normal nodes but not
enabled yet, the topology coordinator will attempt to enable those
features. This is done in the following way, under a group 0 guard:
- A global `barrier_after_feature_update` is performed to make sure
that:
- All nodes have already updated their supported_features column after
boot and won't attempt to revoke any during current runtime,
- Saw and persisted the latest topology state so that, after restart,
the feature check won't allow them to revoke support for features
that the topology coordinator is going to enable.
- After the barrier succeeds, the coordinator tries to add the features
to the `enabled_features` column.
Ensure that replicate is called under lock_endpoint
to serialize endpoint state changes on all shards.
Otherwise, we may end up with incosistent state
across shards.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The function manipulates the endpoint state
and calls replicate and mark_dead, therefore it
must ensure this is done under lock_endpoint.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>