... and replace it with boolean enable_tablets option. All the places
in the code are patched to check the latter option instead of the former
feature.
The option is OFF by default, but the default scylla.yaml file sets this
to true, so that newly installed clusters turn tablets ON.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18898
since we've migrated away from the generic homebrew formatters
for range-alike containers, there is no need to keep there operator<<
around -- they were preserved in order to work with the container
formatters which expect operator<< of the elements.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
In commit 642f9a1966 (repair: Improve
estimated_partitions to reduce memory usage), a 10% hard coded
estimation ratio is used.
This patch introduces a new config option to specify the estimation
ratio of partitions written by repair out of the total partitions.
It is set to 0.1 by default.
Fixes#18615Closesscylladb/scylladb#18634
The direct failure detector design is simplistic. It sends pings
sequentially and times out listeners that reached the threshold (i.e.
didn't hear from a given endpoint for too long) in-between pings.
Given the sequential nature, the previous ping must finish so the next
ping can start. We timeout pings that take too long. The timeout was
hardcoded and set to 300ms. This is too low for wide-area setups --
latencies across the Earth can indeed go up to 300ms. 3 subsequent timed
out pings to a given node were sufficient for the Raft listener to "mark
server as down" (the listener used a threshold of 1s).
Increase the ping timeout to 600ms which should be enough even for
pinging the opposite side of Earth, and make it tunable.
Increase the Raft listener threshold from 1s to 2s. Without the
increased threshold, one timed out ping would be enough to mark the
server as down. Increasing it to 2s requires 3 timed out pings which
makes it more robust in presence of transient network hiccups.
In the future we'll most likely want to decrease the Raft listener
threshold again, if we use Raft for data path -- so leader elections
start quickly after leader failures. (Faster than 2s). To do that we'll
have to improve the design of the direct failure detector.
Ref: scylladb/scylladb#16410Fixes: scylladb/scylladb#16607
---
I tested the change manually using `tc qdisc ... netem delay`, setting
network delay on local setup to ~300ms with jitter. Without the change,
the result is as observed in scylladb/scylladb#16410: interleaving
```
raft_group_registry - marking Raft server ... as dead for Raft groups
raft_group_registry - marking Raft server ... as alive for Raft groups
```
happening once every few seconds. The "marking as dead" happens whenever
we get 3 subsequent failed pings, which is happens with certain (high)
probability depending on the latency jitter. Then as soon as we get a
successful ping, we mark server back as alive.
With the change, the phenomenon no longer appears.
Closesscylladb/scylladb#18443
We move consistent cluster management out of experimental and
make it the default for new clusters in 6.0. In code, we make the
`consistent-topology-changes` flag unused and assumed to be true.
In 6.0, the topology upgrade procedure will be manual and
voluntary, so some clusters will still be using the gossip-based
topology even though they support the raft-based topology.
Therefore, we need to continue testing the gossip-based topology.
This is possible by using the `force-gossip-topology-changes` flag
introduced in scylladb/scylladb#18284.
Ref scylladb/scylladb#17802Closesscylladb/scylladb#18285
* github.com:scylladb/scylladb:
docs: raft.rst: update after removing consistent-topology-changes
treewide: fix indentation after the previous patch
db: config: make consistent-topology-changes unused
test: lib: single_node_cql_env: restart a node in noninitial run_in_thread calls
test: test_read_required_hosts: run with force-gossip-topology-changes
storage_service: join_cluster: replace force_gossip_based_join with force-gossip-topology-changes
storage_service: join_token_ring: fix finish_setup_after_join calls
We make the `consistent-topology-changes` experimental feature
unused and assumed to be true in 6.0. We remove code branches that
executed if `consistent-topology-changes` was disabled.
Currently, if tombstone_gc mode isn't specified for a table,
then "timeout" is used by default. With tablets, running
"nodetool repair -pr" may miss a tablet if it migrated across
the nodes. Then, if we expire tombstones for ranges that
weren't repaired, we may get data resurrection.
Set default tombstone_gc mode value for DDLs that don't
specify it. It's set to "repair" for tables which use tablets
unless they use local replication strategy or rf = 1.
Otherwise it's set to "timeout".
We are going to make the `consistent-topology-changes` experimental
feature unused in 6.0. However, the topology upgrade procedure will
be manual and voluntary, so some 6.0 clusters will be using the
gossip-based topology. Therefore, we need to continue testing the
gossip-based topology. The solution is introducing a new flag,
`force-gossip-topology-changes`, that will enforce the gossip-based
topology in a fresh cluster.
In this patch, we only introduce the parameter without any effect.
Here is the explanation. Making `consistent-topology-changes` unused
and introducing `force-gossip-topology-changes` requires adjustments
in scylla-dtest. We want to merge changes to scylladb and scylla-dtest
in a way that ensures all tests are run correctly during the whole
process. If we merged all changes to scylladb first, before merging
the scylla-dtest changes, all tests would run with the raft-based
topology and the ones excluded in the raft-based topology would fail.
We also can't merge all changes to scylla-dtest first. However, we
can follow this plan:
1. scylladb: merge this patch
2. scylla-dtest: start using `force-gossip-topology-changes`
in jobs that run without the raft-based topology
3. scylladb: merge the rest of the changes
4. scylla-dtest: merge the rest of the changes
Ref scylladb/scylladb#17802Closesscylladb/scylladb#18284
A new configuration variable, components_memory_reclaim_threshold, has
been added to configure the maximum allowed percentage of available
memory for all SSTable components in a shard. If the total memory usage
exceeds this threshold, it will be reclaimed from the components to
bring it back under the limit. Currently, only the memory used by the
bloom filters will be restricted.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Injection parameters can be used in the lambda passed to
inject_with_handler method to take some values from
the test. However, there was no way to set values to these
parameters on node startup, only through
the error injection REST api. Therefore, we couldn't rely
on this when inject_with_handler is used during
node startup, it could trigger before we call the api
from the test.
In this commit with solve this problem by allowing these
parameters to be assigned through scylla.yaml config.
The defer.hh header was added to error_injection.hh to fix
compilation after adding error_injection.hh to config.hh,
defer function is used in error_injection.hh.
Regulates the page size in bytes via config, instead of the currently
used hard-coded constant. Allows tests to configure lower limits so they
can work with smaller data-sets when testing paging related
functionality.
Not wired yet.
When topology barrier is blocked for longer than configured threshold
(2s), stale versions are marked as stalled and when they get released
they report backtrace to the logs. This should help to identify what
was holding for token metadata pointer for too long.
Example log:
token_metadata - topology version 30 held for 299.159 [s] past expiry, released at: 0x2397ae1 0x23a36b6 ...
Closesscylladb/scylladb#17427
Set filesystem permissions for the maintenance socket to 660 (previously it was 755) to allow a scyllaadm's group to connect.
Split the logic of creating sockets into two separate functions, one for each case: when it is a regular cql controller or used by maintenance_socket.
Fixes https://github.com/scylladb/scylladb/issues/16487.
Closesscylladb/scylladb#17113
* github.com:scylladb/scylladb:
maintenance_socket: add option to set owning group
transport/controller: get rid of magic number for socket path's maximal length
transport/controller: set unix_domain_socket_permissions for maintenance_socket
transport/controller: pass unix_domain_socket_permissions to generic_server::listen
transport/controller: split configuring sockets into separate functions
Option `maintenance-socket-group` sets the owning group of the maintenance socket.
If not set, the group will be the same as the user running the scylla node.
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for `error_injection_at_startup`,
and drop its operator<<.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17211
This reverts commit 370fbd346c, reversing
changes made to 0912d2a2c6.
This makes scylla-manager mis-interpret the data_file_directories
somehow, issue #17078
The motivation for tablet resizing is that we want to keep the average tablet size reasonable, such that load rebalancing can remain efficient. Too large tablet makes migration inefficient, therefore slowing down the balancer.
If the avg size grows beyond the upper bound (split threshold), then balancer decides to split. Split spans all tablets of a table, due to power-of-two constraint.
Likewise, if the avg size decreases below the lower bound (merge threshold), then merge takes place in order to grow the avg size. Merge is not implemented yet, although this series lays foundation for it to be impĺemented later on.
A resize decision can be revoked if the avg size changes and the decision is no longer needed. For example, let's say table is being split and avg size drops below the target size (which is 50% of split threshold and 100% of merge one). That means after split, the avg size would drop below the merge threshold, causing a merge after split, which is wasteful, so it's better to just cancel the split.
Tablet metadata gains 2 new fields for managing this:
resize_type: resize decision type, can be either of "merge", "split", or "none".
resize_seq_number: a sequence number that works as the global identifier of the decision (monotonically increasing, increased by 1 on every new decision emitted by the coordinator).
A new RPC was implemented to pull stats from each table replica, such that load balancer can calculate the avg tablet size and know the "split status", for a given table. Avg size is aggregated carefully while taking RF of each DC into account (which might differ).
When a table is done splitting its storage, it loads (mirror) the resize_seq_number from tablet metadata into its local state (in another words, my split status is ready). If a table is split ready, coordinator will see that table's seq number is the same as the one in tablet metadata. Helps to distinguish stale decisions from the latest one (in case decisions are revoked and re-emited later on). Also, it's aggregated carefully, by taking the minimum among all replicas, so coordinator will only update topology when all replicas are ready.
When load balancer emits split decision, replicas will listen to need to split with a "split monitor" that is awakened once a table has replication metadata updated and detects the need for split (i.e. resize_type field is "split").
The split monitor will start splitting of compaction groups (using mechanism introduced here: 081f30d149) for the table. And once splitting work is completed, the table updates its local state as having completed split.
When coordinator pulls the split status of all replicas for a table via RPC, the balancer can see whether that table is ready for "finalizing" the decision, which is about updating tablet metadata to split each tablet into two. Once table replicas have their replication metadata updated with the new tablet count, they can update appropriately their set of compaction groups (that were previously split in the preparation step).
Fixes#16536.
Closesscylladb/scylladb#16580
* github.com:scylladb/scylladb:
test/topology_experimental_raft: Add tablet split test
replica: Bypass reshape on boot with tablets temporarily
replica: Fix table::compaction_group_for_sstable() for tablet streaming
test/topology_experimental_raft: Disable load balancer in test fencing
replica: Remap compaction groups when tablet split is finalized
service: Split tablet map when split request is finalized
replica: Update table split status if completed split compaction work
storage_service: Implement split monitor
topology_cordinator: Generate updates for resize decisions made by balancer
load_balancer: Introduce metrics for resize decisions
db: Make target tablet size a live-updateable config option
load_balancer: Implement resize decisions
service: Wire table_resize_plan into migration_plan
service: Introduce table_resize_plan
tablet_mutation_builder: Add set_resize_decision()
topology_coordinator: Wire load stats into load balancer
storage_service: Allow tablet split and migration to happen concurrently
topology_coordinator: Periodically retrieve table_load_stats
locator: Introduce topology::get_datacenter_nodes()
storage_service: Implement table_load_stats RPC
replica: Expose table_load_stats in table
replica: Introduce storage_group::live_disk_space_used()
locator: Introduce table_load_stats
tablets: Add resize decision metadata to tablet metadata
locator: Introduce resize_decision
This change is intended to ensure, that
db::config fields related to directories
are not changed. To achieve that a member
function called setup_directories() is
removed.
The responsibility for directories paths
has been moved to utils::directories,
which may generate default paths if the
configuration does not provide a specific
value.
Fixes: scylladb#5626
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Native histograms (also known as sparse histograms) are an experimental Prometheus feature.
They use protobuf as the reporting layer.
Native histograms hold the benefits of high resolution at a lower resource cost.
This series allows sending histograms in a native histogram format over protobuf.
By default, protobuf support is disabled. To use protobuf with native histograms, the command line flag prometheus_allow_protobuf should be set to true, and the Prometheus server should send the accept header with protobuf.
Fixes#12931Closesscylladb/scylladb#16737
* github.com:scylladb/scylladb:
main.cc: Add prometheus_allow_protobuf command line
histogram_metrics_helper: support native histogram
config: Add prometheus_allow_protobuf flag
Native histograms (also known as sparse histograms) are an experimental
Prometheus feature. They use protobuf as the reporting layer. The
prometheus_allow_protobuf flag allows the user to enable protobuf
protocol. When this flag is set to true, and the Prometheus server sends
in the request that it accepts protobuf, the result will be in protobuf
protocol.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Previous patch taught tablets allocator to multiply the initial tablets
count by some value. This patch makes this factor configurable
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
So far the service levels interval, responsible for updating SL configuration,
was hardcoded in main.
Now it's extracted to `service_levels_interval_ms` option.
One of the unfortunate anti-features of cql_test_env (the framework used
in our CQL tests that are written in C++) is that it needs to repeat
various bizarre initializations steps done in main.cc, otherwise various
requests work incorrectly. One of these steps that main.cc is to initialize
various "schema extensions" which some of the Scylla features need to work
correctly.
We remembered to initialize some schema extensions in cql_test_env, but
forgot others. The one I will need in the following patch is the "tags"
extension, which we need to mark materialized views used by local
secondary indexes as "synchronous_updates" - without this patch the LSI
tests in secondary_index_test.cc will crash.
In addition to adding the missing extension, this patch also replaces
the segmentation-fault crash when it's missing (caused by a dynamic
cast failure) by a clearer on_internal_error() - so if we ever have
this bug again, it will be easier to debug.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
If set to "ignore", maintenance socket will be disabled.
If set to "workdir", maintenance socket will be opened on <scylla's
workdir>/cql.m.
Otherwise it will be opened on path provided by maintenance_socket
variable.
It is set by default to 'ignore'.
Major compaction already flushes each table to make
sure it considers any mutations that are present in the
memtable for the purpose of tombstone purging.
See 64ec1c6ec6
However, tombstone purging may be inhibited by data
in commitlog segments based on `gc_time_min` in the
`tombstone_gc_state` (See f42eb4d1ce).
Flushing all sstables in the database release
all references to commitlog segments and there
it maximizes the potential for tombstone purging,
which is typically the reason for running major compaction.
However, flushing all tables too frequently might
result in tiny sstables. Since when flushing all
keyspaces using `nodetool flush` the `force_keyspace_compaction`
api is invoked for keyspace successively, we need a mechanism
to prevent too frequent flushes by major compaction.
Hence a `compaction_flush_all_tables_before_major_seconds` interval
configuration option is added (defaults to 24 hours).
In the case that not all tables are flushed prior
to major compaction, we revert to the old behavior of
flushing each table in the keyspace before major-compacting it.
Fixesscylladb/scylladb#15777
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now its plain updateable_value, but without the ..._source object the
updateable_value is just a no-op value holder. In order for the
observers to operate there must be the value source, updating it would
update the attached updateable values _and_ notify the observers.
In order for the config to be the u.v._source, config entries should be
comparable to each other, thus the <=> operator for it
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Replacing `restrict_replication_simplestrategy` config option with
2 config options: `replication_strategy_{warn,fail}_list`, which
allow us to impose soft limits (issue a warning) and hard limits (not
execute CQL) on replication strategy when creating/altering a keyspace.
The reason to rather replace than extend `restrict_replication_simplestrategy` config
option is that it was not used and we wanted to generalize it.
Only soft guardrail is enabled by default and it is set to SimpleStrategy,
which means that we'll generate a CQL warning whenever replication strategy
is set to SimpleStrategy. For new cloud deployments we'll move
SimpleStrategy from warn to the fail list.
Guardrails violations will be tracked by metrics.
Resolves#5224
Refs #8892 (the replication strategy part, not the RF part)
Closesscylladb/scylladb#15399
In this refactoring commit we remove the db::config::host_id
field, as it's hacky and duplicates token_metadata::get_my_id.
Some tests want specific host_id, we add it to cql_test_config
and use in cql_test_env.
We can't pass host_id to sstables_manager by value since it's
initialized in database constructor and host_id is not loaded yet.
We also prefer not to make a dependency on shared_token_metadata
since in this case we would have to create artificial
shared_token_metadata in many tools and tests where sstables_manager
is used. So we pass a function that returns host_id to
sstables_manager constructor.
"experimental" option was marked "Unused" in 64bc8d2f7d. but we
chose to keep it in hope that the upgrade test does not fail.
despite that the upgrade tests per-se survived the "upgrade",
after the upgrade, the tests exercising the experimental features
are still failing hard. they have not been updated to set the
"experimental-features" option, and are still relying on
"experimental" to enable all the experimental features under
test.
so, in this change, let's just drop the option so that
scylla can fail early at seeing this "experimental" option.
this should help us to identify the tests relying on it
quicker. as the "experimental" features should only be used
in development environment, this change should have no impact
to production.
Refs #15214
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15233
Replacing `minimum_keyspace_rf` config option with 4 config options:
`{minimum,maximum}_replication_factor_{warn,fail}_threshold`, which
allow us to impose soft limits (issue a warning) and hard limits (not
execute CQL) on RF when creating/altering a keyspace.
The reason to rather replace than extend `minimum_keyspace_rf` config
option is to be aligned with Cassandra, which did the same, and has the
same parameters' names.
Only min soft limit is enabled by default and it is set to 3, which means
that we'll generate a CQL warning whenever RF is set to either 1 or 2.
RF's value of 0 is always allowed and means that there will not be any
replicas on a given DC. This was agreed with PM.
Because we don't allow to change guardrails' values when scylla is
running (per PM), there're no tests provided with this PR, and dtests will be
provided separately.
Exceeding guardrails' thresholds will be tracked by metrics.
Resolves#8619
Refs #8892 (the RF part, not the replication-strategy part)
Closes#14262
Adds a configurable upper limit to memory usage by index caches.
See the source code comments added in this patch for more details.
This patch shouldn't change visible behaviour, because the limit is set to 1.0
by default, so it is never triggerred. We will change the default in a future
patch.
Three places handle comma-separated lists similarly:
- ss::remove_node.set(...) in api::set_storage_service,
- storage_service::parse_node_list,
- storage_service::is_repair_based_node_ops_enabled.
In the next commit, the fourth place that needs the same logic
appears -- storage_service::raft_replace. It needs to load
and parse the --ignore-dead-nodes-for-replace param from config.
Moreover, the code in is_repair_based_node_ops_enabled is
different and doesn't seem right. We swap '\"' and '\'' with ' '
but don't do anything with it afterward.
To avoid code duplication and fix is_repair_based_node_ops_enabled,
we introduce the new function utils::split_comma_separated_list.
This change has a small side effect on logging. For example,
ignore_nodes_strs in storage_service::parse_node_list might be
printed in a slightly different form.
Compaction group is the data plane for tablets, so this integration
allows each tablet to have its own storage (memtable + sstables).
A crucial step for dynamic tablets, where each tablet can be worked
on independently.
There are still some inefficiencies to be worked on, but as it is,
it already unlocks further development.
```
INFO 2023-07-27 22:43:38,331 [shard 0] init - loading tablet metadata
INFO 2023-07-27 22:43:38,333 [shard 0] init - loading non-system sstables
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 0 present for ks.cf
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 2 present for ks.cf
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 4 present for ks.cf
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 6 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 1 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 3 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 5 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 7 present for ks.cf
```
Closes#14863
* github.com:scylladb/scylladb:
Kill scylla option to configure number of compaction groups
replica: Wire tablet into compaction group
token_metadata: Add this_host_id to topology config
replica: Switch to chunked_vector for storing compaction groups
replica: Generate group_id for compaction_group on demand
in this series, the "experimental" option is marked `Unused` as it has been marked deprecated for almost 2 years since scylla 4.6. and use `experimental_features` to specify the used experimental features explicitly.
Closes#14948
* github.com:scylladb/scylladb:
config: remove unused namespace alias
config: use std::ranges when appropriate
config: drop "experimental" option
test: disable 'enable_user_defined_functions' if experimental_features does not include udf
test: pylib: specify experimental_features explicitly
in 34c3688017, we added a virtual function
to `config_file`, and we new and delete pointer pointing to a
`db::config` instance with `unique_ptr<>`. this makes the compiler
nervous, as deleting a pointer pointing to an instance of non-final
class with virtual function could lead to leak, if this pointer actually
points to a derived class of this non-final class. so, in order to
silence the warning and to prevent potential problem in future, let's
mark `db::config` final.
the warning from Clang 16 looks like:
```
In file included from /home/kefu/dev/scylladb/test/lib/test_services.cc:10:
In file included from /home/kefu/dev/scylladb/test/lib/test_services.hh:25:
In file included from /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/memory:78:
/usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/unique_ptr.h:99:2: error: delete called on non-final 'db::config' that has virtual functions but non-virtual destructor [-Werror,-Wdelete-non-abstract-non-virtual-dtor]
delete __ptr;
^
/usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/unique_ptr.h:404:4: note: in instantiation of member function 'std::default_delete<db::config>::operator()' requested here
get_deleter()(std::move(__ptr));
^
/home/kefu/dev/scylladb/test/lib/test_services.cc:189:16: note: in instantiation of member function 'std::unique_ptr<db::config>::~unique_ptr' requested here
auto cfg = std::make_unique<db::config>();
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15071
The option was introduced to bootstrap the project. It's still
useful for testing, but that translates into maintaining an
additional option and code that will not be really used
outside of testing. A possible option is to later map the
option in boost tests to initial_tablets, which may yield
the same effect for testing.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If `live_updatable_config_params_changeable_via_cql` is set to true, configuration parameters defined with `liveness::LiveUpdate` option can be updated in the runtime with CQL, i.e. by updating `system.config` virtual table.
If we don't want any configuration parameter to be changed in the
runtime by updating `system.config` virtual table, this option should be
set to false. This option should be set to false for e.g. cloud users,
who can only perform CQL queries, and should not be able to change
scylla's configuration on the fly.
Current implemenatation is generic, but has a small drawback - messages
returned to the user can be not fully accurate, consider:
```
cqlsh> UPDATE system.config SET value='2' WHERE name='task_ttl_in_seconds';
WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="option is not live-updateable" info={'failures': 1, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
```
where `task_ttl_in_seconds` has been defined with
`liveness::LiveUpdate`, but because `live_updatable_config_params_changeable_via_cql` is set to
`false` in `scylla.yaml,` `task_ttl_in_seconds` cannot be modified in the
runtime by updating `system.config` virtual table.
Fixes#14355Closes#14382
"experimental" was marked deprecated in 8b917f7c. this change
was included since Scylla 4.6. now that 5.3 has been branched,
this change will be included 5.4. this should be long enough
for the user's turn around if this option is ever used.
the dtests using this option has been audited and updated
accordingly. and the unit testing this option is removed as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
By default, per-table-per-shard metrics reporting is turned off, and the
aggregated version of the metrics (per-table-per-node) will be turned
on.
There could be a situation where a user with an excessive number of
tables would suffer from performance issues, both from the network and
the metrics collection server.
This patch adds a config option, enable_node_table_metrics, which allows
users to turn off per-table metrics reporting altogether.
For example, when running Scylla with the command line argument
'--enable-node-aggregated-table_metrics 0' per-table metrics will not be reported.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Currently, streaming and repair processes and sends data as-is. This is wasteful: streaming might be sending data which is expired or covered by tombstones, taking up valuable bandwidth and processing time. Repair additionally could be exposed to artificial differences, due to different nodes being in different states of compactness.
This PR adds opt-in compaction to `make_streaming_reader()`, then opts in all users. The main difference being in how these choose the current compaction time to use:
* Load'n'stream and streaming uses the current time on the local node.
* Repair uses a centrally chosen compaction time, generated on the repair master and propagated to al repair followers. This is to ensure all repair participants work with the exact state of compactness.
Importantly, this compaction does *not* purge tombstones (tombstone GC is disabled completely).
Fixes: https://github.com/scylladb/scylladb/issues/3561Closes#14756
* github.com:scylladb/scylladb:
replica: make_[multishard_]streaming_reader(): make compaction_time mandatory
repair/row_level: opt in to compacting the stream
streaming: opt-in to compacting the stream
sstables_loader: opt-in for compacting the stream
replica/table: add optional compacting to make_multishard_streaming_reader()
replica/table: add optional compacting to make_streaming_reader()
db/config: add config item for enabling compaction for streaming and repair
repair: log the error which caused the repair to fail
readers: compacting_reader: use compact_mutation_state::abandon_current_partition()
mutation/mutation_compactor: allow user to abandon current partition
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.
Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.
After ae8d2a550d (5.2.0), it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).
This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.
A similar problem was fixed for per-node digest calculation in
c2ba94dc39e4add9db213751295fb17b95e6b962. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.
Fixes#4485.
Manually tested using ccm on cluster upgrade scenarios and node restarts.
Closes#14441
* github.com:scylladb/scylladb:
test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled
schema_mutations, migration_manager: Ignore empty partitions in per-table digest
migration_manager, schema_tables: Implement migration_manager::reload_schema()
schema_tables: Avoid crashing when table selector has only one kind of tables
Compacting can greatly reduce the amount of data to be processed by
streaming and repair, but with certain data shapes, its effectiveness
can be reduced and its CPU overhead might outweight the benefits. This
should very rarely be the case, but leave an off switch in case
this becomes a problem in a deployment.
Not wired yet.
In #14668, we have decided to introduce a new scylla.yaml variable
for the schema commitlog segment size. The segment size puts a limit
on the mutation size that can be written at once, and some schema
mutation writes are much larger than average, as shown in #13864.
Therefore, increasing the schema commitlog segment size is sometimes
necessary.
This option allows user to change the number of ranges to stream in
batch per stream plan.
Currently, each stream plan streams 10% of the total ranges.
With more ranges per stream plan, it reduces the waiting time between
two stream plans. For example,
stream_plan1: shard0 (t0), shard1 (t1)
stream_plan2: shard0 (t2), shard1 (t3)
We start stream_plan2 after all shards finish streaming in stream_plan1.
If shard0 and shard1 in stream_plan1 finishes at different time. One of
the shards will be idle.
If we stream more ranges in a single stream plan, the waiting time will
be reduced.
Previously, we retry the stream plan if one of the stream plans is
failed. That's one of the reasons we want more stream plans. With RBNO
and 1f8b529e08 (range_streamer: Disable restream logic), the
restream factor is not important anymore.
Also, more ranges in a single stream plan will create bigger but fewer
sstables on the receiver side.
The default value is the same as before: 10% percentage of total ranges.
Fixes#14191Closes#14402