In this commit, we postpone the start-up
of the hint manager until we obtain information
about other nodes in the cluster.
When we start the hint managers, one of the
things that happen is creating endpoint
managers -- structures managed by
db::hints::manager. Whether we create
an instance of endpoint manager depends on
the value returned by host_filter::can_hint_for,
which, in turn, may depend on the current state
of locator::topology.
If locator::topology is incomplete, some endpoint
managers may not be started even though they
should (because the target node IS part of the
cluster and we SHOULD send hints to it if there
are some).
The situation like that can happen because we
start the hint managers too early. This commit
aims to solve that problem. We only start
the hint managers when we've gathered information
about the other nodes in the cluster and created
the locator::topology using it.
Hinted Handoff is not negatively affected by these
changes since in between the previous point of
starting the hint managers and the current one,
all of the mutations performed by
service::storage_proxy target the local node, so
no hints would need to be generated anyway.
Fixesscylladb/scylladb#11870Closesscylladb/scylladb#16511
In this mode, the node is not reachable from the outside, i.e.
* it refuses all incoming RPC connections,
* it does not join the cluster, thus
* all group0 operations are disabled (e.g. schema changes),
* all cluster-wide operations are disabled for this node (e.g. repair),
* other nodes see this node as dead,
* cannot read or write data from/to other nodes,
* it does not open Alternator and Redis transport ports and the TCP CQL port.
The only way to make CQL queries is to use the maintenance socket. The node serves only local data.
To start the node in maintenance mode, use the `--maintenance-mode true` flag or set `maintenance_mode: true` in the configuration file.
REST API works as usual, but some routes are disabled:
* authorization_cache
* failure_detector
* hinted_hand_off_manager
This PR also updates the maintenance socket documentation:
* add cqlsh usage to the documentation
* update the documentation to use `WhiteListRoundRobinPolicy`
Fixes#5489.
Closesscylladb/scylladb#15346
* github.com:scylladb/scylladb:
test.py: add test for maintenance mode
test.py: generalize usage of cluster_con
test.py: when connecting to node in maintenance mode use maintenance socket
docs: add maintenance mode documentation
main: add maintenance mode
main: move some REST routes initialization before joining group0
message_service: add sanity check that rpc connections are not created in the maintenance mode
raft_group0_client: disable group0 operations in the maintenance mode
service/storage_service: add start_maintenance_mode() method
storage_service: add MAINTENANCE option to mode enum
service/maintenance_mode: add maintenance_mode_enabled bool class
service/maintenance_mode: move maintenance_socket_enabled definition to seperate file
db/config: add maintenance mode flag
docs: add cqlsh usage to maintenance socket documentation
docs: update maintenance socket documentation to use WhiteListRoundRobinPolicy
To avoid data resurrection, mutations deleted by cleanup operations should be skipped during commitlog replay.
This series implements the above for tablet cleanups, by using a new system table which holds records of cleanup operations.
Fixes#16752Closesscylladb/scylladb#16888
* github.com:scylladb/scylladb:
test: test_tablets: add a test for cleanup after migration
test: pylib: add ScyllaCluster.wipe_sstables
test: boost: add commitlog_cleanup_test
db: commitlog_replayer: ignore mutations affected by (tablet) cleanups
replica: table: garbage-collect irrelevant system.commitlog_cleanups records
db: commitlog: add min_position()
replica: table: populate system.commitlog_cleanups on tablet cleanup
db: system_keyspace: add system.commitlog_cleanups
replica: table: refresh compound sstable set after tablet cleanup
We add a sanity check to ensure at most one transitioning node at
a time. If there is more, something must have gone wrong.
In the future, we might implement concurrent topology operations.
Then, we will remove this sanity check.
We also extend the comment describing `transition_nodes` so that
it better explains why we use a map and how it should be handled.
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for db::schema_tables::table_kind,
and its operator<<() is still used by the homebrew generic formatter
for std::map<>, so it is preserved.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16972
Native histograms (also known as sparse histograms) are an experimental Prometheus feature.
They use protobuf as the reporting layer.
Native histograms hold the benefits of high resolution at a lower resource cost.
This series allows sending histograms in a native histogram format over protobuf.
By default, protobuf support is disabled. To use protobuf with native histograms, the command line flag prometheus_allow_protobuf should be set to true, and the Prometheus server should send the accept header with protobuf.
Fixes#12931Closesscylladb/scylladb#16737
* github.com:scylladb/scylladb:
main.cc: Add prometheus_allow_protobuf command line
histogram_metrics_helper: support native histogram
config: Add prometheus_allow_protobuf flag
To avoid data resurrection, mutations deleted by cleanup operations
have to be skipped during commitlog replay.
This patch implements this, based on the metadata recorded on cleanup
operations into system.commitlog_cleanups.
Add a helper function which returns the minimum replay position
across all existing or future commitlog segments.
Only positions greater or equal to it can be replayed on the next reboot.
We will use this helper in a future patch to garbage collect some cleanup
metadata which refers to replay positions.
There are currently two options how to "request" the number of initial tables for a table
1. specify it explicitly when creating a keyspace
2. let scylla calculate it on its own
Both are not very nice. The former doesn't take cluster layout into consideration. The latter does, but starts with one tablet per shard, which can be too low if the amount of data grows rapidly.
Here's a (maybe temporary) proposal to facilitate at least perf tests -- the --tablets-initial-scale-factor option that enhances the option number two above by multiplying the calculated number of tablets by the configured number. This is what we currently do to run perf tests by patching scylla, with the option it going to be more convenient.
Closesscylladb/scylladb#16919
* github.com:scylladb/scylladb:
config: Add --tablets-initial-scale-factor
tablet_allocator: Add initial tablets scale to config
tablet_allocator: Add config
Native histograms (also known as sparse histograms) are an experimental
Prometheus feature. They use protobuf as the reporting layer. The
prometheus_allow_protobuf flag allows the user to enable protobuf
protocol. When this flag is set to true, and the Prometheus server sends
in the request that it accepts protobuf, the result will be in protobuf
protocol.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Previous patch taught tablets allocator to multiply the initial tablets
count by some value. This patch makes this factor configurable
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This enhancement formats descriptions in config.cc using the standard markup language reStructuredText (RST).
By doing so, it improves the rendering of these descriptions in the documentation, allowing you to use various directives like admonitions, code blocks, ordered lists, and more.
Closesscylladb/scylladb#16311
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define a formatter for db::operation_type, and
remove their operator<<().
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16832
Service level controller updates itself in interval. However the interval time is hardcoded in main to 10 seconds and it leads to long sleeps in some of the tests.
This patch moves this value to `service_levels_interval_ms` command line option and sets this value to 0.5s in cql-pytest.
Closesscylladb/scylladb#16394
* github.com:scylladb/scylladb:
test:cql-pytest: change service levels intervals in tests
configure service levels interval
Refs #16757
Allows waiting for all previous and pending segment deletes to finish.
Useful if a caller of `discard_completed_segments` (i.e. a memtable
flush target) not only wants to ensure segments are clean and released,
but thoroughly deleted/recycled, and hence no treat to resurrecting
data on crash+restart.
Test included.
Closesscylladb/scylladb#16801
Currently to figure out if a topology request is complete a submitter
checks the topology state and tries to figure out from that the status
of the request. This is not exact. Lets look at rebuild handling for
instance. To figure out if request is completed the code waits for
request object to disappear from the topology, but if another rebuild
starts between the end of the previous one and the code noticing that
it completed the code will continue waiting for the next rebuild.
Another problem is that in case of operation failure there is no way to
pass an error back to the initiator.
This series solves those problems by assigning an id for each request and
tracking the status of each request in a separate table. The initiator
can query the request status from the table and see if the request was
completed successfully or if it failed with an error, which is also
evadable from the table.
The schema for the table is:
CREATE TABLE system.topology_requests (
id timeuuid PRIMARY KEY,
initiating_host uuid,
start_time timestamp,
done boolean,
error text,
end_time timestamp,
);
and all entries have TTL of one month.
Instead of trying to guess if a request completed by looking into the
topology state (which is sometimes can be error prone) look at the
request status in the new topology_requests. If request failed report
a reason for the failure from the table.
The goal of this PR is fix Scylla so that the dtest test_mvs_populating_from_existing_data, which starts to fail when enabling tablets, will pass.
The main fix (the second patch) is reverting code which doesn't work with tablets, and I explain why I think this code was not necessary in the first place.
Fixes#16598Closesscylladb/scylladb#16670
* github.com:scylladb/scylladb:
view: revert cleanup filter that doesn't work with tablets
mv: sleep a bit before view-update-generator restart
Provide a unique ID for each topology request and store it the topology
state machine. It will be used to index new topology requests table in
order to retrieve request status.
The table has the following schema and will be managed by raft:
CREATE TABLE system.topology_requests (
id timeuuid PRIMARY KEY,
initiating_host uuid,
start_time timestamp,
done boolean,
error text,
end_time timestamp,
);
In case of an request completing with an error the "error" filed will be non empty when "done" is set to true.
The patch adds cleanup state to the persistent and in memory state and
handles the loading. The state can be "clean" which means no cleanup
needed, "needed" which means the node is dirty and needs to run cleanup
at some point, "running" which means that cleanup is running by the node
right now and when it will be completed the state will be reset to "clean".
This patch reverts commit 10f8f13b90 from
November 2022. That commit added to the "view update generator", the code
which builds view updates for staging sstables, a filter that ignores
ranges that do not belong to this node. However,
1. I believe this filter was never necessary, because the view update
code already silently ignores base updates which do not belong to
this replica (see get_view_natural_endpoint()). After all, the view
update needs to know that this replica is the Nth owner of the base
update to send its update to the Nth view replica, but if no such
N exists, no view update is sent.
2. The code introduced for that filter used a per-keyspace replication
map, which was ok for vnodes but no longer works for tablets, and
causes the operation using it to fail.
3. The filter was used every time the "view update generator" was used,
regardless of whether any cleanup is necessary or not, so every
such operation would fail with tablets. So for example the dtest
test_mvs_populating_from_existing_data fails with tablets:
* This test has view building in parallel with automatic tablet
movement.
* Tablet movement is streaming.
* When streaming happens before view building has finished, the
streamed sstables get "view update generator" run on them.
This causes the problematic code to be called.
Before this patch, the dtest test_mvs_populating_from_existing_data
fails when tablets are enabled. After this patch, it passes.
Fixes#16598
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The "view update generator" is responsible for generating view updates
for staging sstables (such as coming from repair). If the processing
fails, the code retries - immediately. If there is some persistent bug,
such as issue #16598, we will have a tight loop of error messages,
potentially a gigabyte of identical messages every second.
In this patch we simply add a sleep of one second after view update
generation fails before retrying. We can still get many identical
error messages if there is some bug, but not more than one per second.
Refs #16598.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
So far the service levels interval, responsible for updating SL configuration,
was hardcoded in main.
Now it's extracted to `service_levels_interval_ms` option.
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we
* define a formatter for `db::consistency_level`
* drop its `operator<<`, as it is not used anymore
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16755
before this change, we defaults to use "mc" sstable format, and
switch to "md" if the cluster agrees on using it, and to
"me" if the cluster agrees on using this. the cluster feature
is used to get the consensus across the members in the cluster,
if any of the existing nodes in the cluster has its `sstable_format`
configured to, for instance, "mc", then the cluster is stuck with
"mc".
but we disabled "mc" sstable format back in 3d345609, the first LTS
release including that change was scylla v5.2.0. which means, the
cluster of the last major version Scylla should be using "md" or
"me". per our document on upgrade, see docs/upgrade/index.rst,
> You should perform the upgrades consecutively - to each
> successive X.Y version, without skipping any major or minor version.
>
> Before you upgrade to the next version, the whole cluster (each
> node) must be upgraded to the previous version.
we can assume that, a 6.x node will only join a cluster
with 5.x or 6.x nodes. (joining a 7.x cluster should work, but
this is not relevant to this change). in both cases, since
5.x and up scylla can only configured with "md" `sstable_format`,
there is no need to switch from "mc" to "md" anymore. so we can
ditch the code supporting it.
Refs #16551
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
seastar::logger is using the compile-time format checking by default if
compiled using {fmt} 8.0 and up. and it requires the format string to be
consteval string, otherwise we have to use `fmt::runtime()` explicitly.
so adapt the change, let's use the consteval string when formatting
logging messages.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16612
Define struct peer_info holding optional values
for all system.peers columns, allowing the caller to
update any column.
Pass the values as std::vector<std::optional<data_value>>
to query_processor::execute_internal.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
data_value_list is a wrapper around std::initializer_list<data_value>.
Use it for passing values to `cql3::query_processor::execute_internal`
and friends.
A following path will add a std::variant for data_value_or_unset
and extend data_value_list to support unset values.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
system_keyspace had a hack to skip update_peer_info
for the local node, and then to remove an entry for
the local node in system.peers if `update_tokens(endpoint, ...)`
was called for this node.
This change unhacks system_keyspace by considering
update of system.peers with the local address as
an internal error and fixing the call sites that do that.
Fixes#16425
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This small series improves two things in the multi-node tests for tablet supports in materialized views:
1. The test for Alternator LSI, which "sometimes" could reproduce the bug by creating 10-node cluster with a random tablet distribution, is replaced by a reliable 2-node cluster which controls the tablet distribution. The new test also confirms that tablets are actually enabled in Alternator (reviewers of the original test noted it would be easy to pass the test if tablets were accidentally not enabled... :-)).
2. Simplify the tablet lookup code in the test to not go through a "table id", and lookup the table's (or view's) name directly (requires a full-table of the tablets table, but that's entirely reasonable in a test).
The third patch in this series also fixes a comment typo discovered in a previous review.
Closesscylladb/scylladb#16440
* github.com:scylladb/scylladb:
materialized views: fix typo in comment
test_mv_tablets: simplify lookup of tablets
alternator, tablets: improve Alternator LSI tablets test
The create_keyspace_from_schema_partition code creates ks metadata
without schemas and user-types. There's new_keyspace() convenience
helper for such cases.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Almost all callers call new_keyspace with durable writes ON, so it's
worth having default value for it
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The option is kepd in DDL, but is _not_ stored in
system_schema.keyspaces. Instead, it's removed from the provided options
and kept in scylla_keyspaces table in its own column. All the places
that had optional initial_tablets disengaged now set this value up the
way the find appropriate.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Nowadays reading scylla-specific info from schema happens under
respective schema feature. However (at least in raft case) when a new
node joins the cluster merging schema for the first time may happen
_before_ features are merged and enabled. Thus merging schema can go the
wrong way by errorneously skipping the scylla-specific info.
On the other hand, if system_schema.scylla_keyspaces is there it's
there, there's no reason _not_ to pick this data up in that case.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The object in question fully describes the keyspace to be created and,
among other things, contains replication strategy options. Next patches
move the "initial_tablets" option out of those options and keep it
separately, so the ks metadata should also carry this option separately.
This patch is _just_ extending the metadata creation API, in fact the
new field is unused (write-only) so all the places that need to provide
this data keep it disengaged and are explicitly marked with FIXME
comment. Next patches will fix that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
b815aa021c added a yield before
the trace point, causing the moved `frozen_mutation_and_schema`
(and `inet_address_vector_topology_change`) to drop out of scope
and be destroyed, as the rvalue-referenced objects aren't moved
onto the coroutine frame.
This change passes them by value rather than by rvalue-reference
so they will be stored in the coroutine frame.
Fixes#16540
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#16541