Commit Graph

742 Commits

Author SHA1 Message Date
Pavel Emelyanov
b26a3da584 gossiper: Coroutinize wait_for_gossip_to_settle()
Looks notably shorter this way

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220422093000.24407-1-xemul@scylladb.com>
2022-05-03 15:58:04 +03:00
Pavel Emelyanov
e80adbade3 code: De-globalize gossiper
No code uses global gossiper instance, it can be removed. The main and
cql-test-env code now have their own real local instances.

This change also requires adding the debug:: pointer and fixing the
scylle-gdb.py to find the correct global location.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-03 10:57:40 +03:00
Pavel Emelyanov
7a0ca3fedc gossiper: Use container() instead of the global pointer
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-03 10:57:40 +03:00
Pavel Solodovnikov
b25c4fee01 gms: gossiper: coroutinize apply_state_locally
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-17 11:51:18 +03:00
Pavel Solodovnikov
746f1179eb gms: gossiper: coroutinize apply_state_locally_without_listener_notification
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-17 11:38:33 +03:00
Pavel Solodovnikov
b7322c3f5d gms: gossiper: coroutinize do_apply_state_locally
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-17 11:29:26 +03:00
Pavel Solodovnikov
c48dcf607a gms: gossiper: coroutinize apply_new_states
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-17 11:28:42 +03:00
Kamil Braun
41f5b7e69e Merge branch 'raft_group0_early_startup_v3' of https://github.com/ManManson/scylla into next
* 'raft_group0_early_startup_v3' of https://github.com/ManManson/scylla:
  main: allow joining raft group0 before waiting for gossiper to settle
  service: raft_group0: make `join_group0` re-entrant
  service: storage_service: add `join_group0` method
  raft_group_registry: update gossiper state only on shard 0
  raft: don't update gossiper state if raft is enabled early or not enabled at all
  gms: feature_service: add `cluster_uses_raft_mgmt` accessor method
  db: system_keyspace: add `bootstrap_needed()` method
  db: system_keyspace: mark getter methods for bootstrap state as "const"
2022-04-14 16:42:20 +02:00
Raphael S. Carvalho
8427ec056c gms: gossiper: don't duplicate knowledge of minimum time for gossip to settle
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220409022435.58070-2-raphaelsc@scylladb.com>
2022-04-11 19:19:02 +03:00
Piotr Sarna
3272b4826f db: add keyspace-storage-options experimental feature
Specifying non-standard keyspace options is experimental, so it's
going to be protected by a configuration flag.
2022-04-08 09:17:01 +02:00
Piotr Sarna
120980ac8e db,gms: add SCYLLA_KEYSPACE schema feature
This schema feature will be used to guard the upcoming
system_schema.scylla_keyspaces schema table.
2022-04-08 09:17:00 +02:00
Piotr Sarna
567c0d0368 db,gms: add KEYSPACE_STORAGE_OPTIONS feature
The feature represents the ability to store storage options
in keyspace metadata: represented as a map of options,
e.g. storage type, bucket, authentication details, etc.
2022-04-08 09:17:00 +02:00
Pavel Solodovnikov
ccb59ba6c7 gms: feature_service: add cluster_uses_raft_mgmt accessor method
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-07 12:30:21 +03:00
Pavel Emelyanov
05a32328fc snitch: Remove gossiper_starting()
No longer used

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-01 13:16:09 +03:00
Pavel Emelyanov
3da5f6ac30 gossiper: Add system keyspace dependency
The gossiper reads peer features from system keyspace. Also the snitch
code needs system keyspace, and since now it gets all its dependencies
from gossiper (will be fixed some day, but not now), it will do the same
for sys.ks.. Thus it's worth having gossiper->system_keyspace explicit
dependency.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-25 15:08:13 +03:00
Pavel Solodovnikov
011942dcce raft: move tracking SUPPORTS_RAFT_CLUSTER_MANAGEMENT feature to raft
Move the listener from feature service to the `raft_group_registry`.

Enable support for the `USES_RAFT_CLUSTER_MANAGEMENT`
feature when the former is enabled.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-03-18 09:54:25 +03:00
Pavel Solodovnikov
7ea4d44508 gms: feature_service: update system.local#supported_features when feature support changes
Also, change the signature of `support()` method to return
`future<>` since it's now a coroutine. Adjust existing call sites.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-03-18 09:54:21 +03:00
Pavel Emelyanov
6a154305d7 gossiper: Remove db::config reference from gossiper
Also const-ify the db::config reference argument and std::move
the gossip_config argument while at it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-02 18:34:55 +03:00
Pavel Emelyanov
0c24087007 gossiper: Keep live-updateable options on gossiper
These options need to have updateable_value<> instance referencing
them from gossiper itself. The updateable_value<> is shard-aware in
the sense that it should be constructed on correct shard. This patch
does this -- the db::config reference is carried all the way down
to the gossiper constructor, then each instance gets its shard-local
construction of the updateable_value<>s.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-02 18:34:55 +03:00
Pavel Emelyanov
271ceb57b9 gossiper: Keep immutable options on gossip_config
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-02 18:34:55 +03:00
Amnon Heiman
c764f0d0f8 gms/gossiper.cc: Add gauge for live and unreachable nodes
this patch adds two gauges:
scylla_gossip_live - how many live nodes the gossiper sees
scylla_gossip_unreachable - how many nodes the gossiper tries to connect
to but cannot.

Both metrics are reported once per node (i.e., per node, not per shard) it
gives visibility to how a specific node sees the cluster.

For example, a split-brain 6 nodes cluster (3 and 3). Each node would
report that it sees 2 nodes, but the monitoring system would see that
there are, in fact, 6 nodes.

Example of two nodes cluster, both running:
``
scylla_gossip_live{shard="0"} 1.000000
scylla_gossip_unreachable{shard="0"} 0.000000
``

Example of two nodes cluster, one is down:
``
scylla_gossip_live{shard="0"} 0.000000
scylla_gossip_unreachable{shard="0"} 1.000000
``

Fixes #10102

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes #10103

[avi: remove whitespace change and correct spelling]
2022-02-20 19:42:58 +02:00
Michael Livshin
d370558279 add "ME_SSTABLE" cluster feature
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
0b1447c702 add "sstable_format" config
Initialize it to "md" until ME format support is
complete (i.e. storing originating host id in sstable stats metadata
is implemented), so at present there is no observable change by
default.

Also declare "enable_sstables_md_format" unused -- the idea, going
forward, being that only "sstable_format" controls the written sstable
file format and that no more per-format enablement config options
shall be added.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Pavel Solodovnikov
dce3159156 gms: gossiper: coroutinize wait_for_gossip
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:34:52 +03:00
Pavel Solodovnikov
ab41151a41 gms: gossiper: coroutinize advertise_token_removed
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:33:32 +03:00
Pavel Solodovnikov
4416070f56 gms: gossiper: coroutinize advertise_removing
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:33:13 +03:00
Pavel Solodovnikov
e9f5da9507 gms: gossiper: don't wrap convict calls into seastar::async
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:32:14 +03:00
Pavel Solodovnikov
e26829e202 gms: gossiper: coroutinize handle_major_state_change
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
705a759891 gms: gossiper: coroutinize handle_shutdown_msg
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
9ce0e2efa3 gms: gossiper: coroutinize mark_as_shutdown and convict
Since these two functions call each other, convert
to coroutines and eliminate the dependency on `seastar::async`
for both of them at the same time.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
c584a9cc1f gms: gossiper: remove comment about requiring thread context in mark_alive
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
ee30d0a385 gms: gossiper: don't use seastar::async in mark_alive
Since `real_mark_alive` does not require `seastar::async`
now, we can eliminate the wrapping async call, as well.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
529f4d0f98 gms: gossiper: coroutinize do_on_change_notifications
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
37066039df gms: gossiper: coroutinize do_before_change_notifications
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
231d8a3ad4 gms: gossiper: coroutinize real_mark_alive
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
c929f23b8d gms: gossiper: coroutinize mark_dead
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:20 +03:00
Michał Sala
3789a4d02b gms: add PARALLELIZED_AGGREGATION feature
This new feature will be used to determined whether the whole
cluster is ready to parallelize execution of aggregation queries.
2022-02-01 21:14:41 +01:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Pavel Solodovnikov
5dcfb94d5a gms: i_endpoint_state_change_subscriber: make callbacks to return futures
Coroutinize a few simple callbacks in the process.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
b958e85c54 utils: atomic_vector: rename for_each to thread_for_each
To emphasize that the function requires `seastar::thread`
context to function properly.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
445876a125 gms: gossiper: coroutinize start_gossiping
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
04b3172e6b gms: gossiper: coroutinize force_remove_endpoint
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
a01c900d66 gms: gossiper: coroutinize do_status_check
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
42ff01eee2 gms: gossiper: coroutinize remove_endpoint
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Avi Kivity
ae3a360725 database: Move database, keyspace, table classes to replica/ directory
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.

As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
2022-01-06 17:07:30 +02:00
Avi Kivity
d01e1a774b Merge 'Build performance: do not include the entire <seastar/net/ip.hh>' from Nadav Har'El
The header file <seastar/net/ip.hh> is a large collection of unrelated stuff, and according to ClangBuildAnalyzer, takes 2 seconds to compile for every source file that included it - and unfortunately virtually all Scylla source files included it - through either "types.hh" or "gms/inet_address.hh". That's 2*300 CPU seconds wasted.

In this two-patch series we completely eliminate the inclusion of <seastar/net/ip.hh> from Scylla. We still need the ipv4_address, ipv6_address types (e.g., gms/inet_address.hh uses it to hold a node's IP address) so those were split (in a Seastar patch that is already in) from ip.hh into separate small header files that we can include.

This patch reduces the entire build time (of build/dev/scylla) by 4% - reducing almost 10 sCPU minutes (!) from the build.

Closes #9875

* github.com:scylladb/scylla:
  build performance: do not include <seastar/net/ip.hh>
  build performance: speed up inclusion of <gm/inet_address.hh>
2022-01-05 17:55:07 +02:00
Raphael S. Carvalho
426450dc04 treewide: remove useless include of database.hh
Wrote a script based on cpp-include to find places that needlessly
included database.hh, which is expensive to process during
build time.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220104204359.168895-1-raphaelsc@scylladb.com>
2022-01-05 10:15:19 +02:00
Nadav Har'El
3fbbad7d60 build performance: speed up inclusion of <gm/inet_address.hh>
The header file <gm/inet_address.hh> is included, directly or
indirectly, from 291 source files in Scylla. It is hard to reduce this
number because Scylla relies heavily on IP addresses as keys to
different things. So it is important that this header file be fast to
include. Unfortunately it wasn't... ClangBuildAnalyzer measurements
showed that each inclusion of this header file added a whopping 2 seconds
(in dev build mode) to the build. A total of 600 CPU seconds - 10 CPU
minutes - were spent just on this header file. It was actually worse
because the build also spent additional time on template instantiation
(more on this below).

So in this patch we:

1. Remove some unnecessary stuff from gms/inet_address.hh, and avoid
   including it in one place that doesn't need it. This is just
   cosmetic, and doesn't significantly speed up the build.

2. Move the to_sstring() implementation for the .hh to .cc. This saves
   a lot of time on template instantiations - previously every source
   file instantiated this to_sstring(), which was slow (that "format"
   thing is slow).

3. Do not include <seastar/net/ip.hh> which is a huge file including
   half the world. All we need from it is the type "ipv4_address",
   so instead include just the new <seastar/net/ipv4_address.hh>.
   This change brings most of the performance improvement.
   So source files forgot to include various Seastar header files
   because the includes-everything ip.hh did it - so we need to add
   these missing includes in this patch.

After this patch, ClangBuildAnalyzer's reports that the cost of
inclusion of <gms/inet_address.hh> is down from 2 seconds to 0.326
seconds. Additionally the format<inet_address> template instantiation
291 times - about half a second each - is also gone.

All in all, this patch should reduce around 10 CPU minutes from the build.

Refs #1

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-01-04 21:07:23 +02:00
Asias He
a8ad385ecd repair: Get rid of the gc_grace_seconds
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.

To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.

In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:

1) GC a tombstone after gc_grace_seconds

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;

This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.

2) Never GC a tombstone

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};

3) GC a tombstone immediately

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};

4) GC a tombstone after repair

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};

In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.

A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.

Tests: compaction_test.py, ninja test

Fixes #3560

[avi: resolve conflicts vs data_dictionary]
2022-01-04 19:48:14 +02:00
Pavel Solodovnikov
904de0a094 gms: introduce two gossip features for raft-based cluster management
The patch adds the `SUPPORTS_RAFT_CLUSTER_MANAGEMENT`
and `USES_RAFT_CLUSTER_MANAGEMENT` gossiper features.

These features provide a way to organize the automatic
switch to raft-based cluster management.

The scheme is as follows:
 1. Every new node declares support for raft-based cluster ops.
 2. At the moment, no nodes in the cluster can actually use
    raft for cluster management, until the `SUPPORTS*` feature is enabled
    (i.e. understood by every node in the cluster).
 3. After the first `SUPPORTS*` feature is enabled, the nodes
    can declare support for the second, `USES*` feature, which
    means that the node can actually switch to use raft-based cluster
    ops.

The scheme ensures that even if some nodes are down while
transitioning to new bootstrap mechanism, they can easily
switch to the new procedure, not risking to disrupt the
cluster.

The features are not actually wired to anything yet,
providing a framework for the integration with `raft_group0`
code, which is subject for a follow-up series.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20211220081318.274315-1-pa.solodovnikov@scylladb.com>
2021-12-30 11:05:45 +02:00