Commit Graph

725 Commits

Author SHA1 Message Date
Pavel Emelyanov
6a154305d7 gossiper: Remove db::config reference from gossiper
Also const-ify the db::config reference argument and std::move
the gossip_config argument while at it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-02 18:34:55 +03:00
Pavel Emelyanov
0c24087007 gossiper: Keep live-updateable options on gossiper
These options need to have updateable_value<> instance referencing
them from gossiper itself. The updateable_value<> is shard-aware in
the sense that it should be constructed on correct shard. This patch
does this -- the db::config reference is carried all the way down
to the gossiper constructor, then each instance gets its shard-local
construction of the updateable_value<>s.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-02 18:34:55 +03:00
Pavel Emelyanov
271ceb57b9 gossiper: Keep immutable options on gossip_config
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-02 18:34:55 +03:00
Amnon Heiman
c764f0d0f8 gms/gossiper.cc: Add gauge for live and unreachable nodes
this patch adds two gauges:
scylla_gossip_live - how many live nodes the gossiper sees
scylla_gossip_unreachable - how many nodes the gossiper tries to connect
to but cannot.

Both metrics are reported once per node (i.e., per node, not per shard) it
gives visibility to how a specific node sees the cluster.

For example, a split-brain 6 nodes cluster (3 and 3). Each node would
report that it sees 2 nodes, but the monitoring system would see that
there are, in fact, 6 nodes.

Example of two nodes cluster, both running:
``
scylla_gossip_live{shard="0"} 1.000000
scylla_gossip_unreachable{shard="0"} 0.000000
``

Example of two nodes cluster, one is down:
``
scylla_gossip_live{shard="0"} 0.000000
scylla_gossip_unreachable{shard="0"} 1.000000
``

Fixes #10102

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes #10103

[avi: remove whitespace change and correct spelling]
2022-02-20 19:42:58 +02:00
Michael Livshin
d370558279 add "ME_SSTABLE" cluster feature
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
0b1447c702 add "sstable_format" config
Initialize it to "md" until ME format support is
complete (i.e. storing originating host id in sstable stats metadata
is implemented), so at present there is no observable change by
default.

Also declare "enable_sstables_md_format" unused -- the idea, going
forward, being that only "sstable_format" controls the written sstable
file format and that no more per-format enablement config options
shall be added.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Pavel Solodovnikov
dce3159156 gms: gossiper: coroutinize wait_for_gossip
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:34:52 +03:00
Pavel Solodovnikov
ab41151a41 gms: gossiper: coroutinize advertise_token_removed
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:33:32 +03:00
Pavel Solodovnikov
4416070f56 gms: gossiper: coroutinize advertise_removing
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:33:13 +03:00
Pavel Solodovnikov
e9f5da9507 gms: gossiper: don't wrap convict calls into seastar::async
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:32:14 +03:00
Pavel Solodovnikov
e26829e202 gms: gossiper: coroutinize handle_major_state_change
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
705a759891 gms: gossiper: coroutinize handle_shutdown_msg
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
9ce0e2efa3 gms: gossiper: coroutinize mark_as_shutdown and convict
Since these two functions call each other, convert
to coroutines and eliminate the dependency on `seastar::async`
for both of them at the same time.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
c584a9cc1f gms: gossiper: remove comment about requiring thread context in mark_alive
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
ee30d0a385 gms: gossiper: don't use seastar::async in mark_alive
Since `real_mark_alive` does not require `seastar::async`
now, we can eliminate the wrapping async call, as well.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
529f4d0f98 gms: gossiper: coroutinize do_on_change_notifications
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
37066039df gms: gossiper: coroutinize do_before_change_notifications
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
231d8a3ad4 gms: gossiper: coroutinize real_mark_alive
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
c929f23b8d gms: gossiper: coroutinize mark_dead
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:20 +03:00
Michał Sala
3789a4d02b gms: add PARALLELIZED_AGGREGATION feature
This new feature will be used to determined whether the whole
cluster is ready to parallelize execution of aggregation queries.
2022-02-01 21:14:41 +01:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Pavel Solodovnikov
5dcfb94d5a gms: i_endpoint_state_change_subscriber: make callbacks to return futures
Coroutinize a few simple callbacks in the process.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
b958e85c54 utils: atomic_vector: rename for_each to thread_for_each
To emphasize that the function requires `seastar::thread`
context to function properly.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
445876a125 gms: gossiper: coroutinize start_gossiping
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
04b3172e6b gms: gossiper: coroutinize force_remove_endpoint
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
a01c900d66 gms: gossiper: coroutinize do_status_check
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
42ff01eee2 gms: gossiper: coroutinize remove_endpoint
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Avi Kivity
ae3a360725 database: Move database, keyspace, table classes to replica/ directory
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.

As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
2022-01-06 17:07:30 +02:00
Avi Kivity
d01e1a774b Merge 'Build performance: do not include the entire <seastar/net/ip.hh>' from Nadav Har'El
The header file <seastar/net/ip.hh> is a large collection of unrelated stuff, and according to ClangBuildAnalyzer, takes 2 seconds to compile for every source file that included it - and unfortunately virtually all Scylla source files included it - through either "types.hh" or "gms/inet_address.hh". That's 2*300 CPU seconds wasted.

In this two-patch series we completely eliminate the inclusion of <seastar/net/ip.hh> from Scylla. We still need the ipv4_address, ipv6_address types (e.g., gms/inet_address.hh uses it to hold a node's IP address) so those were split (in a Seastar patch that is already in) from ip.hh into separate small header files that we can include.

This patch reduces the entire build time (of build/dev/scylla) by 4% - reducing almost 10 sCPU minutes (!) from the build.

Closes #9875

* github.com:scylladb/scylla:
  build performance: do not include <seastar/net/ip.hh>
  build performance: speed up inclusion of <gm/inet_address.hh>
2022-01-05 17:55:07 +02:00
Raphael S. Carvalho
426450dc04 treewide: remove useless include of database.hh
Wrote a script based on cpp-include to find places that needlessly
included database.hh, which is expensive to process during
build time.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220104204359.168895-1-raphaelsc@scylladb.com>
2022-01-05 10:15:19 +02:00
Nadav Har'El
3fbbad7d60 build performance: speed up inclusion of <gm/inet_address.hh>
The header file <gm/inet_address.hh> is included, directly or
indirectly, from 291 source files in Scylla. It is hard to reduce this
number because Scylla relies heavily on IP addresses as keys to
different things. So it is important that this header file be fast to
include. Unfortunately it wasn't... ClangBuildAnalyzer measurements
showed that each inclusion of this header file added a whopping 2 seconds
(in dev build mode) to the build. A total of 600 CPU seconds - 10 CPU
minutes - were spent just on this header file. It was actually worse
because the build also spent additional time on template instantiation
(more on this below).

So in this patch we:

1. Remove some unnecessary stuff from gms/inet_address.hh, and avoid
   including it in one place that doesn't need it. This is just
   cosmetic, and doesn't significantly speed up the build.

2. Move the to_sstring() implementation for the .hh to .cc. This saves
   a lot of time on template instantiations - previously every source
   file instantiated this to_sstring(), which was slow (that "format"
   thing is slow).

3. Do not include <seastar/net/ip.hh> which is a huge file including
   half the world. All we need from it is the type "ipv4_address",
   so instead include just the new <seastar/net/ipv4_address.hh>.
   This change brings most of the performance improvement.
   So source files forgot to include various Seastar header files
   because the includes-everything ip.hh did it - so we need to add
   these missing includes in this patch.

After this patch, ClangBuildAnalyzer's reports that the cost of
inclusion of <gms/inet_address.hh> is down from 2 seconds to 0.326
seconds. Additionally the format<inet_address> template instantiation
291 times - about half a second each - is also gone.

All in all, this patch should reduce around 10 CPU minutes from the build.

Refs #1

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-01-04 21:07:23 +02:00
Asias He
a8ad385ecd repair: Get rid of the gc_grace_seconds
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.

To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.

In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:

1) GC a tombstone after gc_grace_seconds

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;

This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.

2) Never GC a tombstone

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};

3) GC a tombstone immediately

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};

4) GC a tombstone after repair

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};

In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.

A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.

Tests: compaction_test.py, ninja test

Fixes #3560

[avi: resolve conflicts vs data_dictionary]
2022-01-04 19:48:14 +02:00
Pavel Solodovnikov
904de0a094 gms: introduce two gossip features for raft-based cluster management
The patch adds the `SUPPORTS_RAFT_CLUSTER_MANAGEMENT`
and `USES_RAFT_CLUSTER_MANAGEMENT` gossiper features.

These features provide a way to organize the automatic
switch to raft-based cluster management.

The scheme is as follows:
 1. Every new node declares support for raft-based cluster ops.
 2. At the moment, no nodes in the cluster can actually use
    raft for cluster management, until the `SUPPORTS*` feature is enabled
    (i.e. understood by every node in the cluster).
 3. After the first `SUPPORTS*` feature is enabled, the nodes
    can declare support for the second, `USES*` feature, which
    means that the node can actually switch to use raft-based cluster
    ops.

The scheme ensures that even if some nodes are down while
transitioning to new bootstrap mechanism, they can easily
switch to the new procedure, not risking to disrupt the
cluster.

The features are not actually wired to anything yet,
providing a framework for the integration with `raft_group0`
code, which is subject for a follow-up series.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20211220081318.274315-1-pa.solodovnikov@scylladb.com>
2021-12-30 11:05:45 +02:00
Avi Kivity
4a323772c1 Merge 'Use the same page size limit in reverse queries as in forward reads' from Piotr Jastrzębski
The default for get_unlimited_query_max_result_size() is 100MB (adjustable through config), whereas query::result_memory_limiter::maximum_result_size is 1MB (hard coded, should be enough for everybody)

This limit is then used by the replica to decide when to break pages and, in case of reversed clustering order reads, when to fail the read when accumulated data crosses the threshold. The latter behavior stems from the fact that reversed reads had to accumulate all the data (read in forward order) before they can reverse it and return the result. Reverse reads thus need a higher limit so that they have a higher chance of succeeding.

Most readers are now supporting reading in reverse natively, and only reversing wrappers (make_reversing_reader()) inserted on top of ka/la sstable readers need to accumulate all the data. In other cases, we could break pages sooner. This should lead to better stability (less memory usage) and performance (lower page build latency, higher read concurrency due to less memory footprint).

Tests: unit(dev)

Closes #9815

* github.com:scylladb/scylla:
  storage_proxy: Send page_size in the read_command
  gms: add SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT feature
  result_memory_accounter: use new max_result_size::get_page_size in check_local_limit
  max_result_size: Add page_size field
2021-12-29 15:04:01 +02:00
Avi Kivity
49a603af39 gossip: fix lowres_clock::duration assumption
The variable diff is assigned a type of std::chrono::milliseconds
but later used to store the difference between two
lowres_clock::time_point samples. This works now because the two
types are the same, but fails if lowres_clock::duration changes.

Remove the assumption by using lowres_clock::duration.
2021-12-28 21:13:59 +02:00
Piotr Jastrzebski
02d5997377 gms: add SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT feature
This new feature will be used to determined whether the whole cluster
is ready to use additional page_size field in max_result_size.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-12-28 16:38:02 +01:00
Nadav Har'El
9ae98dbe92 Merge 'Reduce boot time for dtest setup' from Asias He
This patch helps to speed up node boot up for test setups like dtest.

Nadav reported

```
With Asias's two patches o Scylla, and my patch to enable it in dtest:

Boot time of 5 nodes is now down to 9 seconds!

Remember we started this exercise with 214 seconds? :-)
```

Closes #9808

* github.com:scylladb/scylla:
  storage_service: Recheck tokens before throw in storage_service::bootstrap
  gossip: Dot not wait for gossip to settle if skip_wait_for_gossip_to_settle is zero
2021-12-16 13:44:42 +02:00
Avi Kivity
87917d2536 Merge "gms: gossiper: coroutinize a few small functions" from Pavel S
"
Start converting small functions in gossiper code
from using `seastar::thread` context to coroutines.

For now, the changes are quite trivial.
Later, larger code fragments will be converted
to eliminate uses of `seastar::async` function calls.

Moving the code to coroutines makes the code a bit
more readable and also mmediately evident that a given
function is async just looking at the signature (for
example, for void-returning functions, a coroutine
will return `future<>` instead of `void` in case of
a seastar::thread-using function).

Tests: unit(dev)
"

* 'coro_gossip_v1' of https://github.com/ManManson/scylla:
  gms: gossiper: coroutinize `maybe_enable_features`
  gms: gossiper: coroutinize `wait_alive`
  gms: gossiper: coroutinize `add_saved_endpoint`
  gms: gossiper: coroutinize `evict_from_membership`
2021-12-15 16:02:18 +02:00
Asias He
78d0cc4ab5 gossip: Dot not wait for gossip to settle if skip_wait_for_gossip_to_settle is zero
The skip_wait_for_gossip_to_settle == 0 which means do not wait for
gossip to settle at all. It is not respected in
gossiper::wait_for_range_setup and in gossiper::wait_for_gossip for
initial sleeps.

Since setting skip_wait_for_gossip_to_settle zero is not allowed in
production cluster anyway. It is mostly used by tests like dtest to
reduce the cluster boot up time. Respect skip_wait_for_gossip_to_settle
zero flag and avoid any sleep and wait completely.
2021-12-15 19:40:43 +08:00
Pavel Solodovnikov
47533bca65 gms: gossiper: coroutinize maybe_enable_features
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-12-11 09:39:48 +03:00
Pavel Solodovnikov
3993c6a9fb gms: gossiper: coroutinize wait_alive
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-12-11 09:30:32 +03:00
Pavel Solodovnikov
a6ff04dd24 gms: gossiper: coroutinize add_saved_endpoint
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-12-11 09:23:35 +03:00
Pavel Solodovnikov
23dd8b66c5 gms: gossiper: coroutinize evict_from_membership
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-12-11 09:15:03 +03:00
Juliusz Stasiewicz
351f142791 cdc/check_and_repair_cdc_streams: ignore LEFT endpoints
When `check_and_repair_cdc_streams` encountered a node with status LEFT, Scylla
would throw. This behavior is fixed so that LEFT nodes are simply ignored.

Fixes #9771

Closes #9778
2021-12-10 15:28:14 +01:00
Pavel Solodovnikov
777985b64d gms: gossiper: maybe_enable_features() should enable features in seastar::async context
Since `gms::feature::enable()` requires `seastar::async` context
to be present.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-11-28 14:18:11 +02:00
Pavel Solodovnikov
5b5fbb4b33 gms: feature_service: expose registered features map
This will be used for re-enabling previously enabled cluster
features, which will be introduces in later patches.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-11-28 14:18:11 +02:00
Pavel Solodovnikov
a2f5ad432f gms: feature_service: persist enabled features
Save each feature enabled through the feature_service
instance in the `system.scylla_local` under the
'enabled_features' key.

The features would be persisted only if the underlying
query context used by `db::system_keyspace` is initialized.

Since `system.scylla_local` table is essentially a
string->string map, use an ad-hoc method for serializing
enabled features set: the same as used in gossiper for
translating supported features set via gossip.

The entry should be saved before we enable the feature so
that crash-after-enable is safe.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-11-28 14:18:11 +02:00
Pavel Solodovnikov
e891f874df gms: move to_feature_set() function from gossiper to feature_service
This utility will also be used for de-serialization of
persisted enabled features, which will be introduced in a
later patch.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-11-28 14:18:11 +02:00
Benny Halevy
55967a8597 batchlog_manager: endpoint_filter: move to gossiper
There's nothing in this function that actually requries
the batchlog manager instance.

It uses a random number engine that's moved along with it
to class gossiper.

This resolves a circular dependency between the
batchlog_manager and storage_proxy.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 08:27:30 +02:00
Pavel Emelyanov
9fccf7f3af gossiper: Guard background processing with gate
When shutdown gossiper may have some messages being processed in
the background. This brings two problems.

First, the gossiper itself is about to disappear soon and messages
might step on the freed instance (however, this one is not real now,
gossiper is not freed for real, just ::stop() is called).

Second, messages processing may notify other subsystems which, in
turn, do not expect this after gossiper is shutdown.

The common solution to this is to run background code through a gate
that gets closed at some point, the ::shutdown() in gossiper case.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-08 10:25:03 +03:00