Commit Graph

25765 Commits

Author SHA1 Message Date
Piotr Sarna
e42dee6afb gms: make CORRECT_STATIC_COMPACT_IN_MC ft unconditionally true
The feature is assumed to be true due to being over 2 years old.
It's still advertised in gossip, but it's assumed to always be present.
2021-03-30 09:37:13 +02:00
Piotr Sarna
28c9af6fa5 sstables: stop relying on CORRECT_STATIC_COMPACT_IN_MC feature
The feature bit is going away because it's over 2 years old,
so the code which depended on it becomes unconditional.
2021-03-30 09:37:04 +02:00
Piotr Sarna
08c4350968 gms: make TRUNCATION_TABLE feature unconditionally true
Turns out the feature was not used presently.
Historically, the commit which removed the support is
30a700c5b0 .
2021-03-30 09:36:45 +02:00
Piotr Sarna
c070178c7e gms: make ROW_LEVEL_REPAIR feature unconditionally true
The feature is assumed to be true due to being over 2 years old.
It's still advertised in gossip, but it's assumed to always be present.
2021-03-30 09:36:11 +02:00
Piotr Sarna
80ebedd242 repair: stop relying on ROW_LEVEL_REPAIR feature
The feature is going away because it's over 2 years old,
so the code which depended on it becomes unconditional.
2021-03-30 09:35:40 +02:00
Nadav Har'El
115324f71a Merge 'Add partial admission control to Thrift frontend' from Piotr Sarna
This pull request adds partial admission control to Thrift frontend. The solution is partial mostly because the Thrift layer, aside from allowing Thrift messages, may also be used as a base protocol for CQL messages. Coupling admission control to this one is a little bit more complicated due to how the layer currently works - a Thrift handler, created once per connection, keeps a local `query_state` instance for the occasion of handling CQL requests. However, `query_state` should be kept per query, not per connection, so adding admission control to this aspect of the frontend is left for later.

Finally, the way service permits are passed from the server, via the handler factory, handler and then to queries is hacky. I haven't figured out how to force Thrift to pass custom context per query, so the way it works now is by relying on the fact that the server does not yield (in Seastar sense) between having read the request and launching the proper handler. Due to that, it's possible to just store the service permit in the server itself, pass the reference (address) to it down to the handler, and then read it back from the handling code and claim ownership of it. It works, but if anyone has a better idea, please share.

Refs #4826

Closes #8313

* github.com:scylladb/scylla:
  thrift: add support for max_concurrent_requests_per_shard
  thrift: add metrics for admission control
  thrift: add a counter for in-flight requests
  thrift: add a counter for blocked requests
  thrift: partially add admission control
  service_permit: add a getter for the number of units held
  thrift: coroutinize processing a request
  memory_limiter: add a missing seastarx include
2021-03-29 21:36:50 +03:00
Raphael S. Carvalho
a390f4eb61 sstables: optimize LCS reshape for repair-based operations
LCS reshape is currently inefficient for repair-based operation, because
the disjoint run of 256 sstables is reshaped into bigger L0 files, which
will be then integrated into the main sstable set.
On reshape completion, LCS has to compact those big L0 files onto higher
levels, until last level is reached, producing bad write amplification.

A much better approach is to instead compact that disjoint run into the
best possible level L, which can be figured out with:
	log (base fan_out) of (total_size / max_sstable_size)
This compaction will be essentially a copy operation. It's important
to do it rather than only mutating the level of sstables because we have
to reshape the input run according to LCS parameters like sstable size.

For repair-based bootstrap/replace, the input disjoint run is now efficiently
reshaped into an ideal level L, so there's no compaction backlog once
reshape completes.

This behavior will manifest in the log as this:
LeveledManifest - Reshaping 256 disjoint sstables in level 0 into level 2

For repair-based decommission/removenode though, which reshape wasn't
wired on yet, level L may temporarily hold 2 disjoint runs, which overlap
one another, but LCS itself will incrementally merge them through either
promotion of L-1 into L, or by detecting overlapping in level L and
merging the overlapping sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210329171826.42873-1-raphaelsc@scylladb.com>
2021-03-29 20:22:04 +03:00
Botond Dénes
3c54c990ab test: view_build_test: test_view_update_generator_buffering: fail gracefully
Failures in this test typically happen inside the test consumer object.
These however don't stop the test as the code invoking the consumer
object handles exceptions coming from it. So the test will run to
completion and will fail again when comparing the produced output with
the expected one. This results in distracting failures. The real problem
is not the difference in the output, but the first check that failed,
which is however buried in the noise. To prevent this add an "ok" flag
which is set to false if the consumer fails. In this case the additional
checks are skipped in the end to not generate useless noise.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210326083147.26113-2-bdenes@scylladb.com>
2021-03-29 17:58:28 +03:00
Avi Kivity
a8463cfb37 Merge "reader_permit: signal leaked resources" from Botond
"
When a permit is destroyed we check if it still holds on to any
resources in the destructor. Any resources the permit still holds on are
leaked resources, as users should have released these. Currently we just
invoke `on_internal_error_noexcept()` to handle this, which -- depending
on the configuration -- will result in an error message or an assert. In
the former case, the resources will be leaked for good. This mini-series
fixes this, by signaling back these resources to the semaphore. This
helps avoid an eventual complete dry-up of all semaphore resources and a
subsequent complete shutdown of reads.

Tests: unit(release, debug)
"

* 'reader-permit-signal-leaked-resources/v1' of https://github.com/denesb/scylla:
  reader_permit: signal leaked resources
  test: test_reader_lifecycle_policy: keep semaphores alive until all ops cease
  sstables: generate_summary(): extend the lifecycle of the reader concurrency semaphore
2021-03-29 17:57:31 +03:00
Botond Dénes
9e01c4c667 test: view_build_test: test_view_update_generator_buffering: use separate permit for readers
Said test has two separate logical readers, but they share the same
permit, which is illegal. This didn't cause any problems yet, but soon
the semaphore will start to keep score of active/inactive permits which
will be confused by such sharing, so have them use separate permits.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210326083147.26113-1-bdenes@scylladb.com>
2021-03-29 17:35:51 +03:00
Takuya ASADA
6f678ab7ff aws: initialize self._disks['ebs'] when no EBS disks
Seems like aws_instance.ebs_disks() causes traceback when no EBS disks
available, need to initialize with empty list.

Fixes #8365

Closes #8366
2021-03-29 17:21:14 +03:00
Gleb Natapov
13a3cf62bb raft: move incoming message processing into per state functions
Clean up step() function by moving state specific processing into per
state functions. This way it is easier to see how each state handles
individual messages. No functional changes here.

Message-Id: <YGHCiTWjq+L/jVCB@scylladb.com>
2021-03-29 15:48:43 +02:00
Tomasz Grabiec
43fd322856 Merge 'scylla-gdb.py: Add io-queues command' from Piotr Sarna
The command can be used to inspect IO queues of a local reactor.
Example output:
```
    (gdb) scylla io-queues
        Dev 0:
            Class:                  |shares:         |ptr:
            --------------------------------------------------------------------------------
            "default"               |1               |(seastar::priority_class_data *)0x6000002c6500
            "commitlog"             |1000            |(seastar::priority_class_data *)0x6000003ad940
            "memtable_flush"        |1000            |(seastar::priority_class_data *)0x6000005cb300
            "streaming"             |200             |(seastar::priority_class_data *)0x0
            "query"                 |1000            |(seastar::priority_class_data *)0x600000718580
            "compaction"            |1000            |(seastar::priority_class_data *)0x6000030ef0c0

            Max request size:    2147483647
            Max capacity:        Ticket(weight: 4194303, size: 4194303)
            Capacity tail:       Ticket(weight: 73168384, size: 100561888)
            Capacity head:       Ticket(weight: 77360511, size: 104242143)

            Resources executing: Ticket(weight: 2176, size: 514048)
            Resources queued:    Ticket(weight: 384, size: 98304)
            Handles: (1)
                Class 0x6000005d7278:
                    Ticket(weight: 128, size: 32768)
                    Ticket(weight: 128, size: 32768)
                    Ticket(weight: 128, size: 32768)
            Pending in sink: (0)
```

Created when debugging a core dump. Turned out not to be immediately useful for this use case, but I'm publishing it since it may come in handy in future investigations.

Closes #8362

* github.com:scylladb/scylla:
  scylla-gdb: add io-queues command
  scylla-gdb.py: add parsing std::priority_queue
  scylla-gdb.py: add parsing std::atomic
  scylla-gdb.py: add parsing std::shared_ptr
  scylla-db.py: add parsing intrusive_slist
2021-03-29 15:31:48 +02:00
Piotr Sarna
adf07eb8fb scylla-gdb: add io-queues command
The command can be used to inspect reactor's IO queues. Example output:
(gdb) scylla io-queues
    Dev 0:
	Class:                  |shares:         |ptr:
	--------------------------------------------------------------------------------
        "default"               |1               |(seastar::priority_class_data *)0x6000002c6500
        "commitlog"             |1000            |(seastar::priority_class_data *)0x6000003ad940
        "memtable_flush"        |1000            |(seastar::priority_class_data *)0x6000005cb300
        "streaming"             |200             |(seastar::priority_class_data *)0x0
        "query"                 |1000            |(seastar::priority_class_data *)0x600000718580
        "compaction"            |1000            |(seastar::priority_class_data *)0x6000030ef0c0

        Max request size:    2147483647
        Max capacity:        Ticket(weight: 4194303, size: 4194303)
        Capacity tail:       Ticket(weight: 73168384, size: 100561888)
        Capacity head:       Ticket(weight: 77360511, size: 104242143)

        Resources executing: Ticket(weight: 2176, size: 514048)
        Resources queued:    Ticket(weight: 384, size: 98304)
        Handles: (1)
            Class 0x6000005d7278:
                Ticket(weight: 128, size: 32768)
                Ticket(weight: 128, size: 32768)
                Ticket(weight: 128, size: 32768)
        Pending in sink: (0)
2021-03-29 15:01:25 +02:00
Piotr Sarna
f162423b8a scylla-gdb.py: add parsing std::priority_queue
The parsing assumes that the underlying storage is a vector,
which is often enough the case.
2021-03-29 13:10:36 +02:00
Piotr Sarna
e36c1f1d25 scylla-gdb.py: add parsing std::atomic 2021-03-29 13:10:36 +02:00
Piotr Sarna
0d4d04d3e6 scylla-gdb.py: add parsing std::shared_ptr 2021-03-29 13:10:36 +02:00
Piotr Sarna
c61822bc86 scylla-db.py: add parsing intrusive_slist 2021-03-29 13:10:36 +02:00
Piotr Sarna
bc1c92fd05 Merge 'Improve flat_mutation_reader::consume_pausable' from Piotr Jastrzębski
`flat_mutation_reader::consume_pausable` is widely used in Scylla. Some places
worth mentioning are memtables and combined readers but there are others as
well.

This patchset improves `consume_pausable` in three ways:
1. it removes unnecessary allocation
2. it rearranges ifs to not check the same thing twice
3. for a consumer that returns plain stop_iteration not a future<stop_iteration>
   it reduces the amount of future usage

Test: unit(dev, release, debug)

Combined reader microbenchmark has shown from 2% to 22% improvement in median
execution time while memtable microbenchmark has shown from 3.6% to 7.8%
improvement in median execution time.

Before the change:
```
./build/release/test/perf/perf_mutation_readers --random-seed 3549335083
single run iterations:    0
single run duration:      1.000s
number of runs:           5
number of cores:          16
random seed:              3549335083

test                                      iterations      median         mad         min         max
combined.one_row                             1316234   140.120ns     0.020ns   140.074ns   140.141ns
combined.single_active                          7332    91.484us    31.890ns    91.453us    91.778us
combined.many_overlapping                        945   870.973us   429.720ns   868.625us   871.403us
combined.disjoint_interleaved                   7102    85.989us     7.847ns    85.973us    85.997us
combined.disjoint_ranges                        7129    85.570us     7.840ns    85.562us    85.596us
combined.overlapping_partitions_disjoint_rows        5458   124.787us    56.738ns   124.731us   125.370us
clustering_combined.ranges_generic           1920688   217.940ns     0.184ns   217.742ns   218.275ns
clustering_combined.ranges_specialized       1935318   194.610ns     0.199ns   194.210ns   195.228ns
memtable.one_partition_one_row                624001     1.600us     1.405ns     1.599us     1.605us
memtable.one_partition_many_rows               79551    12.555us     1.829ns    12.549us    12.558us
memtable.many_partitions_one_row               40557    24.748us    77.083ns    24.644us    25.135us
memtable.many_partitions_many_rows              3220   310.429us    57.628ns   310.295us   311.189us
```

After the change:
```
./build/release/test/perf/perf_mutation_readers --random-seed 3549335083
single run iterations:    0
single run duration:      1.000s
number of runs:           5
number of cores:          16
random seed:              3549335083

test                                      iterations      median         mad         min         max
combined.one_row                             1358839   109.222ns     0.122ns   109.089ns   109.348ns
combined.single_active                          7525    87.305us    25.540ns    87.273us    87.362us
combined.many_overlapping                        962   853.195us     1.904us   851.244us   855.142us
combined.disjoint_interleaved                   7310    81.988us    28.877ns    81.949us    82.032us
combined.disjoint_ranges                        7315    81.699us    37.144ns    81.662us    81.874us
combined.overlapping_partitions_disjoint_rows        5591   120.964us    15.294ns   120.949us   121.120us
clustering_combined.ranges_generic           1954722   211.993ns     0.052ns   211.883ns   212.084ns
clustering_combined.ranges_specialized       2042194   187.807ns     0.066ns   187.732ns   188.289ns
memtable.one_partition_one_row                648701     1.542us     0.339ns     1.542us     1.543us
memtable.one_partition_many_rows               85007    11.759us     1.168ns    11.752us    11.782us
memtable.many_partitions_one_row               43893    22.805us    17.147ns    22.782us    22.843us
memtable.many_partitions_many_rows              3441   290.220us    41.720ns   290.172us   290.306us
```

Closes #8359

* github.com:scylladb/scylla:
  flat_mutation_reader: optimize consume_pausable for some consumers
  flat_mutation_reader: special case consumers in consume_pausable
  flat_mutation_reader: Change order of checks in consume_pausable
  flat_mutation_reader: fix indentation in consume_pausable
  flat_mutation_reader: Remove allocation in consume_pausable
  perf: Add benchmarks for large partitions
2021-03-29 13:06:56 +02:00
Piotr Sarna
4c79f132b6 thrift: add support for max_concurrent_requests_per_shard
The Thrift frontend is now capable of limiting the max number
of concurrent in-flight requests. Surplus requests are shed.

Tests: manual
2021-03-29 13:05:16 +02:00
Piotr Sarna
9f53327c9d thrift: add metrics for admission control
The new metrics include information about how many requests
were blocked on memory, how much is still available, etc.
2021-03-29 13:05:16 +02:00
Piotr Sarna
6b021779d2 thrift: add a counter for in-flight requests 2021-03-29 13:05:16 +02:00
Piotr Sarna
9391515461 thrift: add a counter for blocked requests
The counter tracks how many requests were blocked by the
memory estimation based admission control semaphore.
2021-03-29 13:05:16 +02:00
Piotr Sarna
ef1de114f0 thrift: partially add admission control
This commit adds admission control in the form of passing
service permits to the Thrift server.
The support is partial, because Thrift also supports running CQL
queries, and for that purpose a query_state object is kept
in the Thrift handler. However, the handler is generally created
once per connection, not once per query, and the query_state object
is supposed to keep the state of a single query only.
In order to keep this series simpler, the CQL-on-top-of-Thrift
layer is not touched and is left as TODO.
Moreover, the Thrift layer does not make it easy to pass custom
per-query context (like service_permit), so the implementation
uses a trick: the service permit is created on the server
and then passed as reference to its connections and their respective
Thrift handlers. Then, each time a query is read from the socket,
this service permit is overwritten and then read back from the Thrift
handler. This mechanism heavily relies on the fact that there are
zero preemption points between overwriting the service permit
and reading it back by the handler. Otherwise, races may occur.
This assumption was verified by code inspection + empirical tests,
but if somebody is aware that it may not always hold, please speak up.
2021-03-29 13:05:16 +02:00
Nadav Har'El
ccc75bfe2a Merge 'Disable thrift by default' from Piotr Sarna
The Thrift layer is functional, but it's not usually the first-choice protocol for Scylla users, so it's hereby disabled by default.

Fixes #8336

Closes #8338

* github.com:scylladb/scylla:
  docs: mention disabling Thrift by default
  db,config: disable Thrift by default
2021-03-29 12:48:20 +03:00
Piotr Sarna
3388694e69 service_permit: add a getter for the number of units held
The helper function makes debugging considerably easier.
2021-03-29 11:34:18 +02:00
Piotr Sarna
364b921e25 thrift: coroutinize processing a request
While not particularly useful now, it will facilitate
later changes which introduce service permits.
2021-03-29 11:34:18 +02:00
Piotr Sarna
09621e5fc5 memory_limiter: add a missing seastarx include
It's that or declaring everything that belongs to seastar namespace
explicitly, and including "seastarx.hh" is more standard.
2021-03-29 11:34:18 +02:00
Michał Chojnowski
8c45225f21 docs: remove the obsolete IMR design note
IMR, as described in this design note, was removed in 001652815c.
This doc should have been removed back then, but was overlooked.

Closes #8340
2021-03-29 10:58:05 +02:00
Pekka Enberg
aec33c599b Update tools/python3 submodule
* tools/python3 6f3bcbe...ad04e8e (2):
  > dist/debian: fix renaming debian/scylla-* files rule
  > fix license of package build script to AGPL
2021-03-29 11:50:24 +03:00
Pekka Enberg
203b7394d7 Update tools/java submodule
* tools/java 7b66b7a0fc...ccc4201ded (1):
  > dist/debian: fix renaming debian/scylla-* files rule
2021-03-29 11:50:19 +03:00
Tomasz Grabiec
c0ce122f77 Merge "raft: wire up rpc add_server/remove_server for configuration changes" from Pavel Solodovnikov
Raft instance needs to update RPC subsystem on changes in
configuration, so that RPC can deliver messages to the new nodes
in configuration, as well as dispose of the old nodes.
I.e. the nodes which are not the part of the most recent
configuration anymore.

The effective scope of RPC mappings is limited by the piece of
code which sends messages to both the "new" nodes (which
are added to the cluster with the most recent configuration
change) and the "old" nodes which are removed from the cluster.

Until the messages are successfully delivered to at least
the majority of "old" nodes and we have heard back from them,
the mappings should be kept intact. After that point the RPC
mappings for the removed nodes are no longer of interest
and thus can be immediately disposed.

There is also another problem to be solved: in Raft an instance may
need to communicate with a peer outside its current configuration.
This may happen, e.g., when a follower falls out of sync with the
majority and then a configuration is changed and a leader not present
in the old configuration is elected.

The solution is to introduce the concept of "expirable" updates to
the RPC subsystem.

When RPC receives a message from an unknown peer, it also adds the
return address of the peer to the address map with a TTL. Should
we need to respond to the peer, its address will be known.

An outgoing communication to an unconfigured peer is impossible.

* manmanson/raft_mappings_wiring_v12:
  raft: update README.md with info on RPC server address mappings
  raft: wire up `rpc::add_server` and `rpc::remove_server` for configuration changes
  raft/fsm: add optional `rpc_configuration` field to fsm_output
  raft: maintain current rpc context in `server_impl`
  raft: use `.contains` instead of `.count` for std::set in `raft::configuration::diff`
  raft: unit-tests for `raft_address_map`
  raft: support expiring server address mappings for rpc module
2021-03-29 10:28:45 +02:00
Piotr Jastrzebski
86cf566692 flat_mutation_reader: optimize consume_pausable for some consumers
consumers that return stop_iteration not future<stop_iteration> don't
have to consume a single fragment per each iteration of repeat. They can
consume whole buffer in each iteration.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:55:14 +02:00
Piotr Jastrzebski
26cc4f112d flat_mutation_reader: special case consumers in consume_pausable
consume_pausable works with consumers that return either stop_iteration
or future<stop_iteration>. So far it was calling futurize_invoke for
both. This patch special cases consumers that return
future<stop_iteration> and don't call futurize_invoke for them as this
is unnecessary work.

More importantly, this will allow the following patch to optimize
consumers that return plain stop_iteration.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:55:14 +02:00
Piotr Jastrzebski
164e23d2b1 flat_mutation_reader: Change order of checks in consume_pausable
This way we can avoid checking is_buffer_empty twice.
Compiler might be able to optimize this out but why depend on it
when the alternative is not less readable.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:55:14 +02:00
Piotr Jastrzebski
776ba29cec flat_mutation_reader: fix indentation in consume_pausable
Code was left with wrong indentation by the previous commit that
removed do_with call around the code that's currently present.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:55:14 +02:00
Piotr Jastrzebski
9fb0014d72 flat_mutation_reader: Remove allocation in consume_pausable
The allocation was introduced in 515bed90bb but I couldn't figure out
why it's needed. It seems that the consumer can just be captured inside
lambda. Tests seem to support the idea.

Indentation will be fixed in the following commit to make the review
easier.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:55:14 +02:00
Piotr Jastrzebski
3aa7bee5e3 perf: Add benchmarks for large partitions
in perf_mutation_readers.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:48:11 +02:00
Avi Kivity
ec4d91f9eb tools: toolchain: dbuild: improve cgroupv2 detection code
dbuild detects if the kernel is using cgroupv2 by checking if the
cgroup2 filesystem is mounted on /sys/fs/cgroup. However, on Ubuntu
20.10, the cgroup filesystem is mounted on /sys/fs/cgroup and the
cgroup2 filesystem is mounted on /sys/fs/cgroup/unified. This second
mount matches the search expression and gives a false positive.

Fix by adding a space at the end; this will fail to match
/sys/fs/cgroup/unified.

Closes #8355
2021-03-29 09:31:29 +03:00
Pavel Solodovnikov
2d9e94f050 raft: update README.md with info on RPC server address mappings
Describe the high-level scheme of managing RPC mappings and
also expand on the introduction of "expirable" RPC mappings concept
and why these are needed.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 22:47:13 +03:00
Pavel Solodovnikov
f61206e483 raft: wire up rpc::add_server and rpc::remove_server for configuration changes
Raft instance needs to update RPC subsystem on changes in
configuration, so that RPC can deliver messages to the new nodes
in configuration, as well as dispose of the old nodes.
I.e. the nodes which are not the part of the most recent
configuration anymore.

The effective scope of RPC mappings is limited by the piece of
code which sends messages to both the "new" nodes (which
are added to the cluster with the most recent configuration
change) and the "old" nodes which are removed from the cluster.

Until the messages are successfully delivered to at least
the majority of "old" nodes and we have heard back from them,
the mappings should be kept intact. After that point the RPC
mappings for the removed nodes are no longer of interest
and thus can be immediately disposed.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 22:47:09 +03:00
Pavel Solodovnikov
16d9e8e9af raft/fsm: add optional rpc_configuration field to fsm_output
The field is set in `fsm.get_output` whenever
`_log.last_conf_idx()` or the term changes.

Also, add `_last_conf_idx` and `_last_term` to
`fsm::last_observed_state`, they are utilized
in the condition to evaluate current rpc
configuration in `fsm.get_output()`.

This will be used later to update rpc config state
stored in `server_impl` and maintain rpc address map.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 22:47:05 +03:00
Tomasz Grabiec
6035fd05b3 Merge "Unify drain() and drain_on_shutdown()" from Pavel Emelyanov
The start-stop code is drifting towards a straightforward
scheme of a bunch of

  service foo;
  foo.start();
  auto stop_foo = defer([&foo] { foo.stop(); });

blocks. The drain_on_shutdown() and its relation to drain()
and decommission() is a big hurdle on the way of this effort.

This set unifies drain() and drain_on_shutdown() so that drain
really becomes just some first steps of the regular shutdown,
i.e. -- what it should be. Some synchronisation bits around it
are still needed, though.

This unification also closes a bunch not-yet-caught bugs when
parts of the system remained running in case shutdown happens
after nodetool drain. In this case the whole drain_on_sutdown()
becomes a noop (just returns drain()'s future) and what's
missing in drain() becomes missing on shutdown.

tests: unit(dev), dtest(simple_boot_shutdown : dev),
       manual(start+stop, start+drain+stop : dev)
refs: #2737

* xemul/br-drain-on-shutdown:
  drain_on_shutdown: Simplify
  drain: Fix indentation
  storage_service: Unify drain and drain_on_shutdown
  storage_proxy: Drain and unsubscribe in main.cc
  migration_manager: Stop it in two phases
  stream_manager: Stop instances on drain
  batchlog_manager: Stop its instances on shutdown
  tracing: Shutdown tracing in drain
  tracing: Stop it in main.cc
  system_distributed_keyspace: Stop it in main.cc
  storage_service: Move (un)subscription to migration events
2021-03-26 18:37:27 +01:00
Pavel Solodovnikov
19cc85b3b6 raft: maintain current rpc context in server_impl
Introduce rpc server_address that represents the
last observed state of address mappings
for RPC module.

It does not correspond to any kind of configuration
in the raft sense, just an artificial construct
corresponding to the largest set of server
addresses coming from both previous and current
raft configurations (to be able to contact both
joining and leaving servers).

This will be used later to update rpc module mappings
when cluster configuration changes.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 20:22:44 +03:00
Pavel Solodovnikov
8799ccbab0 raft: use .contains instead of .count for std::set in raft::configuration::diff
`std::unordered_set::contains` is introduced in C++20 and provides
clearer semantics to check existence of a given element in a set.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 20:22:44 +03:00
Pavel Solodovnikov
7c229998e8 raft: unit-tests for raft_address_map
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 20:22:44 +03:00
Pavel Solodovnikov
3c4d46728d raft: support expiring server address mappings for rpc module
This patch introduces `raft_address_map` class to abstract
the notion of expirable address mappings for a raft rpc module.

In Raft an instance may need to communicate with a peer outside
its current configuration. This may happen, e.g., when a follower
falls out of sync with the majority and then a configuration is
changed and a leader not present in the old configuration is elected.

The solution is to introduce the concept of "expirable" updates to
the RPC subsystem.

When RPC receives a message from an unknown peer, it also adds the
return address of the peer to the address map with a TTL. Should
we need to respond to the peer, its address will be known.

An outgoing communication to an unconfigured peer is impossible.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 20:22:44 +03:00
Pavel Emelyanov
d1796ab3dc drain_on_shutdown: Simplify
The modern version of this method doesn't need the
run_with_no_api_lock(), as it's launched on shard 0
anyway, neither it needs logging before and after
as it's done by the deferred action from main that
calls it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:46 +03:00
Pavel Emelyanov
58b47efe16 drain: Fix indentation
Previous patch left it broken for readability.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:46 +03:00
Pavel Emelyanov
8d7ad6de03 storage_service: Unify drain and drain_on_shutdown
Now they only differ in one bit -- compaction manager is
drained on drain and is left running (until regular stop)
on shutdown. So this unification adds a boolean flag for
this case.

Also the indentation is deliberately left broken for the
sake of patch readability.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:46 +03:00