Commit Graph

27093 Commits

Author SHA1 Message Date
Avi Kivity
b7cb687d36 Merge "Harden storage_service::stop_transport" from Pavel E
"
Stopping transport (cql, thrift, messaging, etc.) can happen from
several places -- drain, decommission, stop, isolation. Some of
them can even run in parallel. This patch makes transport stopping
bulletproof.

tests: unit(dev)
       start-stop (dev)
       start-drain-stop (dev)
fixes: #8911
"

* 'br-stop-transport-races' of https://github.com/xemul/scylla:
  storage_service: Indentation fix
  storage_service: Make stop_transport re-entrable
  storage_service: Stop transport on decommission
2021-06-27 14:46:46 +03:00
Pavel Emelyanov
7014da9404 storage_service: Unregister disk error handlers on stop
Storage service install disk error handlers in constructor and these
connections are not unregistered. It's not a problem in real life,
because storage service is not stopped, but in some tests this can
lead to use-after-frees.

The sstables_datafile_test runs some of the testcases in cql_test_env
which starts and (!) stops the storage service. Other testcases are
run in a lightweight sstables_test_env which does not mess with the
storage service at all. Now, if a case of the 2nd kind is run after
the one of the 1st and (for whatever reason) generates a disk error
it will trigger use-after-free -- after the 1st testcase the storage
service disk error registration would remain, but the storage service
itself would already be stopped, thus triggering the disk error will
try to access stopped sharded storage service inside the .isolate().

The fix is to keep the scoped connection on the storage service list
of various listeners. On stop it will go away automagically.

tests: unit(dev), sstables_datafile_test with forced disk error

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210625062648.27812-1-xemul@scylladb.com>
2021-06-27 14:41:55 +03:00
Avi Kivity
6676ceabde Merge 'Prevent reactor stall in utils::merge_to_gently' from Benny Halevy
std::copy_if runs without yielding.

See https://github.com/scylladb/scylla/issues/8897#issuecomment-867522480

Also, eliminate extraneous loop on merge

first1 will point to the inserted value which is a copy of *first2.
Since list2 is sorted in ascending order, the next item from list2
will never be less than the one we've just inserted,
so we waste an iteration to merely increment first1 again.

Fixes #8897

Test: unit(dev), stall_free_test(debug)
DTest: repair_additional_test.py:RepairAdditionalTest.{repair_same_row_diff_value_3nodes_diff_shard_count_test,repair_disjoint_row_3nodes_diff_shard_count_test} (dev)

Closes #8925

* github.com:scylladb/scylla:
  utils: merge_to_gently: eliminate extraneous loop on merge
  utils: merge_to_gently: prevent stall in std::copy_if
2021-06-27 13:56:32 +03:00
Raphael S. Carvalho
29c93ae592 compaction: Reduce backlog of compacting SSTable properly
It was observed that as compaction progresses the backlog of compacting SSTable
is being reduced very slowly, which causes shares to be higher than needed, and
consequently compaction acts much more aggressively than it has to.

https://user-images.githubusercontent.com/1409139/120237819-93dfc080-c232-11eb-9042-68114e285ea0.png

The graph above shows the amount of backlog that is reduced from a SSTable
being compacted. The red line denotes the total backlog of the SSTable, before
it's selected for compaction. The expectation is that the more a SSTable is
compacted the more backlog will be reduced from it. However, in the current
implementation, it can be seen that the backlog to be reduced, from the SSTable
being compacted, starts being inversely proportional to the amount of data
already compacted.

Turns out that this problem happens because the implementation of backlog
formula becomes incorrect when the SSTable is being compacted.

Backlog for a sstable is currently defined as:
    Bi = Ei * log (T / Ei)

    where Ei = Si - Ci (bytes left to be compacted)
        and Si = size of SStable
        and Ci = total bytes compacted
        and T = total size of table

The formula above can also be rewritten as follows:
    Bi = Ei * log (T) - Ei * log (Ei)

the second term `Ei * log (Ei)` can be rewritten as:
    = (Si - Ci) * log (Ei)
    = Si * log (Ei) - Ci * log (Ei)

However, digging backlog implementation, turns out that we're incorrectly
implementing that second term as:
    = Si * log (Si) - Ci * log (Ei)

Given that Si > Ei, for a SSTable being compacted, the backlog will be higher
than it should.

the following table shows how the backlog of a SSTable being compacted behaves
now versus how it's supposed to behave:
https://gist.github.com/raphaelsc/42e14be0d7d4ed264e538c2d217c8f95

Turns out that this is not the only problem. It was a mistake to change the
formula from `Ei * log(T / Si)` to `Ei * log(T / Ei)`, when fixing the
shrinking table issue, because that also causes the backlog of a compacting
SSTable to be incorrectly reduced.

With the formula rewritten as follows:
    Bi = Ei * log (T) - Ei * log (Ei)

It becomes clear that the more a SSTable is compacted, the slower it becomes
for backlog to be reduced, as T / Ei can increase considerably over time.

So we're reverting the formula back to `Ei * log(T / Si)`.

The graph below shows a better backlog behavior when table is shrinking:
https://user-images.githubusercontent.com/1409139/123495186-06a54700-d5f9-11eb-9386-3fcf4dd8e4d3.png

While analyzing the problem when table is shrinking, realized that it's because
T in the formula is implemented as the effective size (total + partial -
compacted).

With the new formula rewritten as follows:
    Bi = Ei * log (T) - Ei * log (Si)

It becomes clearer that T cannot be lower than Si whatsoever, otherwise the
backlog becomes negative. Also, while table is shrinking, it can happen that
the backlog will be so low that compaction will barely make any progress.
To fix both issues, let's implement T as total size (sum of all Si) rather than
effective size (sum of all Ei).

The graph below shows that this change prevents the backlog from going negative
while still providing similar and expected behavior as before, see:
https://user-images.githubusercontent.com/1409139/123495185-060cb080-d5f9-11eb-89f7-ed445729702a.png

Fixes #8768.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210626003133.3011007-1-raphaelsc@scylladb.com>
2021-06-27 11:43:48 +03:00
Pavel Emelyanov
a89ae9a8e7 storage_service: Indentation fix
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-25 13:21:10 +03:00
Pavel Emelyanov
bd2a58060e storage_service: Make stop_transport re-entrable
It may happen that disk error opccurs and subsequent isolation runs
in parallel with drain or decommission or shutdown. In this case the
stop_transport method would be running two times in parallel. Also
the drain after decommission is not disabled, so it may happen that
stop_transport will be called two times in a row (however -- not in
parallel).

Using shared_promise solves all the possible reentrances.

(the indentation is deliberately left broken)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-25 13:18:43 +03:00
Pavel Emelyanov
b0199b005d storage_service: Stop transport on decommission
The stop_transport sequence is:
- stop client services (cql, thrift, alternator)
- stop gossiping
- stop messaging
- stop stream manager

The decommissioning goes very similarly
- stop client services
- stop batchlog manager
- stop gossiping
- stop messaging

So this change makes decommission stop all networking _before_
batchlog, like it's already done on drain, and additionally stop
the streaming manager.

This change is prerequisite for fixing race between transport
stop and isolation (on disk error).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-25 13:15:38 +03:00
Pavel Emelyanov
3552e99ce7 scylla-gdb: Bring scylla netw back to work
The netw command tries to access the netw::_the_messaging_service that
was removed long ago. The correct place for the messaging service is
in debug:: namespace.

The scylla-gdb test checks that, but the netw command sees that the ptr
in question is not initialized, thinks it's not yet sharded::start()-ed
and exits without errors.

tests: unit(gdb)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210624135107.12375-1-xemul@scylladb.com>
2021-06-24 20:59:27 +03:00
Nadav Har'El
4d7f55a29f cql: add configurable restriction of DateTieredCompactionStrategy
DateTieredCompactionStrategy (DTCS) has been un-recommended for a long time
(users should use TimeWindowCompactionStrategy, TWCS, instead). This
patch adds a new configuration option - restrict_dtcs - which can be used
to restrict the ability to use DTCS in CREATE TABLE or ALTER TABLE
statements. This is part of a "safe mode" effort to allow an installation
to restrict operations which are un-recommended or dangerous.

The new restrict_dtcs option has three values: "true", "false", and "warn":

For the time being, "false" is still the default, and means DTCS is not
restricted  and can still be used freely. We can easily change this
default in a followup patch.

Setting a value of "true" means that DTCS *is* restricted -
trying to create a a table or alter a table with it will fail with an error.

Setting a value of "warn" will allow the create or alter operation, but
will warn the user - both with a warning message which will immediately
appear in cqlsh (for example), and with a log message.

Fixes #8914.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210624122411.435361-1-nyh@scylladb.com>
2021-06-24 20:59:27 +03:00
Benny Halevy
b96eeaefe4 utils: merge_to_gently: eliminate extraneous loop on merge
first1 will point to the inserted value which is a copy of *first2.
Since list2 is sorted in ascending order, the next item from list2
will never be less than the one we've just inserted,
so we waste an iteration to merely increment first1 again.

Note that the standard states that no iterators or references are invalidated
on insert so we can safely keep looking at `first1` after inserting a copy of
`*first2` before it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-24 14:58:12 +03:00
Benny Halevy
453e7c8795 utils: merge_to_gently: prevent stall in std::copy_if
std::copy_if runs without yielding.

See https://github.com/scylladb/scylla/issues/8897#issuecomment-867522480

Note that the standard states that no iterators or references are invalidated
on insert so we can keep inserting before last1 when merging the
remainder of list2 at the tail of list1.

Fixes #8897

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-24 14:47:25 +03:00
Benny Halevy
9bbe7b1482 sstables: mx_sstable_mutation_reader: enforce timeout
Check if the timeout has expired before issuing I/O.

Note that the sstable reader input_stream is not closed
when the timeout is detected. The reader must be closed anyhow after
the error bubbles up the chain of readers and before the
reader is destroyed.  This might already happen if the reader
times out while waiting for reader_concurrency_semaphore admission.

Test: unit(dev), auth_test.test_alter_with_timeouts(debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210624073232.551735-1-bhalevy@scylladb.com>
2021-06-24 12:26:57 +02:00
Kamil Braun
a3f3563828 storage_service: check for existing normal token owners before bootstrapping
The bootstrap procedure starts by "waiting for range setup", which means
waiting for a time interval specified by the `ring_delay` parameter (30s
by default) so the node can receive the tokens of other nodes before
introducing its own tokens.

However it may sometimes happen that the node doesn't receive the
tokens. There are no explicit checks for this. But the code may crash in
weird ways if the tokens-received assuption is false, and we are lucky
if it does crash (instead of, for example, allowing the node to
incorrectly bootstrap, causing data loss in the process).

Introduce an explicit check-and-throw-if-false: a bootstrapping node now
checks that there's at least one NORMAL token in the token ring, which
means that it had to have contacted at least one existing node
in the cluster, which means that it received the gossip application
states of all nodes from that node; in particular the tokens of all
nodes.

Also add an assert in CDC code which relies on that assumption
(and would cause weird division-by-zero errors if the assumption
was false; better to crash on assert than this).

Ref #8889.

Closes #8896
2021-06-24 13:19:08 +03:00
Asias He
2ad8fb756e gossip: Promote gossip quarantine over log to info level
1) Start node n1, n2, n3

2) Bootstrap n4 and kill n4 in the middle of bootstrap

3) Wipe data on n4 and start n4 again

After step 2, n1, n2 and n3 will remove n4 from gossip after
fat_client_timeout and put n4 in quarantine for quarantine_delay().

If n4 bootstraps again in step 3 before the quarantine finishes, n1, n2
and n3 will ignore gossip updates from n4, and n4 will not learn gossip
updates from the cluster.

After PR #8896, the bootstrap will be rejected.

This patch promotes the gossip quarantine over log to info level, so
that dtest can wait for the log to bootstrap the node again.

Refs #8889
Refs #8890

Closes #8905
2021-06-24 12:51:32 +03:00
Michael Livshin
9b9efb2b42 disable caching of the system.large_* tables
The cache of system.large_{partition,rows,cells} accumulates range
tombstones (https://github.com/scylladb/scylla/issues/7750), and those
range tombstones can be evicted only together with their partition
(https://github.com/scylladb/scylla/issues/3288).

Making the system.large_* tables uncached should work around the
problem until #3288 is fixed.

Fixes #8874
Refs #7750
Refs #3288

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20210623171932.8837-1-michael.livshin@scylladb.com>
2021-06-24 12:26:45 +03:00
Piotr Sarna
ae9e52a774 Merge 'Cleanup and improvements for docs/alternator/alternator.md' from Nadav Har'El
Make some improvements to docs/alternator.md as suggested by a user who
had trouble understanding the previous version, and also a few other
random cleanups.

Closes #8910

* github.com:scylladb/scylla:
  docs/alternator/alternator.md: improve "Running Alternator" section
  docs/alternator/alternator.md: correct minor typos
  docs/alternator/alternator.md: fix link format
2021-06-24 12:03:26 +03:00
Avi Kivity
14252c8b71 Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed (#8695)​ (v3)' from Calle Wilund
Fixes #8270

If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk footprint that is over limit causing new segment allocation to stall.

We need to take a few things into account:
1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here.
2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task.
3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit.
4.) (v2) Must ensure discard/delete routines are executed. Because we can race with background disk syncs, we may need to
    issue segment prunes from end_flush() so we wake up actual file deletion/recycling
5.) (v2) Shutdown must ensure discard/delete is run after we've disabled background task etc, otherwise we might fail waking up replenish and get stuck in gate
6.) (v2) Recycling or deleting segments must be consistent, regardless of shutdown. For same reason as above.
7.) (v3) Signal recycle/delete queues/promise on shutdown (with recognized marker) to handle edge case where we only have a single (allocating) segment in the list, and cannot wake up replenisher in any more civilized way.

Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything).

New attempt at this, should fix intermittent shutdown deadlocks in commitlog_test.

Closes #8764

* github.com:scylladb/scylla:
  commitlog_test: Add test case for usage/disk size threshold mismatch
  commitlog_test: Improve test assertion
  commitlog: Add waitable future for background sync/flush
  commitlog: abort queues on shutdown
  commitlog: break out "abort" calls into member functions
  commitlog: Do explicit discard+delete in shutdown
  commitlog: Recycle or not should not depend on shutdown state
  commitlog: Issue discard_unused_segments on segment::flush end IFF deletable
  commitlog: Flush all segments if we only have one.
  commitlog: Always force flush if segment allocation is waiting
  commitlog: Include segment wasted (slack) size in footprint check
  commitlog: Adjust (lower) usage threshold
2021-06-24 12:03:26 +03:00
Pavel Emelyanov
a61afe8421 btree: Improve unlink_leftmost_without_rebalance()
The helper is used to walk the tree key-by-key destroying it
in the mean time. Current implementation of this method just
uses the "regular" erasing code which actually rebalances the
tree despite the name.

The biggest problem with removing the rebalancing is that at
some point non-balanced tree may have the left-most key on an
inner node, so to make 100% rebalance-less unlink every other
method of the tree would have to be prepared for that. However,
there's an option to make "light rebalance" (as it's called in
this patch) that only maintains this crucial property of the
tree -- the left-most key is on the leaf.

Some more tech details. Current rebalancer starts when the
node population falls below 1/2 of its capacity and tries to
- grab a key from one of the siblings if it's balanced
- merge two siblings together if they are small enough

The light rebalance is lighter in two ways. First, it leaves
the node unbalanced until it becomes empty. And then it goes
ahead and replaces it with the next sibling.

This change removes ~60% of the keys movements on random test.
Keys still move when the sibling replace happens because in
this case the separation key needs to be placed at the right
sibling 0 position which means shifting all its keys right.

tests: unit(debug)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210623083836.27491-1-xemul@scylladb.com>
2021-06-24 12:03:26 +03:00
Raphael S. Carvalho
ab9d08d80e sstables: Remove unused filtering reader from sstable_set::make_local_shard_sstable_reader()
It's been a long time since table no longer accepts shared sstables, so this
code which creates a filtering reader, if sstable is shared, is never used.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210618200026.1002621-2-raphaelsc@scylladb.com>
2021-06-24 12:03:26 +03:00
Raphael S. Carvalho
88119a5c81 distributed_loader: Kill table's _sstables_opened_but_not_loaded
_sstables_opened_but_not_loaded was needed because the old loader would
open sstables from all shards before loading them.
In the new loader, introduced with reshape, make_sstables_available()
is called on each shard after resharding and reshape finished, so
there's no need whatsoever for that mess.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210618200026.1002621-1-raphaelsc@scylladb.com>
2021-06-24 12:03:26 +03:00
Tomasz Grabiec
ee28eb4100 Merge "test: raft: move some tests to raft folder" from Pavel Solodovnikov
Move `raft_sys_table_storage_test` and `raft_address_map_test` to
`test/raft` folder since they naturally belong here, not in
`test/boost` folder.

Tests: unit(dev)

* manmanson/move_some_raft_tests_to_raft_folder:
  test: raft: move `raft_address_map_test` to `raft` folder
  test: raft: move `raft_sys_table_storage_test` to `raft` folder
  configure: add extended raft testing dependencies
2021-06-24 12:03:26 +03:00
Pavel Emelyanov
e031e7b0a7 scylla-gdb: Do not leave random offset in smp-queues known vptrs
The process of getting a queue pointer is quite tricky here.
First, it checks if the vptr resolves into 'vtable for async_work_item'
and puts a None mark into known_vptrs dict.
Then, if the entry is present there are two options. First, if it's NOT
None, it's translated directly into the queue object. But if it's None,
then a loop over an offset starts that tries to check is the vptr + offset
maps to a queue. And here's the problem -- if no offsets were mapped to
any specific queues the last checked offset is put into the known vptrs
dict, so the next vptrs will miss the 2nd offset checking, but will go
ahead and use the "random" offset that had failed previously.

tests: unit(gdb)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210624085723.7156-1-xemul@scylladb.com>
2021-06-24 12:03:22 +03:00
Nadav Har'El
b965bc76e0 docs/alternator/alternator.md: improve "Running Alternator" section
A user complained that the "Running Alternator" section was confusing.
It didn't say outright which two configurations are necessary and you
had to read a few paragraph to reach it, and it mixed the YAML names
of options and the command-line names, which are subtly different.

This patch tries to improve this.
2021-06-23 19:41:52 +03:00
Tomasz Grabiec
a60e73fe14 Merge "raft: allow to initiate leader stepdown process explicitly" from Gleb
Sometimes an ability to force a leader change is needed. For instance
if a node that is currently serving as a leader needs to be brought
down for maintenance. If it will be shutdown without leadership
transfer the cluster will be unavailable for leader election timeout at
least.

* scylla-dev/raft-stepdown-v4:
  raft: test: test leadership transfer timeout
  raft: allow to initiate leader stepdown process
2021-06-23 00:14:46 +02:00
Pavel Solodovnikov
a96ddbec35 test: raft: move raft_address_map_test to raft folder
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-22 23:33:22 +03:00
Pavel Solodovnikov
cf5025c44e test: raft: move raft_sys_table_storage_test to raft folder
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-22 23:31:41 +03:00
Pavel Solodovnikov
6912f76e45 configure: add extended raft testing dependencies
Rename `scylla_raft_dependencies` to `scylla_minimal_raft_dependencies`
and introduce `scylla_raft_dependencies` that contains
`scylla_core` (i.e., all scylla source files).

The new `scylla_raft_dependencies` variable will be used
for `raft_address_map_test` and `raft_sys_table_storage_test`,
which use a lot of machinery from scylla.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-22 23:26:18 +03:00
Nadav Har'El
3895d4bb99 docs/alternator/alternator.md: correct minor typos
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-22 20:03:48 +03:00
Benny Halevy
4ab4f63efe sstables: mx/writer: flush_tmp_bufs: maybe_yield in loop
This loop may cause pretty long reactor stalls as seen in
https://github.com/scylladb/scylla/issues/8900

Apparently output_stream<CharType>::slow_write returns
a ready future and no yielding is considered, so
add a check in the top level loop (that must already
be called from a seastar thread).

Fixes #8900

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210622152206.156302-1-bhalevy@scylladb.com>
2021-06-22 18:56:12 +03:00
Avi Kivity
d27e88e785 Merge "compaction: prevent broken_promise or dangling reader errors" from Benny
"
This series prevents broken_promise or dangling reader errors
when (resharding) compaction is stopped, e.g. during shutdown.

At the moment compaction just closes the reader unilaterally
and this yanks the reader from under the queue_reader_handle feet,
causing dangling queue reader and broken_promise errors
as seen in #8755.

Instead, fix queue_reader::close to set value on the
_full/_not_full promises and detach from the handle,
and return _consume_fut from bucket_writer::consume
if handle is terminated.

Fixes #8755

Test: unit(dev)
DTest: materialized_views_test.py:TestMaterializedViews.interrupt_build_process_and_resharding_half_to_max_test(debug)
"

* tag 'propagate-reader-abort-v3' of github.com:bhalevy/scylla:
  mutation_writer: bucket_writer: consume: propagate _consume_fut if queue_reader_handle is_terminated
  queue_reader_handle: add get_exception method
  queue_reader: close: set value on promises on detach from handle
2021-06-22 18:52:11 +03:00
Nadav Har'El
5bb4966cac docs/alternator/alternator.md: fix link format
Unfortunately the scylla.docs.scylladb.com formatter which generates
https://scylla.docs.scylladb.com/master/alternator/alternator.html
doesn't know how to recognize HTTP URLs and convert them into proper
HTML links (something which github's formatter does).

So convert the two URLs we had in alternator.md into markdown links
which both github and our formatter recognize.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-22 18:43:27 +03:00
Calle Wilund
373fa3fa07 table: ensure memtable is actually in memtable list before erasing
Fixes #8749

if a table::clear() was issued while we were flushing a memtable,
the memtable is already gone from list. We need to check this before
erase. Otherwise we get random memory corruption via
std::vector::erase

v2:
* Make interface more set-like (tolerate non-existance in erase).

Closes #8904
2021-06-22 15:58:56 +02:00
Asias He
ffa211a8c7 repair: Avoid copy rows in apply_rows_on_master_in_thread
The rows are not used after the call to to_repair_rows_list. Use
std::move() to avoid copying.

Fixes #8902

Closes #8903
2021-06-22 15:58:56 +02:00
Benny Halevy
02917c79b6 logalloc: get rid of unused _descendant_blocked_requests
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210620064204.1709957-1-bhalevy@scylladb.com>
2021-06-22 15:58:56 +02:00
Piotr Dulikowski
de1679b1b9 hints: make hints concurrency configurable and reduce the default
Previously, hinted handoff had a hardcoded concurrency limit - at most
128 hints could be sent from a single shard at once. This commit makes
this limit configurable by adding a new configuration option:
`max_hinted_handoff_concurrency_per_shard`. This option can be updated
in runtime. Additionally, the default concurrency per shard is made
lower and is now 8.

The motivation for reducing the concurrency was to mitigate the negative
impact hints may have on performance of the receiving node due to them
not being properly isolated with respect to I/O.

Tests:
- unit(dev)
- dtest(hintedhandoff_additional_test.py)

Refs: #8624

Closes #8646
2021-06-22 15:58:56 +02:00
Gleb Natapov
09528b8671 raft: test: test leadership transfer timeout
Test that if leadership transfer cannot be done in configured time frame
fsm cancels the leadership transfer process. Also check that timeout_now
message is resent on each tick while leadership transfer is in progress.
2021-06-22 14:42:50 +03:00
Gleb Natapov
ed49d29473 raft: allow to initiate leader stepdown process
Sometimes an ability to force a leader change is needed. For instance
if a node that is currently serving as a leader needs to be brought
down for maintenance. If it will be shutdown without leadership
transfer the cluster will be unavailable for leader election timeout at
least.

We already have a mechanism to transfer the leadership in case an active
leader is removed. The patch exposes it as an external interface with a
timeout period. If a node is still a leader after the timeout the
operation will fail.
2021-06-22 14:36:42 +03:00
Konstantin Osipov
bd410da77a raft: (service) rename raft_services service to raft_group_registry
This is a more informative name. Helps see that, say, group0
is a separate service and not bundle all raft services together.
Message-Id: <20210619211412.3035835-3-kostja@scylladb.com>
2021-06-21 14:53:54 +03:00
Konstantin Osipov
025f18325e raft: (service) move raft service to namespace service
Message-Id: <20210619211412.3035835-2-kostja@scylladb.com>
2021-06-21 14:53:54 +03:00
Calle Wilund
fdb5801704 table: Always use explicit commitlog discard + clear out rp_set
Fixes #8733

If a memtable flush is still pending when we call table::clear(),
we can end up doing a "discard-all" call to commitlog, followed
by a per-segment-count (using rp_set) _later_. This will foobar
our internal usage counts and quite probably cause assertion
failures.
Fixed by always doing per-memtable explicit discard call. But to
ensure this works, since a memtable being flushed remains on
memtable list for a while (why?), we must also ensure we clear
out the rp_set on discard.

v3:
* Fix table::clear to discard rp_sets before memtables

Closes #8894
2021-06-21 14:53:54 +03:00
Takuya ASADA
a677c46672 dist: stop removing /etc/systemd/system/*.mount on package uninstall
Listing /etc/systemd/system/*.mount as ghost file seems incorrect,
since user may want to keep using RAID volume / coredump directory after
uninstalling Scylla, or user may want to upgrade enterprise version.

Also, we mixed two types of files as ghost file, it should handle differently:
 1. automatically generated by postinst scriptlet
 2. generated by user invoked scylla_setup

The package should remove only 1, since 2 is generated by user decision.

See scylladb/scylla-enterprise#1780

Closes #8810
2021-06-21 14:53:54 +03:00
Calle Wilund
0a7823e683 commitlog_test: Add test case for usage/disk size threshold mismatch
Refs #8270

Tries to simulate case where we mismatch segments usage with actual
disk footprint and fail to flush enough to allow segment recycling
2021-06-21 06:01:19 +00:00
Calle Wilund
954da1f0a9 commitlog_test: Improve test assertion
Changes it so actual data is printed, not just error.
2021-06-21 06:01:19 +00:00
Calle Wilund
d6113912cd commitlog: Add waitable future for background sync/flush
Commitlog timer issues un-waited syncs on all segments. If such
a sync takes too long we can end up keeping a segment alive across
a shutdown, causing the file to be left on disk, even if actually
clean.

This adds a future in segment_manager that is "chained" with all
active syncs (hopefully just one), and ensures we wait for this
to complete in shutdown, before pruning and deleting segments
2021-06-21 06:01:19 +00:00
Benny Halevy
499357fb43 row_cache: autoupdating_underlying_reader: fast_forward_to: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210613104232.634621-2-bhalevy@scylladb.com>
2021-06-20 14:46:35 +03:00
Benny Halevy
3db7db5743 row_cache: autoupdating_underlying_reader: fast_forward_to: capture snapshot by value when updating reader
Currently we capture the snapshot mutation_source by reference
for calling create_underlying_reader after closing the reader.
However, if close_reader yields, the snapshot reference passed
may be gone, so capture it by value instead.

Fixes #8848

Test: unit(dev)
DTest: restore_snapshot_using_old_token_ownership_test(debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210613104232.634621-1-bhalevy@scylladb.com>
2021-06-20 14:46:35 +03:00
Avi Kivity
5b3fb83ebe Merge "Remove unused code here and there" from Pavel E
"
Few randomly spotted dead code locations over past time.
Compile-test only.
"

* 'br-remove-unused-stuff' of https://github.com/xemul/scylla:
  database: Remove unused forward declarations
  feature: Remove unused friendship with gossiper
  schema_tables: Remove unused sharded<proxy> argument
  database: Remove few unused sharded<proxy> captures
  view_update_generator: Remove unused struct sstable_with_table
  storage_service: Remove write-only _force_remove_completion
  distributed_loader: Remove unused load-prio manipulations
2021-06-20 12:01:40 +03:00
Pavel Emelyanov
ab4fc41f25 database: Remove unused forward declarations
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Pavel Emelyanov
d606321575 feature: Remove unused friendship with gossiper
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Pavel Emelyanov
96131349e8 schema_tables: Remove unused sharded<proxy> argument
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00