Fixes#8270
If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk footprint that is over limit causing new segment allocation to stall.
We need to take a few things into account:
1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here.
2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task.
3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit.
4.) (v2) Must ensure discard/delete routines are executed. Because we can race with background disk syncs, we may need to
issue segment prunes from end_flush() so we wake up actual file deletion/recycling
5.) (v2) Shutdown must ensure discard/delete is run after we've disabled background task etc, otherwise we might fail waking up replenish and get stuck in gate
6.) (v2) Recycling or deleting segments must be consistent, regardless of shutdown. For same reason as above.
7.) (v3) Signal recycle/delete queues/promise on shutdown (with recognized marker) to handle edge case where we only have a single (allocating) segment in the list, and cannot wake up replenisher in any more civilized way.
Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything).
New attempt at this, should fix intermittent shutdown deadlocks in commitlog_test.
Closes#8764
* github.com:scylladb/scylla:
commitlog_test: Add test case for usage/disk size threshold mismatch
commitlog_test: Improve test assertion
commitlog: Add waitable future for background sync/flush
commitlog: abort queues on shutdown
commitlog: break out "abort" calls into member functions
commitlog: Do explicit discard+delete in shutdown
commitlog: Recycle or not should not depend on shutdown state
commitlog: Issue discard_unused_segments on segment::flush end IFF deletable
commitlog: Flush all segments if we only have one.
commitlog: Always force flush if segment allocation is waiting
commitlog: Include segment wasted (slack) size in footprint check
commitlog: Adjust (lower) usage threshold
The helper is used to walk the tree key-by-key destroying it
in the mean time. Current implementation of this method just
uses the "regular" erasing code which actually rebalances the
tree despite the name.
The biggest problem with removing the rebalancing is that at
some point non-balanced tree may have the left-most key on an
inner node, so to make 100% rebalance-less unlink every other
method of the tree would have to be prepared for that. However,
there's an option to make "light rebalance" (as it's called in
this patch) that only maintains this crucial property of the
tree -- the left-most key is on the leaf.
Some more tech details. Current rebalancer starts when the
node population falls below 1/2 of its capacity and tries to
- grab a key from one of the siblings if it's balanced
- merge two siblings together if they are small enough
The light rebalance is lighter in two ways. First, it leaves
the node unbalanced until it becomes empty. And then it goes
ahead and replaces it with the next sibling.
This change removes ~60% of the keys movements on random test.
Keys still move when the sibling replace happens because in
this case the separation key needs to be placed at the right
sibling 0 position which means shifting all its keys right.
tests: unit(debug)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210623083836.27491-1-xemul@scylladb.com>
_sstables_opened_but_not_loaded was needed because the old loader would
open sstables from all shards before loading them.
In the new loader, introduced with reshape, make_sstables_available()
is called on each shard after resharding and reshape finished, so
there's no need whatsoever for that mess.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210618200026.1002621-1-raphaelsc@scylladb.com>
Move `raft_sys_table_storage_test` and `raft_address_map_test` to
`test/raft` folder since they naturally belong here, not in
`test/boost` folder.
Tests: unit(dev)
* manmanson/move_some_raft_tests_to_raft_folder:
test: raft: move `raft_address_map_test` to `raft` folder
test: raft: move `raft_sys_table_storage_test` to `raft` folder
configure: add extended raft testing dependencies
The process of getting a queue pointer is quite tricky here.
First, it checks if the vptr resolves into 'vtable for async_work_item'
and puts a None mark into known_vptrs dict.
Then, if the entry is present there are two options. First, if it's NOT
None, it's translated directly into the queue object. But if it's None,
then a loop over an offset starts that tries to check is the vptr + offset
maps to a queue. And here's the problem -- if no offsets were mapped to
any specific queues the last checked offset is put into the known vptrs
dict, so the next vptrs will miss the 2nd offset checking, but will go
ahead and use the "random" offset that had failed previously.
tests: unit(gdb)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210624085723.7156-1-xemul@scylladb.com>
Sometimes an ability to force a leader change is needed. For instance
if a node that is currently serving as a leader needs to be brought
down for maintenance. If it will be shutdown without leadership
transfer the cluster will be unavailable for leader election timeout at
least.
* scylla-dev/raft-stepdown-v4:
raft: test: test leadership transfer timeout
raft: allow to initiate leader stepdown process
Rename `scylla_raft_dependencies` to `scylla_minimal_raft_dependencies`
and introduce `scylla_raft_dependencies` that contains
`scylla_core` (i.e., all scylla source files).
The new `scylla_raft_dependencies` variable will be used
for `raft_address_map_test` and `raft_sys_table_storage_test`,
which use a lot of machinery from scylla.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
"
This series prevents broken_promise or dangling reader errors
when (resharding) compaction is stopped, e.g. during shutdown.
At the moment compaction just closes the reader unilaterally
and this yanks the reader from under the queue_reader_handle feet,
causing dangling queue reader and broken_promise errors
as seen in #8755.
Instead, fix queue_reader::close to set value on the
_full/_not_full promises and detach from the handle,
and return _consume_fut from bucket_writer::consume
if handle is terminated.
Fixes#8755
Test: unit(dev)
DTest: materialized_views_test.py:TestMaterializedViews.interrupt_build_process_and_resharding_half_to_max_test(debug)
"
* tag 'propagate-reader-abort-v3' of github.com:bhalevy/scylla:
mutation_writer: bucket_writer: consume: propagate _consume_fut if queue_reader_handle is_terminated
queue_reader_handle: add get_exception method
queue_reader: close: set value on promises on detach from handle
Fixes#8749
if a table::clear() was issued while we were flushing a memtable,
the memtable is already gone from list. We need to check this before
erase. Otherwise we get random memory corruption via
std::vector::erase
v2:
* Make interface more set-like (tolerate non-existance in erase).
Closes#8904
Previously, hinted handoff had a hardcoded concurrency limit - at most
128 hints could be sent from a single shard at once. This commit makes
this limit configurable by adding a new configuration option:
`max_hinted_handoff_concurrency_per_shard`. This option can be updated
in runtime. Additionally, the default concurrency per shard is made
lower and is now 8.
The motivation for reducing the concurrency was to mitigate the negative
impact hints may have on performance of the receiving node due to them
not being properly isolated with respect to I/O.
Tests:
- unit(dev)
- dtest(hintedhandoff_additional_test.py)
Refs: #8624Closes#8646
Test that if leadership transfer cannot be done in configured time frame
fsm cancels the leadership transfer process. Also check that timeout_now
message is resent on each tick while leadership transfer is in progress.
Sometimes an ability to force a leader change is needed. For instance
if a node that is currently serving as a leader needs to be brought
down for maintenance. If it will be shutdown without leadership
transfer the cluster will be unavailable for leader election timeout at
least.
We already have a mechanism to transfer the leadership in case an active
leader is removed. The patch exposes it as an external interface with a
timeout period. If a node is still a leader after the timeout the
operation will fail.
This is a more informative name. Helps see that, say, group0
is a separate service and not bundle all raft services together.
Message-Id: <20210619211412.3035835-3-kostja@scylladb.com>
Fixes#8733
If a memtable flush is still pending when we call table::clear(),
we can end up doing a "discard-all" call to commitlog, followed
by a per-segment-count (using rp_set) _later_. This will foobar
our internal usage counts and quite probably cause assertion
failures.
Fixed by always doing per-memtable explicit discard call. But to
ensure this works, since a memtable being flushed remains on
memtable list for a while (why?), we must also ensure we clear
out the rp_set on discard.
v3:
* Fix table::clear to discard rp_sets before memtables
Closes#8894
Listing /etc/systemd/system/*.mount as ghost file seems incorrect,
since user may want to keep using RAID volume / coredump directory after
uninstalling Scylla, or user may want to upgrade enterprise version.
Also, we mixed two types of files as ghost file, it should handle differently:
1. automatically generated by postinst scriptlet
2. generated by user invoked scylla_setup
The package should remove only 1, since 2 is generated by user decision.
See scylladb/scylla-enterprise#1780
Closes#8810
Commitlog timer issues un-waited syncs on all segments. If such
a sync takes too long we can end up keeping a segment alive across
a shutdown, causing the file to be left on disk, even if actually
clean.
This adds a future in segment_manager that is "chained" with all
active syncs (hopefully just one), and ensures we wait for this
to complete in shutdown, before pruning and deleting segments
Currently we capture the snapshot mutation_source by reference
for calling create_underlying_reader after closing the reader.
However, if close_reader yields, the snapshot reference passed
may be gone, so capture it by value instead.
Fixes#8848
Test: unit(dev)
DTest: restore_snapshot_using_old_token_ownership_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210613104232.634621-1-bhalevy@scylladb.com>
* tools/java 599b2368d6...5013321823 (4):
> cassandra-stress: fix failure due to the assert exception on disconnect when test is completed
> node_probe: toppartitions: Fix wrong class in getMethod
> Fix NullPointerException in SettingsMode
> cassandra-stress: Remove maxPendingPerConnection default
* tools/jmx a7c4c39...5311e9b (2):
> storage_service: takeSnapshot: support the skipFlush option
> build(deps): bump snakeyaml from 1.16 to 1.26 in /scylla-apiclient
"
The storage service is carried along storage proxy, hints
resource manager and hints managers (two of them) just to
subscribe the hints managers on lifecycle events (and stop
the subscription on shutdown) emitted from storage service.
This dependency chain can be greatly simplified, since the
storage proxy is already subscribed on lifecycle events and
can kick managers directly from its hooks.
tests: unit(dev),
dtest.hintedhandoff_additional_test.hintedhandoff_basic_check_test(dev)
"
* 'br-remove-storage-service-from-hints' of https://github.com/xemul/scylla:
hints: Drop storage service from managers
hints: Do not subscribe managers on lifecycle events directly
The existing test_ssl.py which tests for Scylla's support of various TLS
and SSL versions, used a deprecated and misleading Python API for
choosing the protocol version. In particular, the protocol version
ssl.PROTOCOL_SSLv23 is *not*, despite it's name, SSL versions 2 or 3,
or SSL at all - it is in fact an alias for the latest TLS version :-(
This misunderstanding led us to open the incorrect issue #8837.
So in this patch, we avoid the old Python APIs for choosing protocols,
which were gradually deprecated, and switch to the new API introduced
in Python 3.7 and OpenSSL 1.1.0g - supplying the minimum and maximum
desired protocol version.
With this new API, we can correctly connect with various versions of
the SSL and TLS protocol - between SSLv3 through TLSv1.3. With the
fixed test, we confirm that Scylla does *not* allow SSLv3 - as desired -
so issue #8837 is a non-issue.
Moreover, after issue #8827 was already fixed, this test now passes,
so the "xfail" mark is removed.
Refs #8837.
Refs #8827.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210617134305.173034-1-nyh@scylladb.com>
* seastar 813eee3e...0e48ba88 (5):
> net/tls: on TLS handshake failure, send error to client
> net/dns: fix build on gcc 11
> core: fix docstring for max_concurrent_for_each
> test: alien_test: replace deprecated call to alien::submit_to() with new variant
> alien: prepare for multi-instance use
The fix "net/tls: on TLS handshake failure, send error to client"
fixes#8827.
The test
test/cql-pytest/run --ssl test_ssl.py
now xpasses, so I'll remove the "xfail" mark in a followup patch.
The storage service pointer is only used so (un)subscribe
to (from) lifecycle events. Now the subscription is gone,
so can the storage service pointer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Managers sit on storage proxy which is already subscribed on
lifecycle events, so it can "notify" hints managers directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The patch set introduces a few leadership transfer tests, some of them
are adaptations of corresponding etcd tests (e.g.
`test_leader_transfer_ignore_proposal` and `test_transfer_non_member`).
Others test different scenarios ensuring that pending leadership
transfer doesn't disrupt the rest of the cluster from progressing:
Lost `timeout_now` messages` (`test_leader_transfer_lost_timeout_now` and
`test_leader_transferee_dies_upon_receiving_timeout_now`) as well as
lost `vote_request(force)` from the new candidate
(test_leader_transfer_lost_force_vote_request) don't impact the
election process following that and the leader is elected as normal.
* manmanson/leadership_transfer_tests_v3:
raft: etcd_test: test_transfer_non_member
raft: etcd_test: test_leader_transfer_ignore_proposal
raft: fsm_test: test_leader_transfer_lost_force_vote_request
raft: fsm_test: test_leader_transfer_lost_timeout_now
raft: fsm_test: test_leader_transferee_dies_upon_receiving_timeout_now
Turns out the DELETE statement already supports attributes
like timestamp, so it's ridiculously easy to add USING TIMEOUT
support - it's just the matter of accepting it in the grammar.
Fixes#8855Closes#8876
This series addresses #8852 by:
* migrating to chunked_vector in view update generation code to avoid large allocations
* reducing the number of futures kept in mutate_MV, tracking how many view updates were already sent
Combined with #8853 I was able to only observe large partition warnings in the logs for the reproducing code, without crashes, large allocation or reactor stall warnings. The reproducing code itself is not part of cql-pytest because I haven't yet figured out how to make it fast and robust.
Tests: unit(release)
Refs #8852Closes#8856
* github.com:scylladb/scylla:
db,view: limit the number of simultaneous view update futures
db,view: use chunked_vector for view updates
The write paths in storage_proxy pass replica sets as
std::unordered_set<gms::inet_address>. This is a complex type, with
N+1 allocations for N members, so we change it to a small_vector (via
inet_address_vector_replica_set) which requires just one allocation, and
even zero when up to three replicas are used.
This change is more nuanced than the corresponding change to the read path
abe3d7d7 ("Merge 'storage_proxy: use small_vector for vectors of
inet_address' from Avi Kivity"), for two reasons:
- there is a quadratic algorithm in
abstract_write_response_handler::response(): it searches for a replica
and erases it. Since this happens for every replica, it happens N^2/2
times.
- replica sets for writes always include all datacenters, while reads
usually involve just one datacenter.
So, a write to a keyspace that has 5 datacenters will invoke 15*(15-1)/2
=105 compares.
We could remove this by sending the index of the replica in the replica
set to the replica and ask it to include the index in the response, but
I think that this is unnecessary. Those 105 compares need to be only
105/15 = 7 times cheaper than the corresponding unordered_set operation,
which they surely will. Handling a response after a cross-datacenter round
trip surely involves L3 cache misses, and a small_vector reduces these
to a minimum compared to an unordered_set with its bucket table, linked
list walking and managent, and table rehashing.
Tests using perf_simple_query --write --smp 1 --operations-per-shard 1000000
--task-quota-ms show two allocations removed (as expected) and a nice
reduction in instructions executed.
before: median 204842.54 tps ( 54.2 allocs/op, 13.2 tasks/op, 49890 insns/op)
after: median 206077.65 tps ( 52.2 allocs/op, 13.2 tasks/op, 49138 insns/op)
Closes#8847
the_merge_lock is global, which is fine now because it is only used
in shard 0. However, if we run multiple nodes in a single process,
there will be multiple shard 0:s, and the_merge_lock will be accessed
from multiple threads. This won't work.
To fix, make it thread_local. It would be better to make it a member
of some controlling object, but there isn't one.
Closes#8858
The option is provided by nodetool snapshot
https://docs.scylladb.com/operating-scylla/nodetool-commands/snapshot/
```
nodetool [(-h <host> | --host <host>)] [(-p <port> | --port <port>)]
[(-pp | --print-port)] [(-pw <password> | --password <password>)]
[(-pwf <passwordFilePath> | --password-file <passwordFilePath>)]
[(-u <username> | --username <username>)] snapshot
[(-cf <table> | --column-family <table> | --table <table>)]
[(-kc <kclist> | --kc.list <kclist>)]
[(-sf | --skip-flush)] [(-t <tag> | --tag <tag>)] [--] [<keyspaces...>]
-sf / –skip-flush Do not flush memtables before snapshotting (snapshot will not contain unflushed data)
```
But is currently ignored by scylla-jmx (scylladb/scylla-jmx#167)
and not supported at the api level.
This patch adds support for the option in advance
from the api service level down via snapshot_ctl
to the table class and snapshot implementation.
In addition, a corresponding unit test was added to verify
that taking a snapshot with `skip_flush` does not flush the memtable
(at the table::snapshot level).
Refs #8725Closes#8726
* github.com:scylladb/scylla:
test: database_test: add snapshot_skip_flush_works
api: storage_service/snapshots: support skip-flush option
snapshot: support skip_flush option
table: snapshot: add skip_flush option
api: storage_service/snapshots: add sf (skip_flush) option
This is a translation of Cassandra's CQL unit test source file
validation/entities/StaticColumnsTest.java into our our cql-pytest framework.
This test file checks various features of static columns. All these tests
pass on Cassandra, and all but one pass on Scylla. The xfailing test,
testStaticColumnsWithSecondaryIndex, exposes a query that Cassandra
allows but we don't. The new issue about that is:
Refs #8869.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210616141633.114325-1-nyh@scylladb.com>
Consider two nodes with almost-100% cache hit ratio, but not exactly
100%: one has 99.9% cache hits, the second 99.8%. Normally in HWLB we
want to equalize the miss rate in both nodes. So we send the first node
twice the number of requests we send to the second. But unless the disks
are extremely limited, this doesn't make sense: As a numeric example,
consider that we send 2000 requests to the first node and 1000 to the
second, just so the number of misses will be the same - 2 (0.1% and 0.2%
misses, respectively). At such low miss numbers, the assumption that the
disk reads are the slowest part of the operation is wrong, so trying to
equalize only this part is wrong.
So above some threshold hit rate, we should treat all hit rates as
equivalent. In the code we already had such a threshold - max_hit_rate,
but it was set to the incredibly high 0.999. We saw in actual user
runs (see issue #8815) that this threshold was too high - one node
received twice the amount of requests that another did - although both
had near-100% cache hit rates.
So in this patch we lower the max_hit_rate to 0.95. This will have two
consequences:
1. Two nodes with hit rates above 0.95 will be considered to have the
same hit rate, so they will get equal amount of work - even if one
has hit rate 0.98 and the other 0.99.
2. A cold node with it rate 0.0 will get 5% of the work of a node with
the perfect hit rate limited to 0.95. This will allow the cold node to
slowly warm up its cache. Before this patch, if the hot node happened
to have a hit rate of 0.999 (the previous maximum), the cold node would
get just 0.1% of the work and remain almost idle and fill its cache
extremely slowly - which is a waste.
Fixes#8815.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210616180732.125295-1-nyh@scylladb.com>