Commit Graph

27074 Commits

Author SHA1 Message Date
Avi Kivity
14252c8b71 Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed (#8695)​ (v3)' from Calle Wilund
Fixes #8270

If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk footprint that is over limit causing new segment allocation to stall.

We need to take a few things into account:
1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here.
2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task.
3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit.
4.) (v2) Must ensure discard/delete routines are executed. Because we can race with background disk syncs, we may need to
    issue segment prunes from end_flush() so we wake up actual file deletion/recycling
5.) (v2) Shutdown must ensure discard/delete is run after we've disabled background task etc, otherwise we might fail waking up replenish and get stuck in gate
6.) (v2) Recycling or deleting segments must be consistent, regardless of shutdown. For same reason as above.
7.) (v3) Signal recycle/delete queues/promise on shutdown (with recognized marker) to handle edge case where we only have a single (allocating) segment in the list, and cannot wake up replenisher in any more civilized way.

Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything).

New attempt at this, should fix intermittent shutdown deadlocks in commitlog_test.

Closes #8764

* github.com:scylladb/scylla:
  commitlog_test: Add test case for usage/disk size threshold mismatch
  commitlog_test: Improve test assertion
  commitlog: Add waitable future for background sync/flush
  commitlog: abort queues on shutdown
  commitlog: break out "abort" calls into member functions
  commitlog: Do explicit discard+delete in shutdown
  commitlog: Recycle or not should not depend on shutdown state
  commitlog: Issue discard_unused_segments on segment::flush end IFF deletable
  commitlog: Flush all segments if we only have one.
  commitlog: Always force flush if segment allocation is waiting
  commitlog: Include segment wasted (slack) size in footprint check
  commitlog: Adjust (lower) usage threshold
2021-06-24 12:03:26 +03:00
Pavel Emelyanov
a61afe8421 btree: Improve unlink_leftmost_without_rebalance()
The helper is used to walk the tree key-by-key destroying it
in the mean time. Current implementation of this method just
uses the "regular" erasing code which actually rebalances the
tree despite the name.

The biggest problem with removing the rebalancing is that at
some point non-balanced tree may have the left-most key on an
inner node, so to make 100% rebalance-less unlink every other
method of the tree would have to be prepared for that. However,
there's an option to make "light rebalance" (as it's called in
this patch) that only maintains this crucial property of the
tree -- the left-most key is on the leaf.

Some more tech details. Current rebalancer starts when the
node population falls below 1/2 of its capacity and tries to
- grab a key from one of the siblings if it's balanced
- merge two siblings together if they are small enough

The light rebalance is lighter in two ways. First, it leaves
the node unbalanced until it becomes empty. And then it goes
ahead and replaces it with the next sibling.

This change removes ~60% of the keys movements on random test.
Keys still move when the sibling replace happens because in
this case the separation key needs to be placed at the right
sibling 0 position which means shifting all its keys right.

tests: unit(debug)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210623083836.27491-1-xemul@scylladb.com>
2021-06-24 12:03:26 +03:00
Raphael S. Carvalho
ab9d08d80e sstables: Remove unused filtering reader from sstable_set::make_local_shard_sstable_reader()
It's been a long time since table no longer accepts shared sstables, so this
code which creates a filtering reader, if sstable is shared, is never used.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210618200026.1002621-2-raphaelsc@scylladb.com>
2021-06-24 12:03:26 +03:00
Raphael S. Carvalho
88119a5c81 distributed_loader: Kill table's _sstables_opened_but_not_loaded
_sstables_opened_but_not_loaded was needed because the old loader would
open sstables from all shards before loading them.
In the new loader, introduced with reshape, make_sstables_available()
is called on each shard after resharding and reshape finished, so
there's no need whatsoever for that mess.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210618200026.1002621-1-raphaelsc@scylladb.com>
2021-06-24 12:03:26 +03:00
Tomasz Grabiec
ee28eb4100 Merge "test: raft: move some tests to raft folder" from Pavel Solodovnikov
Move `raft_sys_table_storage_test` and `raft_address_map_test` to
`test/raft` folder since they naturally belong here, not in
`test/boost` folder.

Tests: unit(dev)

* manmanson/move_some_raft_tests_to_raft_folder:
  test: raft: move `raft_address_map_test` to `raft` folder
  test: raft: move `raft_sys_table_storage_test` to `raft` folder
  configure: add extended raft testing dependencies
2021-06-24 12:03:26 +03:00
Pavel Emelyanov
e031e7b0a7 scylla-gdb: Do not leave random offset in smp-queues known vptrs
The process of getting a queue pointer is quite tricky here.
First, it checks if the vptr resolves into 'vtable for async_work_item'
and puts a None mark into known_vptrs dict.
Then, if the entry is present there are two options. First, if it's NOT
None, it's translated directly into the queue object. But if it's None,
then a loop over an offset starts that tries to check is the vptr + offset
maps to a queue. And here's the problem -- if no offsets were mapped to
any specific queues the last checked offset is put into the known vptrs
dict, so the next vptrs will miss the 2nd offset checking, but will go
ahead and use the "random" offset that had failed previously.

tests: unit(gdb)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210624085723.7156-1-xemul@scylladb.com>
2021-06-24 12:03:22 +03:00
Tomasz Grabiec
a60e73fe14 Merge "raft: allow to initiate leader stepdown process explicitly" from Gleb
Sometimes an ability to force a leader change is needed. For instance
if a node that is currently serving as a leader needs to be brought
down for maintenance. If it will be shutdown without leadership
transfer the cluster will be unavailable for leader election timeout at
least.

* scylla-dev/raft-stepdown-v4:
  raft: test: test leadership transfer timeout
  raft: allow to initiate leader stepdown process
2021-06-23 00:14:46 +02:00
Pavel Solodovnikov
a96ddbec35 test: raft: move raft_address_map_test to raft folder
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-22 23:33:22 +03:00
Pavel Solodovnikov
cf5025c44e test: raft: move raft_sys_table_storage_test to raft folder
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-22 23:31:41 +03:00
Pavel Solodovnikov
6912f76e45 configure: add extended raft testing dependencies
Rename `scylla_raft_dependencies` to `scylla_minimal_raft_dependencies`
and introduce `scylla_raft_dependencies` that contains
`scylla_core` (i.e., all scylla source files).

The new `scylla_raft_dependencies` variable will be used
for `raft_address_map_test` and `raft_sys_table_storage_test`,
which use a lot of machinery from scylla.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-22 23:26:18 +03:00
Benny Halevy
4ab4f63efe sstables: mx/writer: flush_tmp_bufs: maybe_yield in loop
This loop may cause pretty long reactor stalls as seen in
https://github.com/scylladb/scylla/issues/8900

Apparently output_stream<CharType>::slow_write returns
a ready future and no yielding is considered, so
add a check in the top level loop (that must already
be called from a seastar thread).

Fixes #8900

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210622152206.156302-1-bhalevy@scylladb.com>
2021-06-22 18:56:12 +03:00
Avi Kivity
d27e88e785 Merge "compaction: prevent broken_promise or dangling reader errors" from Benny
"
This series prevents broken_promise or dangling reader errors
when (resharding) compaction is stopped, e.g. during shutdown.

At the moment compaction just closes the reader unilaterally
and this yanks the reader from under the queue_reader_handle feet,
causing dangling queue reader and broken_promise errors
as seen in #8755.

Instead, fix queue_reader::close to set value on the
_full/_not_full promises and detach from the handle,
and return _consume_fut from bucket_writer::consume
if handle is terminated.

Fixes #8755

Test: unit(dev)
DTest: materialized_views_test.py:TestMaterializedViews.interrupt_build_process_and_resharding_half_to_max_test(debug)
"

* tag 'propagate-reader-abort-v3' of github.com:bhalevy/scylla:
  mutation_writer: bucket_writer: consume: propagate _consume_fut if queue_reader_handle is_terminated
  queue_reader_handle: add get_exception method
  queue_reader: close: set value on promises on detach from handle
2021-06-22 18:52:11 +03:00
Calle Wilund
373fa3fa07 table: ensure memtable is actually in memtable list before erasing
Fixes #8749

if a table::clear() was issued while we were flushing a memtable,
the memtable is already gone from list. We need to check this before
erase. Otherwise we get random memory corruption via
std::vector::erase

v2:
* Make interface more set-like (tolerate non-existance in erase).

Closes #8904
2021-06-22 15:58:56 +02:00
Asias He
ffa211a8c7 repair: Avoid copy rows in apply_rows_on_master_in_thread
The rows are not used after the call to to_repair_rows_list. Use
std::move() to avoid copying.

Fixes #8902

Closes #8903
2021-06-22 15:58:56 +02:00
Benny Halevy
02917c79b6 logalloc: get rid of unused _descendant_blocked_requests
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210620064204.1709957-1-bhalevy@scylladb.com>
2021-06-22 15:58:56 +02:00
Piotr Dulikowski
de1679b1b9 hints: make hints concurrency configurable and reduce the default
Previously, hinted handoff had a hardcoded concurrency limit - at most
128 hints could be sent from a single shard at once. This commit makes
this limit configurable by adding a new configuration option:
`max_hinted_handoff_concurrency_per_shard`. This option can be updated
in runtime. Additionally, the default concurrency per shard is made
lower and is now 8.

The motivation for reducing the concurrency was to mitigate the negative
impact hints may have on performance of the receiving node due to them
not being properly isolated with respect to I/O.

Tests:
- unit(dev)
- dtest(hintedhandoff_additional_test.py)

Refs: #8624

Closes #8646
2021-06-22 15:58:56 +02:00
Gleb Natapov
09528b8671 raft: test: test leadership transfer timeout
Test that if leadership transfer cannot be done in configured time frame
fsm cancels the leadership transfer process. Also check that timeout_now
message is resent on each tick while leadership transfer is in progress.
2021-06-22 14:42:50 +03:00
Gleb Natapov
ed49d29473 raft: allow to initiate leader stepdown process
Sometimes an ability to force a leader change is needed. For instance
if a node that is currently serving as a leader needs to be brought
down for maintenance. If it will be shutdown without leadership
transfer the cluster will be unavailable for leader election timeout at
least.

We already have a mechanism to transfer the leadership in case an active
leader is removed. The patch exposes it as an external interface with a
timeout period. If a node is still a leader after the timeout the
operation will fail.
2021-06-22 14:36:42 +03:00
Konstantin Osipov
bd410da77a raft: (service) rename raft_services service to raft_group_registry
This is a more informative name. Helps see that, say, group0
is a separate service and not bundle all raft services together.
Message-Id: <20210619211412.3035835-3-kostja@scylladb.com>
2021-06-21 14:53:54 +03:00
Konstantin Osipov
025f18325e raft: (service) move raft service to namespace service
Message-Id: <20210619211412.3035835-2-kostja@scylladb.com>
2021-06-21 14:53:54 +03:00
Calle Wilund
fdb5801704 table: Always use explicit commitlog discard + clear out rp_set
Fixes #8733

If a memtable flush is still pending when we call table::clear(),
we can end up doing a "discard-all" call to commitlog, followed
by a per-segment-count (using rp_set) _later_. This will foobar
our internal usage counts and quite probably cause assertion
failures.
Fixed by always doing per-memtable explicit discard call. But to
ensure this works, since a memtable being flushed remains on
memtable list for a while (why?), we must also ensure we clear
out the rp_set on discard.

v3:
* Fix table::clear to discard rp_sets before memtables

Closes #8894
2021-06-21 14:53:54 +03:00
Takuya ASADA
a677c46672 dist: stop removing /etc/systemd/system/*.mount on package uninstall
Listing /etc/systemd/system/*.mount as ghost file seems incorrect,
since user may want to keep using RAID volume / coredump directory after
uninstalling Scylla, or user may want to upgrade enterprise version.

Also, we mixed two types of files as ghost file, it should handle differently:
 1. automatically generated by postinst scriptlet
 2. generated by user invoked scylla_setup

The package should remove only 1, since 2 is generated by user decision.

See scylladb/scylla-enterprise#1780

Closes #8810
2021-06-21 14:53:54 +03:00
Calle Wilund
0a7823e683 commitlog_test: Add test case for usage/disk size threshold mismatch
Refs #8270

Tries to simulate case where we mismatch segments usage with actual
disk footprint and fail to flush enough to allow segment recycling
2021-06-21 06:01:19 +00:00
Calle Wilund
954da1f0a9 commitlog_test: Improve test assertion
Changes it so actual data is printed, not just error.
2021-06-21 06:01:19 +00:00
Calle Wilund
d6113912cd commitlog: Add waitable future for background sync/flush
Commitlog timer issues un-waited syncs on all segments. If such
a sync takes too long we can end up keeping a segment alive across
a shutdown, causing the file to be left on disk, even if actually
clean.

This adds a future in segment_manager that is "chained" with all
active syncs (hopefully just one), and ensures we wait for this
to complete in shutdown, before pruning and deleting segments
2021-06-21 06:01:19 +00:00
Benny Halevy
499357fb43 row_cache: autoupdating_underlying_reader: fast_forward_to: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210613104232.634621-2-bhalevy@scylladb.com>
2021-06-20 14:46:35 +03:00
Benny Halevy
3db7db5743 row_cache: autoupdating_underlying_reader: fast_forward_to: capture snapshot by value when updating reader
Currently we capture the snapshot mutation_source by reference
for calling create_underlying_reader after closing the reader.
However, if close_reader yields, the snapshot reference passed
may be gone, so capture it by value instead.

Fixes #8848

Test: unit(dev)
DTest: restore_snapshot_using_old_token_ownership_test(debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210613104232.634621-1-bhalevy@scylladb.com>
2021-06-20 14:46:35 +03:00
Avi Kivity
5b3fb83ebe Merge "Remove unused code here and there" from Pavel E
"
Few randomly spotted dead code locations over past time.
Compile-test only.
"

* 'br-remove-unused-stuff' of https://github.com/xemul/scylla:
  database: Remove unused forward declarations
  feature: Remove unused friendship with gossiper
  schema_tables: Remove unused sharded<proxy> argument
  database: Remove few unused sharded<proxy> captures
  view_update_generator: Remove unused struct sstable_with_table
  storage_service: Remove write-only _force_remove_completion
  distributed_loader: Remove unused load-prio manipulations
2021-06-20 12:01:40 +03:00
Pavel Emelyanov
ab4fc41f25 database: Remove unused forward declarations
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Pavel Emelyanov
d606321575 feature: Remove unused friendship with gossiper
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Pavel Emelyanov
96131349e8 schema_tables: Remove unused sharded<proxy> argument
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Pavel Emelyanov
0f36f00682 database: Remove few unused sharded<proxy> captures
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Pavel Emelyanov
64bb16af8a view_update_generator: Remove unused struct sstable_with_table
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Pavel Emelyanov
cbcbf648b6 storage_service: Remove write-only _force_remove_completion
This boolean became effectively unused after 829b4c14 (repair:
Make removenode safe by default)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Pavel Emelyanov
7396de72b1 distributed_loader: Remove unused load-prio manipulations
Mostly this was removed by 6dfeb107 (distributed_loader: remove unused
code).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Pekka Enberg
055bc33f0f Update tools/java submodule
* tools/java 599b2368d6...5013321823 (4):
  > cassandra-stress: fix failure due to the assert exception on disconnect when test is completed
  > node_probe: toppartitions: Fix wrong class in getMethod
  > Fix NullPointerException in SettingsMode
  > cassandra-stress: Remove maxPendingPerConnection default
2021-06-18 14:19:34 +03:00
Pekka Enberg
2a9443a753 Update tools/jmx submodule
* tools/jmx a7c4c39...5311e9b (2):
  > storage_service: takeSnapshot: support the skipFlush option
  > build(deps): bump snakeyaml from 1.16 to 1.26 in /scylla-apiclient
2021-06-18 14:19:29 +03:00
Avi Kivity
b099e7c254 Merge "Untie hints managers and storage service" from Pavel E
"
The storage service is carried along storage proxy, hints
resource manager and hints managers (two of them) just to
subscribe the hints managers on lifecycle events (and stop
the subscription on shutdown) emitted from storage service.

This dependency chain can be greatly simplified, since the
storage proxy is already subscribed on lifecycle events and
can kick managers directly from its hooks.

tests: unit(dev),
       dtest.hintedhandoff_additional_test.hintedhandoff_basic_check_test(dev)
"

* 'br-remove-storage-service-from-hints' of https://github.com/xemul/scylla:
  hints: Drop storage service from managers
  hints: Do not subscribe managers on lifecycle events directly
2021-06-17 17:12:31 +03:00
Nadav Har'El
a9b383f423 cql-pytest: improve test for SSL/TLS versions
The existing test_ssl.py which tests for Scylla's support of various TLS
and SSL versions, used a deprecated and misleading Python API for
choosing the protocol version. In particular, the protocol version
ssl.PROTOCOL_SSLv23 is *not*, despite it's name, SSL versions 2 or 3,
or SSL at all - it is in fact an alias for the latest TLS version :-(
This misunderstanding led us to open the incorrect issue #8837.

So in this patch, we avoid the old Python APIs for choosing protocols,
which were gradually deprecated, and switch to the new API introduced
in Python 3.7 and OpenSSL 1.1.0g - supplying the minimum and maximum
desired protocol version.

With this new API, we can correctly connect with various versions of
the SSL and TLS protocol - between SSLv3 through TLSv1.3. With the
fixed test, we confirm that Scylla does *not* allow SSLv3 - as desired -
so issue #8837 is a non-issue.

Moreover, after issue #8827 was already fixed, this test now passes,
so the "xfail" mark is removed.

Refs #8837.
Refs #8827.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210617134305.173034-1-nyh@scylladb.com>
2021-06-17 17:06:31 +03:00
Nadav Har'El
8f107ece9f Update seastar submodule
* seastar 813eee3e...0e48ba88 (5):
  > net/tls: on TLS handshake failure, send error to client
  > net/dns: fix build on gcc 11
  > core: fix docstring for max_concurrent_for_each
  > test: alien_test: replace deprecated call to alien::submit_to() with new variant
  > alien: prepare for multi-instance use

The fix "net/tls: on TLS handshake failure, send error to client"
fixes #8827.

The test

    test/cql-pytest/run --ssl test_ssl.py

now xpasses, so I'll remove the "xfail" mark in a followup patch.
2021-06-17 16:24:57 +03:00
Pavel Emelyanov
92a4278cd1 hints: Drop storage service from managers
The storage service pointer is only used so (un)subscribe
to (from) lifecycle events. Now the subscription is gone,
so can the storage service pointer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-17 15:09:36 +03:00
Pavel Emelyanov
acdc568ecf hints: Do not subscribe managers on lifecycle events directly
Managers sit on storage proxy which is already subscribed on
lifecycle events, so it can "notify" hints managers directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-17 15:06:26 +03:00
Tomasz Grabiec
6d8440fe70 Merge "raft: (testing) leadership transfer tests" from Pavel Solodovnikov
The patch set introduces a few leadership transfer tests, some of them
are adaptations of corresponding etcd tests (e.g.
`test_leader_transfer_ignore_proposal` and `test_transfer_non_member`).

Others test different scenarios ensuring that pending leadership
transfer doesn't disrupt the rest of the cluster from progressing:

Lost `timeout_now` messages` (`test_leader_transfer_lost_timeout_now` and
`test_leader_transferee_dies_upon_receiving_timeout_now`) as well as
lost `vote_request(force)` from the new candidate
(test_leader_transfer_lost_force_vote_request) don't impact the
election process following that and the leader is elected as normal.

* manmanson/leadership_transfer_tests_v3:
  raft: etcd_test: test_transfer_non_member
  raft: etcd_test: test_leader_transfer_ignore_proposal
  raft: fsm_test: test_leader_transfer_lost_force_vote_request
  raft: fsm_test: test_leader_transfer_lost_timeout_now
  raft: fsm_test: test_leader_transferee_dies_upon_receiving_timeout_now
2021-06-17 13:58:31 +02:00
Piotr Sarna
8cca68de75 cql3: add USING TIMEOUT support for deletes
Turns out the DELETE statement already supports attributes
like timestamp, so it's ridiculously easy to add USING TIMEOUT
support - it's just the matter of accepting it in the grammar.

Fixes #8855

Closes #8876
2021-06-17 14:21:01 +03:00
Nadav Har'El
45c2442f49 Merge 'Avoid large allocs in mv update code' from Piotr Sarna
This series addresses #8852 by:
 * migrating to chunked_vector in view update generation code to avoid large allocations
 * reducing the number of futures kept in mutate_MV, tracking how many view updates were already sent

Combined with #8853 I was able to only observe large partition warnings in the logs for the reproducing code, without crashes, large allocation or reactor stall warnings. The reproducing code itself is not part of cql-pytest because I haven't yet figured out how to make it fast and robust.

Tests: unit(release)
Refs  #8852

Closes #8856

* github.com:scylladb/scylla:
  db,view: limit the number of simultaneous view update futures
  db,view: use chunked_vector for view updates
2021-06-17 14:01:38 +03:00
Avi Kivity
4d70f3baee storage_proxy: change unordered_set<inet_address> to small_vector in write path
The write paths in storage_proxy pass replica sets as
std::unordered_set<gms::inet_address>. This is a complex type, with
N+1 allocations for N members, so we change it to a small_vector (via
inet_address_vector_replica_set) which requires just one allocation, and
even zero when up to three replicas are used.

This change is more nuanced than the corresponding change to the read path
abe3d7d7 ("Merge 'storage_proxy: use small_vector for vectors of
inet_address' from Avi Kivity"), for two reasons:

 - there is a quadratic algorithm in
   abstract_write_response_handler::response(): it searches for a replica
   and erases it. Since this happens for every replica, it happens N^2/2
   times.
 - replica sets for writes always include all datacenters, while reads
   usually involve just one datacenter.

So, a write to a keyspace that has 5 datacenters will invoke 15*(15-1)/2
=105 compares.

We could remove this by sending the index of the replica in the replica
set to the replica and ask it to include the index in the response, but
I think that this is unnecessary. Those 105 compares need to be only
105/15 = 7 times cheaper than the corresponding unordered_set operation,
which they surely will. Handling a response after a cross-datacenter round
trip surely involves L3 cache misses, and a small_vector reduces these
to a minimum compared to an unordered_set with its bucket table, linked
list walking and managent, and table rehashing.

Tests using perf_simple_query --write --smp 1 --operations-per-shard 1000000
 --task-quota-ms show two allocations removed (as expected) and a nice
reduction in instructions executed.

before: median 204842.54 tps ( 54.2 allocs/op,  13.2 tasks/op,   49890 insns/op)
after:  median 206077.65 tps ( 52.2 allocs/op,  13.2 tasks/op,   49138 insns/op)

Closes #8847
2021-06-17 13:46:40 +03:00
Avi Kivity
98cdeaf0f2 schema_tables: make the_merge_lock thread_local
the_merge_lock is global, which is fine now because it is only used
in shard 0. However, if we run multiple nodes in a single process,
there will be multiple shard 0:s, and the_merge_lock will be accessed
from multiple threads. This won't work.

To fix, make it thread_local. It would be better to make it a member
of some controlling object, but there isn't one.

Closes #8858
2021-06-17 13:41:11 +03:00
Avi Kivity
00ff3c1366 Merge 'treewide: add support for snapshot skip-flush option' from Benny Halevy
The option is provided by nodetool snapshot
https://docs.scylladb.com/operating-scylla/nodetool-commands/snapshot/
```
nodetool [(-h <host> | --host <host>)] [(-p <port> | --port <port>)]
         [(-pp | --print-port)] [(-pw <password> | --password <password>)]
         [(-pwf <passwordFilePath> | --password-file <passwordFilePath>)]
         [(-u <username> | --username <username>)] snapshot
         [(-cf <table> | --column-family <table> | --table <table>)]
         [(-kc <kclist> | --kc.list <kclist>)]
         [(-sf | --skip-flush)] [(-t <tag> | --tag <tag>)] [--] [<keyspaces...>]

-sf / –skip-flush    Do not flush memtables before snapshotting (snapshot will not contain unflushed data)
```

But is currently ignored by scylla-jmx (scylladb/scylla-jmx#167)
and not supported at the api level.

This patch adds support for the option in advance
from the api service level down via snapshot_ctl
to the table class and snapshot implementation.

In addition, a corresponding unit test was added to verify
that taking a snapshot with `skip_flush` does not flush the memtable
(at the table::snapshot level).

Refs #8725

Closes #8726

* github.com:scylladb/scylla:
  test: database_test: add snapshot_skip_flush_works
  api: storage_service/snapshots: support skip-flush option
  snapshot: support skip_flush option
  table: snapshot: add skip_flush option
  api: storage_service/snapshots: add sf (skip_flush) option
2021-06-17 13:32:23 +03:00
Nadav Har'El
7fd7e90213 cql-pytest: translate Cassandra's tests for static columns
This is a translation of Cassandra's CQL unit test source file
validation/entities/StaticColumnsTest.java into our our cql-pytest framework.

This test file checks various features of static columns. All these tests
pass on Cassandra, and all but one pass on Scylla. The xfailing test,
testStaticColumnsWithSecondaryIndex, exposes a query that Cassandra
allows but we don't. The new issue about that is:

Refs #8869.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210616141633.114325-1-nyh@scylladb.com>
2021-06-17 11:08:28 +02:00
Nadav Har'El
b6b4df9a47 heat-weighted load balancing: improve handling of near-perfect cache
Consider two nodes with almost-100% cache hit ratio, but not exactly
100%: one has 99.9% cache hits, the second 99.8%. Normally in HWLB we
want to equalize the miss rate in both nodes. So we send the first node
twice the number of requests we send to the second. But unless the disks
are extremely limited, this doesn't make sense: As a numeric example,
consider that we send 2000 requests to the first node and 1000 to the
second, just so the number of misses will be the same - 2 (0.1% and 0.2%
misses, respectively). At such low miss numbers, the assumption that the
disk reads are the slowest part of the operation is wrong, so trying to
equalize only this part is wrong.

So above some threshold hit rate, we should treat all hit rates as
equivalent. In the code we already had such a threshold - max_hit_rate,
but it was set to the incredibly high 0.999. We saw in actual user
runs (see issue #8815) that this threshold was too high - one node
received twice the amount of requests that another did - although both
had near-100% cache hit rates.

So in this patch we lower the max_hit_rate to 0.95. This will have two
consequences:

1. Two nodes with hit rates above 0.95 will be considered to have the
   same hit rate, so they will get equal amount of work - even if one
   has hit rate 0.98 and the other 0.99.

2. A cold node with it rate 0.0 will get 5% of the work of a node with
   the perfect hit rate limited to 0.95. This will allow the cold node to
   slowly warm up its cache. Before this patch, if the hot node happened
   to have a hit rate of 0.999 (the previous maximum), the cold node would
   get just 0.1% of the work and remain almost idle and fill its cache
   extremely slowly - which is a waste.

Fixes #8815.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210616180732.125295-1-nyh@scylladb.com>
2021-06-17 11:02:08 +02:00