* tools/java 599b2368d6...5013321823 (4):
> cassandra-stress: fix failure due to the assert exception on disconnect when test is completed
> node_probe: toppartitions: Fix wrong class in getMethod
> Fix NullPointerException in SettingsMode
> cassandra-stress: Remove maxPendingPerConnection default
* tools/jmx a7c4c39...5311e9b (2):
> storage_service: takeSnapshot: support the skipFlush option
> build(deps): bump snakeyaml from 1.16 to 1.26 in /scylla-apiclient
"
The storage service is carried along storage proxy, hints
resource manager and hints managers (two of them) just to
subscribe the hints managers on lifecycle events (and stop
the subscription on shutdown) emitted from storage service.
This dependency chain can be greatly simplified, since the
storage proxy is already subscribed on lifecycle events and
can kick managers directly from its hooks.
tests: unit(dev),
dtest.hintedhandoff_additional_test.hintedhandoff_basic_check_test(dev)
"
* 'br-remove-storage-service-from-hints' of https://github.com/xemul/scylla:
hints: Drop storage service from managers
hints: Do not subscribe managers on lifecycle events directly
The existing test_ssl.py which tests for Scylla's support of various TLS
and SSL versions, used a deprecated and misleading Python API for
choosing the protocol version. In particular, the protocol version
ssl.PROTOCOL_SSLv23 is *not*, despite it's name, SSL versions 2 or 3,
or SSL at all - it is in fact an alias for the latest TLS version :-(
This misunderstanding led us to open the incorrect issue #8837.
So in this patch, we avoid the old Python APIs for choosing protocols,
which were gradually deprecated, and switch to the new API introduced
in Python 3.7 and OpenSSL 1.1.0g - supplying the minimum and maximum
desired protocol version.
With this new API, we can correctly connect with various versions of
the SSL and TLS protocol - between SSLv3 through TLSv1.3. With the
fixed test, we confirm that Scylla does *not* allow SSLv3 - as desired -
so issue #8837 is a non-issue.
Moreover, after issue #8827 was already fixed, this test now passes,
so the "xfail" mark is removed.
Refs #8837.
Refs #8827.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210617134305.173034-1-nyh@scylladb.com>
* seastar 813eee3e...0e48ba88 (5):
> net/tls: on TLS handshake failure, send error to client
> net/dns: fix build on gcc 11
> core: fix docstring for max_concurrent_for_each
> test: alien_test: replace deprecated call to alien::submit_to() with new variant
> alien: prepare for multi-instance use
The fix "net/tls: on TLS handshake failure, send error to client"
fixes#8827.
The test
test/cql-pytest/run --ssl test_ssl.py
now xpasses, so I'll remove the "xfail" mark in a followup patch.
The storage service pointer is only used so (un)subscribe
to (from) lifecycle events. Now the subscription is gone,
so can the storage service pointer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Managers sit on storage proxy which is already subscribed on
lifecycle events, so it can "notify" hints managers directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The patch set introduces a few leadership transfer tests, some of them
are adaptations of corresponding etcd tests (e.g.
`test_leader_transfer_ignore_proposal` and `test_transfer_non_member`).
Others test different scenarios ensuring that pending leadership
transfer doesn't disrupt the rest of the cluster from progressing:
Lost `timeout_now` messages` (`test_leader_transfer_lost_timeout_now` and
`test_leader_transferee_dies_upon_receiving_timeout_now`) as well as
lost `vote_request(force)` from the new candidate
(test_leader_transfer_lost_force_vote_request) don't impact the
election process following that and the leader is elected as normal.
* manmanson/leadership_transfer_tests_v3:
raft: etcd_test: test_transfer_non_member
raft: etcd_test: test_leader_transfer_ignore_proposal
raft: fsm_test: test_leader_transfer_lost_force_vote_request
raft: fsm_test: test_leader_transfer_lost_timeout_now
raft: fsm_test: test_leader_transferee_dies_upon_receiving_timeout_now
Turns out the DELETE statement already supports attributes
like timestamp, so it's ridiculously easy to add USING TIMEOUT
support - it's just the matter of accepting it in the grammar.
Fixes#8855Closes#8876
This series addresses #8852 by:
* migrating to chunked_vector in view update generation code to avoid large allocations
* reducing the number of futures kept in mutate_MV, tracking how many view updates were already sent
Combined with #8853 I was able to only observe large partition warnings in the logs for the reproducing code, without crashes, large allocation or reactor stall warnings. The reproducing code itself is not part of cql-pytest because I haven't yet figured out how to make it fast and robust.
Tests: unit(release)
Refs #8852Closes#8856
* github.com:scylladb/scylla:
db,view: limit the number of simultaneous view update futures
db,view: use chunked_vector for view updates
The write paths in storage_proxy pass replica sets as
std::unordered_set<gms::inet_address>. This is a complex type, with
N+1 allocations for N members, so we change it to a small_vector (via
inet_address_vector_replica_set) which requires just one allocation, and
even zero when up to three replicas are used.
This change is more nuanced than the corresponding change to the read path
abe3d7d7 ("Merge 'storage_proxy: use small_vector for vectors of
inet_address' from Avi Kivity"), for two reasons:
- there is a quadratic algorithm in
abstract_write_response_handler::response(): it searches for a replica
and erases it. Since this happens for every replica, it happens N^2/2
times.
- replica sets for writes always include all datacenters, while reads
usually involve just one datacenter.
So, a write to a keyspace that has 5 datacenters will invoke 15*(15-1)/2
=105 compares.
We could remove this by sending the index of the replica in the replica
set to the replica and ask it to include the index in the response, but
I think that this is unnecessary. Those 105 compares need to be only
105/15 = 7 times cheaper than the corresponding unordered_set operation,
which they surely will. Handling a response after a cross-datacenter round
trip surely involves L3 cache misses, and a small_vector reduces these
to a minimum compared to an unordered_set with its bucket table, linked
list walking and managent, and table rehashing.
Tests using perf_simple_query --write --smp 1 --operations-per-shard 1000000
--task-quota-ms show two allocations removed (as expected) and a nice
reduction in instructions executed.
before: median 204842.54 tps ( 54.2 allocs/op, 13.2 tasks/op, 49890 insns/op)
after: median 206077.65 tps ( 52.2 allocs/op, 13.2 tasks/op, 49138 insns/op)
Closes#8847
the_merge_lock is global, which is fine now because it is only used
in shard 0. However, if we run multiple nodes in a single process,
there will be multiple shard 0:s, and the_merge_lock will be accessed
from multiple threads. This won't work.
To fix, make it thread_local. It would be better to make it a member
of some controlling object, but there isn't one.
Closes#8858
The option is provided by nodetool snapshot
https://docs.scylladb.com/operating-scylla/nodetool-commands/snapshot/
```
nodetool [(-h <host> | --host <host>)] [(-p <port> | --port <port>)]
[(-pp | --print-port)] [(-pw <password> | --password <password>)]
[(-pwf <passwordFilePath> | --password-file <passwordFilePath>)]
[(-u <username> | --username <username>)] snapshot
[(-cf <table> | --column-family <table> | --table <table>)]
[(-kc <kclist> | --kc.list <kclist>)]
[(-sf | --skip-flush)] [(-t <tag> | --tag <tag>)] [--] [<keyspaces...>]
-sf / –skip-flush Do not flush memtables before snapshotting (snapshot will not contain unflushed data)
```
But is currently ignored by scylla-jmx (scylladb/scylla-jmx#167)
and not supported at the api level.
This patch adds support for the option in advance
from the api service level down via snapshot_ctl
to the table class and snapshot implementation.
In addition, a corresponding unit test was added to verify
that taking a snapshot with `skip_flush` does not flush the memtable
(at the table::snapshot level).
Refs #8725Closes#8726
* github.com:scylladb/scylla:
test: database_test: add snapshot_skip_flush_works
api: storage_service/snapshots: support skip-flush option
snapshot: support skip_flush option
table: snapshot: add skip_flush option
api: storage_service/snapshots: add sf (skip_flush) option
This is a translation of Cassandra's CQL unit test source file
validation/entities/StaticColumnsTest.java into our our cql-pytest framework.
This test file checks various features of static columns. All these tests
pass on Cassandra, and all but one pass on Scylla. The xfailing test,
testStaticColumnsWithSecondaryIndex, exposes a query that Cassandra
allows but we don't. The new issue about that is:
Refs #8869.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210616141633.114325-1-nyh@scylladb.com>
Consider two nodes with almost-100% cache hit ratio, but not exactly
100%: one has 99.9% cache hits, the second 99.8%. Normally in HWLB we
want to equalize the miss rate in both nodes. So we send the first node
twice the number of requests we send to the second. But unless the disks
are extremely limited, this doesn't make sense: As a numeric example,
consider that we send 2000 requests to the first node and 1000 to the
second, just so the number of misses will be the same - 2 (0.1% and 0.2%
misses, respectively). At such low miss numbers, the assumption that the
disk reads are the slowest part of the operation is wrong, so trying to
equalize only this part is wrong.
So above some threshold hit rate, we should treat all hit rates as
equivalent. In the code we already had such a threshold - max_hit_rate,
but it was set to the incredibly high 0.999. We saw in actual user
runs (see issue #8815) that this threshold was too high - one node
received twice the amount of requests that another did - although both
had near-100% cache hit rates.
So in this patch we lower the max_hit_rate to 0.95. This will have two
consequences:
1. Two nodes with hit rates above 0.95 will be considered to have the
same hit rate, so they will get equal amount of work - even if one
has hit rate 0.98 and the other 0.99.
2. A cold node with it rate 0.0 will get 5% of the work of a node with
the perfect hit rate limited to 0.95. This will allow the cold node to
slowly warm up its cache. Before this patch, if the hot node happened
to have a hit rate of 0.999 (the previous maximum), the cold node would
get just 0.1% of the work and remain almost idle and fill its cache
extremely slowly - which is a waste.
Fixes#8815.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210616180732.125295-1-nyh@scylladb.com>
Previously the view update code generated a continuation for each
view update and stored them all in a vector. In certain cases
the number of updates can grow really large (to millions and beyond),
so it's better to only store a limited amount of these futures
at a time.
Currently, column names can only appear in a boolean binary expression,
but not on their own. This means that in the statement
SELECT a FROM tab WHERE a > 3;
We can represent the WHERE clause as an expression, but not the selector.
To pave the way for using expressions in selector contexts, we promote
the elements of binary_operator::lhs (column_value, column_value_tuple,
token) to be expressions in their own right. binary_operator::lhs
becomes an expression (wrapped in unique_ptr, because variants can't
contain themselves).
Note that all three new possibilities make sense in a selector:
SELECT column FROM tab
SELECT token(pk) FROM tab
SELECT function_that_accepts_a_tuple((col1, col2)) FROM tab
There is some fallout from this:
- because binary_operator contains a unique_ptr, it is no longer
copyable. We add a copy constructor and assignment operator to
compensate.
- often, the new elements don't make sense when evaluating a boolean
expression, which is the only context we had before. We call
on_internal_error in these cases. The parser right now prevents such
cases from being constructed in the first place (this is equivalent to
if (some_struct_value) in C).
- in statement_restrictions.cc, we need to evalute the lhs in the context
of the full binary operator. I introduced with_current_binary_operator()
for this; an alternative approach is to create a new sub-visitor.
Closes#8797
Miscellaneous preparatory patches for group 0 discovery.
* scylla-dev/raft-group-0-part-2-v4:
raft: (service) servers map is gid -> server, not sid -> server
system_keyspace: raft.group_id and raft_snapshots.group_id are TIMEUUID
raft: (server) wait for configuration transition to complete
raft: (server) implement raft::server::get_configuration()
raft: (service) don't throw from schema state machine
raft: (service) permit some scylla.raft cells to be empty
raft: (service) properly handle failure to add a server
raft: implement is_transient_error()
In commit 11a8912093 (gossiper:
get_gossip_status: return string_view and make noexcept)
get_gossip_status returns a pointer to an endpoint_state in
endpoint_state_map.
After commit 425e3b1182 (gossip: Introduce
direct failure detector), gossiper::mark_dead and gossiper::real_mark_alive
can yield in the middle of the function. It is possible that
endpoint_state can be removed, causing use-after-free to access it.
To fix, make a copy before we yield.
Fixes#8859Closes#8862
Raft Group registry should map Raft Group Id to Raft Server,
not Raft Server ID (which is identical for all groups) to Raft server.
Raft Group 0 ID works as a cluster identifier, so is generated when a
new cluster is created and is shared by all nodes of the same cluster.
Implement a helper to get raft::server by group id.
Consistently throw a new raft_group_not_found exception
if there is no server or rpc for the specified group id.
"
The LSA small objects allocation latency is greatly affected by
the way this allocator encodes the object descriptor in front of
each allocated slot.
Nowadays it's one of VLE variants implemented with the help of a
loop. Re-implementing this piece with less instructions and without
a loop allows greatly reducing the allocation latency.
The speed-up mostly comes from loop-less code that doesn't confuse
branch predictor. Also the express encoder seems to benefit from
writing 8 bytes of the encoded value in one go, rather than byte-
-by-byte.
Perf measurements:
1. (new) logallog test shows ~40% smaller times
2. perf_mutation in release mode shows ~2% increase in tps
3. the encoder itself is 2 - 4 times faster on x86_64 and
1.05 - 3 times faster on aarch64. The speed-up depends on
the 'encoded length', old encoder has linear time, the
new one is constant
tests: unit(dev), perf(release), just encoder on Aarch64
"
* 'br-lsa-alloc-latency-4' of https://github.com/xemul/scylla:
lsa: Use express encoder
uleb64: Add express encoding
lsa: Extract uleb64 code into header
test: LSA allocation perf test
To make it possible to use the express encoder, lsa needs to
make sure that the value is below express supreme value and
provide the size of the gap after the encoded value.
Both requirements can be satisfied when encoding the migrator
index on object allocation.
On free the encoded value can be larger, so the extended
express encoder will need more instructions and will not be
that efficient, so the old encoder is used there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Standard encoding is compiled into a loop that puts values
into memory byte-by-byte. This works slowly, but reliably.
When allocating an object LSA uses ubel64 encoder with 2
features that allow to optimize the encoder:
1. the value is migrator.index() which is small enough
to fit 2 bytes when encoded
2. After the descriptor there usually comes an object
which is of 8+ bytes in size
Feature #1 makes it possible to encode the value with just
a few instructions. In O3 level clang makes it like
mov %esi,%ecx
and $0xfc0,%ecx
and $0x3f,%esi
lea (%rsi,%rcx,4),%ecx
add $0x40,%ecx
Next, the encoder needs to put the value into a gap whose
size depends on the alignment of previous and current objects,
so the classical algo loops through this size. Feature #2
makes it possible to put the encoded value and the needed
amount of zeros by using 2 64-bit movs. In this case the
encoded value gets off the needed size and overwrites some
memory after. That's OK, as this overwritten memory is where
the allocated object _will_ be, so the contents there is not
of any interest.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The LSA code encodes an object descriptor before the object
itself. The descriptor is 32-bit value and to put it in an
efficient manner it's encoded into unsigned little-endian
base-64 sequence.
The encoding code is going to be optimized, so put it into a
dedicated header in advance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
The close path of the multishard combining reader is riddled with
workarounds the fact that the flat mutation reader couldn't wait on
futures when destroyed. Now that we have a close() method that can do
just that, all these workarounds can be removed.
Even more workarounds can be found in tests, where resources like the
reader concurrency semaphore are created separately for each tested
multishard reader and then destroyed after it doesn't need it, so we had
to come up with all sorts of creative and ugly workarounds to keep
these alive until background cleanup is finished.
This series fixes all this. Now, after calling close on the multishard
reader, all resources it used, including the life-cycle policy, the
semaphores created by it can be safely destroyed. This greatly
simplifies the handling of the multishard reader, and makes it much
easier to reason about life-cycle dependencies.
Tests: unit(dev, release:v2, debug:v2,
mutation_reader_test:debug -t test_multishard,
multishard_mutation_query_test:debug,
multishard_combining_reader_as_mutation_source:debug)
"
* 'multishard-combining-reader-close-cleanup/v3' of https://github.com/denesb/scylla:
mutation_reader: reader_lifecycle_policy: remove convenience methods
mutation_reader: multishard_combining_reader: store shard_reader via unique ptr
test/lib/reader_lifecycle_policy: destroy_reader: cleanup context
test/lib/reader_lifecycle_policy: get rid of lifecycle workarounds
test/lib/reader_lifecycle_policy: destroy_reader(): stop the semaphore
test/lib/reader_lifecycle_policy: use a more robust eviction mechanism
reader_concurrency_semaphore: wait for all permits to be destroyed in stop()
test/lib/reader_lifcecycle_policy: fix indentation
mutation_reader: reader_lifecycle_policy::destroy_reader(): require to be called on native shard
reader_lifecycle_policy implementations: fix indentation
mutation_reader: reader_lifecycle_policy::destroy_reader(): de-futurize reader parameter
mutation_reader: shard_reader::close(): wait on the remote reader
multishard_mutation_query: destroy remote parts in the foreground
mutation_reader: shard_reader::close(): close _reader
mutation_reader: reader_lifcecycle_policy::destroy_reader(): remove out-of-date comment
By default, wait for the server to leave the joint configuration
when making a configuration change.
When assembling a fresh cluster Scylla may run a series of
configuration changes. These changes would all go through the same
leader and serialize in the critical section around server::cas().
Unless this critical section protects the complete transition from
C_old configuration to C_new, after the first configuration
is committed, the second may fail with exception that a configuration
change is in progress. The topology changes layer should handle
this exception, however, this may introduce either unpleasant
delays into cluster assembly (i.e. if we sleep before retry), or
a busy-wait/thundering herd situation, when all nodes are
retrying their configuration changes.
So let's be nice and wait for a full transition in
server::set_configuration().
When loading raft state from scylla.raft, permit some cells
to be empty. Indeed, the server is not obliged to persist
all vote, term, snapshot once it starts. And the log can be
empty.
The test measures the time it takes to allocate a bunch
of small objects on LSA inside single segment.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently `require()` throws an exception when the condition fails. The
problem with this is that the error is only printed at the end of the
test, with no trace in the logs on where exactly it happened, compared
to other logged events. This patchs also adds an error-level log line to
address this.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210616065711.46224-1-bdenes@scylladb.com>
These convenience methods are not used as much anymore and they are not
even really necessary as the register/unregister inactive read API got
streamlined a lot to the point where all of these "convenience methods"
are just one-liners, which we can just inline into their few callers
without loosing readability.
No need for a shared pointer anymore, as we don't have to potentially
keep the shard reader alive after the multishard reader is destroyed, we
now do proper cleanup in close().
We still need a pointer as the shard reader is un-movable but is stored
in a vector which requires movable values.
Now that we don't rely on any external machinery to keep the relevant
parts of the context alive until needed as its life-cycle is effectively
enclosed in that of the life-cycle policy itself, we can cleanup the
context in `destroy_reader()` itself, avoiding a background trip back to
this shard.
The lifecycle of the reader lifecycle policy and all the resources the
reads use is now enclosed in that of the multishard reader thanks to its
close() method. We can now remove all the workarounds we had in place to
keep different resources as long as background reader cleanup finishes.
So that when this method returns the semaphore is safe to destroy. This
in turn will enable us to get rid of all the machinery we have in place
to deal with the semaphore having to out-live the lifecycle policy
without a clear time as to when it can be safe to destroy.
The test reader lifecycle policy has a mode in which it wants to ensure
all inactive readers are evicted, so tests can stress reader recreation
logic. For this it currently employs a trick of creating a waiter on the
semaphore. I don't even know how this even works (or if it even does)
but it sure complicates the lifecycle policy code a lot.
So switch to the much more reliable and simple method of creating the
semaphore with a single count and no memory. This ensures that all
inactive reads are immediately evicted, while still allows a single read
to be admitted at all times.
Currently shard_reader::close() (its caller) goes to the remote shard,
copies back all fragments left there to the local shard, then calls
`destroy_reader()`, which in the case of the multishard mutation query
copies it all back to the native shard. This was required before because
`shard_reader::stop()` (`close()`'s) predecessor) couldn't wait on
`smp::submit_to()`. But close can, so we can get rid of all this
back-and-forth and just call `destroy_reader()` on the shard the reader
lives on, just like we do with `create_reader()`.
This series partially addresses #8852 and its problems caused by deleting large partitions from tables with materialized views. The issue in question is not fixed by this series, because a full fix requires a more complex rewrite of the view update mechanism.
This series makes calculating affected clustering ranges for materialized view updates more resilient to large allocations and stalls. It does so by futurizing the function which can potentially involve large computations and makes it use non-contiguous storage instead of std::vector to avoid large allocations.
Tests: unit(release)
Closes#8853
* github.com:scylladb/scylla:
db,view,table: futurize calculating affected ranges
table: coroutinize do_push_view_replica_updates
db,view: use chunked vector for view affected ranges
interval: generalize deoverlap()
The shard reader is now able to wait on the stopped reader and pass the
already stopped reader to `destroy_reader()`, so we can de-futurize the
reader parameter of said method. The shard reader was already patched to
pass a ready future so adjusting the call-site is trivial.
The most prominent implementation, the multishard mutation query, can
now also drop its `_dismantling_gate` which was put in place so it can
wait on the background stopping if readers.
A consequence of this move is that handling errors that might happen
during the stopping of the reader is now handled in the shard reader,
not all lifecycle policy implementations.