Commit Graph

27030 Commits

Author SHA1 Message Date
Pavel Emelyanov
64bb16af8a view_update_generator: Remove unused struct sstable_with_table
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Pavel Emelyanov
cbcbf648b6 storage_service: Remove write-only _force_remove_completion
This boolean became effectively unused after 829b4c14 (repair:
Make removenode safe by default)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Pavel Emelyanov
7396de72b1 distributed_loader: Remove unused load-prio manipulations
Mostly this was removed by 6dfeb107 (distributed_loader: remove unused
code).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Pekka Enberg
055bc33f0f Update tools/java submodule
* tools/java 599b2368d6...5013321823 (4):
  > cassandra-stress: fix failure due to the assert exception on disconnect when test is completed
  > node_probe: toppartitions: Fix wrong class in getMethod
  > Fix NullPointerException in SettingsMode
  > cassandra-stress: Remove maxPendingPerConnection default
2021-06-18 14:19:34 +03:00
Pekka Enberg
2a9443a753 Update tools/jmx submodule
* tools/jmx a7c4c39...5311e9b (2):
  > storage_service: takeSnapshot: support the skipFlush option
  > build(deps): bump snakeyaml from 1.16 to 1.26 in /scylla-apiclient
2021-06-18 14:19:29 +03:00
Avi Kivity
b099e7c254 Merge "Untie hints managers and storage service" from Pavel E
"
The storage service is carried along storage proxy, hints
resource manager and hints managers (two of them) just to
subscribe the hints managers on lifecycle events (and stop
the subscription on shutdown) emitted from storage service.

This dependency chain can be greatly simplified, since the
storage proxy is already subscribed on lifecycle events and
can kick managers directly from its hooks.

tests: unit(dev),
       dtest.hintedhandoff_additional_test.hintedhandoff_basic_check_test(dev)
"

* 'br-remove-storage-service-from-hints' of https://github.com/xemul/scylla:
  hints: Drop storage service from managers
  hints: Do not subscribe managers on lifecycle events directly
2021-06-17 17:12:31 +03:00
Nadav Har'El
a9b383f423 cql-pytest: improve test for SSL/TLS versions
The existing test_ssl.py which tests for Scylla's support of various TLS
and SSL versions, used a deprecated and misleading Python API for
choosing the protocol version. In particular, the protocol version
ssl.PROTOCOL_SSLv23 is *not*, despite it's name, SSL versions 2 or 3,
or SSL at all - it is in fact an alias for the latest TLS version :-(
This misunderstanding led us to open the incorrect issue #8837.

So in this patch, we avoid the old Python APIs for choosing protocols,
which were gradually deprecated, and switch to the new API introduced
in Python 3.7 and OpenSSL 1.1.0g - supplying the minimum and maximum
desired protocol version.

With this new API, we can correctly connect with various versions of
the SSL and TLS protocol - between SSLv3 through TLSv1.3. With the
fixed test, we confirm that Scylla does *not* allow SSLv3 - as desired -
so issue #8837 is a non-issue.

Moreover, after issue #8827 was already fixed, this test now passes,
so the "xfail" mark is removed.

Refs #8837.
Refs #8827.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210617134305.173034-1-nyh@scylladb.com>
2021-06-17 17:06:31 +03:00
Nadav Har'El
8f107ece9f Update seastar submodule
* seastar 813eee3e...0e48ba88 (5):
  > net/tls: on TLS handshake failure, send error to client
  > net/dns: fix build on gcc 11
  > core: fix docstring for max_concurrent_for_each
  > test: alien_test: replace deprecated call to alien::submit_to() with new variant
  > alien: prepare for multi-instance use

The fix "net/tls: on TLS handshake failure, send error to client"
fixes #8827.

The test

    test/cql-pytest/run --ssl test_ssl.py

now xpasses, so I'll remove the "xfail" mark in a followup patch.
2021-06-17 16:24:57 +03:00
Pavel Emelyanov
92a4278cd1 hints: Drop storage service from managers
The storage service pointer is only used so (un)subscribe
to (from) lifecycle events. Now the subscription is gone,
so can the storage service pointer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-17 15:09:36 +03:00
Pavel Emelyanov
acdc568ecf hints: Do not subscribe managers on lifecycle events directly
Managers sit on storage proxy which is already subscribed on
lifecycle events, so it can "notify" hints managers directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-17 15:06:26 +03:00
Tomasz Grabiec
6d8440fe70 Merge "raft: (testing) leadership transfer tests" from Pavel Solodovnikov
The patch set introduces a few leadership transfer tests, some of them
are adaptations of corresponding etcd tests (e.g.
`test_leader_transfer_ignore_proposal` and `test_transfer_non_member`).

Others test different scenarios ensuring that pending leadership
transfer doesn't disrupt the rest of the cluster from progressing:

Lost `timeout_now` messages` (`test_leader_transfer_lost_timeout_now` and
`test_leader_transferee_dies_upon_receiving_timeout_now`) as well as
lost `vote_request(force)` from the new candidate
(test_leader_transfer_lost_force_vote_request) don't impact the
election process following that and the leader is elected as normal.

* manmanson/leadership_transfer_tests_v3:
  raft: etcd_test: test_transfer_non_member
  raft: etcd_test: test_leader_transfer_ignore_proposal
  raft: fsm_test: test_leader_transfer_lost_force_vote_request
  raft: fsm_test: test_leader_transfer_lost_timeout_now
  raft: fsm_test: test_leader_transferee_dies_upon_receiving_timeout_now
2021-06-17 13:58:31 +02:00
Piotr Sarna
8cca68de75 cql3: add USING TIMEOUT support for deletes
Turns out the DELETE statement already supports attributes
like timestamp, so it's ridiculously easy to add USING TIMEOUT
support - it's just the matter of accepting it in the grammar.

Fixes #8855

Closes #8876
2021-06-17 14:21:01 +03:00
Nadav Har'El
45c2442f49 Merge 'Avoid large allocs in mv update code' from Piotr Sarna
This series addresses #8852 by:
 * migrating to chunked_vector in view update generation code to avoid large allocations
 * reducing the number of futures kept in mutate_MV, tracking how many view updates were already sent

Combined with #8853 I was able to only observe large partition warnings in the logs for the reproducing code, without crashes, large allocation or reactor stall warnings. The reproducing code itself is not part of cql-pytest because I haven't yet figured out how to make it fast and robust.

Tests: unit(release)
Refs  #8852

Closes #8856

* github.com:scylladb/scylla:
  db,view: limit the number of simultaneous view update futures
  db,view: use chunked_vector for view updates
2021-06-17 14:01:38 +03:00
Avi Kivity
4d70f3baee storage_proxy: change unordered_set<inet_address> to small_vector in write path
The write paths in storage_proxy pass replica sets as
std::unordered_set<gms::inet_address>. This is a complex type, with
N+1 allocations for N members, so we change it to a small_vector (via
inet_address_vector_replica_set) which requires just one allocation, and
even zero when up to three replicas are used.

This change is more nuanced than the corresponding change to the read path
abe3d7d7 ("Merge 'storage_proxy: use small_vector for vectors of
inet_address' from Avi Kivity"), for two reasons:

 - there is a quadratic algorithm in
   abstract_write_response_handler::response(): it searches for a replica
   and erases it. Since this happens for every replica, it happens N^2/2
   times.
 - replica sets for writes always include all datacenters, while reads
   usually involve just one datacenter.

So, a write to a keyspace that has 5 datacenters will invoke 15*(15-1)/2
=105 compares.

We could remove this by sending the index of the replica in the replica
set to the replica and ask it to include the index in the response, but
I think that this is unnecessary. Those 105 compares need to be only
105/15 = 7 times cheaper than the corresponding unordered_set operation,
which they surely will. Handling a response after a cross-datacenter round
trip surely involves L3 cache misses, and a small_vector reduces these
to a minimum compared to an unordered_set with its bucket table, linked
list walking and managent, and table rehashing.

Tests using perf_simple_query --write --smp 1 --operations-per-shard 1000000
 --task-quota-ms show two allocations removed (as expected) and a nice
reduction in instructions executed.

before: median 204842.54 tps ( 54.2 allocs/op,  13.2 tasks/op,   49890 insns/op)
after:  median 206077.65 tps ( 52.2 allocs/op,  13.2 tasks/op,   49138 insns/op)

Closes #8847
2021-06-17 13:46:40 +03:00
Avi Kivity
98cdeaf0f2 schema_tables: make the_merge_lock thread_local
the_merge_lock is global, which is fine now because it is only used
in shard 0. However, if we run multiple nodes in a single process,
there will be multiple shard 0:s, and the_merge_lock will be accessed
from multiple threads. This won't work.

To fix, make it thread_local. It would be better to make it a member
of some controlling object, but there isn't one.

Closes #8858
2021-06-17 13:41:11 +03:00
Avi Kivity
00ff3c1366 Merge 'treewide: add support for snapshot skip-flush option' from Benny Halevy
The option is provided by nodetool snapshot
https://docs.scylladb.com/operating-scylla/nodetool-commands/snapshot/
```
nodetool [(-h <host> | --host <host>)] [(-p <port> | --port <port>)]
         [(-pp | --print-port)] [(-pw <password> | --password <password>)]
         [(-pwf <passwordFilePath> | --password-file <passwordFilePath>)]
         [(-u <username> | --username <username>)] snapshot
         [(-cf <table> | --column-family <table> | --table <table>)]
         [(-kc <kclist> | --kc.list <kclist>)]
         [(-sf | --skip-flush)] [(-t <tag> | --tag <tag>)] [--] [<keyspaces...>]

-sf / –skip-flush    Do not flush memtables before snapshotting (snapshot will not contain unflushed data)
```

But is currently ignored by scylla-jmx (scylladb/scylla-jmx#167)
and not supported at the api level.

This patch adds support for the option in advance
from the api service level down via snapshot_ctl
to the table class and snapshot implementation.

In addition, a corresponding unit test was added to verify
that taking a snapshot with `skip_flush` does not flush the memtable
(at the table::snapshot level).

Refs #8725

Closes #8726

* github.com:scylladb/scylla:
  test: database_test: add snapshot_skip_flush_works
  api: storage_service/snapshots: support skip-flush option
  snapshot: support skip_flush option
  table: snapshot: add skip_flush option
  api: storage_service/snapshots: add sf (skip_flush) option
2021-06-17 13:32:23 +03:00
Nadav Har'El
7fd7e90213 cql-pytest: translate Cassandra's tests for static columns
This is a translation of Cassandra's CQL unit test source file
validation/entities/StaticColumnsTest.java into our our cql-pytest framework.

This test file checks various features of static columns. All these tests
pass on Cassandra, and all but one pass on Scylla. The xfailing test,
testStaticColumnsWithSecondaryIndex, exposes a query that Cassandra
allows but we don't. The new issue about that is:

Refs #8869.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210616141633.114325-1-nyh@scylladb.com>
2021-06-17 11:08:28 +02:00
Nadav Har'El
b6b4df9a47 heat-weighted load balancing: improve handling of near-perfect cache
Consider two nodes with almost-100% cache hit ratio, but not exactly
100%: one has 99.9% cache hits, the second 99.8%. Normally in HWLB we
want to equalize the miss rate in both nodes. So we send the first node
twice the number of requests we send to the second. But unless the disks
are extremely limited, this doesn't make sense: As a numeric example,
consider that we send 2000 requests to the first node and 1000 to the
second, just so the number of misses will be the same - 2 (0.1% and 0.2%
misses, respectively). At such low miss numbers, the assumption that the
disk reads are the slowest part of the operation is wrong, so trying to
equalize only this part is wrong.

So above some threshold hit rate, we should treat all hit rates as
equivalent. In the code we already had such a threshold - max_hit_rate,
but it was set to the incredibly high 0.999. We saw in actual user
runs (see issue #8815) that this threshold was too high - one node
received twice the amount of requests that another did - although both
had near-100% cache hit rates.

So in this patch we lower the max_hit_rate to 0.95. This will have two
consequences:

1. Two nodes with hit rates above 0.95 will be considered to have the
   same hit rate, so they will get equal amount of work - even if one
   has hit rate 0.98 and the other 0.99.

2. A cold node with it rate 0.0 will get 5% of the work of a node with
   the perfect hit rate limited to 0.95. This will allow the cold node to
   slowly warm up its cache. Before this patch, if the hot node happened
   to have a hit rate of 0.999 (the previous maximum), the cold node would
   get just 0.1% of the work and remain almost idle and fill its cache
   extremely slowly - which is a waste.

Fixes #8815.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210616180732.125295-1-nyh@scylladb.com>
2021-06-17 11:02:08 +02:00
Piotr Sarna
1fb831c8c1 db,view: limit the number of simultaneous view update futures
Previously the view update code generated a continuation for each
view update and stored them all in a vector. In certain cases
the number of updates can grow really large (to millions and beyond),
so it's better to only store a limited amount of these futures
at a time.
2021-06-17 10:20:52 +02:00
Piotr Sarna
a7f7716ecf db,view: use chunked_vector for view updates
The number of view updates can grow large, especially in corner
cases like removing large base partitions. Chunked vector
prevents large allocations.
2021-06-17 10:15:17 +02:00
Avi Kivity
3c21833aac cql3: expr: make column_value (and similar) a first-class expression
Currently, column names can only appear in a boolean binary expression,
but not on their own. This means that in the statement

   SELECT a FROM tab WHERE a > 3;

We can represent the WHERE clause as an expression, but not the selector.

To pave the way for using expressions in selector contexts, we promote
the elements of binary_operator::lhs (column_value, column_value_tuple,
token) to be expressions in their own right. binary_operator::lhs
becomes an expression (wrapped in unique_ptr, because variants can't
contain themselves).

Note that all three new possibilities make sense in a selector:

  SELECT column FROM tab
  SELECT token(pk) FROM tab
  SELECT function_that_accepts_a_tuple((col1, col2)) FROM tab

There is some fallout from this:

 - because binary_operator contains a unique_ptr, it is no longer
   copyable. We add a copy constructor and assignment operator to
   compensate.
 - often, the new elements don't make sense when evaluating a boolean
   expression, which is the only context we had before. We call
    on_internal_error in these cases. The parser right now prevents such
   cases from being constructed in the first place (this is equivalent to
   if (some_struct_value) in C).
 - in statement_restrictions.cc, we need to evalute the lhs in the context
   of the full binary operator. I introduced with_current_binary_operator()
   for this; an alternative approach is to create a new sub-visitor.

Closes #8797
2021-06-17 10:08:58 +03:00
Tomasz Grabiec
6bdf8c4c46 Merge "raft: second series of preparatory patches for group 0 discovery" from Kostja
Miscellaneous preparatory patches for group 0 discovery.

* scylla-dev/raft-group-0-part-2-v4:
  raft: (service) servers map is gid -> server, not sid -> server
  system_keyspace: raft.group_id and raft_snapshots.group_id are TIMEUUID
  raft: (server) wait for configuration transition to complete
  raft: (server) implement raft::server::get_configuration()
  raft: (service) don't throw from schema state machine
  raft: (service) permit some scylla.raft cells to be empty
  raft: (service) properly handle failure to add a server
  raft: implement is_transient_error()
2021-06-17 00:15:40 +02:00
Asias He
7a32cab524 gossip: Fix use-after-free in real_mark_alive and mark_dead
In commit 11a8912093 (gossiper:
get_gossip_status: return string_view and make noexcept)
get_gossip_status returns a pointer to an endpoint_state in
endpoint_state_map.

After commit 425e3b1182 (gossip: Introduce
direct failure detector), gossiper::mark_dead and gossiper::real_mark_alive
can yield in the middle of the function. It is possible that
endpoint_state can be removed, causing use-after-free to access it.

To fix, make a copy before we yield.

Fixes #8859

Closes #8862
2021-06-16 21:16:26 +02:00
Konstantin Osipov
18e3fcdbf1 raft: (service) servers map is gid -> server, not sid -> server
Raft Group registry should map Raft Group Id to Raft Server,
not Raft Server ID (which is identical for all groups) to Raft server.

Raft Group 0 ID works as a cluster identifier, so is generated when a
new cluster is created and is shared by all nodes of the same cluster.

Implement a helper to get raft::server by group id.

Consistently throw a new raft_group_not_found exception
if there is no server or rpc for the specified group id.
2021-06-16 19:05:50 +03:00
Avi Kivity
f05ddf0967 Merge "Improve LSA descriptor encoding" from Pavel
"
The LSA small objects allocation latency is greatly affected by
the way this allocator encodes the object descriptor in front of
each allocated slot.

Nowadays it's one of VLE variants implemented with the help of a
loop. Re-implementing this piece with less instructions and without
a loop allows greatly reducing the allocation latency.

The speed-up mostly comes from loop-less code that doesn't confuse
branch predictor. Also the express encoder seems to benefit from
writing 8 bytes of the encoded value in one go, rather than byte-
-by-byte.

Perf measurements:

1. (new) logallog test shows ~40% smaller times

2. perf_mutation in release mode shows ~2% increase in tps

3. the encoder itself is 2 - 4 times faster on x86_64 and
   1.05 - 3 times faster on aarch64. The speed-up depends on
   the 'encoded length', old encoder has linear time, the
   new one is constant

tests: unit(dev), perf(release), just encoder on Aarch64
"

* 'br-lsa-alloc-latency-4' of https://github.com/xemul/scylla:
  lsa: Use express encoder
  uleb64: Add express encoding
  lsa: Extract uleb64 code into header
  test: LSA allocation perf test
2021-06-16 18:07:13 +03:00
Pavel Emelyanov
8d0780fb92 lsa: Use express encoder
To make it possible to use the express encoder, lsa needs to
make sure that the value is below express supreme value and
provide the size of the gap after the encoded value.

Both requirements can be satisfied when encoding the migrator
index on object allocation.

On free the encoded value can be larger, so the extended
express encoder will need more instructions and will not be
that efficient, so the old encoder is used there.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-16 17:47:12 +03:00
Pavel Emelyanov
1782b0c6b9 uleb64: Add express encoding
Standard encoding is compiled into a loop that puts values
into memory byte-by-byte. This works slowly, but reliably.
When allocating an object LSA uses ubel64 encoder with 2
features that allow to optimize the encoder:

1. the value is migrator.index() which is small enough
   to fit 2 bytes when encoded
2. After the descriptor there usually comes an object
   which is of 8+ bytes in size

Feature #1 makes it possible to encode the value with just
a few instructions. In O3 level clang makes it like

  mov    %esi,%ecx
  and    $0xfc0,%ecx
  and    $0x3f,%esi
  lea    (%rsi,%rcx,4),%ecx
  add    $0x40,%ecx

Next, the encoder needs to put the value into a gap whose
size depends on the alignment of previous and current objects,
so the classical algo loops through this size. Feature #2
makes it possible to put the encoded value and the needed
amount of zeros by using 2 64-bit movs. In this case the
encoded value gets off the needed size and overwrites some
memory after. That's OK, as this overwritten memory is where
the allocated object _will_ be, so the contents there is not
of any interest.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-16 17:47:10 +03:00
Pavel Emelyanov
d8dea48248 lsa: Extract uleb64 code into header
The LSA code encodes an object descriptor before the object
itself. The descriptor is 32-bit value and to put it in an
efficient manner it's encoded into unsigned little-endian
base-64 sequence.

The encoding code is going to be optimized, so put it into a
dedicated header in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-16 17:46:44 +03:00
Avi Kivity
0948908502 Merge "mutation_reader: multishard_combining_reader clean-up close path" from Botond
"
The close path of the multishard combining reader is riddled with
workarounds the fact that the flat mutation reader couldn't wait on
futures when destroyed. Now that we have a close() method that can do
just that, all these workarounds can be removed.
Even more workarounds can be found in tests, where resources like the
reader concurrency semaphore are created separately for each tested
multishard reader and then destroyed after it doesn't need it, so we had
to come up with all sorts of creative and ugly workarounds to keep
these alive until background cleanup is finished.
This series fixes all this. Now, after calling close on the multishard
reader, all resources it used, including the life-cycle policy, the
semaphores created by it can be safely destroyed. This greatly
simplifies the handling of the multishard reader, and makes it much
easier to reason about life-cycle dependencies.

Tests: unit(dev, release:v2, debug:v2,
    mutation_reader_test:debug -t test_multishard,
    multishard_mutation_query_test:debug,
    multishard_combining_reader_as_mutation_source:debug)
"

* 'multishard-combining-reader-close-cleanup/v3' of https://github.com/denesb/scylla:
  mutation_reader: reader_lifecycle_policy: remove convenience methods
  mutation_reader: multishard_combining_reader: store shard_reader via unique ptr
  test/lib/reader_lifecycle_policy: destroy_reader: cleanup context
  test/lib/reader_lifecycle_policy: get rid of lifecycle workarounds
  test/lib/reader_lifecycle_policy: destroy_reader(): stop the semaphore
  test/lib/reader_lifecycle_policy: use a more robust eviction mechanism
  reader_concurrency_semaphore: wait for all permits to be destroyed in stop()
  test/lib/reader_lifcecycle_policy: fix indentation
  mutation_reader: reader_lifecycle_policy::destroy_reader(): require to be called on native shard
  reader_lifecycle_policy implementations: fix indentation
  mutation_reader: reader_lifecycle_policy::destroy_reader(): de-futurize reader parameter
  mutation_reader: shard_reader::close(): wait on the remote reader
  multishard_mutation_query: destroy remote parts in the foreground
  mutation_reader: shard_reader::close(): close _reader
  mutation_reader: reader_lifcecycle_policy::destroy_reader(): remove out-of-date comment
2021-06-16 17:25:50 +03:00
Konstantin Osipov
9c93d77e74 system_keyspace: raft.group_id and raft_snapshots.group_id are TIMEUUID
Fix a bug in definitions of system.raft, system.raft_snapshots,
group_id is TIMEUUID, not long.
2021-06-16 16:52:43 +03:00
Konstantin Osipov
c67c77ed03 raft: (server) wait for configuration transition to complete
By default, wait for the server to leave the joint configuration
when making a configuration change.

When assembling a fresh cluster Scylla may run a series of
configuration changes. These changes would all go through the same
leader and serialize in the critical section around server::cas().

Unless this critical section protects the complete transition from
C_old configuration to C_new, after the first configuration
is committed, the second may fail with exception that a configuration
change is in progress. The topology changes layer should handle
this exception, however, this may introduce either unpleasant
delays into cluster assembly (i.e. if we sleep before retry), or
a busy-wait/thundering herd situation, when all nodes are
retrying their configuration changes.

So let's be nice and wait for a full transition in
server::set_configuration().
2021-06-16 16:52:43 +03:00
Konstantin Osipov
631c89e1a6 raft: (server) implement raft::server::get_configuration()
raft::server::set_configuration() is useless on
application level if we can't query the previous configuration.
2021-06-16 16:52:43 +03:00
Konstantin Osipov
867440f080 raft: (service) don't throw from schema state machine
It's now started as Scylla starts, and state machine failure
leads to panic at start.
2021-06-16 16:52:43 +03:00
Konstantin Osipov
845ff9f344 raft: (service) permit some scylla.raft cells to be empty
When loading raft state from scylla.raft, permit some cells
to be empty. Indeed, the server is not obliged to persist
all vote, term, snapshot once it starts. And the log can be
empty.
2021-06-16 16:52:43 +03:00
Konstantin Osipov
b8fa6c6e9c raft: (service) properly handle failure to add a server
future.get() is not available outside thread context
and co_await is not available inside catch (...) block.
2021-06-16 16:47:11 +03:00
Konstantin Osipov
73c59865f7 raft: implement is_transient_error()
Add a helper to classify Raft exceptions as transient.
2021-06-16 16:26:31 +03:00
Pavel Emelyanov
1e67361267 test: LSA allocation perf test
The test measures the time it takes to allocate a bunch
of small objects on LSA inside single segment.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-16 13:40:44 +03:00
Botond Dénes
b4e69cf63d test/lib/test_utils: require(): also log failed conditions
Currently `require()` throws an exception when the condition fails. The
problem with this is that the error is only printed at the end of the
test, with no trace in the logs on where exactly it happened, compared
to other logged events. This patchs also adds an error-level log line to
address this.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210616065711.46224-1-bdenes@scylladb.com>
2021-06-16 12:05:25 +03:00
Botond Dénes
28c2b54875 mutation_reader: reader_lifecycle_policy: remove convenience methods
These convenience methods are not used as much anymore and they are not
even really necessary as the register/unregister inactive read API got
streamlined a lot to the point where all of these "convenience methods"
are just one-liners, which we can just inline into their few callers
without loosing readability.
2021-06-16 11:29:37 +03:00
Botond Dénes
63f0839164 mutation_reader: multishard_combining_reader: store shard_reader via unique ptr
No need for a shared pointer anymore, as we don't have to potentially
keep the shard reader alive after the multishard reader is destroyed, we
now do proper cleanup in close().
We still need a pointer as the shard reader is un-movable but is stored
in a vector which requires movable values.
2021-06-16 11:29:37 +03:00
Botond Dénes
a69db31b5c test/lib/reader_lifecycle_policy: destroy_reader: cleanup context
Now that we don't rely on any external machinery to keep the relevant
parts of the context alive until needed as its life-cycle is effectively
enclosed in that of the life-cycle policy itself, we can cleanup the
context in `destroy_reader()` itself, avoiding a background trip back to
this shard.
2021-06-16 11:29:36 +03:00
Botond Dénes
d2ddaced4e test/lib/reader_lifecycle_policy: get rid of lifecycle workarounds
The lifecycle of the reader lifecycle policy and all the resources the
reads use is now enclosed in that of the multishard reader thanks to its
close() method. We can now remove all the workarounds we had in place to
keep different resources as long as background reader cleanup finishes.
2021-06-16 11:29:36 +03:00
Botond Dénes
5a271e42a5 test/lib/reader_lifecycle_policy: destroy_reader(): stop the semaphore
So that when this method returns the semaphore is safe to destroy. This
in turn will enable us to get rid of all the machinery we have in place
to deal with the semaphore having to out-live the lifecycle policy
without a clear time as to when it can be safe to destroy.
2021-06-16 11:29:36 +03:00
Botond Dénes
c09c62a0fb test/lib/reader_lifecycle_policy: use a more robust eviction mechanism
The test reader lifecycle policy has a mode in which it wants to ensure
all inactive readers are evicted, so tests can stress reader recreation
logic. For this it currently employs a trick of creating a waiter on the
semaphore. I don't even know how this even works (or if it even does)
but it sure complicates the lifecycle policy code a lot.
So switch to the much more reliable and simple method of creating the
semaphore with a single count and no memory. This ensures that all
inactive reads are immediately evicted, while still allows a single read
to be admitted at all times.
2021-06-16 11:29:36 +03:00
Botond Dénes
578a092e4a reader_concurrency_semaphore: wait for all permits to be destroyed in stop()
To prevent use-after-free resulting from any permit out-living the
semaphore.
2021-06-16 11:29:36 +03:00
Botond Dénes
a10a6e253e test/lib/reader_lifcecycle_policy: fix indentation
Left broken from the previous patch.
2021-06-16 11:29:36 +03:00
Botond Dénes
8c7447effd mutation_reader: reader_lifecycle_policy::destroy_reader(): require to be called on native shard
Currently shard_reader::close() (its caller) goes to the remote shard,
copies back all fragments left there to the local shard, then calls
`destroy_reader()`, which in the case of the multishard mutation query
copies it all back to the native shard. This was required before because
`shard_reader::stop()` (`close()`'s) predecessor) couldn't wait on
`smp::submit_to()`. But close can, so we can get rid of all this
back-and-forth and just call `destroy_reader()` on the shard the reader
lives on, just like we do with `create_reader()`.
2021-06-16 11:29:35 +03:00
Avi Kivity
c3838cbc3b Merge 'Make calculating affected ranges yieldable' from Piotr Sarna
This series partially addresses #8852 and its problems caused by deleting large partitions from tables with materialized views. The issue in question is not fixed by this series, because a full fix requires a more complex rewrite of the view update mechanism.
This series makes calculating affected clustering ranges for materialized view updates more resilient to large allocations and stalls. It does so by futurizing the function which can potentially involve large computations and makes it use non-contiguous storage instead of std::vector to avoid large allocations.

Tests: unit(release)

Closes #8853

* github.com:scylladb/scylla:
  db,view,table: futurize calculating affected ranges
  table: coroutinize do_push_view_replica_updates
  db,view: use chunked vector for view affected ranges
  interval: generalize deoverlap()
2021-06-16 11:26:49 +03:00
Botond Dénes
4ecf061c90 reader_lifecycle_policy implementations: fix indentation
Left broken from the previous patch.
2021-06-16 11:21:38 +03:00
Botond Dénes
a7e59d3e2c mutation_reader: reader_lifecycle_policy::destroy_reader(): de-futurize reader parameter
The shard reader is now able to wait on the stopped reader and pass the
already stopped reader to `destroy_reader()`, so we can de-futurize the
reader parameter of said method. The shard reader was already patched to
pass a ready future so adjusting the call-site is trivial.
The most prominent implementation, the multishard mutation query, can
now also drop its `_dismantling_gate` which was put in place so it can
wait on the background stopping if readers.

A consequence of this move is that handling errors that might happen
during the stopping of the reader is now handled in the shard reader,
not all lifecycle policy implementations.
2021-06-16 11:21:38 +03:00