The lifecycle of the reader lifecycle policy and all the resources the
reads use is now enclosed in that of the multishard reader thanks to its
close() method. We can now remove all the workarounds we had in place to
keep different resources as long as background reader cleanup finishes.
So that when this method returns the semaphore is safe to destroy. This
in turn will enable us to get rid of all the machinery we have in place
to deal with the semaphore having to out-live the lifecycle policy
without a clear time as to when it can be safe to destroy.
The test reader lifecycle policy has a mode in which it wants to ensure
all inactive readers are evicted, so tests can stress reader recreation
logic. For this it currently employs a trick of creating a waiter on the
semaphore. I don't even know how this even works (or if it even does)
but it sure complicates the lifecycle policy code a lot.
So switch to the much more reliable and simple method of creating the
semaphore with a single count and no memory. This ensures that all
inactive reads are immediately evicted, while still allows a single read
to be admitted at all times.
Currently shard_reader::close() (its caller) goes to the remote shard,
copies back all fragments left there to the local shard, then calls
`destroy_reader()`, which in the case of the multishard mutation query
copies it all back to the native shard. This was required before because
`shard_reader::stop()` (`close()`'s) predecessor) couldn't wait on
`smp::submit_to()`. But close can, so we can get rid of all this
back-and-forth and just call `destroy_reader()` on the shard the reader
lives on, just like we do with `create_reader()`.
The shard reader is now able to wait on the stopped reader and pass the
already stopped reader to `destroy_reader()`, so we can de-futurize the
reader parameter of said method. The shard reader was already patched to
pass a ready future so adjusting the call-site is trivial.
The most prominent implementation, the multishard mutation query, can
now also drop its `_dismantling_gate` which was put in place so it can
wait on the background stopping if readers.
A consequence of this move is that handling errors that might happen
during the stopping of the reader is now handled in the shard reader,
not all lifecycle policy implementations.
Previously `ticker` would use a single function, `on_tick`, which it
called in a loop with yields in-between. In `basic_test` we would use
this to tick every object in synchrony.
However, to closely simulate a production environment, we want the
tick ratios to be different. For example Raft servers should be ticked
rarely compared to the network.
We may also want to give the Seastar reactor more space between the
function calls (e.g. if they cause a bunch of work to be created for the
reactor that needs more than one tick to complete).
To support these use cases we first generalize `ticker` to take a set of
functions with associated numbers. These numbers are the call periods of
their corresponding functions: given {n, f}, `f` will be called each
`n`th tick.
We use this new functionality to tick Raft servers less often than the
network in basic_test.
This patchset effectively reverts 01b6a2eb38
which caused the ticker to call `on_tick` only when the Seastar reactor
had no work to do. This approach is unfortunately incompatible with the
approach taken there. We *do* want the ticker to race with other work,
potentially producing more work while already scheduled work is executing,
and we want to see in tests what happens when we adjust the ticking ratios
of different subsystems.
The previous approach also had a problem where if there was an infinite task
loop executing, the ticker wouldn't ever tick.
The previous fix was introduced since the ticker caused too much work to
be produced (so the reactor couldn't keep up) due to ticking the Raft
servers too often (after each yield). This commit deals with the problem
in a different way, by ticking the servers rarely, which also resembles
"real-life" scenarios better.
* kbr/tick-network-often-v4:
raft: randomized_nemesis_test: generalize `ticker` to take a set of functions
raft: randomized_nemesis_test: split `environment::tick` into two functions
raft: randomized_nemesis_test: fix potential use-after-free in basic_test
... with associated calling periods and use the new API in `basic_test`.
Previously `ticker` would use a single function, `on_tick`, which it
called in a loop with yields in-between. In `basic_test` we would use
this to tick every object in synchrony.
However, to closely simulate a production environment, we may want the
tick ratios to be different. For example Raft servers should be ticked
rarely compared to the network.
We may also want to give the Seastar reactor more space between the
function calls (e.g. if they cause a bunch of work to be created for the
reactor that needs more than one tick to complete).
To support these use cases we generalize `ticker` to take a set of
functions with associated numbers. These numbers are the call periods of
their corresponding functions: given {n, f}, `f` will be called each
`n`th tick.
We also modify `basic_test` to use this new approach: we tick Raft
servers once per 10 network ticks (in particular, once per 10 reactor
yields).
This commit effectively reverts 01b6a2eb38
which caused the ticker to call `on_tick` only when the Seastar reactor
had no work to do. This approach is unfortunately incompatible with the
approach taken there. We *do* want the ticker to race with other work,
potentially producing more work while already scheduled work is executing,
and we want to see in tests what happens when we adjust the ticking ratios
of different subsystems.
The previous approach also had a problem where if there was an infinite task
loop executing, the ticker wouldn't ever tick.
The previous fix was introduced since the ticker caused too much work to
be produced (so the reactor couldn't keep up) due to ticking the Raft
servers too often (after each yield). This commit deals with the problem
in a different way, by ticking the servers rarely, which also resembles
"real-life" scenarios better.
With this change we must also wait a bit longer for the first node to
elect itself as a leader at the beginning of the test.
The test starts by waiting a certain number of ticks for the first node
to elect itself as a leader.
If this wait times out - i.e. the number of ticks passes before the node
manages to elect itself - the future associated with the task which checks
for the leader condition becomes discarded (it is passed to
`with_timeout`) and the task may keep using the `environment` (which it
has a reference to) even after the `environment` is destroyed.
Furthermore, the aforementioned task is a coroutine which uses lambda
captures in its body. Leaving `with_timeout` destroys the lambda object,
causing the coroutine to refer to no-longer-existing captures.
We fix the problems by:
- making `environment` `weakly_referencable` and checking if its alive
before it's used inside the task,
- not capturing anything in the lambda but passing whatever's needed as
function arguments (so these things get allocated inside the coroutine
frame).
Merged patch series by Pavel Emelyanov:
Alternator start and stop code is sitting inside the main()
and it's a big piece of code out there. Havig it all in main
complicates rework of start-stop sequences, it's much more
handy to have it in alternator/.
This set puts the mentioned code into transport- and thrift-
like controller model. While doing it one more call for global
storage service goes away.
* 'br-alternator-clientize' of https://github.com/xemul/scylla:
alternator: Move start-stop code into controller
alternator: Move the whole starting code into a sched group
alternator: Dont capture db, use cfg
alternator: Controller skeleton
alternator: Controller basement
alternator: Drop storage service from executor
This patch also fixes rare hangs in debug mode for drops_04 without
prevote.
Branch URL: https://github.com/alecco/scylla/tree/raft-fixes-05-v2-dueling
Tests: unit ({dev}), unit ({debug}), unit ({release})
Changes in v2:
- Fixed commit message @kostja
Whithout prevote, a node disconnected for long enough becomes candidate.
While disconnected (A) it keeps increasing its term.
When it rejoins it disrupts the current leader (C) which steps down due
to the higher term in (A)'s append_entries_reply and (C) also increases
its term.
Meanwhile followers (B) and (D) don't know (C) stepped down but see it
alive according to the current failure detecture implementation, and
also (A) has shorter log than them.
So they reject (A)'s vote requests (Raft 4.2.3 Disruptive servers).
Then (C) rejects voting for (A) because it has shorter log.
And (C) becomes candidate but even though (A) votes for (C), the
previous followers (B) and (D) ignore a vote request while leader (C) is
still alive and election timeout has not passed.
(A) and (C) alone can't reach quorum 2/4. So elections never succeed.
This patch addresses this problem by making followers not ignore vote
requests from who they think is the current leader even though
election timout was not reached.
As @kostja noted, if failure detector would consider a leader alive only
as long as it sends heartbeats (append requests) this patch is no longer
needed.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20210611172734.254757-1-alejo.sanchez@scylladb.com>
Incremental selection may not work properly for LCS and ICS due to an
use-after-free bug in partitioned set which came into existence after
compound set was introduced.
The use-after-free happens because partitioned set wasn't taking into
account that the next position can become the current position in the
next iteration, which will be used by all selectors managed by
compound set. So if next position is freed, when it were being used
as current position, subsequent selectors would find the current
position freed, making them produce incorrect results.
Fix this by moving ownership of next pos from incremental_selector_impl
to incremental_selector, which makes it more robust as the latter knows
better when the selection is done with the next pos. incremental_selector
will still return ring_position_view to avoid copies.
Fixes#8802.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210611130957.156712-1-raphaelsc@scylladb.com>
Add test coverage inspired by etcd for non-voter servers,
and fix issues discovered when testing.
* scylla-dev/raft-learner-test-v4:
raft: (testing) test non-voter can vote
raft: (testing) test receiving a confchange in a snapshot
raft: (testing) test voter-non-voter config change loop
raft: (testing) test non-voter doesn't start election on election timeout
raft: (testing) test what happens when a learner gets TimeoutNow
raft: (testing) implement a test for a leader becoming non-voter
raft: style fix
raft: step down as a leader if converted to a non-voter
raft: improve configuration consistency checks
raft: (testing) test that non-voter stays in PIPELINE mode
raft: (testing) always return fsm_debug in create_follower()
This is a reproducer for issue #8827, that checks that a client which
tries to connect to Scylla with an unsupported version of SSL or TLS
gets the expected error alert - not some sort of unexpected EOF.
Issue #8827 is still open, so this test is still xfailing. However,
I verified that with a fix for this issue, the test passes.
The test also prints which protocol versions worked - so it also helps
checking issue #8837 (about the ancient SSL protocol being allowed).
Refs #8837
Refs #8827
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210610151714.1746330-1-nyh@scylladb.com>
When a non-voter is requested a vote, it must vote
to preserve liveness. In Raft, servers respond
to messages without consulting with their current configuration,
and the non-voter may not have the latest configuration
when it is requested to vote.
Once learner receives TimeoutNow it becomes a candidate, discovers it
can't vote, doesn't increase its term and converts back to a
follower. Once entries arrive from a new leader it updates its
term.
Recently, the logic of elect_new_leader was changed to allow the old
leader to vote for the new candidate. But the implementation is wrong as
it re-connects the old leader in all cases disregarding if the nodes
were already disconnected.
Check if both old leader and the requested new leader are connected
first and only if it is the case then the old leader can participate in
the election.
There were occasional hangs in the loop of elect_new_leader because
other nodes besides the candidate were ticked. This patch fixes the
loop by removing ticks inside of it.
The loop is needed to handle prevote corner cases (e.g. 2 nodes).
While there, also wait log on all followers to avoid a previously
dropped leader to be a dueling candidate.
And update _leader only if it was changed.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20210609193945.910592-3-alejo.sanchez@scylladb.com>
The query tracing tests in test/alternator's test_tracing.py had one
timeout of 30 seconds to find the trace, and one unclearly-coded timeout
for finding the right content for the trace. We recently saw both
timeouts exceeded in tests, but only rarely and only in debug mode,
in a run 100 times slower than normal.
This patch increases both timeouts to 100 seconds. Whatever happens then,
we win: If the test stops failing, we know the new timeout was enough.
If the test continues to fail, we will be able to conclude that we have a
real bug - e.g., perhaps one of the LWT operations has a bug causing it
to hang indefinitely.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210608205026.1600037-1-nyh@scylladb.com>
In #8772, an assert validating first token <= last token
failed in leveled_manifest::overlapping.
It is unclear how we got to that state, so add validation
in sstable::set_first_and_last_keys() that the to-be-set
first and last keys are well ordered.
Otherwise, throw malformed_sstable_exception.
set_first_and_last_keys is called both on the write path
from the sstable writer before the sstable is sealed,
and on the open/load path via update_info_for_opened_data().
This series also fixes issues with unit tests with
regards to first/last keys so they won't fail the
validation.
Refs #8772
Test: unit(dev)
DTest: next-gating(dev), materialized_views_test:TestMaterializedViews.interrupt_build_process_and_resharding_half_to_max_test(debug)
* tag 'validate-first-and-last-keys-ordering-v1':
sstable: validate first and last keys ordering
test: lib: reusable_sst: save unexpected errors
test: sstable_datafile_test: stcs_reshape_test: use token_generation_for_current_shard
test: sstable_test: define primary key in schema for compressed sstable
* scylla-dev/raft-group-0-part-1-rebase:
raft: (service) pass Raft service into storage_service
raft: (service) add comments for boot steps
raft: add ordering for raft::server_address based on id
raft: (internal) simplify construction of tagged_id
raft: (internal) tagged_id minor improvements
Raft group 0 initialization and configuration changes
should be integrated with Scylla cluster assembly,
happening when starting the storage service and joining
the cluster. Prepare for this.
Since Raft service depends on query processor, and query
processor depends on storage service, to break a dependency
loop split Raft initialization into two steps: starting
an under-constructed instance of "sharded" Raft service,
accepting an under-constructed instance of "sharded"
query_processor, and then passed into storage service start
function, and then the local state of Raft groups from system
tables once query processor starts.
Consistently abbreviate raft_services instance raft_svcs, as
is the convention at Scylla.
Update the tests.
Introduce a syntax helper tagged_id::create_random_id(),
used to create a new Raft server or group id.
Provide a default ordering for tagged ids, for use
in Raft leader discovery, which selects the smallest
id for leader.
This patch adds a "--ssl" option to test/cql-pytest's pytest, as well as
to the run script test/cql-pytest/run. When "test/cql-pytest/run --ssl"
is used, Scylla is started listening for encrypted connections on its
standard port (9042) - using a temporary unsigned certificate. Then, the
individual tests connect to this encrypted port using TLSv1.2 (Scylla
doesn't support earlier version of SSL) instead of TCP.
This "--ssl" feature allows writing test which stress various aspects of
the connection (e.g., oversized requests - see PR #8800), and then be
able to run those tests in both TCP and SSL modes.
Fixes#8811
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210607200329.1536234-1-nyh@scylladb.com>
Indexed select statements fetch primary key information from
their internal materialized views and then use it to query
the base table. Unfortunately, the current mechanism for retrieving
base table rows makes it easy to overwhelm the replicas with unbounded
concurrency - the number of concurrent ops is increased exponentially
until a short read is encountered, but it's not enough to cap the
concurrency - if data is fetched row-by-row, then short reads usually
don't occur and as a result it's easy to see concurrency of 1M or
higher. In order to avoid overloading the replicas, the concurrency
of indexed queries is now capped at 4096 and additionally throttled
if enough results are already fetched. For paged queries it means that
the query returns as soon as 1MB of data is ready, and for unpaged ones
the concurrency will no longer be doubled as soon as the previous
iteration fetched 1MB of results.
The fixed 4096 value can be subject to debate, its reasoning is as follows:
for 2KiB rows, so moderately large but not huge, they result in
fetching 10MB of data, which is the granularity used by replicas.
For 200B rows, which is rather small, the result would still be
around 1MB.
At the same time, 4096 separate tasks also means 4096 allocations,
so increasing the number also strains the allocator.
Fixes#8799
Tests: unit(release),
manual: observing metrics of modified index_paging_test
Closes#8814
* github.com:scylladb/scylla:
cql3: limit the transitional result size for indexed queries
cql3: return indexed pages after 1MB worth of data
cql3: limit the concurrency of indexed statements
Currently each tick of the virtual clock immediately schedules the next one
at the end of the task queue, but this is too aggressive. If a tick
generates work that need two tasks to be scheduled one after another
such implementation will make the task queue grow to infinity. Considering
that in the debug mode even ready future causes preemption and task
queue shuffling may cause two or more ticks to be executed without any
other work done in the middle it is very easy to get to such situation.
The patch changes the virtual clock to tick only when a shard is idle.
Message-Id: <20210606140305.2930189-1-gleb@scylladb.com>
Unpaged indexed queries already have a concurrency limit of 4096,
but now the concurrency is further limited by previous number of bytes
fetched. Once this number reached 1MB, the concurrency will not be
increased in consecutive queries to avoid overload.
Feature requests, fixes, and OOP refactor of replication_test.
Note: all known bugs and hangs are now fixed.
A new helper class "raft_cluster" is created.
Each move of a helper function to the class has its own commit.
New helpers are provided
To simplify code, for now only a single apply function can be set per
raft_cluster. No tests were using in any other way. In the future,
there could be custom apply functions per server dynamically assigned,
if this becomes needed.
* alejo/raft-tests-replication-02-v3-30: (66 commits)
raft: replication test: wait for log for both index and term
raft: replication test: reset network at construction
raft: replication test: use lambda visitor for updates
raft: replication test: move structs into class
raft: replication test: move data structures to cluster class
raft: replication test: remove shared pointers
raft: replication test: move get_states() to raft_cluster
raft: replication test: test_server inside raft_cluster
raft: replication test: rpc declarative tests
raft: replication test: add wait_log
raft: replication test: add stop and reset server
raft: replication test: disconnect 2 support
raft: replication test: explicit node_id naming
raft: replication test: move definitions up
raft: replication test: no append entries support
raft: replication test: fix helper parameter
raft: replication test: stop servers out of config
raft: replication test: wait log when removing leader from configuration
raft: replication test: only manipulate servers in configuration
raft: replication test: only cancel rearm ticker for removed server
...
In this small series, I rewrite test/alternator/run to Python using the utility
functions developed for test/cql-pytest. In the future, we should do the same to
test/redis/run and test/scylla-gdb/run.
The benefit of this rewrite is less code duplication (all run scripts start with
the same duplicate code to deal with temporary directories, to run Scylla IP
addresses, etc.), but most importantly - in the future fixes we do to cql-pytest
(e.g., parameters needed to start Scylla efficiently, how to shut down Scylla,
etc.) will appear automatically in alternator test without needing to remember
to change both.
Another benefit is that test/alternator/run will now be Python, not a shell
script. This should make it easier to integrate it into test.py (refs #6212) in
the future - if we want to.
Closes#8792
* github.com:scylladb/scylla:
test/alternator: rewrite test/alternator/run script in Python
test/cql-pytest: make test run code more general
Eliminate not used includes and replace some more includes
with forward declarations where appropriate.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Most RAFT packets are sent very rarely during special phases of the
protocol (like election or leader stepdown). The protocol itself does
not care if a packet is sent or dropped, so returning futures from their
send function does not serve any purpose. Change the raft's rpc interface
to return void for all packet types but append_request. We still want to
get a future from sending append_request for backpressure purposes since
replication protocol is more efficient if there is no packet loss, so
it is better to pause a sender than dropping packets inside the rpc. Rpc
is still allowed to drop append_requests if overloaded.
Fixes#8773
When refactored for cdc, properties -> extensions merge
was modified so it did not handle _removal_ (i.e. an
extension function returning null -> no entry in new map).
This causes certain enterprise extensions to not be able
to disable themselves.
Fixed by filtering existing extensions by property keywords.
Unit test added.
Closes#8774
In test_tracing.py::test_slow_query_log, the was what looked like
an innocent 30-second timeout, but this was in fact a 8 minute
timeout - because it started with sleeping 1 second, then 2 seconds,
then 3, ... until 30 seconds. Such a high timeout is frustrating when
trying to debug failures in the test - which is only expected to take
2 seconds (and all of it because of an artificial timeout).
So fix the loop to stop iterating after 60 seconds (a compromise
between 30 seconds and 8 minutes...), sleeping a constant amount
between iterations.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210601150631.1037158-1-nyh@scylladb.com>