This series incorporates various refactorings aimed mostly
at eliminating extra parameters to `serializer_*_impl` functions
for `EnumDef` and `ClassDef` AST classes.
Instead of carrying these parameters here and there over many
places, they are calculated on a preliminary run to collect
additional metadata, such as: namespaces and template parameters
from parent scopes. This metadata is used later to extend AST
classes.
The patchset does not introduce any changes in the generation
procedures, exclusively dealing with internal code structuring.
NOTE: although metadata collection involves an extra run through
the parse tree, the proper way should be to populate it instantly
while parsing the input. This is left to be adjusted lated in a
follow-up series.
Closes#8148
* github.com:scylladb/scylla:
idl: add descriptions for the top-level generation routines
idl: make ns_qualified name a class method
idl: cache template declarations inside enums and classes
idl: cache parent template params for enums and classes
idl: rename misleading `local_types` to `local_writable_types`
idl: remove remaining uses of `namespaces` argument
idl: remove `is_final` function and use `.final` AST class property
idl: remove `parent_template_param` from `local_types` set
idl: cache namespaces in AST nodes
idl: remove unused variables
"
This series moves the timeout parameter, that is passed to most
f_m_r methods, into the reader_permit. This eliminates
the need to pass the timeout around, as it's taken
from the permit when needed.
The permit timeout is updated in certain cases
when the permit/reader is paused and retrieved
later on for reuse.
Following are perf_simple_query results showing ~1%
reduction in insns/op and corresponding increase in tps.
$ build/release/test/perf/perf_simple_query -c 1 --operations-per-shard 1000000 --task-quota-ms 10
Before:
102500.38 tps ( 75.1 allocs/op, 12.1 tasks/op, 45620 insns/op)
After:
103957.53 tps ( 75.1 allocs/op, 12.1 tasks/op, 45372 insns/op)
Test: unit(dev)
DTest:
repair_additional_test.py:RepairAdditionalTest.repair_abort_test (release)
materialized_views_test.py:TestMaterializedViews.remove_node_during_mv_insert_3_nodes_test (release)
materialized_views_test.py:InterruptBuildProcess.interrupt_build_process_with_resharding_half_to_max_test (release)
migration_test.py:TTLWithMigrate.big_table_with_ttls_test (release)
"
* tag 'reader_permit-timeout-v6' of github.com:bhalevy/scylla:
flat_mutation_reader: get rid of timeout parameter
reader_concurrency_semaphore: use permit timeout for admission
reader_concurrency_semaphore: adjust reactivated reader timeout
multishard_mutation_query: create_reader: validate saved reader permit
repair: row_level: read_mutation_fragment: set reader timeout
flat_mutation_reader: maybe_timed_out: use permit timeout
test: sstable_datafile_test: add sstable_reader_with_timeout
reader_permit: add timeout member
With data segregation on repair, thousands of sstables are potentially
added to maintenance set which causes high latency due to stalls.
That's because N*M sstables are created by a repair,
where N = # of ranges
and M = # of segregations
For TWCS, M = # of windows.
Assuming N = 768 and M = 20, ~15k sstables end up in sstable set
To fix this problem, let's avoid performing data segregation in repair,
as offstrategy will already perform the segregation anyway.
So from now on, only N non-overlapping sstables will be added to set.
Read amplification isn't affected because a query will only touch one
sstable in maintenance set.
When offstrategy starts, it will pick all sstables from set and
compact them in a single step while performing data segregation,
so data is properly laid out before integrated into the main set.
tests:
- sstable_compaction_test.twcs_reshape_with_disjoint_set_test
- mode(dev)
- manual test using repair-based bootstrap
Fixes#9199.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210824185043.76475-1-raphaelsc@scylladb.com>
Current code std::move()-s the range tombstone into consumer thus
moving the tombstone's linkage to the containing list as well. As
the result the orignal range tombstone itself leaks as it leaves
the tree and cannot be reached on .clear(). Another danger is that
the iterator pointing to the tombstone becomes invalid while it's
then ++-ed to advance to the next entry.
The immediate fix is to keep the tombstone linked to the list while
moving.
fixes: #9207
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210825100834.3216-1-xemul@scylladb.com>
"
This series implements section 6.4 of the Raft PhD. It allows to do
linearisable reads on a follower bypassing raft log entirely. After this
series server::read_barrier can be executed on a follower as well as
leader and after it completes local user's state machine state can be
accessed directly.
"
* 'raft-read-v9' of github.com:scylladb/scylla-dev:
raft: test: add read_barrier test to replication_test
raft: test: add read_barrier tests to fsm_test
raft: make read_barrier work on a follower as well as on a leader
raft: add a function to wait for an index to be applied
raft: (server) add a helper to wait through uncertainty period
raft: make fsm::current_leader() public
raft: add hasher for raft::internal::tagged_uint64
serialize: add serialized for std::monostate
raft: fix indentation in applier_fiber
This patch implements RAFT extension that allows to perform linearisable
reads by accessing local state machine. The extension is described
in section 6.4 of the PhD. To sum it up to perform a read barrier on
a follower it needs to asks a leader the last committed index that it
knows about. The leader must make sure that it is still a leader before
answering by communicating with a quorum. When follower gets the index
back it waits for it to be applied and by that completes read_barrier
invocation.
The patch adds three new RPC: read_barrier, read_barrier_reply and
execute_read_barrier_on_leader. The last one is the one a follower uses
to ask a leader about safe index it can read. First two are used by a
leader to communicate with a quorum.
Add a helper to be able to wait until a Raft cluster
leader is elected. It can be used to avoid sleeps
when it's necessary to forward a request to the leader,
but the leader is yet unknown.
"
Factor out replication test, make it work with different clocks, add
some features, and add a many nodes test with steady_clock. Also
refactor common test helper.
Many nodes test passes for release and dev and normal tick of 100ms for
up to 1000 servers. For debug mode it's much fewer due to lack of
optimizations so it's only tested for smaller numbers.
Tests: unit ({dev}), unit ({debug}), unit ({release})
"
* 'raft-many-22-v12' of https://github.com/alecco/scylla: (21 commits)
raft: candidate timeout proportional to cluster size
raft: testing: many nodes test
raft: replication test: remove unused tick_all
raft: replication test: delays
raft: replication test: packet drop rpc helper
raft: replication test: connectivity configuration
raft: replication test: rpc network map in raft_cluster
raft: replication test: use minimum granularity
raft: replication test: minor: rename local to int ids
raft: replication test: fix restart_tickers when partitioning
raft: replication test: partition ranges
raft: replication test: isolate one server
raft: replication test: move objects out of header
raft: replication test: make dummy command const
raft: replication test: template clock type
raft: replication test: tick delta inside raft_cluster
raft: replication test: style - member initializer
raft: replication test: move common code out
raft: testing: refactor helper
raft: log election stages
...
Now that the timeout is stored in the reader
permit use it for admission rather than a timeout
parameter.
Note that evictable_reader::next_partition
currently passes db::no_timeout to
resume_or_create_reader, which propagated to
maybe_wait_readmission, but it seems to be
an oversight of the f_m_r api that doesn't
pass a timeout to next_partition().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The timeout needs to be propagated to the reader's permit.
Reset it to db::no_timeout in repair_reader::pause().
Warn if set_timeout asks to change the timeout too far into the
past (100ms). It is possible that it will be passed a
past timeout from the rcp path, where the message timeout
is applied (as duration) over the local lowres_clock time
and parallel read_data messages that share the query may end
up having close, but different timeout values.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To avoid dueling candidates with large clusters, make the timeout
proportional to the cluster size.
Debug mode is too slow for a test of 1000 nodes so it's disabled, but
the test passes for release and dev modes.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Tests with many nodes and realistic timers and ticks.
Network delays are kept as a fraction of ticks. (e.g. 20/100)
Tests with 600 or more nodes hang in debug mode.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Allow test supplied delays for rpc communication.
Allow supplying network delay, local delay (nodes within the same
server), how many nodes are local, and an extra small delay simulating
local load.
Modify rpc class to support delays. If delays are enabled, it no longer
directly calls the other node's server code but it schedules it to be
called later. This makes the test more realistic as in the previous
version the first candidate was always going to get to all followers
first, preventing a dueling candidates scenario.
Previously, tickers were all scheduled at the same time, so there was no
spread of them across the tick time. Now these tickers are scheduled
with a uniform spread across this time (tick delta).
Also previously, for custom free elections used tick_all() which
traversed _in_configuration sequentially and ticked each. This, combined
with rpc outbound directly calling methods in the other server without
yielding, caused free elections to be unrealistic with same order
determined and first candidate always winning. This patch changes this
behavior. The free election uses normal tickers (now uniformly
distributed in tick delay time) and its loop waits for tick delay time
(yielding) and checks if there's a new leader. Also note the order might
not be the same in debug mode if more than one tick is scheduled.
As rpc messages are sent delayed, network connectivity needs to be
checked again before calling the function on the remote side.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
When partitioning, elect_new_leader restarts tickers, so don't
re-restart them in this case.
When leader is dropped and no new leader is specified, restart tickers
before free election.
If no change of leader, restart tickers.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>