"
This series adds a fuzzer-type unit test for range scans, which
generates a semi-random dataset and executes semi-random range scans
against it, validating the result.
This test aims to cover a wide range of corner cases with the help of
randomness. Data and queries against it are generated in such a way that
various corner cases and their combinations are likely to be covered.
The infrastructure under range-scans have gone under massive changes in
the last year, growing in complexity and scope. The correctness of range
scans is critical for the correct functioning of any Scylla cluster, and
while the current unit tests served well in detecting any major problems
(mostly while developing), they are too simplistic and can only be
relied on to check the correctness of the basic functionality. This test
aims to extend coverage drastically, testing cases that the author of
the range-scan code or that of the existing unit tests didn't even think
exists, by relying on some randomness.
Fixes: #3954 (deprecates really)
"
* 'more-extensive-range-scan-unit-tests/v2' of https://github.com/denesb/scylla:
tests/multishard_mutation_query_test: add fuzzy test
tests/multishard_mutation_query_test: refactor read_all_partitions_with_paged_scan()
tests/test_table: add advanced `create_test_table()` overload
tests/test_table: make `create_test_table()` customizable
query: add trim_clustering_row_ranges_to()
tests/test_table: add keyspace and table name params
tests/test_table: s/create_test_cf/create_test_table/
tests: move create_test_cf() to tests/test_table.{hh,cc}
tests/multishard_mutation_query_test: drop many partition test
tests/multishard_mutation_query_test: drop range tombstone test
The materialized-views flow control carefully calculates an amount of
microseconds to delay a client to slow it down to the desired rate -
but then a typo (std::min instead of std::max) causes this delay to
be zeroed, which in effect completely nullifies the flow control
algorithm.
Before this fix, experiments suggested that view flow control was
not having any effect and view backlog not bounded at all. After this
fix, we can see the flow control having its desired effect, and the
view backlog converging.
Fixes#4143.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190226161452.498-1-nyh@scylladb.com>
The shard-aware drivers can cause a huge amount of connections to be created
when there are tens of thousands of clients. While normally the shard-aware
drivers are beneficial, in those cases they can consume too much memory.
Provide an option to disable shard awareness from the server (it is likely to
be easier to do this on the server than to reprovision those thousands of
clients).
Tests: manual test with wireshark.
Message-Id: <20190223173331.24424-1-avi@scylladb.com>
In resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test, we saw:
4 nodes in the tests
n1, n2, n3, n4 are started
n1 is stopped
n1 is changed to use different shard config
n1 is restarted ( 2019-01-27 04:56:00,377 )
The backtrace happened on n2 right fater n1 restarts:
0 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature STREAM_WITH_RPC_STREAM is enabled
1 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature WRITE_FAILURE_REPLY is enabled
2 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature XXHASH is enabled
3 WARN 2019-01-27 04:56:05,177 [shard 0] gossip - Fail to send EchoMessage to 127.0.58.1: seastar::rpc::closed_error (connection is closed)
4 INFO 2019-01-27 04:56:05,205 [shard 0] gossip - InetAddress 127.0.58.1 is now UP, status =
5 Segmentation fault on shard 0.
6 Backtrace:
7 0x00000000041c0782
8 0x00000000040d9a8c
9 0x00000000040d9d35
10 0x00000000040d9d83
11 /lib64/libpthread.so.0+0x00000000000121af
12 0x0000000001a8ac0e
13 0x00000000040ba39e
14 0x00000000040ba561
15 0x000000000418c247
16 0x0000000004265437
17 0x000000000054766e
18 /lib64/libc.so.6+0x0000000000020f29
19 0x00000000005b17d9
The theory is: migration_manager::maybe_schedule_schema_pull is scheduled, at this time
n1 has SCHEMA application_state, when n1 restarts, n2 gets new application
state from n1 which does not have SCHEMA yet, when migration_manager::maybe_schedule
wakes up from the 60 sleep, n1 has non-empty endpoint_state but empty
application_state for SCHEMA. We dereference the nullptr
application_state and abort.
In commit da80f27f44, we fixed the problem by
checking the pointer before dereference.
To prevent this to happen in the first place, we'd better to add
application_state::SCHEMA when gossip starts. This way, peer nodes
always see the application_state::SCHEMA when a node restarts.
Tests: resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test
Fixes#4148Fixes#4258
Limits are stored as uint32_t everywhere, but in some places
int32_t was used, which created inconsistencies when comparing
the value to std::numeric_limits<Type>::max().
In order to solve inconsistencies, the types are unified to uint32_t,
and instead of explicitly calling numeric limit max,
an already existing constant value query::max_rows is utilized.
Fixes#4253
Message-Id: <4234712ff61a0391821acaba63455a34844e489b.1550683120.git.sarna@scylladb.com>
"
This series introduces PER PARTITION LIMIT to CQL.
Protocol and storage is already capable of applying per-partition limits,
so for nonpaged queries the changes are superficial - a variable is parsed
and passed down.
For paged queries and filtering the situation is a little bit more complicated
due to corner cases: results for one partition can be split over 2 or more pages,
filtering may drop rows, etc. To solve these, another variable is added to paging
state - the number of rows already returned from last served partition.
Note that "last" partition may be stretched over any number of pages, not just the
last one, which is a case especially when considering filtering.
As a result, per-partition-limiting queries are not eligible for page generator
optimization, because they may need to have their results locally filtered
for extraneous rows (e.g. when the next page asks for per-partition limit 5,
but we already received 4 rows from the last partition, so need just 1 more
from last partition key, but 5 from all next ones).
Tests: unit (dev)
Fixes#2202
"
* 'add_per_partition_limit_3' of https://github.com/psarna/scylla:
tests: remove superficial ignore_order from filtering tests
tests: add filtering with per partition key limit test
tests: publish extract_paging_state and count_rows_fetched
tests: fix order of parameters in with_rows_ignore_order
cql3,grammar: add PER PARTITION LIMIT
idl,service: add persistent last partition row count
cql3: prevent page generator usage for per-partition limit
cql3: add checking for previous partition count to filtering
pager: add adjusting per-partition row limit
cql3: obey per partition limit for filtering
cql3: clean up unneeded limit variables
cql3: obey per partition limit for select statement
cql3: add get_per_partition_limit
cql3: add per_partition_limit to CQL statement
In order to process paged queries with per-partition limits properly,
paging state needs to keep additional information: what was the row
count of last partition returned in previous run.
That's necessary because the end of previous page and the beginning
of current one might consist of rows with the same partition key
and we need to be able to trim the results to the number indicated
by per-partition limit.
Filtering now needs to take into account per partition limits as well,
and for that it's essential to be able to compare partition keys
and decide which rows should be dropped - if previous page(s) contained
rows with the same partition key, these need to be taken into
consideration too.
For filtering pagers, per partition limit should be set
to page size every time a query is executed, because some rows
may potentially get dropped from results.
Part of the code is already implemented (counters and hinted-handoff).
Part of the code will probably never be (triggers). And the rest is
the code that estimates number of rows per range to determine query
parallelism, but we implemented exponential growth algorithms instead.
Message-Id: <20190214112226.GE19055@scylladb.com>
"
get_restricted_ranges() is inefficient since it calculates all
vnodes that cover a requested key ranges in advance, but callers often
use only the first one. Replace the function with generator interface
that generates requested number of vnodes on demand.
"
* 'gleb/query_ranges_to_vnodes_generator' of github.com:scylladb/seastar-dev:
storage_proxy: limit amount of precaclulated ranges by query_ranges_to_vnodes_generator
storage_proxy: remove old get_restricted_ranges() interface
cql3/statements/select_statement: convert index query interface to new query_ranges_to_vnodes_generator interface
tests: convert storage_proxy test to new query_ranges_to_vnodes_generator interface
storage_proxy: convert range query path to new query_ranges_to_vnodes_generator interface
storage_proxy: introduce new query_ranges_to_vnode_generator interface
When bootstrapping, a node should to wait to have a schema agreement
with its peers, before it can join the ring. This is to ensure it can
immediately accept writes. Failing to reach schema agreement before
joining is not fatal, as the node can pull unknown schemas on writes
on-demand. However, if such a schema contains references to UDFs, the
node will reject writes using it, due to #3760.
To ensure that schema agreement is reached before joining the ring,
`storage_service::join_token_ring()` has to checks. First it checks that
at least one peer was connected previously. For this it compares
`database::get_version()` with `database::empty_version`. The (implied)
assumption is that this will become something other than
`database::empty_version` only after having connected (and pulled
schemas from) at least one peer. This assumption doesn't hold anymore,
as we now set the version earlier in the boot process.
The second check verifies that we have the same schema version as all
known, live peers. This check assumes (since 3e415e2) that we have
already "met" all (or at least some) of our peers and if there is just
one known node (us) it concludes that this is a single-node cluster,
which automatically has schema agreement.
It's easy to see how these two checks will fail. The first fails to
ensure that we have met our peers, and the second wrongfully concludes
that we are a one-node cluster, and hence have schema agreement.
To fix this, modify the first check. Instead of relying on the presence
of a non-empty database version, supposedly implying that we already
talked to our peers, explicitely make sure that we have really talked to
*at least* one other node, before proceeding to the second check, which
will now do the correct thing, actually checking the schema versions.
Fixes: #4196
Branches: 3.0, 2.3
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <40b95b18e09c787e31ba6c5519fb64d68b4ca32e.1550228389.git.bdenes@scylladb.com>
Do not recalculate too much ranges in advance, it requires large
allocation and usually means that a consumer of the interface is going
to do to much work in parallel.
Fixes: #3767
get_restricted_ranges() function gets query provided key ranges
and divides them on vnode boundaries. It iterates over all ranges and
calculates all vnodes, but all its users are usually interested in only
one vnode since most likely it will be enough to populate a page. If it
will be not enough they will ask for more. This patch introduces new
interface instead of the function that allows to generate vnode ranges
on demand instead of precalculating all of them.
This algorithm was already duplicated in two places
(service/pager/query_pagers.cc and mutation_reader.cc). Soon it will be
used in a third place. Instead of triplicating, move it into a function
that everybody can use.
Fixes#4010
Unless user sets this explicitly, we should try explicitly avoid
deprecated protocol versions. While gnutls should do this for
connections initiated thusly, clients such as drivers etc might
use obsolete versions.
Message-Id: <20190107131513.30197-1-calle@scylladb.com>
Commit 976324bbb8 changed to use
get_application_state_ptr to get a pointer of the application_state. It
may return nullptr that is dereferenced unconditionally.
In resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test, we saw:
4 nodes in the tests
n1, n2, n3, n4 are started
n1 is stopped
n1 is changed to use different shard config
n1 is restarted ( 2019-01-27 04:56:00,377 )
The backtrace happened on n2 right fater n1 restarts:
0 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature STREAM_WITH_RPC_STREAM is enabled
1 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature WRITE_FAILURE_REPLY is enabled
2 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature XXHASH is enabled
3 WARN 2019-01-27 04:56:05,177 [shard 0] gossip - Fail to send EchoMessage to 127.0.58.1: seastar::rpc::closed_error (connection is closed)
4 INFO 2019-01-27 04:56:05,205 [shard 0] gossip - InetAddress 127.0.58.1 is now UP, status =
5 Segmentation fault on shard 0.
6 Backtrace:
7 0x00000000041c0782
8 0x00000000040d9a8c
9 0x00000000040d9d35
10 0x00000000040d9d83
11 /lib64/libpthread.so.0+0x00000000000121af
12 0x0000000001a8ac0e
13 0x00000000040ba39e
14 0x00000000040ba561
15 0x000000000418c247
16 0x0000000004265437
17 0x000000000054766e
18 /lib64/libc.so.6+0x0000000000020f29
19 0x00000000005b17d9
We do not know when this backtrace happened, but according to log from n3 an n4:
INFO 2019-01-27 04:56:22,154 [shard 0] gossip - InetAddress 127.0.58.2 is now DOWN, status = NORMAL
INFO 2019-01-27 04:56:21,594 [shard 0] gossip - InetAddress 127.0.58.2 is now DOWN, status = NORMAL
We can be sure the backtrace on n2 happened before 04:56:21 - 19 seconds (the
delay the gossip notice a peer is down), so the abort time is around 04:56:0X.
The migration_manager::maybe_schedule_schema_pull that triggers the backtrace
must be scheduled before n1 is restarted, because it dereference
application_state pointer after it sleeps 60 seconds, so the time
maybe_schedule_schema_pull is called is around 04:55:0X which is before n1 is
restarted.
So my theory is: migration_manager::maybe_schedule_schema_pull is scheduled, at this time
n1 has SCHEMA application_state, when n1 restarts, n2 gets new application
state from n1 which does not have SCHEMA yet, when migration_manager::maybe_schedule
wakes up from the 60 sleep, n1 has non-empty endpoint_state but empty
application_state for SCHEMA. We dereference the nullptr
application_state and abort.
Fixes: #4148
Tests: resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test
Message-Id: <9ef33277483ae193a49c5f441486ee6e045d766b.1548896554.git.asias@scylladb.com>
"
This series prevents view building to fall back to storing hints.
Instead, it will try to send hints to an endpoint as if it has
consistency level ONE, and in case of failure retry the whole
building step. Then, view building will never be marked as finished
prematurely (because of pending hints), which will help avoid
creating inconsistencies when decommissioning a node from the cluster.
Tests:
unit (release)
dtest (materialized_views_test.py.*)
Fixes#3857Fixes#4039
"
* 'do_not_mark_view_as_built_with_hints_7' of https://github.com/psarna/scylla:
db,view: add updating view_building_paused statistics
database: add view_building_paused metrics
table: make populate_views not allow hints
db,view: add allow_hints parameter to mutate_MV
storage_proxy: add allow_hints parameter to send_to_endpoint
This commit declares shared_ptr<user_types_metadata> in
database.hh were user_types_metadata is an incomplete type so
it requires
"Allow to use shared_ptr with incomplete type other than sstable"
to compile correctly.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Scylla starts doing IO much earlier that it starts cql/thrift servers.
The IO may cause an error that will try stop all servers, but since they
are still not running it will do nothing, but servers will be started
later. Fix it by checking that the node is not isolated before starting
servers.
Message-Id: <20190110152830.GE3172@scylladb.com>
Previously the limit was erroneously applied per page
instead of being accumulated, which might have caused returning
too many rows. As of now, LIMIT is handled properly inside
restrictions filter.
Fixes#4100
SSTables loaded to the system via /upload dir may sometimes be needed
to generate view updates from them (if their table has accompanying
views).
Fixes#4047
Replace stdx::optional and stdx::string_view with the C++ std
counterparts.
Some instances of boost::variant were also replaced with std::variant,
namely those that called seastar::visit.
Scylla now requires GCC 8 to compile.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190108111141.5369-1-duarte@scylladb.com>
While we keep ordinary hints in a directory parallel to the data directory,
we decided to keep the materialized view hints in a subdirectory of the data
directory, named "view_pending_updates". But during boot, we expect all
subdirectories of data/ to be keyspace names, and when we notice this one,
we print a warning:
WARN: database - Skipping undefined keyspace: view_pending_updates
This spurious warning annoyed users. But moreover, we could have bigger
problems if the user actually tries to create a keyspace with that name.
So in this patch, we move the view hints to a separate top-level directory,
which defaults to /var/lib/scylla/view_hints, but as usual can be configured.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190107142257.16342-1-nyh@scylladb.com>
This header, which is easily replaced with a forward declaration,
introduces a dependency on database.hh everywhere. Remove it and scatter
includes of database.hh in source files that really need it.
When a node is bootstrapping, it will receive data from other nodes
via streaming, including materialized views. Regardless whether these
views are built on other nodes or not, building them on newly
bootstrapped nodes has no effect - updates were either already streamed
completely (if view building have finished) or will be propagated
via view building, if the process is still ongoing.
So, marking all views as 'built' for the bootstrapped node prevents it
from spawning superfluous view building processes.
Fixes#4001
Message-Id: <fd53692c38d944122d1b1013fdb0aedf517fa409.1546498861.git.sarna@scylladb.com>
distributed_loader is a sizeable fraction of database.cc, so moving it
out reduces compile time and improves readability.
Message-Id: <20181230200926.15074-1-avi@scylladb.com>
We saw failure in dtest concurrent_schema_changes_test.py:
TestConcurrentSchemaChanges.changes_while_node_down_test test.
======================================================================
ERROR: changes_while_node_down_test (concurrent_schema_changes_test.TestConcurrentSchemaChanges)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/asias/src/cloudius-systems/scylla-dtest/concurrent_schema_changes_test.py", line 432, in changes_while_node_down_test
self.make_schema_changes(session, namespace='ns2')
File "/home/asias/src/cloudius-systems/scylla-dtest/concurrent_schema_changes_test.py", line 86, in make_schema_changes
session.execute('USE ks_%s' % namespace)
File "cassandra/cluster.py", line 2141, in cassandra.cluster.Session.execute
return self.execute_async(query, parameters, trace, custom_payload, timeout, execution_profile, paging_state).result()
File "cassandra/cluster.py", line 4033, in cassandra.cluster.ResponseFuture.result
raise self._final_exception
ConnectionShutdown: Connection to 127.0.0.1 is closed
The test:
session = self.patient_cql_connection(node2)
self.prepare_for_changes(session, namespace='ns2')
node1.stop()
self.make_schema_changes(session, namespace='ns2') --> ConnectionShutdown exception throws
The problem is that, after receiving the DOWN event, the python
Cassandra driver will call Cluster:on_down which checks if this client
has any connections to the node being shutdown. If there is any
connections, the Cluster:on_down handler will exit early, so the session
to the node being shutdown will not be removed.
If we shutdown the cql server first, the connection count will be zero
and the session will be removed.
Fixes: #4013
Message-Id: <7388f679a7b09ada10afe7e783d7868a58aac6ec.1545634941.git.asias@scylladb.com>
"
=== How the the partition level repair works
- The repair master decides which ranges to work on.
- The repair master splits the ranges to sub ranges which contains around 100
partitions.
- The repair master computes the checksum of the 100 partitions and asks the
related peers to compute the checksum of the 100 partitions.
- If the checksum matches, the data in this sub range is synced.
- If the checksum mismatches, repair master fetches the data from all the peers
and sends back the merged data to peers.
=== Major problems with partition level repair
- A mismatch of a single row in any of the 100 partitions causes 100
partitions to be transferred. A single partition can be very large. Not to
mention the size of 100 partitions.
- Checksum (find the mismatch) and streaming (fix the mismatch) will read the
same data twice
=== Row level repair
Row level checksum and synchronization: detect row level mismatch and transfer
only the mismatch
=== How the row level repair works
- To solve the problem of reading data twice
Read the data only once for both checksum and synchronization between nodes.
We work on a small range which contains only a few mega bytes of rows,
We read all the rows within the small range into memory. Find the
mismatch and send the mismatch rows between peers.
We need to find a sync boundary among the nodes which contains only N bytes of
rows.
- To solve the problem of sending unnecessary data.
We need to find the mismatched rows between nodes and only send the delta.
The problem is called set reconciliation problem which is a common problem in
distributed systems.
For example:
Node1 has set1 = {row1, row2, row3}
Node2 has set2 = { row2, row3}
Node3 has set3 = {row1, row2, row4}
To repair:
Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3.
Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2
Node1 sends row3 (set1 + set2 + set3 - set3) to Node3.
=== How to implement repair with set reconciliation
- Step A: Negotiate sync boundary
class repair_sync_boundary {
dht::decorated_key pk;
position_in_partition position
}
Reads rows from disk into row buffers until the size is larger than N
bytes. Return the repair_sync_boundary of the last mutation_fragment we
read from disk. The smallest repair_sync_boundary of all nodes is
set as the current_sync_boundary.
- Step B: Get missing rows from peer nodes so that repair master contains all the rows
Request combined hashes from all nodes between last_sync_boundary and
current_sync_boundary. If the combined hashes from all nodes are identical,
data is synced, goto Step A. If not, request the full hashes from peers.
At this point, the repair master knows exactly what rows are missing. Request the
missing rows from peer nodes.
Now, local node contains all the rows.
- Step C: Send missing rows to the peer nodes
Since local node also knows what peer nodes own, it sends the missing rows to
the peer nodes.
=== How the RPC API looks like
- repair_range_start()
Step A:
- request_sync_boundary()
Step B:
- request_combined_row_hashes()
- reqeust_full_row_hashes()
- request_row_diff()
Step C:
- send_row_diff()
- repair_range_stop()
=== Performance evaluation
We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We
created a keyspace with a replication factor of 3 and inserted 1 billion
rows to each of the 3 nodes. Each node has 241 GiB of data.
We tested 3 cases below.
1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows.
Time to repair:
old = 87 min
new = 70 min (rebuild took 50 minutes)
improvement = 19.54%
2) 100% synced: all of the 3 nodes have 1 billion identical rows.
Time to repair:
old = 43 min
new = 24 min
improvement = 44.18%
3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows.
Time to repair:
old: 211 min
new: 44 min
improvement: 79.15%
Bytes sent on wire for repair:
old: tx= 162 GiB, rx = 90 GiB
new: tx= 1.15 GiB, tx = 0.57 GiB
improvement: tx = 99.29%, rx = 99.36%
It is worth noting that row level repair sends and receives exactly the
number of rows needed in theory.
In this test case, repair master needs to receives 2 million rows and
sends 4 million rows. Here are the details: Each node has 1 billion *
0.1% distinct rows, that is 1 million rows. So repair master receives 1
million rows from repair slave 1 and 1 million rows from repair slave 2.
Repair master sends 1 million rows from repair master and 1 million rows
received from repair slave 1 to repair slave 2. Repair master sends
sends 1 million rows from repair master and 1 million rows received from
repair slave 2 to repair slave 1.
In the result, we saw the rows on wire were as expected.
tx_row_nr = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000
rx_row_nr = 500233 + 500235 + 499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000
Fixes: #3033
Tests: dtests/repair_additional_test.py
"
* 'asias/row_level_repair_v7' of github.com:cloudius-systems/seastar-dev: (51 commits)
repair: Enable row level repair
repair: Add row_level_repair
repair: Add docs for row level repair
repair: Add repair_init_messaging_service_handler
repair: Add repair_meta
repair: Add repair_writer
repair: Add repair_reader
repair: Add repair_row
repair: Add fragment_hasher
repair: Add decorated_key_with_hash
repair: Add get_random_seed
repair: Add get_common_diff_detect_algorithm
repair: Add shard_config
repair: Add suportted_diff_detect_algorithms
repair: Add repair_stats to repair_info
repair: Introduce repair_stats
flat_mutation_reader: Add make_generating_reader
storage_service: Introduce ROW_LEVEL_REPAIR feature
messaging_service: Add RPC verbs for row level repair
repair: Export the repair logger
...
When delaying a base write, there is no need to hold on to the
mutation if all replicas have already replied.
We introduce mutation_holder::release_mutation(), which frees the
mutations that are no longer needed during the rest of the delay.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
As the amount of pending view updates increases we know that there’s a
mismatch between the rate at which the base receives writes and the
rate at which the view retires them. We react by applying backpressure
to decrease the rate of incoming base writes, allowing the slow view
replicas to catch up. We want to delay the client’s next writes to a
base replica. We use the base’s backlog of view updates to derive
this delay.
If we achieve CL and the backlogs of all replicas involved were last
seen to be empty, then we wouldn't delay the client's reply. However,
it could be that one of the replicas is actually overloaded, and won't
reply for many new such requests. We'll eventually start applying
backpressure to the client via the background's write queue, but in
the meanwhile we may be dropping view updates. To mitigate this we rely
on the backlog being gossiped periodically.
Fixes#2538
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the view_update_backlog_broker class, which is
responsible for periodically updating the local gossip state with the
current node's view update backlog. It also registers to updates from
other nodes, and updates the local coordinator's view of their view
update backlogs.
We consider the view update backlog received from a peer through the
mutation_done verb to be always fresh, but we consider the one received
through gossip to be fresh only if it has a higher timestamp than what
we currently have recorded.
This is because a node only updates its gossip state periodically, and
also because a node can transitively receive gossip state about a third
node with outdated information.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This lays the groundwork for brokering a node's view update
backlog across the whole cluster. This is needed for when a
coordinator does not contact a given replica for a long time, and uses
a backlog view that is outdated and causes requests to be
unnecessarily delayed.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Change the inter-node protocol so we can propagate the view update
backlog from a base replica to the coordinator through the
mutation_done and mutation_failed verbs.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>