We saw failure in dtest concurrent_schema_changes_test.py:
TestConcurrentSchemaChanges.changes_while_node_down_test test.
======================================================================
ERROR: changes_while_node_down_test (concurrent_schema_changes_test.TestConcurrentSchemaChanges)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/asias/src/cloudius-systems/scylla-dtest/concurrent_schema_changes_test.py", line 432, in changes_while_node_down_test
self.make_schema_changes(session, namespace='ns2')
File "/home/asias/src/cloudius-systems/scylla-dtest/concurrent_schema_changes_test.py", line 86, in make_schema_changes
session.execute('USE ks_%s' % namespace)
File "cassandra/cluster.py", line 2141, in cassandra.cluster.Session.execute
return self.execute_async(query, parameters, trace, custom_payload, timeout, execution_profile, paging_state).result()
File "cassandra/cluster.py", line 4033, in cassandra.cluster.ResponseFuture.result
raise self._final_exception
ConnectionShutdown: Connection to 127.0.0.1 is closed
The test:
session = self.patient_cql_connection(node2)
self.prepare_for_changes(session, namespace='ns2')
node1.stop()
self.make_schema_changes(session, namespace='ns2') --> ConnectionShutdown exception throws
The problem is that, after receiving the DOWN event, the python
Cassandra driver will call Cluster:on_down which checks if this client
has any connections to the node being shutdown. If there is any
connections, the Cluster:on_down handler will exit early, so the session
to the node being shutdown will not be removed.
If we shutdown the cql server first, the connection count will be zero
and the session will be removed.
Fixes: #4013
Message-Id: <7388f679a7b09ada10afe7e783d7868a58aac6ec.1545634941.git.asias@scylladb.com>
"
=== How the the partition level repair works
- The repair master decides which ranges to work on.
- The repair master splits the ranges to sub ranges which contains around 100
partitions.
- The repair master computes the checksum of the 100 partitions and asks the
related peers to compute the checksum of the 100 partitions.
- If the checksum matches, the data in this sub range is synced.
- If the checksum mismatches, repair master fetches the data from all the peers
and sends back the merged data to peers.
=== Major problems with partition level repair
- A mismatch of a single row in any of the 100 partitions causes 100
partitions to be transferred. A single partition can be very large. Not to
mention the size of 100 partitions.
- Checksum (find the mismatch) and streaming (fix the mismatch) will read the
same data twice
=== Row level repair
Row level checksum and synchronization: detect row level mismatch and transfer
only the mismatch
=== How the row level repair works
- To solve the problem of reading data twice
Read the data only once for both checksum and synchronization between nodes.
We work on a small range which contains only a few mega bytes of rows,
We read all the rows within the small range into memory. Find the
mismatch and send the mismatch rows between peers.
We need to find a sync boundary among the nodes which contains only N bytes of
rows.
- To solve the problem of sending unnecessary data.
We need to find the mismatched rows between nodes and only send the delta.
The problem is called set reconciliation problem which is a common problem in
distributed systems.
For example:
Node1 has set1 = {row1, row2, row3}
Node2 has set2 = { row2, row3}
Node3 has set3 = {row1, row2, row4}
To repair:
Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3.
Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2
Node1 sends row3 (set1 + set2 + set3 - set3) to Node3.
=== How to implement repair with set reconciliation
- Step A: Negotiate sync boundary
class repair_sync_boundary {
dht::decorated_key pk;
position_in_partition position
}
Reads rows from disk into row buffers until the size is larger than N
bytes. Return the repair_sync_boundary of the last mutation_fragment we
read from disk. The smallest repair_sync_boundary of all nodes is
set as the current_sync_boundary.
- Step B: Get missing rows from peer nodes so that repair master contains all the rows
Request combined hashes from all nodes between last_sync_boundary and
current_sync_boundary. If the combined hashes from all nodes are identical,
data is synced, goto Step A. If not, request the full hashes from peers.
At this point, the repair master knows exactly what rows are missing. Request the
missing rows from peer nodes.
Now, local node contains all the rows.
- Step C: Send missing rows to the peer nodes
Since local node also knows what peer nodes own, it sends the missing rows to
the peer nodes.
=== How the RPC API looks like
- repair_range_start()
Step A:
- request_sync_boundary()
Step B:
- request_combined_row_hashes()
- reqeust_full_row_hashes()
- request_row_diff()
Step C:
- send_row_diff()
- repair_range_stop()
=== Performance evaluation
We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We
created a keyspace with a replication factor of 3 and inserted 1 billion
rows to each of the 3 nodes. Each node has 241 GiB of data.
We tested 3 cases below.
1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows.
Time to repair:
old = 87 min
new = 70 min (rebuild took 50 minutes)
improvement = 19.54%
2) 100% synced: all of the 3 nodes have 1 billion identical rows.
Time to repair:
old = 43 min
new = 24 min
improvement = 44.18%
3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows.
Time to repair:
old: 211 min
new: 44 min
improvement: 79.15%
Bytes sent on wire for repair:
old: tx= 162 GiB, rx = 90 GiB
new: tx= 1.15 GiB, tx = 0.57 GiB
improvement: tx = 99.29%, rx = 99.36%
It is worth noting that row level repair sends and receives exactly the
number of rows needed in theory.
In this test case, repair master needs to receives 2 million rows and
sends 4 million rows. Here are the details: Each node has 1 billion *
0.1% distinct rows, that is 1 million rows. So repair master receives 1
million rows from repair slave 1 and 1 million rows from repair slave 2.
Repair master sends 1 million rows from repair master and 1 million rows
received from repair slave 1 to repair slave 2. Repair master sends
sends 1 million rows from repair master and 1 million rows received from
repair slave 2 to repair slave 1.
In the result, we saw the rows on wire were as expected.
tx_row_nr = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000
rx_row_nr = 500233 + 500235 + 499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000
Fixes: #3033
Tests: dtests/repair_additional_test.py
"
* 'asias/row_level_repair_v7' of github.com:cloudius-systems/seastar-dev: (51 commits)
repair: Enable row level repair
repair: Add row_level_repair
repair: Add docs for row level repair
repair: Add repair_init_messaging_service_handler
repair: Add repair_meta
repair: Add repair_writer
repair: Add repair_reader
repair: Add repair_row
repair: Add fragment_hasher
repair: Add decorated_key_with_hash
repair: Add get_random_seed
repair: Add get_common_diff_detect_algorithm
repair: Add shard_config
repair: Add suportted_diff_detect_algorithms
repair: Add repair_stats to repair_info
repair: Introduce repair_stats
flat_mutation_reader: Add make_generating_reader
storage_service: Introduce ROW_LEVEL_REPAIR feature
messaging_service: Add RPC verbs for row level repair
repair: Export the repair logger
...
When delaying a base write, there is no need to hold on to the
mutation if all replicas have already replied.
We introduce mutation_holder::release_mutation(), which frees the
mutations that are no longer needed during the rest of the delay.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
As the amount of pending view updates increases we know that there’s a
mismatch between the rate at which the base receives writes and the
rate at which the view retires them. We react by applying backpressure
to decrease the rate of incoming base writes, allowing the slow view
replicas to catch up. We want to delay the client’s next writes to a
base replica. We use the base’s backlog of view updates to derive
this delay.
If we achieve CL and the backlogs of all replicas involved were last
seen to be empty, then we wouldn't delay the client's reply. However,
it could be that one of the replicas is actually overloaded, and won't
reply for many new such requests. We'll eventually start applying
backpressure to the client via the background's write queue, but in
the meanwhile we may be dropping view updates. To mitigate this we rely
on the backlog being gossiped periodically.
Fixes#2538
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the view_update_backlog_broker class, which is
responsible for periodically updating the local gossip state with the
current node's view update backlog. It also registers to updates from
other nodes, and updates the local coordinator's view of their view
update backlogs.
We consider the view update backlog received from a peer through the
mutation_done verb to be always fresh, but we consider the one received
through gossip to be fresh only if it has a higher timestamp than what
we currently have recorded.
This is because a node only updates its gossip state periodically, and
also because a node can transitively receive gossip state about a third
node with outdated information.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This lays the groundwork for brokering a node's view update
backlog across the whole cluster. This is needed for when a
coordinator does not contact a given replica for a long time, and uses
a backlog view that is outdated and causes requests to be
unnecessarily delayed.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Change the inter-node protocol so we can propagate the view update
backlog from a base replica to the coordinator through the
mutation_done and mutation_failed verbs.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
In subsequent patches, replicas will reply to the coordinator with
their view update backlog. Before introducing changes to the
messaging_service, prepare the storage_proxy to receive and store
those backlogs.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The local view update backlog is the max backlog out of the relative
memory backlog size and the relative hints backlog size.
We leverage the db::view::node_update_backlog class so we can send the
max backlog out of the node's shards.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
View updates are sent with a timeout of 5 minutes, unrelated to
any user-defined value and meant as a protection mechanism. During
normal operation we don’t benefit from timing out view writes and
offloading them to the hinted-handoff queue, since they are an
internal, non-real time workload that we already spent resources on.
This value should be increases further, but that change depends on
Refs #2538
Refs #3826
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"
Working on database.hh or any header that is included in database.hh
(of which there is a lot), is a major pain as each change involves the
recompilation of half of our compilation units.
Reduce the impact by removing the `#include "database.hh"` directive
from as many header files as possible. Many headers can make do with
just some forward declarations and don't need to include the entire
headers. I also found some headers that included database.hh without
actually needing it.
Results
Before:
$ touch database.hh
$ ninja build/release/scylla
[1/154] CXX build/release/gen/cql3/CqlParser.o
After:
$ touch database.hh
$ ninja build/release/scylla
[1/107] CXX build/release/gen/cql3/CqlParser.o
"
* 'reduce-dependencies-on-database-hh/v2' of https://github.com/denesb/scylla:
treewide: remove include database.hh from headers where possible
database_fwd.hh: add keyspace fwd declaration
service/client_state: de-inline set_keyspace()
Move cache_temperature into its own header
Many headers don't really need to include database.hh, the include can
be replaced by forward declarations and/or including the actually needed
headers directly. Some headers don't need this include at all.
Each header was verified to be compilable on its own after the change,
by including it into an empty `.cc` file and compiling it. `.cc` files
that used to get `database.hh` through headers that no longer include it
were changed to include it themselves.
Embedding the expire timer for a write response in the
abstract_write_response_handler simplifies the code as it allows
removing the rh_entry type.
It will also make the timeout easily accessible inside the handler,
for future patches.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181213111818.39983-1-duarte@scylladb.com>
This is a backport of CASSANDRA-11038.
Before this, a restarted node will be reported as new node with NEW_NODE
cql notification.
To fix, only send NEW_NODE notification when the node was not part of
the cluster
Fixes: #3979
Tests: pushed_notifications_test.py:TestPushedNotifications.restart_node_test
Message-Id: <453d750b98b5af510c4637db25b629f07dd90140.1544583244.git.asias@scylladb.com>
Different nodes can concurrently create the distributed system
keyspace on boot, before the "if not exists" clause can take effect.
However, the resulting schema mutations will be different since
different nodes use different timestamps. This patch forces the
timestamps to be the same across all nodes, so we save some schema
mismatches.
This fixes a bug exposed by ca5dfdf, whereby the initialization of the
distributed system keyspace is done before waiting for schema
agreement. While waiting for schema agreement in
storage_service::join_token_ring(), the node still hasn't joined the
ring and schemas can't be pulled from it, so nodes can deadlock. A
similar situation can happen between a seed node and a non-seed node,
where the seed node progresses to a different "wait for schema
agreement" barrier, but still can't make progress because it can't
pull the schema from the non-seed node still trying to join the ring.
Finally, it is assumed that changes to the schema of the current
distributed system keyspace tables will be protected by a cluster
feature and a subsequent schema synchronization, such that all nodes
will be at a point where schemas can be transferred around.
Fixes#3976
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181211113407.20075-1-duarte@scylladb.com>
db::config is a global class; changes in any module can cause changes
in db::config. Therefore, it is a cause of needless recompilation.
Remove some of these dependencies by having consumers of db::config
declare an intermediate config struct that is contains only
configuration of interest to them, and have their caller fill it out
(in the case of auth, it already followed this scheme and the patchset
only moves the translation function).
In addition, some outright pointless inclusions of db/config.hh are
removed.
The result is somewhat shorter compile times, and fewer needless
recompiles.
* https://github.com/avikivity/scylla unconfig-1/v1:
config: remove inclusions of db/config.hh from header files
repair: remove unneeded config.hh inclusion
batchlog_manager: remove dependency on db::config
auth: remove permissions_cache dependency on db::config
auth: remove auth::service dependency on db::config
auth: remove unneeded db/config.hh includes
At this point the cql_ready facility is ready. To use it, advertise the
RPC_READY application state in the following cases:
- When a node boots, set it to false
- When cql server is ready, set it to true
- When cql server is down, set it to false
auth::service already has its own configuration and a function to create it
from db::config; just move it to the caller. This reduces dependencies on the
global db::config class.
permissions_cache already has its own configuration and a function to create it
from db::config; just move it to the caller. This reduces dependencies on the
global db::config class.
Instead, distribute those inclusions to .cc files that require them. This
reduces rebuilds when config.hh changes, and makes it easier to locate files
that need config disaggregation.
rh_entry address is captured inside timeout's callback lambda, so the
structure should not be moved after it is created. Change the code to
create rh_entry in-place instead of moving it into the map.
Fixes#3972.
Message-Id: <20181206164043.GN25283@scylladb.com>
storage_service keeps a bunch of "feature" variables, indicating cluster-wide
supported features, and has the ability to wait until the entire cluster supports
a given feature.
The propagation of features depends on gossip, but gossip is initialized after
storage_service, so the current code late-initializes the features. However, that
means that whoever waits on a feature between storage_service initialization and
gossip initialization loses their wait entry. In #3952, we have proof that this
in fact happens.
Fix this by removing the circular dependency. We now store features in a new
service, feature_service, that is started before both gossip and storage_service.
Gossip updates feature_service while storage_service reads for it.
Fixes#3953.
* https://github.com/avikivity/3953/v4.1:
storage_service: deinline enable_all_features()
gossiper: keep features registered
tests/gossip: switch to seastar::thread
storage_service: deinline init/deinit functions
gossiper: split feature storage into a new feature_service
gossiper: maybe enable features after start_gossiping()
storage_service: fix gap when feature::when_enabled() doesn't work
storage_service::register_features() reassigns to feature variables in
storage_service. This means that any call to feature::when_enabled() will be
orphaned when the feature is assigned.
Now that feature lifetimes are not tied to gossip, we can move the feature
initialization to the constructor and eliminate the gap. When gossip is started
it will evaluate application_states and enable features that the cluster agrees on.
Feature lifetime is tied to storage_service lifetime, but features are now managed
by gossip. To avoid circular dependency, add a new feature_service service to manage
feature lifetime.
To work around the problem, the current code re-initializes features after
gossip is initialized. This patch does not fix this problem; it only makes it
possible to solve it by untyping features from gossip.
drain suffers from the same problem as startup suffers now: memtables
are flushed as part of the drain routine, and because there are no
incoming writes the shares the controller assign to flushes go down over
time, slowing down the process of drain.
This patch reorders things so that we stop compactions first, and flush
later. It guarantees that when flush do happen it will have the full
bandwidth to work with.
There is a comment in the code saying we should stop compactions
forcefully instead of waiting for them to finish. I consider this
orthogonal to this patch therefore I am not touching this. Doing so will
make the drain operation even faster but can be done later. Even when we
do it, having the flushes proceed alone instead of during compactions
will make it faster.
Signed-off-by: Glauber Costa <glauber@scylladb.com>