There is no guarantee that rpc streaming makes progress in some time
period. Remove the keep alive timer in streaming to avoid killing the
session when the rpc streaming is just slow.
The keep alive timer is used to close the session in the following case:
n2 (the rpc streaming sender) streams to n1 (the rpc streaming receiver)
kill -9 n2
We need this because we do not kill the session when gossip think a node
is down, because we think the node down might only be temporary
and it is a waste to drop the previous work that has done especially
when the stream session takes long time.
Since in range_streamer, we do not stream all data in a single stream
session, we stream 10% of the data per time, and we have retry logic.
I think it is fine to kill a stream session when gossip thinks a node is
down. This patch changes to close all stream session with the node that
gossip think it is down.
Message-Id: <bdbb9486a533eee25fcaf4a23a946629ba946537.1551773823.git.asias@scylladb.com>
(cherry picked from commit b8158dd65d)
During streaming, there are cases when we should invoke the view write
path. In particular, if we're streaming because of repair or if a view
has not yet finished building and we're bootstrapping a new node.
The design constraints are:
1) The streamed writes should be visible to new writes, but the
sstable should not participate in compaction, or we would lose the
ability to exclude the streamed writes on a restart;
2) The streamed writes must not be considered when generating view
updates for them;
3) Resilient to node restarts;
4) Resilient to concurrent stream sessions, possibly streaming mutations for overlapping ranges.
We achieve this by writing the streamed writes to an sstable in a
different folder, call it "staging". We achieve 1) by publishing the
sstable to the column family sstable set, but excluding it from
compactions. We do these steps upon boot, by looking at the staging
directory, thus achieving 3).
Fixes#3275
* 'streaming_view_to_staging_sstables_9' of https://github.com/psarna/scylla: (29 commits)
tests: add materialized views test
tests: add view update generator to cql test env
main: add registering staging sstables read from disk
database: add a check if loaded sstable is already staging
database: add get_staging_sstable method
streaming: stream tables with views through staging sstables
streaming: add system distributed keyspace ref to streaming
streaming: add view update generator reference to streaming
main: add generating missed mv updates from staging sstables
storage_service: move initializing sys_dist_ks before bootstrap
db/view: add view_update_from_staging_generator service
db/view: add view updating consumer
table: add stream_view_replica_updates
table: split push_view_replica_updates
table: add as_mutation_source_excluding
table: move push_view_replica_updates to table.cc
database: add populating tables with staging sstables
database: add creating /staging directory for sstables
database: add sstable-excluding reader
table: add move_sstable_from_staging_in_thread function
...
(cherry picked from commit a38f6078fb)
On receiving a mutation_fragment or a mutation triggered by a streaming
operation, we pass an enum stream_reason to notify the receiver what
the streaming is used for. So the receiver can decide further operation,
e.g., send view updates, beyond applying the streaming data on disk.
Fixes#3276
Message-Id: <f15ebcdee25e87a033dcdd066770114a499881c0.1539498866.git.asias@scylladb.com>
(cherry picked from commit 7f826d3343)
This reverts commit f792c78c96.
With the "Use range_streamer everywhere" (7217b7ab36) series,
all the user of streaming now do streaming with relative small ranges
and can retry streaming at higher level.
Reduce the time-to-recover from 5 hours to 10 minutes per stream session.
Even if the 10 minutes idle detection might cause higher false positive,
it is fine, since we can retry the "small" stream session anyway. In the
long term, we should replace the whole idle detection logic with
whenever the stream initiator goes away, the stream slave goes away.
Message-Id: <75f308baf25a520d42d884c7ef36f1aecb8a64b0.1520992219.git.asias@scylladb.com>
The uninitialized session has no peer associated with it yet. There is
no point sending the failed message when abort the session. Sending the
failed message in this case will send to a peer with uninitialized
dst_cpu_id which will casue the receiver to pass a bogus shard id to
smp::submit_to which cases segfault.
In addition, to be safe, initialize the dst_cpu_id to zero. So that
uninitialized session will send message to shard zero instead of random
bogus shard id.
Fixes the segfault issue found by
repair_additional_test.py:RepairAdditionalTest.repair_abort_test
Fixes#3115
Message-Id: <9f0f7b44c7d6d8f5c60d6293ab2435dadc3496a9.1515380325.git.asias@scylladb.com>
In the case there are large number of column families, the sender will
send all the column families in parallel. We allow 20% of shard memory
for streaming on the receiver, so each column family will have 1/N, N is
the number of in-flight column families, memory for memtable. Large N
causes a lot of small sstables to be generated.
It is possible there are multiple senders to a single receiver, e.g.,
when a new node joins the cluster, the maximum in-flight column families
is number of peer node. The column families are sent in the order of
cf_id. It is not guaranteed that all peers has the same speed so they
are sending the same cf_id at the same time, though. We still have
chance some of the peers are sending the same cf_id.
Fixes#3065
Message-Id: <46961463c2a5e4f1faff232294dc485ac4f1a04e.1513159678.git.asias@scylladb.com>
It is the handler for the failed complete message. Add a flag to
remember if we received a such message from peer, if so, do not send
back the failed complete message back to the peer when running
close_session with failed status.
Currently, send_complete_message is not used. We will use it shortly in
case the local session is failed. Send a complete message with failed
flag to notify peer node that the session is failed so that peer can
close the session. This can speed up the closing of failed session.
Also rename it to send_failed_complete_message.
Streaming ususally takes long time to complete. Abort it on false
positive idle detection can be very wasteful.
Increase the abort timeout from 10 minutes to a very large timeout, 300
minutes. The real idle session will be aborted eventually if other
mechanisms, e.g., streaming manager has gossip callback for on_remove
and on_restart event to abort, do not abort the session.
Fixes#2197
Message-Id: <57f81bfebfdc6f42164de5a84733097c001b394e.1494552921.git.asias@scylladb.com>
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
- 'net' namespace conflicts with seastar::net, renamed to 'netw'.
- 'transport' namespace conflicts with seastar::transport, renamed to
cql_transport.
- "logger" global variables now conflict with logger global type, renamed
to xlogger.
- other minor changes
There are places in which we need to use the column family object many
times, with deferring points in between. Because the column family may
have been destroyed in the deferring point, we need to go and find it
again.
If we use lw_shared_ptr, however, we'll be able to at least guarantee
that the object will be alive. Some users will still need to check, if
they want to guarantee that the column family wasn't removed. But others
that only need to make sure we don't access an invalid object will be
able to avoid the cost of re-finding it just fine.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <722bf49e158da77ff509372c2034e5707706e5bf.1478111467.git.glauber@scylladb.com>
Wrapping ranges are a pain, so we are moving wrap handling to the edges.
Since cql can't generate wrapping ranges, this means thrift and the ring
maintenance code; also range->ring transformations need to merge the first
and last ranges.
Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>
Remove inclusions from header files (primary offender is fb_utilities.hh)
and introduce new messaging_service_fwd.hh to reduce rebuilds when the
messaging service changes.
Message-Id: <1475584615-22836-1-git-send-email-avi@scylladb.com>
If the peer node of a stream_session is restarted or removed we should
abort the streaming. It is better to hook gossip callback in the stream
manager than in each streamm_session.
When the first time the keep alive timer fires, the _last_stream_bytes
btyes will be zero since it is the first time we update it. The keep
alive timer will be rearmed and fired again. The second time, we find
there is no progress, we close the session. The total idle time will be
2 * keep alive timer.
To make the idle time to close the session be more precise, we reduce
the interval to check the progess and close the session by checking last
time the progress is made.
Message-Id: <c959cffce0cc738a3d73caaf71d2adb709d46863.1456831616.git.asias@scylladb.com>
For each stream_session, we pretend we are sending/receiving one file,
to make it compatible with nodetool. For receiving_files, the file name
is "rxnofile". For sending_files, the file name is "txnofile".
stream_manager::update_all_progress_info is introduced to update the
progress info of all the stream_sessions in the node. We need this
because streaming mutations are received on all the cores, but the
stream_session object is only on one of the cores. It adds overhead if
we update progress info in stream_session object whenever we receive a
streaming mutation. So, what we do now is when we really need the
progress info, we update the progress info in stream_session object.
With http://127.0.0.$i:10000/stream_manager/, it looks like below when
decommission node 3 in a 3 nodes cluster.
=========== GET NODE 1
[{"plan_id": "935a2cc0-dc6b-11e5-bdbf-000000000000", "description":
"Unbootstrap", "sessions": [{"receiving_files": [{"value": {"direction":
"IN", "file_name": "rxnofile", "session_index": 0, "total_bytes":
16876296, "peer": "127.0.0.3", "current_bytes": 16876296}, "key":
"rxnofile"}], "receiving_summaries": [{"files": 1, "total_size": 0,
"cf_id": "869d8630-dc6b-11e5-bdbf-000000000000"}], "session_index": 0,
"state": "PREPARING", "connecting": "127.0.0.3", "peer": "127.0.0.3"}]}]
=========== GET NODE 2
[{"plan_id": "935a2cc0-dc6b-11e5-bdbf-000000000000", "description":
"Unbootstrap", "sessions": [{"receiving_files": [{"value": {"direction":
"IN", "file_name": "rxnofile", "session_index": 0, "total_bytes":
16755552, "peer": "127.0.0.3", "current_bytes": 16755552}, "key":
"rxnofile"}], "receiving_summaries": [{"files": 1, "total_size": 0,
"cf_id": "869d8630-dc6b-11e5-bdbf-000000000000"}], "session_index": 0,
"state": "PREPARING", "connecting": "127.0.0.3", "peer": "127.0.0.3"}]}]
=========== GET NODE 3
[{"plan_id": "935a2cc0-dc6b-11e5-bdbf-000000000000", "description":
"Unbootstrap", "sessions": [{"sending_files": [{"value": {"direction":
"OUT", "file_name": "txnofile", "session_index": 0, "total_bytes":
16876296, "peer": "127.0.0.1", "current_bytes": 16876296}, "key":
"txnofile"}], "sending_summaries": [{"files": 1, "total_size": 0,
"cf_id": "869d8630-dc6b-11e5-bdbf-000000000000"}], "session_index": 0,
"state": "PREPARING", "connecting": "127.0.0.1", "peer":
"127.0.0.1"},{"sending_files": [{"value": {"direction": "OUT",
"file_name": "txnofile", "session_index": 0, "total_bytes": 16755552,
"peer": "127.0.0.2", "current_bytes": 16755552}, "key": "txnofile"}],
"sending_summaries": [{"files": 1, "total_size": 0, "cf_id":
"869d8630-dc6b-11e5-bdbf-000000000000"}], "session_index": 0, "state":
"PREPARING", "connecting": "127.0.0.2", "peer": "127.0.0.2"}]}]
Currently, only the shard where the stream_plan is created on will send
streaing mutations. To utilize all the available cores, we can make each
shard send mutations which it is responsbile for. On the receiver side,
we do not forward the mutations to the shard where the stream_session is
created, so that we can avoid unnecessary forwarding.
Note: the downside is that it is now harder to:
1) to track number of bytes sent and received
2) to update the keep alive timer upon receive of the STREAM_MUTATION
To fix, we now store the sent/recieved bytes info on all shards. When
the keep alive timer expires, we check if any progress has been made.
Hopefully, this patch will make the streaming much faster and in turn
make the repair/decommission/adding a node faster.
Refs: https://github.com/scylladb/scylla/issues/849
Tested with decommission/repair dtest.
Message-Id: <96b419ab11b736a297edd54a0b455ffdc2511ac5.1454645370.git.asias@scylladb.com>
It is 1:1 mapping between session_info and stream_session. Putting
session_info inside stream_session, we can get rid of the
stream_coordinator::host_streaming_data class.
There are only two messages: prepare_message and outgoing_file_message.
Actually only the prepare_message is the message we send on wire.
Flatten the namespace.
- int connections_per_host
Scylla does not create connections per stream_session, instead it uses
rpc, thus connections_per_host is not relevant to scylla.
- bool keep_ss_table_level
- int repaired_at
Scylla does not stream sstable files. They are not relevant to scylla.
messaging_service will use private ip address automatically to connect a
peer node if possible. There is no need for the upper level like
streaming to worry about it. Drop it simplifies things a bit.
"When a node gain or regain responsibility for certain token ranges, streaming
will be performed, upon receiving of the stream data, the row cache
is invalidated for that range.
Refs #484."
The problem is that we set the session state to WAIT_COMPLETE in
send_complete_message's continuation, the peer node might send
COMPLETE_MESSAGE before we run the continuation, thus we set the wrong
status in COMPLETE_MESSAGE's handler and will not close the session.
Before:
GOT STREAM_MUTATION_DONE
receive task_completed
SEND COMPLETE_MESSAGE to 127.0.0.2:0
GOT COMPLETE_MESSAGE, from=127.0.0.2, connecting=127.0.0.3, dst_cpu_id=0
complete: PREPARING -> WAIT_COMPLETE
GOT COMPLETE_MESSAGE Reply
maybe_completed: WAIT_COMPLETE -> WAIT_COMPLETE
After:
GOT STREAM_MUTATION_DONE
receive task_completed
maybe_completed: PREPARING -> WAIT_COMPLETE
SEND COMPLETE_MESSAGE to 127.0.0.2:0
GOT COMPLETE_MESSAGE, from=127.0.0.2, connecting=127.0.0.3, dst_cpu_id=0
complete: WAIT_COMPLETE -> COMPLETE
Session with 127.0.0.2 is complete
If the session is idle for 10 minutes, close the session. This can
detect the following hangs:
1) if the sending node is gone, the receiving peer will wait forever
2) if the node which should send COMPLETE_MESSAGE to the peer node is
gone, the peer node will wait forever
Fixes simple_kill_streaming_node_while_bootstrapping_test.
When a node gain or regain responsibility for certain token ranges,
streaming will be performed, upon receiving of the stream data, the
row cache is invalidated for that range.
Refs #484.