Commit Graph

245 Commits

Author SHA1 Message Date
Paweł Dziepak
3e0555809e storage_proxy: catch all exceptions in read executor
abstract_read_executor::reconcile() is supposed to make sure that
_result_promise is eventually set to either a result or an exception.
That may not happen however if reconciliation throws any exception
since only read timeouts are being caught. When that happends the
continuation chain becomes stuck.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-31 16:38:41 +01:00
Glauber Costa
5fa866223d streaming: add incoming streaming mutations to a different sstable
Keeping the mutations coming from the streaming process as mutations like any
other have a number of advantages - and that's why we do it.

However, this makes it impossible for Seastar's I/O scheduler to differentiate
between incoming requests from clients, and those who are arriving from peers
in the streaming process.

As a result, if the streaming mutations consume a significant fraction of the
total mutations, and we happen to be using the disk at its limits, we are in no
position to provide any guarantees - defeating the whole purpose of the
scheduler.

To implement that, we'll keep a separate set of memtables that will contain
only streaming mutations. We don't have to do it this way, but doing so
makes life a lot easier. In particular, to write an SSTable, our API requires
(because the filter requires), that a good estimate on the number of partitions
is informed in advance. The partitions also need to be sorted.

We could write mutations directly to disk, but the above conditions couldn't be
met without significant effort. In particular, because mutations can be
arriving from multiple peer nodes, we can't really sort them without keeping a
staging area anyway.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-03-23 09:13:00 -04:00
Paweł Dziepak
9f3893980a move SCHEMA_CHECK registration to migration_manager
The verb is just for reporting and debugging purposes, but it is better
not to register it until it can return a meaningful value. Besides, it
really belongs to the migration manager subsystem anyway.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1458037053-14836-1-git-send-email-pdziepak@scylladb.com>
2016-03-15 12:24:37 +02:00
Asias He
883d8cb8fd storage_service: Move REPLICATION_FINISHED verb to storage_service
It belongs to storage_service not storage_proxy.
2016-03-15 16:13:22 +08:00
Gleb Natapov
5076f4878b main: Defer storage proxy RPC verb registration after commitlog replay
Message-Id: <20160315071229.GM6117@scylladb.com>
2016-03-15 09:18:12 +02:00
Pekka Enberg
1429213b4c main: Defer migration manager RPC verb registration after commitlog replay
Defer registering migration manager RPC verbs after commitlog has has
been replayed so that our own schema is fully loaded before other other
nodes start querying it or sending schema updates.
Message-Id: <1457971028-7325-1-git-send-email-penberg@scylladb.com>
2016-03-14 18:03:16 +01:00
Paweł Dziepak
82d2a2dccb specify whether query::result, result_digest or both are needed
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-11 18:27:13 +00:00
Paweł Dziepak
46079f763b query: add keys and tombstones to result digest
Query result digest is used to verify that all replicas have the same
data. Therefore, it needs to contain more information than the query
result itself in order to ensure proper detection of disagreements.

Generally, adding clustering keys to the digest regardless of whether
the client asked for them will guarantee correctness. However, adding
tombstones as well improves the chances of early detection of nodes
containing stale data.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-11 18:27:13 +00:00
Paweł Dziepak
77dbe3c12f storage_proxy: fix reconciliation with limits
Currently, if there is a disagreement between replicas we get mutations
from all of them, merge this mutations and send the result to the
client, difference between the result and the mutation sent by a
particular replica is sent back to repair it.
Unfortunately, that may not suffice to provide user with correct results
in case of disagreements.

Consider the following scenario:

create table cf(p int, c int, r int, primary key(p, c));

node1:
p=0, c=1, r=1 (timestamp = 1)
p=0, c=2, r=2 (timestamp = 2)

node2:
p=0, c=1, r=tombstone (timestamp = 2)
p=0, c=2, r=1 (timestamp = 1)

query:
select r from cf limit 1;

Let's assume there are no row markers. node1 will send only outdated
cell (p=0, c=1, r=1) while node2 will send both tombstone for c=1 and
outdated cell (p=0, c=2, r=1). A disagreement will be detected, the
replies will be merged and the coordinator will respond to the client
with result r=1, while the correct answer is r=2.

The solution proposed in this patch is to attempt to detect cases when
the problem may occur and retry queries with larger limit which result
in replicas providing more information.

The detection logic is simple: the partition key and clustering key of
the last row in the reconciled result are compared with the partition
keys and clustering keys of the last rows of replies from replicas
(except short reads). If the (pk, ck) of the replica last row is smaller
than the (pk, ck) of the reconciled result the query is retried.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-11 18:26:33 +00:00
Gleb Natapov
f242c6395c storage_proxy: add counter for retries reads
Message-Id: <20160309130453.GF2253@scylladb.com>
2016-03-09 14:09:42 +01:00
Gleb Natapov
ce6d1a242a storage_proxy: fix background_reads counter
background_reads collectd counter was not always properly decremented.
Fix it and streamline background read repair error handling.

Message-Id: <20160307182255.GI4849@scylladb.com>
2016-03-07 19:41:09 +01:00
Gleb Natapov
2d092bbd32 storage_proxy: send read requests with timeout
No need to wait for replies long after request is timed out.
Message-Id: <1457351304-28721-2-git-send-email-gleb@scylladb.com>
2016-03-07 14:00:11 +01:00
Gleb Natapov
4122422d19 storage_proxy: always wait for digest read resolver done future
Currently it is waited upon only if background read repair check is
needed and this cause unhandled exception warning to be printed if
it enters failed state. Fix this by always waiting on it, but doing
anything beyond ignoring an exception only if check is needed.
Message-Id: <1457351304-28721-1-git-send-email-gleb@scylladb.com>
2016-03-07 14:00:09 +01:00
Gleb Natapov
626c9d046b fix EACH_QUORUM handling during bootstrapping
Currently write acknowledgements handling does not take bootstrapping
node into account for CL=EACH_QUORUM. The patch fixes it.

Fixes #994

Message-Id: <20160307121620.GR2253@scylladb.com>
2016-03-07 13:56:34 +01:00
Gleb Natapov
f59415b3c6 Take pending endpoints into account while checking for sufficient live nodes
During bootstrapping additional copies of data has to be made to ensure
that CL level is met (see CASSANDRA-833 for details). Our code does
that, but it does not take into account that bootstraping node can be
dead which may cause request to proceed even though there is no
enough live nodes for it to be completed. In such a case request neither
completes nor timeouts, so it appear to be stuck from CQL layer POV. The
patch fixes this by taking into account pending nodes while checking
that there are enough sufficient live nodes for operation to proceed.

Fixes #965

Message-Id: <20160303165250.GG2253@scylladb.com>
2016-03-07 13:30:13 +01:00
Gleb Natapov
b89b6f442b storage_proxy: fix race between read cl completion and timeout in digest resolver
If timeout happens after cl promise is fulfilled, but before
continuation runs it removes all the data that cl continuation needs
to calculate result. Fix this by calculating result immediately and
returning it in cl promise instead of delaying this work until
continuation runs. This has a nice side effect of simplifying digest
mismatch handling and making it exception free.

Fixes #977.

Message-Id: <1457015870-2106-3-git-send-email-gleb@scylladb.com>
2016-03-03 16:48:28 +02:00
Gleb Natapov
e4ac5157bc storage_proxy: store only one data reply in digest resolver.
Read executor may ask for more than one data reply during digest
resolving stage, but only one result is actually needed to satisfy
a query, so no need to store all of them.

Message-Id: <1457015870-2106-2-git-send-email-gleb@scylladb.com>
2016-03-03 16:47:53 +02:00
Gleb Natapov
69b61b81ce storage_proxy: fix cl achieved condition in digest resolver timeout handler
In digest resolver for cl to be achieved it is not enough to get correct
number of replies, but also to have data reply among them. The condition
in digest timeout does not check that, fortunately we have a variable
that we set to true when cl is achieved, so use it instead.

Message-Id: <1457015870-2106-1-git-send-email-gleb@scylladb.com>
2016-03-03 16:47:11 +02:00
Pekka Enberg
6d7e14a53a Merge "Implement describe_schema_versions" from Paweł
"This series implements describe_schema_versions so that we nodetool
 describecluster can return proper schema information for the whole
 cluster. It involves adding new verb SCHEMA_CHECK which is used to get
 schema version for a given node and a simple map-reduce that using that
 verb gets info from the whole cluster.

 This fixes #677, fixes #684, and fixes #472."
2016-03-02 16:02:53 +02:00
Paweł Dziepak
ca68c36c8c storage_proxy: handle SCHEMA_CHECK verb
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-02 12:49:54 +00:00
Paweł Dziepak
bdc23ae5b5 remove db/serializer.hh includes
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-03-02 09:07:09 +00:00
Gleb Natapov
22d2b9a2dc Yield execution in mutation_result_merger
mutation_result_merger::get can run for a long time. Make it yield
execution from time to time.

Message-Id: <1456674046-14502-1-git-send-email-gleb@scylladb.com>
2016-02-28 17:55:33 +02:00
Gleb Natapov
32e9f1ecd4 Fix read_timeouts storage_proxy counter
Read timeouts are not counted now. The patch fixes it.

Message-Id: <20160228133315.GN6705@scylladb.com>
2016-02-28 15:34:42 +02:00
Calle Wilund
590ec1674b truncate: Require timestamp join-function to ensure equal values
Fixes #937

In fixing #884, truncation not truncating memtables properly,
time stamping in truncate was made shard-local. This however
breaks the snapshot logic, since for all shards in a truncate,
the sstables should snapshot to the same location.

This patch adds a required function argument to truncate (and
by extension drop_column_family) that produces a time stamp in
a "join" fashion (i.e. same on all shards), and utilizes the
joinpoint type in caller to do so.

Message-Id: <1456332856-23395-2-git-send-email-calle@scylladb.com>
2016-02-24 18:59:31 +02:00
Avi Kivity
1f752446d2 Merge "Truncation format & fixes" from Calle
"Fixes #884
Fixes #895

Also at seastar-dev: calle/truncate_more

1.) Change truncation records to be stored with IDL serialization
2.) Fix db::serializers encoding of replay_position
3.) Detect attempted reading of Origin truncation records, and instead
    of crashing, ignore and warn.
4.) Change truncation time stamps to be generated per-shard, _after_
    CF flush is done, otherwise data in memtables at flush would be
    retained/replayed on next start. Retain the highest time stamp
    generated.

Note for (3): This patch set does _not_ clear out origin records
automatically. This because I feel that is a somewhat drastic and
irreversible thing to do. If we want to avail the user of a means
to get rid of the (3) warning, we should probably tell him to either
use cqlsh, or add an API call for this, so he can do it explicitly.
"
2016-02-15 11:39:56 +02:00
Tomasz Grabiec
456275e06a storage_proxy: Simplify condition
Message-Id: <1455288472-30538-1-git-send-email-tgrabiec@scylladb.com>
2016-02-14 11:22:15 +02:00
Calle Wilund
18203a4244 database::truncate/drop: Move time stamp generation to shard
Fixes #884

Time stamps for truncation must be generated after flush, either by
splitting the truncate into two (or more) for-each-shard operations,
or simply by doing time stamping per shard (this solution).

We generate TS on each shard after flushing, and then rely on the
actual stored value to be the highest time point generated.

This should however, from batch replay point of view, be functionally
equivalent. And not a problem.
2016-02-09 15:45:37 +00:00
Gleb Natapov
63a5aa6122 prevent superfluous frozen_mutation copying
Sometimes frozen_mutation is copied while it can be moved instead. Fix
those cases.

Message-Id: <20160204165708.GI6705@scylladb.com>
2016-02-07 10:54:16 +02:00
Gleb Natapov
049ae37d08 storage_proxy: change collectd to show foreground mutation instead of overall mutation count
It is much easier to see what is going on this way otherwise graphs for
bg mutations and overall mutations are very close with usual scaling for
many workloads.

Message-Id: <20160204083452.GH6705@scylladb.com>
2016-02-04 14:58:56 +02:00
Gleb Natapov
b4b560e0fc change result_digest to hold std::array instead of a std::vector
Digest size if fixed, so no need to use std::vector to hold it.

Message-Id: <20160203102530.GU6705@scylladb.com>
2016-02-03 12:27:39 +02:00
Glauber Costa
f6cfb04d61 add a priority class to mutation readers
SSTables already have a priority argument wired to their read path. However,
most of our reads do not call that interface directly, but employ the services
of a mutation reader instead.

Some of those readers will be used to read through a mutation_source, and those
have to patched as well.

Right now, whenever we need to pass a class, we pass Seastar's default priority
class.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-01-25 15:20:38 -05:00
Gleb Natapov
dde2e80a20 storage_proxy: remove batchlog synchronously
Wait for batchlog removal before completing a query otherwise batchlog
removal queries may accumulate. Still ignore an error if it happens
since it is not critical, but log it.

Message-Id: <20160118095642.GB6705@scylladb.com>
2016-01-18 12:38:12 +02:00
Avi Kivity
d5050e4c6a storage_proxy: make MUTATION and MUTATION_DONE verbs sychronous at the server side
While MUTATION and MUTATION_DONE are asynchronous by nature (when a MUTATION
completes, it sends a MUTATION_DONE message instead of responding
synchronously), we still want them to be synchronous at the server side
wrt. the RPC server itself.  This is because RPC accounts for resources
consumed by the handler only while the handler is executing; if we return
immediately, and let the code execute asynchronously, RPC believes no
resources are consumed and can instantiate more handlers than the shard
has resources for.

Fix by changing the return type of the handlers to future<no_wait_type>
(from a plain no_wait_type), and making that future complete when local
processing is over.

Ref #596.
Message-Id: <1453048967-5286-1-git-send-email-avi@scylladb.com>
2016-01-18 09:59:34 +02:00
Gleb Natapov
647a09cd7b storage_proxy: improve mutation timeout logging
Message-Id: <20160114105359.GY6705@scylladb.com>
2016-01-14 12:00:35 +01:00
Avi Kivity
f917f73616 Merge "Handling of schema changes" from Tomasz
"Our domain objects have schema version dependent format, for efficiency
reasons. The data structures which map between columns and values rely on
column ids, which are consecutive integers. For example, we store cells in a
vector where index into the vector is an implicit column id identifying table
column of the cell. When columns are added or removed the column ids may
shift. So, to access mutations or query results one needs to know the version
of the schema corresponding to it.

In case of query results, the schema version to which it conforms will always
be the version which was used to construct the query request. So there's no
change in the way query result consumers operate to handle schema changes. The
interfaces for querying needed to be extended to accept schema version and do
the conversions if necessary.

Shard-local interfaces work with a full definition of schema version,
represented by the schema type (usually passed as schema_ptr). Schema versions
are identified across shards and nodes with a UUID (table_schema_version
type). We maintain schema version registry (schema_registry) to avoid fetching
definitions we already know about. When we get a request using unknown schema,
we need to fetch the definition from the source, which must know it, to obtain
a shard-local schema_ptr for it.

Because mutation representation is schema version dependent, mutations of
different versions don't necessarily commute. When a column is dropped from
schema, the dropped column is no longer representable in the new schema. It is
generally fine to not hold data for dropped columns, the intent behind
dropping a column is to lose the data in that column. However, when merging an
incoming mutation with an existing mutation both of which have different
schema versions, we'd have to choose which schema should be considered
"latest" in order not to loose data. Schema changes can be made concurrently
in the cluster and initiated on different nodes so there is not always a
single notion of latest schema. However, schema changes are commutative and by
merging changes nodes eventually agree on the version.  For example adding
column A (version X) on one node and adding column B (version Y) on another
eventually results in a schema version with both A and B (version Z). We
cannot tell which version among X and Y is newer, but we can tell that version
Z is newer than both X and Y. So the solution to the problem of merging
conflicting mutations could be to ensure that such merge is performed using
the schema which is superior to schemas of both mutations.

The approach taken in the series for ensuring this is as follows. When a node
receives a mutation of an unknown schema version it first performs a schema
merge with the source of that mutation. Schema merge makes sure that current
node's version is superior to the schema of incoming mutation. Once the
version is synced with, it is remembered as such and won't be synced with on
later mutations. Because of this bookkeeping, schema versions must be
monotonic; we don't want table altering to result in any earlier version
because that would cause nodes to avoid syncing with them. The version is a
cryptographically-secure hash of schema mutations, which should fulfill this
purpose in practice.

TODO: It's possible that the node is already performing a sync triggered by
broadcasted schema mutations. To avoid triggering a second sync needlessly, the
schema merging should mark incoming versions as being synced with.

Each table shard keeps track of its current schema version, which is
considered to be superior to all versions which are going to be applied to it.
All data sources for given column family within a shard have the same notion
of current schema version. Individual entries in cache and memtables may be at
earlier versions but this is hidden behind the interface. The entries are
upgraded to current version lazily on access. Sstables are immutable, so they
don't need to track current version. Like any other data source, they can be
queried with any schema version.

Note, the series triggered a bug in demangler:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68700"
2016-01-11 17:59:14 +02:00
Vlad Zolotarov
0ed210e117 storage_proxy::query(): intercept exceptions coming from trace()
Exceptions originated by an unimplemented to_string() methods
may interrupt the query() flow if not intercepted. Don't let it
happen.

Fixes issue #768

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-01-11 12:29:50 +01:00
Tomasz Grabiec
a2cdbff965 storage_proxy: Log failures of definitions update handler
Fixes #769.
2016-01-11 10:34:53 +01:00
Tomasz Grabiec
e1e8858ed1 service: Fetch and sync schema 2016-01-11 10:34:53 +01:00
Tomasz Grabiec
da3a453003 service: Add GET_SCHEMA_VERSION remote call
The verb belongs to a seaprate client to avoid potential deadlocks
should the throttling on connection level be introduced in the
future. Another reason is to reduce latency for version requests as it
can potentially block many requests.
2016-01-11 10:34:52 +01:00
Tomasz Grabiec
4e5a52d6fa db: Make read interface schema version aware
The intent is to make data returned by queries always conform to a
single schema version, which is requested by the client. For CQL
queries, for example, we want to use the same schema which was used to
compile the query. The other node expects to receive data conforming
to the requested schema.

Interface on shard level accepts schema_ptr, across nodes we use
table_schema_version UUID. To transfer schema_ptr across shards, we
use global_schema_ptr.

Because schema is identified with UUID across nodes, requestors must
be prepared for being queried for the definition of the schema. They
must hold a live schema_ptr around the request. This guarantees that
schema_registry will always know about the requested version. This is
not an issue because for queries the requestor needs to hold on to the
schema anyway to be able to interpret the results. But care must be
taken to always use the same schema version for making the request and
parsing the results.

Schema requesting across nodes is currently stubbed (throws runtime
exception).
2016-01-11 10:34:52 +01:00
Tomasz Grabiec
036974e19b Make mutation interfaces support multiple versions
Schema is tracked in memtable and cache per-entry. Entries are
upgraded lazily on access. Incoming mutations are upgraded to table's
current schema on given shard.

Mutating nodes need to keep schema_ptr alive in case schema version is
requested by target node.
2016-01-11 10:34:51 +01:00
Asias He
2345cda42f messaging_service: Rename shard_id to msg_addr
Use shard_id as the destination of the messaging_service is confusing,
since shard_id is used in the context of cpu id.
Message-Id: <8c9ef193dc000ef06f8879e6a01df65cf24635d8.1452155241.git.asias@scylladb.com>
2016-01-07 10:36:35 +02:00
Avi Kivity
f3980f1fad Merge seastar upstream
* seastar 51154f7...8b2171e (9):
  > memcached: avoid a collision of an expiration with time_point(-1).
  > tutorial: minor spelling corrections etc.
  > tutorial: expand semaphores section
  > Merge "Use steady_clock where monotonic clock is required" from Vlad
  > Merge "TLS fixes + RPC adaption" from Calle
  > do_with() optimization
  > tutorial: explain limiting parallelism using semaphores
  > submit_io: change pending flushes criteria
  > apps: remove defunct apps/seastar

Adjust code to use steady_clock instead of high_resolution_clock.
2015-12-27 14:40:20 +02:00
Pekka Enberg
9604d55a44 Merge "Add unit test for get_restricted_ranges()" from Tomek 2015-12-17 09:14:30 +02:00
Avi Kivity
b34a1f6a84 Merge "Preliminary changes for handling of schema changes" from Tomasz
"I extracted some less controversial changes on which the schema changes series will depend
 o somewhat reduce the noise in the main series."
2015-12-16 19:08:22 +02:00
Tomasz Grabiec
872bfadb3d messaging_service: Remove unused parameters from send_migration_request() 2015-12-16 18:06:54 +01:00
Tomasz Grabiec
e445e4785c storage_proxy: Extract get_restricted_ranges() as a free function
To make it directly testable.
2015-12-16 13:09:01 +01:00
Gleb Natapov
de63b3a824 storage_proxy: provide timeout for send_mutation verb
Providing timeout for send_mutation verb allows rpc to drop packets that
sit in outgoing queue for to long.
2015-12-16 10:13:46 +02:00
Gleb Natapov
fe4bc741f4 storage_proxy: throttle mutations based on ongoing background activity
With consistency level less then ALL mutation processing can move to
background (meaning client was answered, but there is still work to
do on behalf of the request). If background request rate completion
is lower than incoming request rate background request will accumulate
and eventually will exhaust all memory resources. This patch's aim is
to prevent this situation by monitoring how much memory all current
background request take and when some threshold is passed stop moving
request to background (by not replying to a client until either memory
consumptions moves below the threshold or request is fully completed).

There are two main point where each background mutation consumes memory:
holding frozen mutation until operation is complete in order to hint it
if it does not) and on rpc queue to each replica where it sits until it's
sent out on the wire. The patch accounts for both of those separately
and limits the former to be 10% of total memory and the later to be 6M.
Why 6M? The best answer I can give is why not :) But on a more serious
note the number should be small enough so that all the data can be
sent out in a reasonable amount of time and one shard is not capable to
achieve even close to a full bandwidth, so empirical evidence shows 6M
to be a good number.
2015-12-16 10:13:46 +02:00
Gleb Natapov
e43ae7521f storage_proxy: unfuturize send_to_live_endpoints()
send_to_live_endpoints() is never waited upon, it does its job in the
background. This patch formalize that by changing return value to void
and also refactoring code so that frozen_mutation shared pointer is not
held more that it should: currently it is held until send_mutation()
completes, but since send_mutation() does not use frozen_mutation
asynchronously this is not necessary.
2015-12-15 15:40:36 +02:00