scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-27 03:45:11 +00:00

Author	SHA1	Message	Date
Paweł Dziepak	3e0555809e	storage_proxy: catch all exceptions in read executor abstract_read_executor::reconcile() is supposed to make sure that _result_promise is eventually set to either a result or an exception. That may not happen however if reconciliation throws any exception since only read timeouts are being caught. When that happends the continuation chain becomes stuck. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-31 16:38:41 +01:00
Glauber Costa	5fa866223d	streaming: add incoming streaming mutations to a different sstable Keeping the mutations coming from the streaming process as mutations like any other have a number of advantages - and that's why we do it. However, this makes it impossible for Seastar's I/O scheduler to differentiate between incoming requests from clients, and those who are arriving from peers in the streaming process. As a result, if the streaming mutations consume a significant fraction of the total mutations, and we happen to be using the disk at its limits, we are in no position to provide any guarantees - defeating the whole purpose of the scheduler. To implement that, we'll keep a separate set of memtables that will contain only streaming mutations. We don't have to do it this way, but doing so makes life a lot easier. In particular, to write an SSTable, our API requires (because the filter requires), that a good estimate on the number of partitions is informed in advance. The partitions also need to be sorted. We could write mutations directly to disk, but the above conditions couldn't be met without significant effort. In particular, because mutations can be arriving from multiple peer nodes, we can't really sort them without keeping a staging area anyway. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:13:00 -04:00
Paweł Dziepak	9f3893980a	move SCHEMA_CHECK registration to migration_manager The verb is just for reporting and debugging purposes, but it is better not to register it until it can return a meaningful value. Besides, it really belongs to the migration manager subsystem anyway. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1458037053-14836-1-git-send-email-pdziepak@scylladb.com>	2016-03-15 12:24:37 +02:00
Asias He	883d8cb8fd	storage_service: Move REPLICATION_FINISHED verb to storage_service It belongs to storage_service not storage_proxy.	2016-03-15 16:13:22 +08:00
Gleb Natapov	5076f4878b	main: Defer storage proxy RPC verb registration after commitlog replay Message-Id: <20160315071229.GM6117@scylladb.com>	2016-03-15 09:18:12 +02:00
Pekka Enberg	1429213b4c	main: Defer migration manager RPC verb registration after commitlog replay Defer registering migration manager RPC verbs after commitlog has has been replayed so that our own schema is fully loaded before other other nodes start querying it or sending schema updates. Message-Id: <1457971028-7325-1-git-send-email-penberg@scylladb.com>	2016-03-14 18:03:16 +01:00
Paweł Dziepak	82d2a2dccb	specify whether query::result, result_digest or both are needed Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:27:13 +00:00
Paweł Dziepak	46079f763b	query: add keys and tombstones to result digest Query result digest is used to verify that all replicas have the same data. Therefore, it needs to contain more information than the query result itself in order to ensure proper detection of disagreements. Generally, adding clustering keys to the digest regardless of whether the client asked for them will guarantee correctness. However, adding tombstones as well improves the chances of early detection of nodes containing stale data. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:27:13 +00:00
Paweł Dziepak	77dbe3c12f	storage_proxy: fix reconciliation with limits Currently, if there is a disagreement between replicas we get mutations from all of them, merge this mutations and send the result to the client, difference between the result and the mutation sent by a particular replica is sent back to repair it. Unfortunately, that may not suffice to provide user with correct results in case of disagreements. Consider the following scenario: create table cf(p int, c int, r int, primary key(p, c)); node1: p=0, c=1, r=1 (timestamp = 1) p=0, c=2, r=2 (timestamp = 2) node2: p=0, c=1, r=tombstone (timestamp = 2) p=0, c=2, r=1 (timestamp = 1) query: select r from cf limit 1; Let's assume there are no row markers. node1 will send only outdated cell (p=0, c=1, r=1) while node2 will send both tombstone for c=1 and outdated cell (p=0, c=2, r=1). A disagreement will be detected, the replies will be merged and the coordinator will respond to the client with result r=1, while the correct answer is r=2. The solution proposed in this patch is to attempt to detect cases when the problem may occur and retry queries with larger limit which result in replicas providing more information. The detection logic is simple: the partition key and clustering key of the last row in the reconciled result are compared with the partition keys and clustering keys of the last rows of replies from replicas (except short reads). If the (pk, ck) of the replica last row is smaller than the (pk, ck) of the reconciled result the query is retried. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:26:33 +00:00
Gleb Natapov	f242c6395c	storage_proxy: add counter for retries reads Message-Id: <20160309130453.GF2253@scylladb.com>	2016-03-09 14:09:42 +01:00
Gleb Natapov	ce6d1a242a	storage_proxy: fix background_reads counter background_reads collectd counter was not always properly decremented. Fix it and streamline background read repair error handling. Message-Id: <20160307182255.GI4849@scylladb.com>	2016-03-07 19:41:09 +01:00
Gleb Natapov	2d092bbd32	storage_proxy: send read requests with timeout No need to wait for replies long after request is timed out. Message-Id: <1457351304-28721-2-git-send-email-gleb@scylladb.com>	2016-03-07 14:00:11 +01:00
Gleb Natapov	4122422d19	storage_proxy: always wait for digest read resolver done future Currently it is waited upon only if background read repair check is needed and this cause unhandled exception warning to be printed if it enters failed state. Fix this by always waiting on it, but doing anything beyond ignoring an exception only if check is needed. Message-Id: <1457351304-28721-1-git-send-email-gleb@scylladb.com>	2016-03-07 14:00:09 +01:00
Gleb Natapov	626c9d046b	fix EACH_QUORUM handling during bootstrapping Currently write acknowledgements handling does not take bootstrapping node into account for CL=EACH_QUORUM. The patch fixes it. Fixes #994 Message-Id: <20160307121620.GR2253@scylladb.com>	2016-03-07 13:56:34 +01:00
Gleb Natapov	f59415b3c6	Take pending endpoints into account while checking for sufficient live nodes During bootstrapping additional copies of data has to be made to ensure that CL level is met (see CASSANDRA-833 for details). Our code does that, but it does not take into account that bootstraping node can be dead which may cause request to proceed even though there is no enough live nodes for it to be completed. In such a case request neither completes nor timeouts, so it appear to be stuck from CQL layer POV. The patch fixes this by taking into account pending nodes while checking that there are enough sufficient live nodes for operation to proceed. Fixes #965 Message-Id: <20160303165250.GG2253@scylladb.com>	2016-03-07 13:30:13 +01:00
Gleb Natapov	b89b6f442b	storage_proxy: fix race between read cl completion and timeout in digest resolver If timeout happens after cl promise is fulfilled, but before continuation runs it removes all the data that cl continuation needs to calculate result. Fix this by calculating result immediately and returning it in cl promise instead of delaying this work until continuation runs. This has a nice side effect of simplifying digest mismatch handling and making it exception free. Fixes #977. Message-Id: <1457015870-2106-3-git-send-email-gleb@scylladb.com>	2016-03-03 16:48:28 +02:00
Gleb Natapov	e4ac5157bc	storage_proxy: store only one data reply in digest resolver. Read executor may ask for more than one data reply during digest resolving stage, but only one result is actually needed to satisfy a query, so no need to store all of them. Message-Id: <1457015870-2106-2-git-send-email-gleb@scylladb.com>	2016-03-03 16:47:53 +02:00
Gleb Natapov	69b61b81ce	storage_proxy: fix cl achieved condition in digest resolver timeout handler In digest resolver for cl to be achieved it is not enough to get correct number of replies, but also to have data reply among them. The condition in digest timeout does not check that, fortunately we have a variable that we set to true when cl is achieved, so use it instead. Message-Id: <1457015870-2106-1-git-send-email-gleb@scylladb.com>	2016-03-03 16:47:11 +02:00
Pekka Enberg	6d7e14a53a	Merge "Implement describe_schema_versions" from Paweł "This series implements describe_schema_versions so that we nodetool describecluster can return proper schema information for the whole cluster. It involves adding new verb SCHEMA_CHECK which is used to get schema version for a given node and a simple map-reduce that using that verb gets info from the whole cluster. This fixes #677, fixes #684, and fixes #472."	2016-03-02 16:02:53 +02:00
Paweł Dziepak	ca68c36c8c	storage_proxy: handle SCHEMA_CHECK verb Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-02 12:49:54 +00:00
Paweł Dziepak	bdc23ae5b5	remove db/serializer.hh includes Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-02 09:07:09 +00:00
Gleb Natapov	22d2b9a2dc	Yield execution in mutation_result_merger mutation_result_merger::get can run for a long time. Make it yield execution from time to time. Message-Id: <1456674046-14502-1-git-send-email-gleb@scylladb.com>	2016-02-28 17:55:33 +02:00
Gleb Natapov	32e9f1ecd4	Fix read_timeouts storage_proxy counter Read timeouts are not counted now. The patch fixes it. Message-Id: <20160228133315.GN6705@scylladb.com>	2016-02-28 15:34:42 +02:00
Calle Wilund	590ec1674b	truncate: Require timestamp join-function to ensure equal values Fixes #937 In fixing #884, truncation not truncating memtables properly, time stamping in truncate was made shard-local. This however breaks the snapshot logic, since for all shards in a truncate, the sstables should snapshot to the same location. This patch adds a required function argument to truncate (and by extension drop_column_family) that produces a time stamp in a "join" fashion (i.e. same on all shards), and utilizes the joinpoint type in caller to do so. Message-Id: <1456332856-23395-2-git-send-email-calle@scylladb.com>	2016-02-24 18:59:31 +02:00
Avi Kivity	1f752446d2	Merge "Truncation format & fixes" from Calle "Fixes #884 Fixes #895 Also at seastar-dev: calle/truncate_more 1.) Change truncation records to be stored with IDL serialization 2.) Fix db::serializers encoding of replay_position 3.) Detect attempted reading of Origin truncation records, and instead of crashing, ignore and warn. 4.) Change truncation time stamps to be generated per-shard, _after_ CF flush is done, otherwise data in memtables at flush would be retained/replayed on next start. Retain the highest time stamp generated. Note for (3): This patch set does _not_ clear out origin records automatically. This because I feel that is a somewhat drastic and irreversible thing to do. If we want to avail the user of a means to get rid of the (3) warning, we should probably tell him to either use cqlsh, or add an API call for this, so he can do it explicitly. "	2016-02-15 11:39:56 +02:00
Tomasz Grabiec	456275e06a	storage_proxy: Simplify condition Message-Id: <1455288472-30538-1-git-send-email-tgrabiec@scylladb.com>	2016-02-14 11:22:15 +02:00
Calle Wilund	18203a4244	database::truncate/drop: Move time stamp generation to shard Fixes #884 Time stamps for truncation must be generated after flush, either by splitting the truncate into two (or more) for-each-shard operations, or simply by doing time stamping per shard (this solution). We generate TS on each shard after flushing, and then rely on the actual stored value to be the highest time point generated. This should however, from batch replay point of view, be functionally equivalent. And not a problem.	2016-02-09 15:45:37 +00:00
Gleb Natapov	63a5aa6122	prevent superfluous frozen_mutation copying Sometimes frozen_mutation is copied while it can be moved instead. Fix those cases. Message-Id: <20160204165708.GI6705@scylladb.com>	2016-02-07 10:54:16 +02:00
Gleb Natapov	049ae37d08	storage_proxy: change collectd to show foreground mutation instead of overall mutation count It is much easier to see what is going on this way otherwise graphs for bg mutations and overall mutations are very close with usual scaling for many workloads. Message-Id: <20160204083452.GH6705@scylladb.com>	2016-02-04 14:58:56 +02:00
Gleb Natapov	b4b560e0fc	change result_digest to hold std::array instead of a std::vector Digest size if fixed, so no need to use std::vector to hold it. Message-Id: <20160203102530.GU6705@scylladb.com>	2016-02-03 12:27:39 +02:00
Glauber Costa	f6cfb04d61	add a priority class to mutation readers SSTables already have a priority argument wired to their read path. However, most of our reads do not call that interface directly, but employ the services of a mutation reader instead. Some of those readers will be used to read through a mutation_source, and those have to patched as well. Right now, whenever we need to pass a class, we pass Seastar's default priority class. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-01-25 15:20:38 -05:00
Gleb Natapov	dde2e80a20	storage_proxy: remove batchlog synchronously Wait for batchlog removal before completing a query otherwise batchlog removal queries may accumulate. Still ignore an error if it happens since it is not critical, but log it. Message-Id: <20160118095642.GB6705@scylladb.com>	2016-01-18 12:38:12 +02:00
Avi Kivity	d5050e4c6a	storage_proxy: make MUTATION and MUTATION_DONE verbs sychronous at the server side While MUTATION and MUTATION_DONE are asynchronous by nature (when a MUTATION completes, it sends a MUTATION_DONE message instead of responding synchronously), we still want them to be synchronous at the server side wrt. the RPC server itself. This is because RPC accounts for resources consumed by the handler only while the handler is executing; if we return immediately, and let the code execute asynchronously, RPC believes no resources are consumed and can instantiate more handlers than the shard has resources for. Fix by changing the return type of the handlers to future<no_wait_type> (from a plain no_wait_type), and making that future complete when local processing is over. Ref #596. Message-Id: <1453048967-5286-1-git-send-email-avi@scylladb.com>	2016-01-18 09:59:34 +02:00
Gleb Natapov	647a09cd7b	storage_proxy: improve mutation timeout logging Message-Id: <20160114105359.GY6705@scylladb.com>	2016-01-14 12:00:35 +01:00
Avi Kivity	f917f73616	Merge "Handling of schema changes" from Tomasz "Our domain objects have schema version dependent format, for efficiency reasons. The data structures which map between columns and values rely on column ids, which are consecutive integers. For example, we store cells in a vector where index into the vector is an implicit column id identifying table column of the cell. When columns are added or removed the column ids may shift. So, to access mutations or query results one needs to know the version of the schema corresponding to it. In case of query results, the schema version to which it conforms will always be the version which was used to construct the query request. So there's no change in the way query result consumers operate to handle schema changes. The interfaces for querying needed to be extended to accept schema version and do the conversions if necessary. Shard-local interfaces work with a full definition of schema version, represented by the schema type (usually passed as schema_ptr). Schema versions are identified across shards and nodes with a UUID (table_schema_version type). We maintain schema version registry (schema_registry) to avoid fetching definitions we already know about. When we get a request using unknown schema, we need to fetch the definition from the source, which must know it, to obtain a shard-local schema_ptr for it. Because mutation representation is schema version dependent, mutations of different versions don't necessarily commute. When a column is dropped from schema, the dropped column is no longer representable in the new schema. It is generally fine to not hold data for dropped columns, the intent behind dropping a column is to lose the data in that column. However, when merging an incoming mutation with an existing mutation both of which have different schema versions, we'd have to choose which schema should be considered "latest" in order not to loose data. Schema changes can be made concurrently in the cluster and initiated on different nodes so there is not always a single notion of latest schema. However, schema changes are commutative and by merging changes nodes eventually agree on the version. For example adding column A (version X) on one node and adding column B (version Y) on another eventually results in a schema version with both A and B (version Z). We cannot tell which version among X and Y is newer, but we can tell that version Z is newer than both X and Y. So the solution to the problem of merging conflicting mutations could be to ensure that such merge is performed using the schema which is superior to schemas of both mutations. The approach taken in the series for ensuring this is as follows. When a node receives a mutation of an unknown schema version it first performs a schema merge with the source of that mutation. Schema merge makes sure that current node's version is superior to the schema of incoming mutation. Once the version is synced with, it is remembered as such and won't be synced with on later mutations. Because of this bookkeeping, schema versions must be monotonic; we don't want table altering to result in any earlier version because that would cause nodes to avoid syncing with them. The version is a cryptographically-secure hash of schema mutations, which should fulfill this purpose in practice. TODO: It's possible that the node is already performing a sync triggered by broadcasted schema mutations. To avoid triggering a second sync needlessly, the schema merging should mark incoming versions as being synced with. Each table shard keeps track of its current schema version, which is considered to be superior to all versions which are going to be applied to it. All data sources for given column family within a shard have the same notion of current schema version. Individual entries in cache and memtables may be at earlier versions but this is hidden behind the interface. The entries are upgraded to current version lazily on access. Sstables are immutable, so they don't need to track current version. Like any other data source, they can be queried with any schema version. Note, the series triggered a bug in demangler: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68700"	2016-01-11 17:59:14 +02:00
Vlad Zolotarov	0ed210e117	storage_proxy::query(): intercept exceptions coming from trace() Exceptions originated by an unimplemented to_string() methods may interrupt the query() flow if not intercepted. Don't let it happen. Fixes issue #768 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-01-11 12:29:50 +01:00
Tomasz Grabiec	a2cdbff965	storage_proxy: Log failures of definitions update handler Fixes #769.	2016-01-11 10:34:53 +01:00
Tomasz Grabiec	e1e8858ed1	service: Fetch and sync schema	2016-01-11 10:34:53 +01:00
Tomasz Grabiec	da3a453003	service: Add GET_SCHEMA_VERSION remote call The verb belongs to a seaprate client to avoid potential deadlocks should the throttling on connection level be introduced in the future. Another reason is to reduce latency for version requests as it can potentially block many requests.	2016-01-11 10:34:52 +01:00
Tomasz Grabiec	4e5a52d6fa	db: Make read interface schema version aware The intent is to make data returned by queries always conform to a single schema version, which is requested by the client. For CQL queries, for example, we want to use the same schema which was used to compile the query. The other node expects to receive data conforming to the requested schema. Interface on shard level accepts schema_ptr, across nodes we use table_schema_version UUID. To transfer schema_ptr across shards, we use global_schema_ptr. Because schema is identified with UUID across nodes, requestors must be prepared for being queried for the definition of the schema. They must hold a live schema_ptr around the request. This guarantees that schema_registry will always know about the requested version. This is not an issue because for queries the requestor needs to hold on to the schema anyway to be able to interpret the results. But care must be taken to always use the same schema version for making the request and parsing the results. Schema requesting across nodes is currently stubbed (throws runtime exception).	2016-01-11 10:34:52 +01:00
Tomasz Grabiec	036974e19b	Make mutation interfaces support multiple versions Schema is tracked in memtable and cache per-entry. Entries are upgraded lazily on access. Incoming mutations are upgraded to table's current schema on given shard. Mutating nodes need to keep schema_ptr alive in case schema version is requested by target node.	2016-01-11 10:34:51 +01:00
Asias He	2345cda42f	messaging_service: Rename shard_id to msg_addr Use shard_id as the destination of the messaging_service is confusing, since shard_id is used in the context of cpu id. Message-Id: <8c9ef193dc000ef06f8879e6a01df65cf24635d8.1452155241.git.asias@scylladb.com>	2016-01-07 10:36:35 +02:00
Avi Kivity	f3980f1fad	Merge seastar upstream * seastar 51154f7...8b2171e (9): > memcached: avoid a collision of an expiration with time_point(-1). > tutorial: minor spelling corrections etc. > tutorial: expand semaphores section > Merge "Use steady_clock where monotonic clock is required" from Vlad > Merge "TLS fixes + RPC adaption" from Calle > do_with() optimization > tutorial: explain limiting parallelism using semaphores > submit_io: change pending flushes criteria > apps: remove defunct apps/seastar Adjust code to use steady_clock instead of high_resolution_clock.	2015-12-27 14:40:20 +02:00
Pekka Enberg	9604d55a44	Merge "Add unit test for get_restricted_ranges()" from Tomek	2015-12-17 09:14:30 +02:00
Avi Kivity	b34a1f6a84	Merge "Preliminary changes for handling of schema changes" from Tomasz "I extracted some less controversial changes on which the schema changes series will depend o somewhat reduce the noise in the main series."	2015-12-16 19:08:22 +02:00
Tomasz Grabiec	872bfadb3d	messaging_service: Remove unused parameters from send_migration_request()	2015-12-16 18:06:54 +01:00
Tomasz Grabiec	e445e4785c	storage_proxy: Extract get_restricted_ranges() as a free function To make it directly testable.	2015-12-16 13:09:01 +01:00
Gleb Natapov	de63b3a824	storage_proxy: provide timeout for send_mutation verb Providing timeout for send_mutation verb allows rpc to drop packets that sit in outgoing queue for to long.	2015-12-16 10:13:46 +02:00
Gleb Natapov	fe4bc741f4	storage_proxy: throttle mutations based on ongoing background activity With consistency level less then ALL mutation processing can move to background (meaning client was answered, but there is still work to do on behalf of the request). If background request rate completion is lower than incoming request rate background request will accumulate and eventually will exhaust all memory resources. This patch's aim is to prevent this situation by monitoring how much memory all current background request take and when some threshold is passed stop moving request to background (by not replying to a client until either memory consumptions moves below the threshold or request is fully completed). There are two main point where each background mutation consumes memory: holding frozen mutation until operation is complete in order to hint it if it does not) and on rpc queue to each replica where it sits until it's sent out on the wire. The patch accounts for both of those separately and limits the former to be 10% of total memory and the later to be 6M. Why 6M? The best answer I can give is why not :) But on a more serious note the number should be small enough so that all the data can be sent out in a reasonable amount of time and one shard is not capable to achieve even close to a full bandwidth, so empirical evidence shows 6M to be a good number.	2015-12-16 10:13:46 +02:00
Gleb Natapov	e43ae7521f	storage_proxy: unfuturize send_to_live_endpoints() send_to_live_endpoints() is never waited upon, it does its job in the background. This patch formalize that by changing return value to void and also refactoring code so that frozen_mutation shared pointer is not held more that it should: currently it is held until send_mutation() completes, but since send_mutation() does not use frozen_mutation asynchronously this is not necessary.	2015-12-15 15:40:36 +02:00

1 2 3 4 5

245 Commits