scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-30 19:46:48 +00:00

Author	SHA1	Message	Date
Gleb Natapov	75499896ab	client_state: store _user as optional instead of shared_ptr _user cannot outlive client_state class instance, so there is no point in holding it in shared_ptr. Tested: debug test.py and dtest auth_test.py Message-Id: <20191128131217.26294-5-gleb@scylladb.com>	2019-11-28 15:48:59 +02:00
Gleb Natapov	ce5d6d5eee	storage_service: store thrift server as an optional instead of shared_ptr Only do_stop_rpc_server uses the shared_ptr to prolong server's lifetime until stop() completes, but do_with() can be used to achieve the same. Message-Id: <20191128131217.26294-3-gleb@scylladb.com>	2019-11-28 15:48:51 +02:00
Gleb Natapov	b9b99431a8	storage_service: store cql server as an optional instead of shared_ptr Only do_stop_native_transport() uses the shared_ptr to prolong server's lifetime until stop() completes, but do_with() can be used to achieve the same. Message-Id: <20191128131217.26294-2-gleb@scylladb.com>	2019-11-28 15:48:47 +02:00
Pavel Emelyanov	8532093c61	cql: The cql_server does not need proxy reference Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20191127153842.4098-1-xemul@scylladb.com>	2019-11-28 10:58:46 +01:00
Tomasz Grabiec	87b72dad3e	Merge "treewide: add missing const qualifiers" from Pavel Solodovnikov This patchset adds missing "const" function qualifiers throughout the Scylla code base, which would make code less error-prone. The changeset incorporates Kostja's work regarding const qualifiers in the cql code hierarchy along with a follow-up patch addressing the review comment of the corresponding patch set (the patch subject is "cql: propagate const property through prepared statement tree.").	2019-11-27 10:56:20 +01:00
Piotr Sarna	9c5a5a5ac2	treewide: add names to semaphores By default, semaphore exceptions bring along very little context: either that a semaphore was broken or that it timed out. In order to make debugging easier without introducing significant runtime costs, a notion of named semaphore is added. A named semaphore is simply a semaphore with statically defined name, which is present in its errors, bringing valuable context. A semaphore defined as: auto sem = semaphore(0); will present the following message when it breaks: "Semaphore broken" However, a named semaphore: auto named_sem = named_semaphore(0, named_semaphore_exception_factory{"io_concurrency_sem"}); will present a message with at least some debugging context: "Semaphore broken: io_concurrency_sem" It's not much, but it would really help in pinpointing bugs without having to inspect core dumps. At the same time, it does not incur any costs for normal semaphore operations (except for its creation), but instead only uses more CPU in case an error is actually thrown, which is considered rare and not to be on the hot path. Refs #4999 Tests: unit(dev), manual: hardcoding a failure in view building code	2019-11-26 15:14:21 +02:00
Pavel Solodovnikov	2f442f28af	treewide: add const qualifiers throughout the code base	2019-11-26 02:24:49 +03:00
Pavel Emelyanov	f6ac969f1e	mm: Stop migration manager Before stopping the db itself, stop the migration service. It must be stopped before RPC, but RPC is not stopped yet itself, so we should be safe here. Here's the tail of the resulting logs: INFO 2019-11-20 11:22:35,193 [shard 0] init - shutdown migration manager INFO 2019-11-20 11:22:35,193 [shard 0] migration_manager - stopping migration service INFO 2019-11-20 11:22:35,193 [shard 1] migration_manager - stopping migration service INFO 2019-11-20 11:22:35,193 [shard 0] init - Shutdown database started INFO 2019-11-20 11:22:35,193 [shard 0] init - Shutdown database finished INFO 2019-11-20 11:22:35,193 [shard 0] init - stopping prometheus API server INFO 2019-11-20 11:22:35,193 [shard 0] init - Scylla version 666.development-0.20191120.25820980f shutdown complete. Also -- stop the mm on drain before the commitlog it stopped. [Tomasz: mm needs the cl because pulling schema changes from other nodes involves applying them into the database. So cl/db needs to be stopped after mm is stopped.] The drain logs would look like ... INFO 2019-11-25 11:00:40,562 [shard 0] migration_manager - stopping migration service INFO 2019-11-25 11:00:40,562 [shard 1] migration_manager - stopping migration service INFO 2019-11-25 11:00:40,563 [shard 0] storage_service - DRAINED: and then on stop ... INFO 2019-11-25 11:00:46,427 [shard 0] init - shutdown migration manager INFO 2019-11-25 11:00:46,427 [shard 0] init - Shutdown database started INFO 2019-11-25 11:00:46,427 [shard 0] init - Shutdown database finished INFO 2019-11-25 11:00:46,427 [shard 0] init - stopping prometheus API server INFO 2019-11-25 11:00:46,427 [shard 0] init - Scylla version 666.development-0.20191125.3eab6cd54 shutdown complete. Fixes #5300 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20191125080605.7661-1-xemul@scylladb.com>	2019-11-25 12:59:01 +01:00
Vladimir Davydov	bf5f864d80	paxos: piggyback result query on prepare response Current LWT implementation uses at least three network round trips: - first, execute PAXOS prepare phase - second, query the current value of the updated key - third, propose the change to participating replicas (there's also learn phase, but we don't wait for it to complete). The idea behind the optimization implemented by this patch is simple: piggyback the current value of the updated key on the prepare response to eliminate one round trip. To generate less network traffic, only the closest to the coordinator replica sends data while other participating replicas send digests which are used to check data consistency. Note, this patch changes the API of some RPC calls used by PAXOS, but this should be okay as long as the feature in the early development stage and marked experimental. To assess the impact of this optimization on LWT performance, I ran a simple benchmark that starts a number of concurrent clients each of which updates its own key (uncontended case) stored in a cluster of three AWS i3.2xlarge nodes located in the same region (us-west-1) and measures the aggregate bandwidth and latency. The test uses shard-aware gocql driver. Here are the results: latency 99% (ms) bandwidth (rq/s) timeouts (rq/s) clients before after before after before after 1 2 2 626 637 0 0 5 4 3 2616 2843 0 0 10 3 3 4493 4767 0 0 50 7 7 10567 10833 0 0 100 15 15 12265 12934 0 0 200 48 30 13593 14317 0 0 400 185 60 14796 15549 0 0 600 290 94 14416 15669 0 0 800 568 118 14077 15820 2 0 1000 710 118 13088 15830 9 0 2000 1388 232 13342 15658 85 0 3000 1110 363 13282 15422 233 0 4000 1735 454 13387 15385 329 0 That is, this optimization improves max LWT bandwidth by about 15% and allows to run 3-4x more clients while maintaining the same level of system responsiveness.	2019-11-24 11:35:29 +02:00
Vladimir Davydov	3d1d4b018f	paxos: remove unnecessary move constructor invocations invoke_on() guarantees that captures object won't be destroyed until the future returned by the invoked function is resolved so there's no need to move key, token, proposal for calling paxos_state::*_impl helpers.	2019-11-24 11:35:29 +02:00
Vladimir Davydov	ef2e96c47c	storage_proxy: factor out helper to sort endpoints by proximity We need it for PAXOS.	2019-11-24 11:35:29 +02:00
Vladimir Davydov	63d4590336	storage_proxy: move digest_algorithm upper We need it for PAXOS. Mark it as static inline while we are at it.	2019-11-24 11:35:29 +02:00
Avi Kivity	1fe062aed4	Merge "Add basic UDF support" from Rafael " This patch series adds only UDF support, UDA will be in the next patch series. With this all CQL types are mapped to Lua. Right now we setup a new lua state and copy the values for each argument and return. This will be optimized once profiled. We require --experimental to enable UDF in case there is some change to the table format. " * 'espindola/udf-only-v4' of https://github.com/espindola/scylla: (65 commits) Lua: Document the conversions between Lua and CQL Lua: Implement decimal subtraction Lua: Implement decimal addition Lua: Implement support for returning decimal Lua: Implement decimal to string conversion Lua: Implement decimal to floating point conversion Lua: Implement support for decimal arguments Lua: Implement support for returning varint Lua: Implement support for returning duration Lua: Implement support for duration arguments Lua: Implement support for returning inet Lua: Implement support for inet arguments Lua: Implement support for returning time Lua: Implement support for time arguments Lua: Implement support for returning timeuuid Lua: Implement support for returning uuid Lua: Implement support for uuid and timeuuid arguments Lua: Implement support for returning date Lua: Implement support for date arguments Lua: Implement support for returning timestamp ...	2019-11-17 16:38:19 +02:00
Vladimir Davydov	25aeefd6f3	cql: fix CAS consistency level validation This patch resurrects Cassandra's code validating a consistency level for CAS requests. Basically, it makes CAS requests use a special function instead of validate_for_write to make error messages more coherent. Note, we don't need to resurrect requireNetworkTopologyStrategy as EACH_QUORUM should work just fine for both CAS and non-CAS writes. Looks like it is just an artefact of a rebase in the Cassandra repository.	2019-11-14 12:15:39 +01:00
Gleb Natapov	552c56633e	storage_proxy: do not release mutation if not all replies were received MV backpressure code frees mutation for delayed client replies earlier to save memory. The commit `2d7c026d6e` that introduced the logic claimed to do it only when all replies are received, but this is not the case. Fix the code to free only when all replies are received for real. Fixes #5242 Message-Id: <20191113142117.GA14484@scylladb.com>	2019-11-13 16:23:19 +02:00
Piotr Dulikowski	59fbbb993f	memtables: add partition/row hit/miss counters Adds per-table metrics for counting partition and row reuse in memtables. New metrics are as follows: - memtable_partition_writes - number of write operations performed on partitions in memtables, - memtable_partition_hits - number of write operations performed on partitions that previously existed in a memtable, - memtable_row_writes - number of row write operations performed in memtables, - memtable_row_hits - number of row write operations that ovewrote rows previously present in a memtable. Tests: unit(release)	2019-11-12 13:35:41 +01:00
Rafael Ávila de Espíndola	d9337152f3	Use threads when executing user functions This adds a requires_thread predicate to functions and propagates that up until we get to code that already returns futures. We can then use the predicate to decide if we need to use seastar::async. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola	fc72a64c67	Add schema propagation and storage for UDF With this it is possible to create user defined functions and aggregates and they are saved to disk and the schema change is propagated. It is just not possible to call them yet. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola	ce6304d920	UDF: Add a feature and config option to track if udf is enabled It can only be enabled with --experimental. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-11-07 08:40:47 -08:00
Vladimir Davydov	b75862610e	paxos_state: account paxos round latency This patch adds the following per table stats: cas_prepare_latency cas_propose_latency cas_commit_latency They are equivalent to CasPropose, CasPrepare, CasCommit metrics exposed by Cassandra.	2019-10-29 19:26:18 +03:00
Vladimir Davydov	c27ab87410	storage_proxy: add cas request accounting This patch implements accounting of Cassandra's metrics related to lightweight transactions, namely: cas_read_latency transactional read latency (histogram) cas_write_latency transactional write latency (histogram) cas_read_timeouts number of transactional read timeouts cas_write_timeouts number of transactional write timeouts cas_read_unavailable number of transactional read unavailable errors cas_write_unavailable number of transactional write unavailable errors cas_read_unfinished_commit number of transaction commit attempts that occurred on read cas_write_unfinished_commit number of transaction commit attempts that occurred on write cas_write_condition_not_met number of transaction preconditions that did not match current values cas_read_contention how many contended reads were encountered (histogram) cas_write_contention how many contended writes were encountered (histogram)	2019-10-29 19:25:47 +03:00
Vladimir Davydov	967a9e3967	storage_proxy: zap ballot_and_contention Pass contention by reference to begin_and_repair_paxos(), where it is incremented on every sleep. Rationale: we want to account the total number of times query() / cas() had to sleep, either directly or within begin_and_repair_paxos(), no matter if the function failed or succeeded.	2019-10-29 19:22:18 +03:00
Gleb Natapov	0e9df4eaf8	lwt: mark lwt as experimental We may want to change paxos tables format and change internode protocol, so hide lwt behind experimental flag for now. Message-Id: <20191029102725.GM2866@scylladb.com>	2019-10-29 14:33:48 +02:00
Gleb Natapov	e5e44bfda2	client_state: fix get_timestamp_for_paxos() to always advance a timestamp Message-Id: <20191029102336.GL2866@scylladb.com>	2019-10-29 13:07:33 +02:00
Vladimir Davydov	e0b31dd273	query: add flag to return static row on partition with no rows A SELECT statement that has clustering key restrictions isn't supposed to return static content if no regular rows matches the restrictions, see #589. However, for the CAS statement we do need to return static content on failure so this patch adds a flag that allows the caller to override this behavior.	2019-10-28 21:50:44 +03:00
Konstantin Osipov	203eb3eccc	lwt: sleep a random amount of time when retrying CAS Sleep a random interval between 0 and 100 ms before retrying CAS. Reuse sleep function, make the distribution object thread local.	2019-10-27 23:42:03 +03:00
Konstantin Osipov	0674fab05c	lwt: implement storage_proxy::cas() Introduce service::cas_request abstract base class which can be used to parameterize Paxos logic. Implement storage_proxy::cas() - compare and swap - the storage proxy entry point for lightweight transactions.	2019-10-27 23:42:03 +03:00
Gleb Natapov	70adf65341	storage_proxy: make mutation holder responsible for mutation operation Currently the code that manipulates mutations during write need to check what kind of mutations are those and (sometimes) choose different code paths. This patch encapsulates the differences in virtual functions of mutation_holder object, so that high level code will not concern itself with the details. The functions that are added: apply_locally(), apply_remotely() and store_hint().	2019-10-27 23:21:51 +03:00
Gleb Natapov	b3e01a45d7	lwt: storage_proxy: implement paxos protocol This patch adds all functionality needed for Paxos protocol. The implementation does not strictly adhere to Paxos paper since the original paper allows setting a value only once, while for LWT we need to be able to make another Paxos round after "learn" phase completes, which requires things like repair to be introduced.	2019-10-27 23:21:51 +03:00
Gleb Natapov	d1774693bf	lwt: Define state needed by paxos and persist it Paxos protocol relies on replicas having a state that persists over crashes/restarts. This patch defines such state and stores it in the database itself in the paxos table to make it persistent. The stored state is: in_progress_ballot - promised ballot proposal - accepted value proposal_ballot - the ballot of the accepted value most_recent_commit - most recently learned value most_recent_commit_at - the ballot of the most recently learned value	2019-10-27 23:21:51 +03:00
Gleb Natapov	15b935b95d	lwt: add data structures needed for paxos implementation This patch add two data structures that will be used by paxos. First one is "proposal" which contains a ballot and a mutation representing a value paxos protocol is trying to set. Second one is "prepare_response" which is a value returned by paxos prepare stage. It contains currently accepted value (if any) and most recently learned value (again if any). The later is used to "repair" replicas that missed previous "learn" message.	2019-10-27 23:21:51 +03:00
Kamil Braun	e74b5deb5d	cql3: enable non-frozen UDTs. Add a cluster feature for non-frozen UDTs. If the cluster supports non-frozen UDTs, do not return an error message when trying to create a table with a non-frozen user type.	2019-10-25 12:04:44 +02:00
Asias He	f876580740	storage_service: Reject nodetool cleanup when there is pending ranges From Shlomi: 4 node cluster Node A, B, C, D (Node A: seed) cassandra-stress write n=10000000 -pop seq=1..10000000 -node <seed-node> cassandra-stress read duration=10h -pop seq=1..10000000 -node <seed-node> while read is progressing Node D: nodetool decommission Node A: nodetool status node - wait for UL Node A: nodetool cleanup (while decommission progresses) I get the error on c-s once decommission ends java.io.IOException: Operation x0 on key(s) [383633374d31504b5030]: Data returned was not validated The problem is when a node gets new ranges, e.g, the bootstrapping node, the existing nodes after a node is removed or decommissioned, nodetool cleanup will remove data within the new ranges which the node just gets from other nodes. To fix, we should reject the nodetool cleanup when there is pending ranges on that node. Note, rejecting nodetool cleanup is not a full protection because new ranges can be assigned to the node while cleanup is still in progress. However, it is a good start to reject until we have full protection solution. Refs: #5045	2019-10-23 19:20:36 +08:00
Asias He	a39c8d0ed0	Revert "storage_service: remove storage_service::_is_bootstrap_mode." It will be needed by "storage_service: Reject nodetool cleanup when there is pending ranges" This reverts commit `dbca327b46`.	2019-10-23 19:20:36 +08:00
Kamil Braun	f1c26bf5c9	storage_service: more comments in join_token_ring Explain why a call to update_normal_tokens is needed.	2019-10-21 11:11:03 +02:00
Kamil Braun	dbca327b46	storage_service: remove storage_service::_is_bootstrap_mode. The flag did nothing. It was used in one place to check if there's a bug, but it can easily by proven by reading the code that the check would never pass.	2019-10-21 11:11:03 +02:00
Kamil Braun	b757a19f84	storage_service: simplify storage_service::bootstrap method The storage_service::bootstrap method took a parameter: tokens to bootstrap with. However, this method is only called in one place (join_token_ring) with only one parameter: _bootstrap_tokens. It doesn't make sense to call this method anywhere else with any other parameter. This commit also adds a comment explaining what the method does and moves it into the private section of storage_service.	2019-10-21 11:11:03 +02:00
Kamil Braun	84b41bd89b	storage_service: fix typo in handle_state_moving	2019-10-21 11:11:03 +02:00
Kamil Braun	2ff4f9b8f4	storage_service: remove unnecessary use of stringstream	2019-10-21 11:11:03 +02:00
Kamil Braun	06cc7d409d	storage_service: remove redundant call to update_tokens during join_token_ring When a non-seed node was bootstrapping, system_keyspace::update_tokens was called twice: first right after the tokens were generated (or received if we were replacing a different node) in the call to `bootstrap`, and then later in join_token_ring. The second call was redundant. The join_token_ring call was also redundant if we were not bootstrapping and had tokens saved previously (e.g. when restarting). In that case we would have read them from LOCAL and then save the same tokens again. This commit removes the redundant call and inserts calls to update_tokens where they are necessary, when new tokens are generated. The aim is to make the code easier to understand. It also adds a comment which explains why the tokens don't need to be generated in one of the cases.	2019-10-21 11:11:03 +02:00
Kamil Braun	a223864f81	storage_service: remove storage_service::set_tokens method. After commit `36ccf72f3c`, this method was used only in one place. Its name did not make it obvious what it does and when is it safe to call it. This commit pulls out the code from set_tokens to the point where it was called (join_token_ring). The code is only possible to understand in context. This code was also saving the tokens to the LOCAL table before retrieving them from this table again. There is no point in doing that: 1. there are no races, since when join_token_ring is running, it is the only function which can call system_keyspace::update_tokens (which saves them to the LOCAL table). There can be no multiple instances of join_token_ring. 2. Even if there was a race, this wouldn't fix anything. The tokens we retrieve from LOCAL by calling get_local_tokens().get0() could already be different in the LOCAL table when the get0() returns.	2019-10-21 11:09:59 +02:00
Kamil Braun	36ccf72f3c	storage_service: remove is_survey_mode That was dead, untested code, making it unnecessarily hard to implement new features.	2019-10-21 10:38:49 +02:00
Kamil Braun	602c7268cc	storage_service::handle_state_normal: tokens_to_update* -> owned_tokens Replace the two variables: tokens_to_update_in_metadata tokens_to_update_in_system_keyspace which were exactly the same, with one variable owned_tokens. The new name describes what the variable IS instead what's it used for. Add a comment to clarify what "owned" means: those are the tokens the node chose and any collision was resolved positively for this node. Move the variable definition further down in the code, where it's actually needed.	2019-10-21 10:38:49 +02:00
Kamil Braun	2db07c697f	storage_service::handle_state_normal: remove local_tokens_to_remove That was dead code. Removing tokens is handled inside remove_endpoint, using the endpoints_to_remove set.	2019-10-21 10:38:49 +02:00
Piotr Jastrzebski	afe520ad77	gossip: Add application_state::IGNORE_MSB_BITS We would like to share with other nodes the value of ignore_msb_bits property used by the node. This is needed because CDC will operate on streams of changes. Each shard on each node will have its own stream that will be identified by a stream_id. Stream_id will be selected in such a way that using stream_id as partition key will locate partition identified by stream_id on a node and shard that the stream belongs to. To be able to generate such stream_id we need to know ignore_msb_bits property value for each node. IMPORTANT NOTE: At this point CDC does not support topology changes. It will work only on a stable cluster. Support for topology modifications will be added in later steps. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2019-10-17 10:55:31 +02:00
Piotr Jastrzebski	a66d7cfe57	gossip: Add application_state::SHARD_COUNT We would like to share with other nodes the number of shards available at the node. This is needed because CDC will operate on streams of changes. Each shard on each node will have its own stream that will be identified by a stream_id. Stream_id will be selected in such a way that using stream_id as partition key will locate partition identified by stream_id on a node and shard that the stream belongs to. To be able to generate such stream_id we need to know how many shards are on each node. IMPORTANT NOTE: At this point CDC does not support topology changes. It will work only on a stable cluster. Support for topology modifications will be added in later steps. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2019-10-17 10:55:31 +02:00
Piotr Jastrzebski	f7ce8e4f2b	cdc: Add flag guarding it's usage At first, CDC will only be enabled when experimental flag is on. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2019-10-17 10:55:31 +02:00
Piotr Sarna	36a1905e98	storage_proxy: handle unstarted write cancelling When another node is reported to be down, view updates queued for it are cancelled, but some of them may already be initiated. Right now, cancelling such a write resulted in an exception, but on conceptual level it's not really an exception, since this behaviour is expected. Previous version of this patch was based on introducing a special exception type that was later handled specially, but it's not clear if it's a good direction. Instead, this patch simply makes this path non-exceptional, as was originally done by Nadav in the first version of the series that introduced handling unstarted write cancellations. Additionally, a message containing the information that a write is cancelled is logged with debug level.	2019-10-07 16:55:36 +03:00
Avi Kivity	162730862d	storage_proxy: remove variadic future from query_partition_key_range_concurrent() Seastar variadic futures are deprecated, so replace with a nice struct.	2019-09-30 21:33:44 +03:00
Avi Kivity	968b34a2b4	storage_proxy: remove variadic future from digest_read_resolver Seastar variadic futures are deprecated, so replace with a nice struct.	2019-09-30 21:32:17 +03:00

1 2 3 4 5 ...

1519 Commits