scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-02 14:15:46 +00:00

Author	SHA1	Message	Date
Kamil Braun	013330199d	cdc/storage_proxy: keep cdc_service alive in storage_proxy operations storage_proxy is never deinitialized, so it may have still used cdc_service after its destructor was called. This fixes the problem by cdc_service inheriting from async_sharded_service and storage_proxy calling shared_from_this on the service whenever it uses it. cdc_service inherits from async_sharded_service and not simply from enable_shared_from_this, because there might be other services that cdc_service depends on. Assuming that these services are deinitialized after cdc_service (as they should), i.e. after stop() is called on cdc_service, making cdc_service async_sharded_service will keep their deinitialization code from being called until all references to cdc_service disappear (async_sharded_service keeps stop() from returning until this happens). Some more improvements should be possible through some refactoring: 1. Make augment_mutation_call a free function, not a member of cdc_service: it doesn't need any state that cdc_service has. db_context can be passed down from storage_proxy when it calls the function. 2. Remove the storage_proxy -> cdc_service reference. storage_proxy only needs augment_mutation_call, which would not be a part of the service. This would also get rid of the proxy -> cdc -> proxy reference cycle that we have now, and would allow storage_proxy to be safely deinitialized after cdc_service. 3. Maybe we could even remove the cdc_service -> storage_proxy reference. Is it really needed?	2020-06-08 13:25:51 +03:00
Gleb Natapov	e3ff88e674	lwt: prune system.paxos table when quorum of replicas learned the value Instead of waiting for all replicas to reply execute prune after quorum of replicas. This will keep system.paxos smaller in the case where one node is down. Fixes #6330 Message-Id: <20200525110822.GC233208@scylladb.com>	2020-05-27 08:40:05 +03:00
Piotr Sarna	92aadb94e5	treewide: propagate trace state to write path In order to add tracing to places where it can be useful, e.g. materialized view updates and hinted handoff, tracing state is propagated to all applicable call sites.	2020-05-18 16:05:23 +02:00
Gleb Natapov	d555fb60d7	lwt: add counters for background and foreground paxos operations Paxos may leave an operation in a background after returning result to a caller. Lest add a counter for background/foreground paxos handlers so that it will be easier to detect memory related issues. Message-Id: <20200510092942.GA24506@scylladb.com>	2020-05-11 14:37:00 +02:00
Gleb Natapov	4622c61a37	lwt: linearise reads Currently the following scenario may happen: Consider 3 nodes A, B and C and a LWT failed write operation that managed to get V accepted on A. The value is read twice. First read access B and C and returns nothing. Next one access A and B, notices failed round and completes it. Returns value V. Since two consequent reads without any writes in the middle return different value this breaks linearisability. This happens because read does not do full paxos round. The patch makes read code to reuse the same logic as write by writing a dummy value which ensures that complete paxos round is used.	2020-05-05 15:37:42 +03:00
Gleb Natapov	0c2db6f42d	lwt: linearise unmet condition operations Currently the following scenario may happen: Consider 3 nodes A, B and C and a LWT failed write operation that managed to get V accepted on A. Next operation may be conditioned on a value been V, but it may access nodes B and C first and fail. Retrying the same operation without any writes in the middle may now access A and B and succeed since it will notice V and will complete previous transaction. Having to different outcome for the same operation without any writes in the middle breaks linearisability. This happens because when condition is unmet we abandon the paxos round, so this patch makes us complete it with empty value. Now if first conditional write after failure access B and C it will write accepted ballot there with the value greater than one of V and V will no longer be replayed ever.	2020-05-05 12:38:31 +03:00
Gleb Natapov	0fed86e4c6	lwt: change cas_request::apply signature Change the way query result is passed from getting a reference to a result to getting a foreign_ptr<lw_shared_ptr<query::result>>. This will allow cas_request to keep it without copying.	2020-05-05 12:38:23 +03:00
Gleb Natapov	fbb04698d0	lwt: pass paxos::proposal as a shared pointer everywhere paxos::proposal reference is passed into a lot of functions and sometimes it has to be copied to prolong its lifetime. Create it as a shared pointer and pass it everywhere to avoid those copies.	2020-04-22 13:51:43 +03:00
Konstantin Osipov	18b9bb57ac	lwt: rename metrics to match accepted terminology Rename inherited metrics cas_propose and cas_commit to cas_accept and cas_learn respectively. A while ago we made a decision to stick to widely accepted terms for Paxos rounds: prepare, accept, learn. The rest of the code is using these terms, so rename the metrics to avoid confusion/technical debt. While at it, rename a few internal methods and functions. Fixes #6169 Message-Id: <20200414213537.129547-1-kostja@scylladb.com>	2020-04-15 12:20:30 +02:00
Gleb Natapov	8a408ac5a8	lwt: remove entries from system.paxos table after successful learn stage The learning stage of PAXOS protocol leaves behind an entry in system.paxos table with the last learned value (which can be large). In case not all participants learned it successfully next round on the same key may complete the learning using this info. But if all nodes learned the value the entry does not serve useful purpose any longer. The patch adds another round, "prune", which is executed in background (limited to 1000 simultaneous instances) and removes the entry in case all nodes replied successfully to the "learn" round. It uses the ballot's timestamp to do the deletion, so not to interfere with the next round. Since deletion happens very close to previous writes it will likely happen in memtable and will never reach sstable, so that reduces memtable flush and compaction overhead. Fixes #5779 Message-Id: <20200330154853.GA31074@scylladb.com>	2020-03-30 21:02:14 +03:00
Piotr Dulikowski	ef1c62aa04	storage_proxy: track CDC operations in standard flow Register cdc operation result tracker for write response handlers coming from the usual write requests.	2020-03-23 14:05:25 +01:00
Piotr Dulikowski	cccc33f0fd	storage_proxy: add cdc tracker hooks to write response handlers Adds a field to abstract_write_response_handler that points to the cdc operation result tracker, and a function for registering the tracker in the handlers that currently write to a CDC log table.	2020-03-23 14:05:25 +01:00
Piotr Dulikowski	e7062de02b	cdc: register metric counters This patch defines a CDC metrics object and registers all of its counters. storage_proxy is chosen as the owner of the metrics object. Because in subsequent commits it will become possible for CDC metrics to be updated after a write operation ends, and because the cdc_service has shorter lifetime than storage_proxy, we could risk a use-after-free if we placed this object inside cdc_service.	2020-03-23 14:05:25 +01:00
Piotr Dulikowski	41d82e39ea	storage proxy: rename mutate_hint_from_scratch Changes the name of storage_proxy::mutate_hint_from_scratch function to another name, whose meaning is more clear: send_hint_to_all_replicas. Tests: unit(dev)	2020-02-24 17:30:22 +02:00
Avi Kivity	6c7aa18238	Merge "Introduce schema::get_partitioner" from Piotr " Introduce schema::get_partitioner and use it instead of dht::global_partitioner. Fixes #5493 Tests: unit(dev, release, debug) " * 'per_table_partitioner_prep' of https://github.com/haaawk/scylla: (35 commits) cdc: stop using partitioners partitioner_test: stop calling set_global_partitioner storage_service: stop calling global_partitioner() mutation_writer_test: stop calling global_partitioner() schema: reduce number of global_partitioner() calls test_services: stop calling global_partitioner() sstable_utils: stop calling global_partitioner() sstable_resharding_test: stop depending on global partitioner sstable_mutation_test: stop calling global_partitioner() sstable_data_file_test: stop calling global_partitioner() random_schema: stop taking partitioner in constructor mutation_reader_test: stop calling global_partitioner() multishard_mutation_query_test: stop calling global_partitioner() row_level repair: stop calling global_partitioner() distribute_reader_and_consume_on_shards: don't take partitioner thrift: reduce global_partitioner() calls binary_search: stop calling global_partitioner() index_entry: stop calling global_partitioner() mc writer: stop calling global_partitioner() sstable: stop calling global_partitioner() ...	2020-02-17 18:12:53 +02:00
Piotr Dulikowski	01084a79b8	hh: send orphaned hints on HINT_MUTATION verb When replaying a hint with a destination node that is no longer in the cluster, it will be sent with cl=ALL to all its new replicas. Before this patch, the MUTATION verb was used, which causes such hints to be handled on the same connection and with the same priority as regular writes. This can cause problems when a large number of hints is orphaned and they are scheduled to be sent at once. Such situation may happen when replacing a dead node - all nodes that accumulated hints for the dead node will now send them with cl=ALL to their new replicas. This patch changes the verb used to send such hints to HINT_MUTATION. This verb is handled on a separate connection and with streaming scheduling group, which gives them similar priority to non-orphaned hints. Refs: #4712 Tests: unit(dev)	2020-02-17 14:45:22 +01:00
Piotr Jastrzebski	abd76e566f	dht::shard_of: stop calling global_partitioner() Take const schema& as a parameter of shard_of and use it to obtain partitioner instead of calling global_partitioner(). Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-02-17 10:23:16 +01:00
Gleb Natapov	7694f164c4	lwt: add more tracing to paxos stages Message-Id: <20200211160653.30317-1-gleb@scylladb.com>	2020-02-16 11:22:30 +02:00
Pavel Emelyanov	fecea1de7e	proxy: Use own token_metadata Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-02-10 20:54:32 +03:00
Avi Kivity	bed61b96a2	Merge "Move features from storage- into feature-service" from Pavel " There's a lot of code around that needs storage service purely to get the specific feature value (cluster_supports_<something> calls). This creates several circular dependencies, e.g. storage_service <-> migration_manager one and database <-> storage_servuce. Also features sit on storage_service, but register themselfs on the feature_service and the former subscribes on them back which also looks strange. I propose to keep all the features on feature_service, this keeps the latter intependent from other components, makes it possible to break one of the mentioned circle dependencyand heavily relax the other. Also the set helps us fighting the globals and, after it, the feature_service can be safely stopped at the very last moment. Tests: unit(dev), manual debug build start-stop " * 'br-features-to-service-5' of https://github.com/xemul/scylla: gossiper: Avoid string merge-split for nothing features: Stop on shutdown storage_service: Remove helpers storage_service: Prepare to switch from on-board feature helpers cql3: Check feature in .validate database: Use feature service storage_proxy: Use feature service migration_manager: Use feature service start: Pass needed feature as argument into migrate_truncation_records features: Unfriend storage_service features: Simplify feature registration features: Introduce known_feature_set features: Move disabled features set from storage_service features: Move schema_features helper features: Move all features from storage_service to feature_service storage_service: Use feature_config from _feature_service features: Add feature_config storage_service: Kill set_disabled_features gms: Move features stuff into own .cc file migration_manager: Move some fns into class	2020-02-09 19:22:07 +02:00
Nadav Har'El	9fd9ec14c2	storage_proxy: make it into a peering sharded service We consider globals like service::get_storage_proxy() a bad idea, and would like to reduce their use as much as possible - and eventually, eliminate it completely. One easy case to fix case is when we already have a shard-local proxy, but now we need the sharded object, to invoke_on() something on it. In this patch, we turn storage_proxy into a peering_sharded_service. This means that if you already have a storage_proxy, you can call its container() function to get the sharded<storage_proxy>, without needing to call the global service::get_storage_proxy(). We found a few such cases in storage_proxy itself, and in Alternator, and fixed them to use container() instead of the global function. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2020-02-05 21:14:18 +02:00
Pavel Emelyanov	12c1378be0	storage_proxy: Use feature service Keep reference on local feature service from storage_proxy and use it in places that have (local) storage_proxy at hands. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-02-03 15:16:23 +03:00
Eliran Sinvani	971711a546	storage proxy: migrate to per scheduling group statistics This commit builds on top of the introduced per scheduling group statistics template and employs it for achieving a per scheduling group statistics in storage_proxy. Some of the statistics also had meaning as a global - per shard one. Those are the ones for determining if to throttle the write request. This was handled by creating a global stats struct that will hold those stats and by changing the stat update to also include the global one. One point that complicated it is an already existing aggregation over the per shard stats that now became a per scheduling group per shard stats, converting the aggregation to a two-dimensional aggregation. One thing this commit doesn't handle is validating that an individual statistic didn't "cross a scheduling group boundary", such validation is possible but it can easily be added in the future. There is a subtlety to doing so since if the operation did cross to other scheduling group two connected statistics can lose balance for example written bytes and completed write transactions. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2020-01-30 15:01:44 +01:00
Gleb Natapov	0d0c05a569	lwt: allow only one paxos instance to run for each key simultaneously This will prevent contention in case of parallel updates of the same row by the same coordinator. The patch does it by introducing a new per key lock map and taking it before running PAXOS protocol (either for write of for read). Message-Id: <20200117101228.GA14816@scylladb.com>	2020-01-28 12:39:23 +02:00
Nadav Har'El	1ed21d70dc	merge: CDC: do mutation augmentation from storage proxy Merged pull request https://github.com/scylladb/scylla/pull/5567 from Calle Wilund: Fixes #5314 Instead of tying CDC handling into cql statement objects, this patch set moves it to storage proxy, i.e. shared code for mutating stuff. This means we automatically handle cdc for code paths outside cql (i.e. alternator). It also adds api handling (though initially inefficient) for batch statements. CDC is tied into storage proxy by giving the former a ref to the latter (per shard). Initially this is not a constructor parameter, because right now we have chicken and egg issues here. Hopefully, Pavels refactoring of migration manager and notifications will untie these and this relationship can become nicer. The actual augmentation can (as stated above) be made much more efficient. Hopefully, the stream management refactoring will deal with expensive stream lookup, and eventually, we can maybe coalesce pre-image selects for batches. However, that is left as an exercise for when deemed needed. The augmentation API has an optional return value for a "post-image handler" to be used iff returned after mutation call is finished (and successful). It is not yet actually invoked from storage_proxy, but it is at least in the call chain.	2020-01-16 17:12:56 +02:00
Gleb Natapov	51672e5990	paxos: immediately sync commitlog entries for writes made by paxos learn stage	2020-01-15 12:15:42 +02:00
Gleb Natapov	d28dd4957b	lwt: Process lwt request on a owning shard LWT is much more efficient if a request is processed on a shard that owns a token for the request. This is because otherwise the processing will bounce to an owning shard multiple times. The patch proposes a way to move request to correct shard before running lwt. It works by returning an error from lwt code if a shard is incorrect one specifying the shard the request should be moved to. The error is processed by transport code that jumps to a correct shard and re-process incoming message there.	2020-01-13 10:26:02 +02:00
Calle Wilund	fc5904372b	storage_proxy: Add (optional) cdc service object pointer member The cdc service is assigned from outside, post construction, mainly because of the chickens and eggs in main startup. Would be nice to have it unconditionally, but this is workable.	2020-01-07 12:01:58 +00:00
Calle Wilund	d6003253dd	storage_proxy: Move mutate_counters to private section It is (and shall) only be called from inside storage proxy, and we would like this to be reflected in the interface so our eventual moving of cdc logic into the mutate call chains become easier to verify and comprehend.	2020-01-07 12:01:58 +00:00
Pekka Enberg	6bc18ba713	storage_proxy: Remove reference to MBean interface The JMX interface is implemented by the scylla-jmx project, not scylla. Therefore, let's remove this historical reference to MBeans from storage_proxy. Message-Id: <20191211121652.22461-1-penberg@scylladb.com>	2019-12-11 14:24:28 +02:00
Benny Halevy	105c8ef5a9	messaging_service: wait on unregister_handler Prepare for returning future<> from seastar rpc unregister_handler. Refs https://github.com/scylladb/scylla/issues/5228 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20191208153924.1953-1-bhalevy@scylladb.com>	2019-12-11 14:17:41 +02:00
Piotr Dulikowski	77d2ceaeba	storage_proxy: handle hints through separate rpc verb	2019-12-05 00:51:52 +01:00
Pavel Solodovnikov	2f442f28af	treewide: add const qualifiers throughout the code base	2019-11-26 02:24:49 +03:00
Vladimir Davydov	bf5f864d80	paxos: piggyback result query on prepare response Current LWT implementation uses at least three network round trips: - first, execute PAXOS prepare phase - second, query the current value of the updated key - third, propose the change to participating replicas (there's also learn phase, but we don't wait for it to complete). The idea behind the optimization implemented by this patch is simple: piggyback the current value of the updated key on the prepare response to eliminate one round trip. To generate less network traffic, only the closest to the coordinator replica sends data while other participating replicas send digests which are used to check data consistency. Note, this patch changes the API of some RPC calls used by PAXOS, but this should be okay as long as the feature in the early development stage and marked experimental. To assess the impact of this optimization on LWT performance, I ran a simple benchmark that starts a number of concurrent clients each of which updates its own key (uncontended case) stored in a cluster of three AWS i3.2xlarge nodes located in the same region (us-west-1) and measures the aggregate bandwidth and latency. The test uses shard-aware gocql driver. Here are the results: latency 99% (ms) bandwidth (rq/s) timeouts (rq/s) clients before after before after before after 1 2 2 626 637 0 0 5 4 3 2616 2843 0 0 10 3 3 4493 4767 0 0 50 7 7 10567 10833 0 0 100 15 15 12265 12934 0 0 200 48 30 13593 14317 0 0 400 185 60 14796 15549 0 0 600 290 94 14416 15669 0 0 800 568 118 14077 15820 2 0 1000 710 118 13088 15830 9 0 2000 1388 232 13342 15658 85 0 3000 1110 363 13282 15422 233 0 4000 1735 454 13387 15385 329 0 That is, this optimization improves max LWT bandwidth by about 15% and allows to run 3-4x more clients while maintaining the same level of system responsiveness.	2019-11-24 11:35:29 +02:00
Vladimir Davydov	ef2e96c47c	storage_proxy: factor out helper to sort endpoints by proximity We need it for PAXOS.	2019-11-24 11:35:29 +02:00
Vladimir Davydov	967a9e3967	storage_proxy: zap ballot_and_contention Pass contention by reference to begin_and_repair_paxos(), where it is incremented on every sleep. Rationale: we want to account the total number of times query() / cas() had to sleep, either directly or within begin_and_repair_paxos(), no matter if the function failed or succeeded.	2019-10-29 19:22:18 +03:00
Konstantin Osipov	0674fab05c	lwt: implement storage_proxy::cas() Introduce service::cas_request abstract base class which can be used to parameterize Paxos logic. Implement storage_proxy::cas() - compare and swap - the storage proxy entry point for lightweight transactions.	2019-10-27 23:42:03 +03:00
Gleb Natapov	70adf65341	storage_proxy: make mutation holder responsible for mutation operation Currently the code that manipulates mutations during write need to check what kind of mutations are those and (sometimes) choose different code paths. This patch encapsulates the differences in virtual functions of mutation_holder object, so that high level code will not concern itself with the details. The functions that are added: apply_locally(), apply_remotely() and store_hint().	2019-10-27 23:21:51 +03:00
Gleb Natapov	b3e01a45d7	lwt: storage_proxy: implement paxos protocol This patch adds all functionality needed for Paxos protocol. The implementation does not strictly adhere to Paxos paper since the original paper allows setting a value only once, while for LWT we need to be able to make another Paxos round after "learn" phase completes, which requires things like repair to be introduced.	2019-10-27 23:21:51 +03:00
Avi Kivity	162730862d	storage_proxy: remove variadic future from query_partition_key_range_concurrent() Seastar variadic futures are deprecated, so replace with a nice struct.	2019-09-30 21:33:44 +03:00
Avi Kivity	c6b66d197b	Merge "Couple of preparatory patches for lwt" from Gleb " This is a collection of assorted patches that will be needed for LWT. Most of them are trivial, but one touches a lot of files, so have a good chance to cause rebase headache (I already had to rebase it on top of Alternator). Lets push them earlier instead of carrying them in the lwt branch. " * 'gleb/lwt-prepare-v2' of github.com:scylladb/seastar-dev: lwt: make _last_timestamp_micros static lwt: Add client_state::get_timestamp_for_paxos() function lwt: Pass client_state reference all the way to storage_proxy::query exceptions: Add a constructor for unavailable_exception that allows providing a custom message serializer: Add std::variant support lwt: Add missing functions to utils/UUID_gen.hh	2019-09-29 13:02:26 +03:00
Avi Kivity	ba64ec78cf	messaging_service: use rpc::tuple instead of variadic futures for rpc Since variadic future<> is deprecated, switch to rpc::tuple for multiple return values in rpc calls. This is more or less mechanical translation.	2019-09-26 12:09:31 +02:00
Gleb Natapov	e72a105b5e	lwt: Pass client_state reference all the way to storage_proxy::query client_state holds a state to generate monotonically increasing unique timestamp. Queries with a SERIAL consistency level need it to generate a paxos round.	2019-09-26 11:44:00 +03:00
Benny Halevy	1fea5f5904	storage_proxy: refactor remove_response_handler Refactor remove_response_handler_entry out of remove_response_handler, to be called on a valid iterator found by _response_handlers.find(id). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-09-25 11:19:50 +03:00
Avi Kivity	301246f6c0	storage_proxy: protect _view_update_handlers_list iterators from invalidation on_down() iterates over _view_update_handlers_list, but it yields during iteration, and while it yields, elements in that list can be removed, resulting in a use-after-free. Prevent this by registering iterators that can be potentially invalidated, and any time we remove an element from the list, check whether we're removing an element that is being pointed to by a live iterator. If that is the case, advance the iterator so that it points at a valid element (or at the end of the list). Fixes #4912. Tests: unit (dev)	2019-09-04 17:19:28 +03:00
Gleb Natapov	7d7b1685aa	storage_proxy: store a permit in a read executor A read executor exists until read operation completes in its entirety so storing a permit there guaranties that it will be freed only after no background work left for the request on this server.	2019-08-12 10:20:43 +03:00
Gleb Natapov	d5ced800f0	storage_proxy: store a permit in a write response handler A write response handler exists until write operation completes in its entirety so storing a permit there guaranties that it will be freed only after no background work left for the request on this server.	2019-08-12 10:20:43 +03:00
Gleb Natapov	6a4207f202	Pass service permit to storage_proxy Current cql transport code acquire a permit before processing a query and release it when the query gets a reply, but some quires leave work behind. If the work is allowed to accumulate without any limit a server may eventually run out of memory. To prevent that the permit system should account for the background work as well. The patch is a first step in this direction. It passes a permit down to storage proxy where it will be later hold by background work.	2019-08-12 10:20:43 +03:00
Konstantin Osipov	56f3bda4c7	metrics: introduce a metric for non-local reads A read which arrived to a non-replica and had to be forwarded to a replica by the coordinator is accounted in an own metric, reads_coordinator_outside_replica_set. Most often such read is produced by a driver which is unaware of token distribution on the ring. If a read was forwarded to another replica due to heat weighted load balancing or query preference set by the user, it's not accounted in the metric. In case of a multi-partition read (a query using IN statement, e.g. x in (1, 2, 3)), if any of the keys is read from a non-local node the read is accounted as a non-local. The rationale behind it is that if the user tries to be careful and send IN queries only to the same vnode, they are rewarded with the counter staying at zero, while if they send multi-partition IN queries without any precautions, they will see the metric go up which gives them a starting point for investigating performance problems. Closes #4338	2019-07-08 19:23:38 +03:00
Avi Kivity	591d2968cc	storage_proxy: limit resources consumed in cross-shard operations Currently, each shard protects itself by not reading from rpc and the native transport if in-flight requests consume too much memory for that shard. However, if all shards then forward their requests to some other shard, then that shard can easily run out of memory since its load can be multiplied by the number of shards that send it requests. To protect against this, use the new Seastar smp_service_group infrastructure. We create three groups: read, write, and write ack (the latter is needed to avoid ABBA deadlocks is shard A exhausts all its resources sending writes to shard B, and shard B simulateously does the same; neither will be able to send acknowledgements, so if the writes are throttled, they will never be unthrottled until a timeout occurs). Range scans are not addressed by this patch since they are handled by multishard_mutation_query, which has its own complex cross-shard communication scheme, but it be a similar solution. Ref #1105 (missing range scan protection) Tests: unit (dev) Message-Id: <20190512142243.17795-1-avi@scylladb.com>	2019-06-07 10:53:23 +02:00

1 2 3 4 5

239 Commits