scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-12 19:02:12 +00:00

Author	SHA1	Message	Date
Gleb Natapov	121cd383fa	lwt: remove entries from system.paxos table after successful learn stage The learning stage of PAXOS protocol leaves behind an entry in system.paxos table with the last learned value (which can be large). In case not all participants learned it successfully next round on the same key may complete the learning using this info. But if all nodes learned the value the entry does not serve useful purpose any longer. The patch adds another round, "prune", which is executed in background (limited to 1000 simultaneous instances) and removes the entry in case all nodes replied successfully to the "learn" round. It uses the ballot's timestamp to do the deletion, so not to interfere with the next round. Since deletion happens very close to previous writes it will likely happen in memtable and will never reach sstable, so that reduces memtable flush and compaction overhead. Fixes #5779 Message-Id: <20200330154853.GA31074@scylladb.com> (cherry picked from commit `8a408ac5a8`)	2020-04-02 15:36:52 +02:00
Piotr Dulikowski	ef1c62aa04	storage_proxy: track CDC operations in standard flow Register cdc operation result tracker for write response handlers coming from the usual write requests.	2020-03-23 14:05:25 +01:00
Piotr Dulikowski	cccc33f0fd	storage_proxy: add cdc tracker hooks to write response handlers Adds a field to abstract_write_response_handler that points to the cdc operation result tracker, and a function for registering the tracker in the handlers that currently write to a CDC log table.	2020-03-23 14:05:25 +01:00
Piotr Dulikowski	e7062de02b	cdc: register metric counters This patch defines a CDC metrics object and registers all of its counters. storage_proxy is chosen as the owner of the metrics object. Because in subsequent commits it will become possible for CDC metrics to be updated after a write operation ends, and because the cdc_service has shorter lifetime than storage_proxy, we could risk a use-after-free if we placed this object inside cdc_service.	2020-03-23 14:05:25 +01:00
Piotr Dulikowski	41d82e39ea	storage proxy: rename mutate_hint_from_scratch Changes the name of storage_proxy::mutate_hint_from_scratch function to another name, whose meaning is more clear: send_hint_to_all_replicas. Tests: unit(dev)	2020-02-24 17:30:22 +02:00
Avi Kivity	6c7aa18238	Merge "Introduce schema::get_partitioner" from Piotr " Introduce schema::get_partitioner and use it instead of dht::global_partitioner. Fixes #5493 Tests: unit(dev, release, debug) " * 'per_table_partitioner_prep' of https://github.com/haaawk/scylla: (35 commits) cdc: stop using partitioners partitioner_test: stop calling set_global_partitioner storage_service: stop calling global_partitioner() mutation_writer_test: stop calling global_partitioner() schema: reduce number of global_partitioner() calls test_services: stop calling global_partitioner() sstable_utils: stop calling global_partitioner() sstable_resharding_test: stop depending on global partitioner sstable_mutation_test: stop calling global_partitioner() sstable_data_file_test: stop calling global_partitioner() random_schema: stop taking partitioner in constructor mutation_reader_test: stop calling global_partitioner() multishard_mutation_query_test: stop calling global_partitioner() row_level repair: stop calling global_partitioner() distribute_reader_and_consume_on_shards: don't take partitioner thrift: reduce global_partitioner() calls binary_search: stop calling global_partitioner() index_entry: stop calling global_partitioner() mc writer: stop calling global_partitioner() sstable: stop calling global_partitioner() ...	2020-02-17 18:12:53 +02:00
Piotr Dulikowski	01084a79b8	hh: send orphaned hints on HINT_MUTATION verb When replaying a hint with a destination node that is no longer in the cluster, it will be sent with cl=ALL to all its new replicas. Before this patch, the MUTATION verb was used, which causes such hints to be handled on the same connection and with the same priority as regular writes. This can cause problems when a large number of hints is orphaned and they are scheduled to be sent at once. Such situation may happen when replacing a dead node - all nodes that accumulated hints for the dead node will now send them with cl=ALL to their new replicas. This patch changes the verb used to send such hints to HINT_MUTATION. This verb is handled on a separate connection and with streaming scheduling group, which gives them similar priority to non-orphaned hints. Refs: #4712 Tests: unit(dev)	2020-02-17 14:45:22 +01:00
Piotr Jastrzebski	abd76e566f	dht::shard_of: stop calling global_partitioner() Take const schema& as a parameter of shard_of and use it to obtain partitioner instead of calling global_partitioner(). Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-02-17 10:23:16 +01:00
Gleb Natapov	7694f164c4	lwt: add more tracing to paxos stages Message-Id: <20200211160653.30317-1-gleb@scylladb.com>	2020-02-16 11:22:30 +02:00
Pavel Emelyanov	fecea1de7e	proxy: Use own token_metadata Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-02-10 20:54:32 +03:00
Avi Kivity	bed61b96a2	Merge "Move features from storage- into feature-service" from Pavel " There's a lot of code around that needs storage service purely to get the specific feature value (cluster_supports_<something> calls). This creates several circular dependencies, e.g. storage_service <-> migration_manager one and database <-> storage_servuce. Also features sit on storage_service, but register themselfs on the feature_service and the former subscribes on them back which also looks strange. I propose to keep all the features on feature_service, this keeps the latter intependent from other components, makes it possible to break one of the mentioned circle dependencyand heavily relax the other. Also the set helps us fighting the globals and, after it, the feature_service can be safely stopped at the very last moment. Tests: unit(dev), manual debug build start-stop " * 'br-features-to-service-5' of https://github.com/xemul/scylla: gossiper: Avoid string merge-split for nothing features: Stop on shutdown storage_service: Remove helpers storage_service: Prepare to switch from on-board feature helpers cql3: Check feature in .validate database: Use feature service storage_proxy: Use feature service migration_manager: Use feature service start: Pass needed feature as argument into migrate_truncation_records features: Unfriend storage_service features: Simplify feature registration features: Introduce known_feature_set features: Move disabled features set from storage_service features: Move schema_features helper features: Move all features from storage_service to feature_service storage_service: Use feature_config from _feature_service features: Add feature_config storage_service: Kill set_disabled_features gms: Move features stuff into own .cc file migration_manager: Move some fns into class	2020-02-09 19:22:07 +02:00
Nadav Har'El	9fd9ec14c2	storage_proxy: make it into a peering sharded service We consider globals like service::get_storage_proxy() a bad idea, and would like to reduce their use as much as possible - and eventually, eliminate it completely. One easy case to fix case is when we already have a shard-local proxy, but now we need the sharded object, to invoke_on() something on it. In this patch, we turn storage_proxy into a peering_sharded_service. This means that if you already have a storage_proxy, you can call its container() function to get the sharded<storage_proxy>, without needing to call the global service::get_storage_proxy(). We found a few such cases in storage_proxy itself, and in Alternator, and fixed them to use container() instead of the global function. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2020-02-05 21:14:18 +02:00
Pavel Emelyanov	12c1378be0	storage_proxy: Use feature service Keep reference on local feature service from storage_proxy and use it in places that have (local) storage_proxy at hands. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-02-03 15:16:23 +03:00
Eliran Sinvani	971711a546	storage proxy: migrate to per scheduling group statistics This commit builds on top of the introduced per scheduling group statistics template and employs it for achieving a per scheduling group statistics in storage_proxy. Some of the statistics also had meaning as a global - per shard one. Those are the ones for determining if to throttle the write request. This was handled by creating a global stats struct that will hold those stats and by changing the stat update to also include the global one. One point that complicated it is an already existing aggregation over the per shard stats that now became a per scheduling group per shard stats, converting the aggregation to a two-dimensional aggregation. One thing this commit doesn't handle is validating that an individual statistic didn't "cross a scheduling group boundary", such validation is possible but it can easily be added in the future. There is a subtlety to doing so since if the operation did cross to other scheduling group two connected statistics can lose balance for example written bytes and completed write transactions. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2020-01-30 15:01:44 +01:00
Gleb Natapov	0d0c05a569	lwt: allow only one paxos instance to run for each key simultaneously This will prevent contention in case of parallel updates of the same row by the same coordinator. The patch does it by introducing a new per key lock map and taking it before running PAXOS protocol (either for write of for read). Message-Id: <20200117101228.GA14816@scylladb.com>	2020-01-28 12:39:23 +02:00
Nadav Har'El	1ed21d70dc	merge: CDC: do mutation augmentation from storage proxy Merged pull request https://github.com/scylladb/scylla/pull/5567 from Calle Wilund: Fixes #5314 Instead of tying CDC handling into cql statement objects, this patch set moves it to storage proxy, i.e. shared code for mutating stuff. This means we automatically handle cdc for code paths outside cql (i.e. alternator). It also adds api handling (though initially inefficient) for batch statements. CDC is tied into storage proxy by giving the former a ref to the latter (per shard). Initially this is not a constructor parameter, because right now we have chicken and egg issues here. Hopefully, Pavels refactoring of migration manager and notifications will untie these and this relationship can become nicer. The actual augmentation can (as stated above) be made much more efficient. Hopefully, the stream management refactoring will deal with expensive stream lookup, and eventually, we can maybe coalesce pre-image selects for batches. However, that is left as an exercise for when deemed needed. The augmentation API has an optional return value for a "post-image handler" to be used iff returned after mutation call is finished (and successful). It is not yet actually invoked from storage_proxy, but it is at least in the call chain.	2020-01-16 17:12:56 +02:00
Gleb Natapov	51672e5990	paxos: immediately sync commitlog entries for writes made by paxos learn stage	2020-01-15 12:15:42 +02:00
Gleb Natapov	d28dd4957b	lwt: Process lwt request on a owning shard LWT is much more efficient if a request is processed on a shard that owns a token for the request. This is because otherwise the processing will bounce to an owning shard multiple times. The patch proposes a way to move request to correct shard before running lwt. It works by returning an error from lwt code if a shard is incorrect one specifying the shard the request should be moved to. The error is processed by transport code that jumps to a correct shard and re-process incoming message there.	2020-01-13 10:26:02 +02:00
Calle Wilund	fc5904372b	storage_proxy: Add (optional) cdc service object pointer member The cdc service is assigned from outside, post construction, mainly because of the chickens and eggs in main startup. Would be nice to have it unconditionally, but this is workable.	2020-01-07 12:01:58 +00:00
Calle Wilund	d6003253dd	storage_proxy: Move mutate_counters to private section It is (and shall) only be called from inside storage proxy, and we would like this to be reflected in the interface so our eventual moving of cdc logic into the mutate call chains become easier to verify and comprehend.	2020-01-07 12:01:58 +00:00
Pekka Enberg	6bc18ba713	storage_proxy: Remove reference to MBean interface The JMX interface is implemented by the scylla-jmx project, not scylla. Therefore, let's remove this historical reference to MBeans from storage_proxy. Message-Id: <20191211121652.22461-1-penberg@scylladb.com>	2019-12-11 14:24:28 +02:00
Benny Halevy	105c8ef5a9	messaging_service: wait on unregister_handler Prepare for returning future<> from seastar rpc unregister_handler. Refs https://github.com/scylladb/scylla/issues/5228 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20191208153924.1953-1-bhalevy@scylladb.com>	2019-12-11 14:17:41 +02:00
Piotr Dulikowski	77d2ceaeba	storage_proxy: handle hints through separate rpc verb	2019-12-05 00:51:52 +01:00
Pavel Solodovnikov	2f442f28af	treewide: add const qualifiers throughout the code base	2019-11-26 02:24:49 +03:00
Vladimir Davydov	bf5f864d80	paxos: piggyback result query on prepare response Current LWT implementation uses at least three network round trips: - first, execute PAXOS prepare phase - second, query the current value of the updated key - third, propose the change to participating replicas (there's also learn phase, but we don't wait for it to complete). The idea behind the optimization implemented by this patch is simple: piggyback the current value of the updated key on the prepare response to eliminate one round trip. To generate less network traffic, only the closest to the coordinator replica sends data while other participating replicas send digests which are used to check data consistency. Note, this patch changes the API of some RPC calls used by PAXOS, but this should be okay as long as the feature in the early development stage and marked experimental. To assess the impact of this optimization on LWT performance, I ran a simple benchmark that starts a number of concurrent clients each of which updates its own key (uncontended case) stored in a cluster of three AWS i3.2xlarge nodes located in the same region (us-west-1) and measures the aggregate bandwidth and latency. The test uses shard-aware gocql driver. Here are the results: latency 99% (ms) bandwidth (rq/s) timeouts (rq/s) clients before after before after before after 1 2 2 626 637 0 0 5 4 3 2616 2843 0 0 10 3 3 4493 4767 0 0 50 7 7 10567 10833 0 0 100 15 15 12265 12934 0 0 200 48 30 13593 14317 0 0 400 185 60 14796 15549 0 0 600 290 94 14416 15669 0 0 800 568 118 14077 15820 2 0 1000 710 118 13088 15830 9 0 2000 1388 232 13342 15658 85 0 3000 1110 363 13282 15422 233 0 4000 1735 454 13387 15385 329 0 That is, this optimization improves max LWT bandwidth by about 15% and allows to run 3-4x more clients while maintaining the same level of system responsiveness.	2019-11-24 11:35:29 +02:00
Vladimir Davydov	ef2e96c47c	storage_proxy: factor out helper to sort endpoints by proximity We need it for PAXOS.	2019-11-24 11:35:29 +02:00
Vladimir Davydov	967a9e3967	storage_proxy: zap ballot_and_contention Pass contention by reference to begin_and_repair_paxos(), where it is incremented on every sleep. Rationale: we want to account the total number of times query() / cas() had to sleep, either directly or within begin_and_repair_paxos(), no matter if the function failed or succeeded.	2019-10-29 19:22:18 +03:00
Konstantin Osipov	0674fab05c	lwt: implement storage_proxy::cas() Introduce service::cas_request abstract base class which can be used to parameterize Paxos logic. Implement storage_proxy::cas() - compare and swap - the storage proxy entry point for lightweight transactions.	2019-10-27 23:42:03 +03:00
Gleb Natapov	70adf65341	storage_proxy: make mutation holder responsible for mutation operation Currently the code that manipulates mutations during write need to check what kind of mutations are those and (sometimes) choose different code paths. This patch encapsulates the differences in virtual functions of mutation_holder object, so that high level code will not concern itself with the details. The functions that are added: apply_locally(), apply_remotely() and store_hint().	2019-10-27 23:21:51 +03:00
Gleb Natapov	b3e01a45d7	lwt: storage_proxy: implement paxos protocol This patch adds all functionality needed for Paxos protocol. The implementation does not strictly adhere to Paxos paper since the original paper allows setting a value only once, while for LWT we need to be able to make another Paxos round after "learn" phase completes, which requires things like repair to be introduced.	2019-10-27 23:21:51 +03:00
Avi Kivity	162730862d	storage_proxy: remove variadic future from query_partition_key_range_concurrent() Seastar variadic futures are deprecated, so replace with a nice struct.	2019-09-30 21:33:44 +03:00
Avi Kivity	c6b66d197b	Merge "Couple of preparatory patches for lwt" from Gleb " This is a collection of assorted patches that will be needed for LWT. Most of them are trivial, but one touches a lot of files, so have a good chance to cause rebase headache (I already had to rebase it on top of Alternator). Lets push them earlier instead of carrying them in the lwt branch. " * 'gleb/lwt-prepare-v2' of github.com:scylladb/seastar-dev: lwt: make _last_timestamp_micros static lwt: Add client_state::get_timestamp_for_paxos() function lwt: Pass client_state reference all the way to storage_proxy::query exceptions: Add a constructor for unavailable_exception that allows providing a custom message serializer: Add std::variant support lwt: Add missing functions to utils/UUID_gen.hh	2019-09-29 13:02:26 +03:00
Avi Kivity	ba64ec78cf	messaging_service: use rpc::tuple instead of variadic futures for rpc Since variadic future<> is deprecated, switch to rpc::tuple for multiple return values in rpc calls. This is more or less mechanical translation.	2019-09-26 12:09:31 +02:00
Gleb Natapov	e72a105b5e	lwt: Pass client_state reference all the way to storage_proxy::query client_state holds a state to generate monotonically increasing unique timestamp. Queries with a SERIAL consistency level need it to generate a paxos round.	2019-09-26 11:44:00 +03:00
Benny Halevy	1fea5f5904	storage_proxy: refactor remove_response_handler Refactor remove_response_handler_entry out of remove_response_handler, to be called on a valid iterator found by _response_handlers.find(id). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-09-25 11:19:50 +03:00
Avi Kivity	301246f6c0	storage_proxy: protect _view_update_handlers_list iterators from invalidation on_down() iterates over _view_update_handlers_list, but it yields during iteration, and while it yields, elements in that list can be removed, resulting in a use-after-free. Prevent this by registering iterators that can be potentially invalidated, and any time we remove an element from the list, check whether we're removing an element that is being pointed to by a live iterator. If that is the case, advance the iterator so that it points at a valid element (or at the end of the list). Fixes #4912. Tests: unit (dev)	2019-09-04 17:19:28 +03:00
Gleb Natapov	7d7b1685aa	storage_proxy: store a permit in a read executor A read executor exists until read operation completes in its entirety so storing a permit there guaranties that it will be freed only after no background work left for the request on this server.	2019-08-12 10:20:43 +03:00
Gleb Natapov	d5ced800f0	storage_proxy: store a permit in a write response handler A write response handler exists until write operation completes in its entirety so storing a permit there guaranties that it will be freed only after no background work left for the request on this server.	2019-08-12 10:20:43 +03:00
Gleb Natapov	6a4207f202	Pass service permit to storage_proxy Current cql transport code acquire a permit before processing a query and release it when the query gets a reply, but some quires leave work behind. If the work is allowed to accumulate without any limit a server may eventually run out of memory. To prevent that the permit system should account for the background work as well. The patch is a first step in this direction. It passes a permit down to storage proxy where it will be later hold by background work.	2019-08-12 10:20:43 +03:00
Konstantin Osipov	56f3bda4c7	metrics: introduce a metric for non-local reads A read which arrived to a non-replica and had to be forwarded to a replica by the coordinator is accounted in an own metric, reads_coordinator_outside_replica_set. Most often such read is produced by a driver which is unaware of token distribution on the ring. If a read was forwarded to another replica due to heat weighted load balancing or query preference set by the user, it's not accounted in the metric. In case of a multi-partition read (a query using IN statement, e.g. x in (1, 2, 3)), if any of the keys is read from a non-local node the read is accounted as a non-local. The rationale behind it is that if the user tries to be careful and send IN queries only to the same vnode, they are rewarded with the counter staying at zero, while if they send multi-partition IN queries without any precautions, they will see the metric go up which gives them a starting point for investigating performance problems. Closes #4338	2019-07-08 19:23:38 +03:00
Avi Kivity	591d2968cc	storage_proxy: limit resources consumed in cross-shard operations Currently, each shard protects itself by not reading from rpc and the native transport if in-flight requests consume too much memory for that shard. However, if all shards then forward their requests to some other shard, then that shard can easily run out of memory since its load can be multiplied by the number of shards that send it requests. To protect against this, use the new Seastar smp_service_group infrastructure. We create three groups: read, write, and write ack (the latter is needed to avoid ABBA deadlocks is shard A exhausts all its resources sending writes to shard B, and shard B simulateously does the same; neither will be able to send acknowledgements, so if the writes are throttled, they will never be unthrottled until a timeout occurs). Range scans are not addressed by this patch since they are handled by multishard_mutation_query, which has its own complex cross-shard communication scheme, but it be a similar solution. Ref #1105 (missing range scan protection) Tests: unit (dev) Message-Id: <20190512142243.17795-1-avi@scylladb.com>	2019-06-07 10:53:23 +02:00
Piotr Sarna	aea4b7ea78	service: remove unused stop_hints_manager Stopping hints manager now occurs when draining storage proxy and it shouldn't be executed independently, so it's removed from external API.	2019-03-07 13:44:06 +01:00
Piotr Sarna	cc806909d7	storage_proxy: add drain_on_shutdown implementation When storage proxy is shutting down, all interruptible writes can be timed out in order not to wait for them. Instead, the mechanism will fall back to storing hints and/or not progressing with view building.	2019-03-07 13:44:05 +01:00
Piotr Sarna	92df1d5a6b	storage_proxy: add endpoint_lifecycle_subscriber interface Storage proxy is able to react to membership changes in order to cancel long-standing operations for an endpoint.	2019-03-07 12:10:40 +01:00
Piotr Sarna	75ec5fa876	storage_proxy: add intrusive list of view write handlers In order to be able to iterate over view update write response handlers, an intrusive list of them is added to storage proxy. This way iteration can be easily yielded without invalidating operators and all logic is moved to slow path.	2019-03-07 12:10:40 +01:00
Gleb Natapov	26e5700819	storage_proxy: limit amount of precaclulated ranges by query_ranges_to_vnodes_generator Do not recalculate too much ranges in advance, it requires large allocation and usually means that a consumer of the interface is going to do to much work in parallel. Fixes: #3767	2019-02-12 10:45:25 +02:00
Gleb Natapov	ecc5230de5	storage_proxy: remove old get_restricted_ranges() interface It is not used any more.	2019-02-11 14:45:43 +02:00
Gleb Natapov	2735a85c8e	storage_proxy: convert range query path to new query_ranges_to_vnodes_generator interface	2019-02-11 14:45:43 +02:00
Gleb Natapov	692a0bd000	storage_proxy: introduce new query_ranges_to_vnode_generator interface get_restricted_ranges() function gets query provided key ranges and divides them on vnode boundaries. It iterates over all ranges and calculates all vnodes, but all its users are usually interested in only one vnode since most likely it will be enough to populate a page. If it will be not enough they will ask for more. This patch introduces new interface instead of the function that allows to generate vnode ranges on demand instead of precalculating all of them.	2019-02-11 14:45:43 +02:00
Piotr Sarna	e0fe9ce2c0	storage_proxy: add allow_hints parameter to send_to_endpoint With hints allowed, send_to_endpoint will leverage consistency level ANY to send data. Otherwise, it will use the default - cl::ONE.	2019-01-28 09:38:41 +01:00

1 2 3 4 5

230 Commits