scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-21 17:10:35 +00:00

Author	SHA1	Message	Date
Avi Kivity	582802825a	treewide: use system-#include (angle brackets) for seastar Seastar is an external library from Scylla's point of view so we should use the angle bracket #include style. Most of the source follows this, this patch fixes a few stragglers. Also fix cases of #include which reached out to seastar's directory tree directly, via #include "seastar/include/sesatar/..." to just refer to <seastar/...>. Closes #10433	2022-04-26 14:46:42 +03:00
Mikołaj Sielużycki	6f1b6da68a	compile: Fix headers so that *-headers targets compile cleanly. Closes #10273	2022-03-25 16:19:26 +02:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Avi Kivity	bbad8f4677	replica: move ::database, ::keyspace, and ::table to replica namespace Move replica-oriented classes to the replica namespace. The main classes moved are ::database, ::keyspace, and ::table, but a few ancillary classes are also moved. There are certainly classes that should be moved but aren't (like distributed_loader) but we have to start somewhere. References are adjusted treewide. In many cases, it is obvious that a call site should not access the replica (but the data_dictionary instead), but that is left for separate work. scylla-gdb.py is adjusted to look for both the new and old names.	2022-01-07 12:04:38 +02:00
Avi Kivity	ae3a360725	database: Move database, keyspace, table classes to replica/ directory The database, keyspace, and table classes represent the replica-only part of the objects after which they are named. Reading from a table doesn't give you the full data, just the replica's view, and it is not consistent since reconciliation is applied on the coordinator. As a first step in acknowledging this, move the related files to a replica/ subdirectory.	2022-01-06 17:07:30 +02:00
Avi Kivity	7bdc999bba	service: paxos_state: wean off get_local_storage_proxy() Instead of calling get_local_storage_proxy in paxos_state, get it from the caller (who is, in fact, storage_proxy or one of its components). Some of the callers, although they are storage_proxy components, don't have a storage_proxy reference handy and so they ignomiously call get_local_storage_proxy() themselves. This will be adjusted later. The other callers who are, in fact, storage_proxy, have to take special care not to cross a shard boundary. When they do, smp::submit_to() is converted to sharded::invoke_on() in order to get the correct local instance. Test: unit (dev) Closes #9824	2021-12-20 00:31:13 +02:00
Pavel Emelyanov	5cfeac0c90	paxos: Drop forward declarations of seastar pointers They will break compilation after next seastar update, but the good news is that scylla compiles even without them. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20211202173643.1070-1-xemul@scylladb.com>	2021-12-02 19:49:03 +02:00
Benny Halevy	ce3fcc121e	paxos_state: prepare: handle exception getting data or digest This exception is ignored by design, but if it's left unhandled, it generates `Exceptional future ignored` warnings, like the following. Also, ignore f2 if f1 failed since we return early in this case. ``` [shard 5] seastar - Exceptional future ignored: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem), backtrace: 0x431689e 0x4316d40 0x43170e8 0x3f35486 0x218d14a 0x3f8002f 0x3f81217 0x3f9f868 0x3f4b76a /opt/scylladb/libreloc/libpthread.so.0+0x93f8 /opt/scylladb/libreloc/libc.so.6+0x101902#012 N7seastar12continuationINS_8internal22promise_base_with_typeISt7variantIJN5utils4UUIDEN7service5paxos7promiseEEEEEZZZZNS7_11paxos_state7prepareEN7tracing15trace_state_ptrENS_13lw_shared_ptrIK6schemaEERKN5query12read_commandERK13partition_keyS5_bNSI_16digest_algorithmENSt6chrono10time_pointINS_12lowres_clockENSQ_8durationIlSt5ratioILl1ELl1000EEEEEEENK3$_0clEvENUlvE_clEvENKUlSB_E_clESB_EUlT_E_ZNS_6futureISt5tupleIJNS13_IvEENS13_IS14_IJNSE_INSI_6resultEEE17cache_temperatureEEEEEEE14then_impl_nrvoIS12_NS13_IS9_EEEET0_OS11_EUlOSA_RS12_ONS_12future_stateIS1B_EEE_S1B_EE#012 seastar::continuation<seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::finally_body<seastar::with_semaphore<seastar::semaphore_default_exception_factory, seastar::lowres_clock, service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#1}>(seastar::basic_semaphore<seastar::semaphore_default_exception_factory, seastar::lowres_clock>&, unsigned long, seastar::lowres_clock::duration, std::result_of&&)::{lambda(seastar::basic_semaphore)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, seastar::lowres_clock> >(seastar::basic_semaphore)::{lambda()#1}, false>, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::then_wrapped_nrvo<seastar::future<std::variant<utils::UUID, service::paxos::promise> >, seastar::semaphore_units<seastar::semaphore_default_exception_factory, seastar::lowres_clock> >(seastar::future<std::variant<utils::UUID, service::paxos::promise> >&&)::{lambda(seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >&&, seastar::semaphore_units<seastar::semaphore_default_exception_factory, seastar::lowres_clock>&, seastar::future_state<std::variant<utils::UUID, service::paxos::promise> >&&)#1}, std::variant<utils::UUID, service::paxos::promise> >#012 seastar::continuation<seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::finally_body<service::paxos::paxos_state::key_lock_map::with_locked_key<service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#1}>(dht::token const&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >, std::result_of)::{lambda()#1}, false>, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::then_wrapped_nrvo<seastar::future<std::variant<utils::UUID, service::paxos::promise> >, {lambda()#1}>({lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >&&, {lambda()#1}&, seastar::future_state<std::variant<utils::UUID, service::paxos::promise> >&&)#1}, std::variant<utils::UUID, service::paxos::promise> >#012 seastar::continuation<seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::finally_body<service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#2}, false>, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::then_wrapped_nrvo<seastar::future<std::variant<utils::UUID, service::paxos::promise> >, service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#2}>(service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#2}&&)::{lambda(seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >&&, service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#2}&, seastar::future_state<std::variant<utils::UUID, service::paxos::promise> >&&)#1}, std::variant<utils::UUID, service::paxos::promise> >#012 seastar::continuation<seastar::internal::promise_base_with_type<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > >, service::storage_proxy::init_messaging_service()::$_51::operator()(seastar::rpc::client_info const&, seastar::rpc::opt_time_point, query::read_command, partition_key, utils::UUID, bool, query::digest_algorithm, std::optional<tracing::trace_info>) const::{lambda(seastar::lw_shared_ptr<schema const>)#1}::operator()(seastar::lw_shared_ptr<schema const>)::{lambda()#1}::operator()() const::{lambda(std::variant<utils::UUID, service::paxos::promise>)#1}, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::then_impl_nrvo<{lambda()#1}, {lambda()#1}<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > > >({lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > >&&, {lambda()#1}&, seastar::future_state<std::variant<utils::UUID, service::paxos::promise> >&&)#1}, std::variant<utils::UUID, service::paxos::promise> >#012 seastar::continuation<seastar::internal::promise_base_with_type<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > >, seastar::future<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > >::finally_body<seastar::smp::submit_to<service::storage_proxy::init_messaging_service()::$_51::operator()(seastar::rpc::client_info const&, seastar::rpc::opt_time_point, query::read_command, partition_key, utils::UUID, bool, query::digest_algorithm, std::optional<tracing::trace_info>) const::{lambda(seastar::lw_shared_ptr<schema const>)#1}::operator()(seastar::lw_shared_ptr<schema const>)::{lambda()#1}>(unsigned int, se ``` Refs #7779 Refs #9331 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210919053007.13960-1-bhalevy@scylladb.com>	2021-09-19 11:58:21 +03:00
Pavel Emelyanov	c39f04fa6f	code: Remove storage-service header from irrelevant places Some .cc files over the code include the storage service for no real need. Drop the header and include (in some) what's really needed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-22 18:50:19 +03:00
Avi Kivity	4d70f3baee	storage_proxy: change unordered_set<inet_address> to small_vector in write path The write paths in storage_proxy pass replica sets as std::unordered_set<gms::inet_address>. This is a complex type, with N+1 allocations for N members, so we change it to a small_vector (via inet_address_vector_replica_set) which requires just one allocation, and even zero when up to three replicas are used. This change is more nuanced than the corresponding change to the read path `abe3d7d7` ("Merge 'storage_proxy: use small_vector for vectors of inet_address' from Avi Kivity"), for two reasons: - there is a quadratic algorithm in abstract_write_response_handler::response(): it searches for a replica and erases it. Since this happens for every replica, it happens N^2/2 times. - replica sets for writes always include all datacenters, while reads usually involve just one datacenter. So, a write to a keyspace that has 5 datacenters will invoke 15*(15-1)/2 =105 compares. We could remove this by sending the index of the replica in the replica set to the replica and ask it to include the index in the response, but I think that this is unnecessary. Those 105 compares need to be only 105/15 = 7 times cheaper than the corresponding unordered_set operation, which they surely will. Handling a response after a cross-datacenter round trip surely involves L3 cache misses, and a small_vector reduces these to a minimum compared to an unordered_set with its bucket table, linked list walking and managent, and table rehashing. Tests using perf_simple_query --write --smp 1 --operations-per-shard 1000000 --task-quota-ms show two allocations removed (as expected) and a nice reduction in instructions executed. before: median 204842.54 tps ( 54.2 allocs/op, 13.2 tasks/op, 49890 insns/op) after: median 206077.65 tps ( 52.2 allocs/op, 13.2 tasks/op, 49138 insns/op) Closes #8847	2021-06-17 13:46:40 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Pavel Solodovnikov	2187a59089	treewide: move `service::cas_request` out from `storage_proxy.hh` And remove all remaining inclusions of `storage_proxy.hh` in the headers. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-06-06 19:18:49 +03:00
Konstantin Osipov	c83cf1f965	uuid: switch the API to use std::chrono A follow up for the patch for #7611. This change was requested during review and moved out of #7611 to reduce its scope. The patch switches UUID_gen API from using plain integers to hold time units to units from std::chrono. For one, we plan to switch the entire code base to std::chrono units, to ensure type safety. Secondly, using std::chrono units allows to increase code reuse with template metaprogramming and remove a few of UUID_gen functions that beceme redundant as a result. * switch get_time_UUID(), unix_timestamp(), get_time_UUID_raw(), switch min_time_UUID(), max_time_UUID(), create_time_safe() to std::chrono * remove unused variant of from_unix_timestamp() * remove unused get_time_UUID_bytes(), create_time_unsafe(), redundant get_adjusted_timestamp() * inline get_raw_UUID_bytes() * collapse to similar implementations of get_time_UUID() * switch internal constants to std::chrono * remove unnecessary unique_ptr from UUID_gen::_instance Message-Id: <20210406130152.3237914-2-kostja@scylladb.com>	2021-04-06 17:12:54 +03:00
Benny Halevy	088f92e574	paxos_state: learn: fix injected error description It was copy-pasted from another injection point. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20201220091439.3604201-1-bhalevy@scylladb.com>	2021-01-24 11:51:23 +02:00
Gleb Natapov	85cffd1aeb	lwt: rewrite storage_proxy::cas using coroutings Makes code much simpler to understand. Message-Id: <20201201160213.GW1655743@scylladb.com>	2020-12-17 18:15:35 +01:00
Pavel Solodovnikov	055fd3d8ad	lwt: store column_mapping's for each table schema version upon a DDL change This patch introduces a new system table: `system.scylla_table_schema_history`, which is used to keep track of column mappings for obsolete table schema versions (i.e. schema becomes obsolete when it's being changed by means of `CREATE TABLE` or `ALTER TABLE` DDL operations). It is populated automatically when a new schema version is being pulled from a remote in get_schema_definition() at migration_manager.cc and also when schema change is being propagated to system schema tables in do_merge_schema() at schema_tables.cc. The data referring to the most recent table schema version is always present. Other entries are garbage-collected when the corresponding table schema version is obsoleted (they will be updated with a TTL equal to `DEFAULT_GC_GRACE_SECONDS` on `ALTER TABLE`). In case we failed to persist column mapping after a schema change, missing entries will be recreated on node boot. Later, the information from this table is used in `paxos_state::learn` callback in case we have a mismatch between the most recent schema version and the one that is stored inside the `frozen_mutation` for the accepted proposal. Such situation may arise under following circumstances: 1. The previous LWT operation crashed on the "accept" stage, leaving behind a stale accepted proposal, which waits to be repaired. 2. The table affected by LWT operation is being altered, so that schema version is now different. Stored proposal now references obsolete schema. 3. LWT query is retried, so that Scylla tries to repair the unfinished Paxos round and apply the mutation in the learn stage. When such mismatch happens, prior to that patch the stored `frozen_mutation` is able to be applied only if we are lucky enough and column_mapping in the mutation is "compatible" with the new table schema. It wouldn't work if, for example, the columns are reordered, or some columns, which are referenced by an LWT query, are dropped. With this patch we try to look up the column mapping for the obsolete schema version, then upgrade the stored mutation using obtained column mapping and apply an upgraded mutation instead. In case we don't find a column_mapping we just return an error from the learn stage. Tests: unit(dev, debug), dtests(paxos_tests.py:TestPaxos.schema_mismatch_*_test) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2020-10-15 19:24:30 +03:00
Pavel Solodovnikov	88ba184247	paxos: use schema_registry when applying accepted proposal if there is schema mismatch Try to look up and use schema from the local schema_registry in case when we have a schema mismatch between the most recent schema version and the one that is stored inside the `frozen_mutation` for the accepted proposal. When such situation happens the stored `frozen_mutation` is able to be applied only if we are lucky enough and column_mapping in the mutation is "compatible" with the new table schema. It wouldn't work if, for example, the columns are reordered, or some columns, which are referenced by an LWT query, are dropped. With the patch we are able to mitigate these cases as long as the referenced schema is still present in the node cache (e.g. it didn't restart/crash or the cache entry is not too old to be evicted). Tests: unit(dev, debug), dtest(paxos_tests.schema_mismatch_*_test) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20200827150844.624017-1-pa.solodovnikov@scylladb.com>	2020-08-27 19:04:09 +02:00
Pavel Solodovnikov	9aa4712270	lwt: introduce `paxos_grace_seconds` per-table option to set paxos ttl Previously system.paxos TTL was set as max(3h, gc_grace_seconds). Introduce new per-table option named `paxos_grace_seconds` to set the amount of seconds which are used to TTL data in paxos tables when using LWT queries against the base table. Default value is equal to `DEFAULT_GC_GRACE_SECONDS`, which is 10 days. This change allows to easily test various issues related to paxos TTL. Fixes #6284 Tests: unit (dev, debug) Co-authored-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20200816223935.919081-1-pa.solodovnikov@scylladb.com>	2020-08-17 16:44:14 +02:00
Botond Dénes	159d37053d	storage_proxy: use read_command::max_result_size to pass max result size around Use the recently added `max_result_size` field of `query::read_command` to pass the max result size around, including passing it to remote nodes. This means that the max result size will be sent along each read, instead of once per connection. As we want to select the appropriate `max_result_size` based on the type of the query as well as based on the query class (user or internal) the previous method won't do anymore. If the remote doesn't fill this field, the old per-connection value is used.	2020-07-28 18:00:29 +03:00
Amnon Heiman	186301aff8	per table metrics: change estimated_histogram to time_estimated_histogram This patch changes the per table latencies histograms: read, write, cas_prepare, cas_accept, and cas_learn. Beside changing the definition type and the insertion method, the API was changed to support the new metrics. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2020-07-14 11:17:43 +03:00
Rafael Ávila de Espíndola	555d8fe520	build: Be consistent about system versus regular headers We were not consistent about using '#include "foo.hh"' instead of '#include <foo.hh>' for scylla's own headers. This patch fixes that inconsistency and, to enforce it, changes the build to use -iquote instead of -I to find those headers. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200608214208.110216-1-espindola@scylladb.com>	2020-06-10 15:49:51 +03:00
Alejo Sanchez	59d60ae672	paxos: fix indentation Fix indentation Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2020-06-03 14:47:18 +02:00
Alejo Sanchez	019c96cfda	paxos: add error injections Adds error injections on critical points for: prepare accept learn release_semaphore_for_key Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2020-06-03 14:44:53 +02:00
Piotr Sarna	92aadb94e5	treewide: propagate trace state to write path In order to add tracing to places where it can be useful, e.g. materialized view updates and hinted handoff, tracing state is propagated to all applicable call sites.	2020-05-18 16:05:23 +02:00
Avi Kivity	beaeda5234	database: remove variadic future from query() and query_mutations() Variadic futures are deprecated; replace with future<std::tuple<...>>. Tests: unit (dev)	2020-05-17 18:45:38 +02:00
Gleb Natapov	97af6bb0bd	lwt: make load_paxos_state to take partition_key_view instead of a deference Some caller have partition_key_view, but not partition_key, so thy need to create a temporary and copy just to pass a reference. Change it by accepting a view.	2020-04-22 13:51:43 +03:00
Gleb Natapov	c970da3811	lwt: do not copy proposal in paxos_state::accept A proposal is passed as a reference and all callers have it in stable memory until the call ends, so it is safe to use the reference everywhere.	2020-04-22 13:51:43 +03:00
Konstantin Osipov	18b9bb57ac	lwt: rename metrics to match accepted terminology Rename inherited metrics cas_propose and cas_commit to cas_accept and cas_learn respectively. A while ago we made a decision to stick to widely accepted terms for Paxos rounds: prepare, accept, learn. The rest of the code is using these terms, so rename the metrics to avoid confusion/technical debt. While at it, rename a few internal methods and functions. Fixes #6169 Message-Id: <20200414213537.129547-1-kostja@scylladb.com>	2020-04-15 12:20:30 +02:00
Pavel Solodovnikov	3206c1bf66	paxos_state: introduce error injections for testing timeouts in paxos stages The following sleep injections are added to paxos_state: * paxos_state_prepare_timeout (timeouts in paxos_state::prepare) * paxos_state_accept_timeout (timeouts in paxos_state::accept) * paxos_state_learn_timeout (timeouts in paxos_state::learn) Tests: unit ({dev}), unit ({debug}) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Message-Id: <20200403092107.181057-1-alejo.sanchez@scylladb.com>	2020-04-08 10:47:15 +02:00
Gleb Natapov	8a408ac5a8	lwt: remove entries from system.paxos table after successful learn stage The learning stage of PAXOS protocol leaves behind an entry in system.paxos table with the last learned value (which can be large). In case not all participants learned it successfully next round on the same key may complete the learning using this info. But if all nodes learned the value the entry does not serve useful purpose any longer. The patch adds another round, "prune", which is executed in background (limited to 1000 simultaneous instances) and removes the entry in case all nodes replied successfully to the "learn" round. It uses the ballot's timestamp to do the deletion, so not to interfere with the next round. Since deletion happens very close to previous writes it will likely happen in memtable and will never reach sstable, so that reduces memtable flush and compaction overhead. Fixes #5779 Message-Id: <20200330154853.GA31074@scylladb.com>	2020-03-30 21:02:14 +03:00
Rafael Ávila de Espíndola	eca0ac5772	everywhere: Update for deprecated apply functions Now apply is only for tuples, for varargs use invoke. This depends on the seastar changes adding invoke. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200324163809.93648-1-espindola@scylladb.com>	2020-03-25 08:49:53 +02:00
Piotr Jastrzebski	2d7532f87f	dht: add dht::get_token and replace all calls to dht::global_partitioner().get_token dht::get_token is better because it takes schema and uses it to obtain partitioner instead of using a global partitioner. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-02-17 10:59:15 +01:00
Gleb Natapov	ff88ff880b	lwt: use cached truncation record instead of quering the database Message-Id: <20200206163838.5220-3-gleb@scylladb.com>	2020-02-06 18:15:48 +01:00
Gleb Natapov	0d0c05a569	lwt: allow only one paxos instance to run for each key simultaneously This will prevent contention in case of parallel updates of the same row by the same coordinator. The patch does it by introducing a new per key lock map and taking it before running PAXOS protocol (either for write of for read). Message-Id: <20200117101228.GA14816@scylladb.com>	2020-01-28 12:39:23 +02:00
Gleb Natapov	51672e5990	paxos: immediately sync commitlog entries for writes made by paxos learn stage	2020-01-15 12:15:42 +02:00
Nadav Har'El	f16e3b0491	merge: bouncing lwt request to an owning shard Merged patch series from Gleb Natapov: "LWT is much more efficient if a request is processed on a shard that owns a token for the request. This is because otherwise the processing will bounce to an owning shard multiple times. The patch proposes a way to move request to correct shard before running lwt. It works by returning an error from lwt code if a shard is incorrect one specifying the shard the request should be moved to. The error is processed by the transport code that jumps to a correct shard and re-process incoming message there. The nicer way to achieve the same would be to jump to a right shard inside of the storage_proxy::cas(), but unfortunately with current implementation of the modification statements they are unusable by a shard different from where it was created, so the jump should happen before a modification statement for an cas() is created. When we fix our cql code to be more cross-shard friendly this can be reworked to do the jump in the storage_proxy." Gleb Natapov (4): transport: change make_result to takes a reference to cql result instead of shared_ptr storage_service: move start_native_transport into a thread lwt: Process lwt request on a owning shard lwt: drop invoke_on in paxos_state prepare and accept auth/service.hh \| 5 +- message/messaging_service.hh \| 2 +- service/client_state.hh \| 30 +++- service/paxos/paxos_state.hh \| 10 +- service/query_state.hh \| 6 + service/storage_proxy.hh \| 2 + transport/messages/result_message.hh \| 20 +++ transport/messages/result_message_base.hh \| 4 + transport/request.hh \| 4 + transport/server.hh \| 25 ++- cql3/statements/batch_statement.cc \| 6 + cql3/statements/modification_statement.cc \| 6 + cql3/statements/select_statement.cc \| 8 + message/messaging_service.cc \| 2 +- service/paxos/paxos_state.cc \| 48 ++--- service/storage_proxy.cc \| 47 ++++- service/storage_service.cc \| 120 +++++++------ test/boost/cql_query_test.cc \| 1 + thrift/handler.cc \| 3 + transport/messages/result_message.cc \| 5 + transport/server.cc \| 203 ++++++++++++++++------ 21 files changed, 377 insertions(+), 180 deletions(-)	2020-01-14 09:59:59 +02:00
Gleb Natapov	5753ab7195	lwt: drop invoke_on in paxos_state prepare and accept Since lwt requests are now running on an owning shard there is no longer a need to invoke cross shard call on paxos_state level. RPC calls may still arrive to a wrong shard so we need to make cross shard call there.	2020-01-13 10:26:02 +02:00
Gleb Natapov	feed544c5d	paxos: fix truncation time checking during learn stage The comparison is done in millisecons, not microseconds. Fixes #5566 Message-Id: <20200108094927.GN9084@scylladb.com>	2020-01-08 14:37:07 +01:00
Avi Kivity	f7d69b0428	Revert "Merge "bouncing lwt request to an owning shard" from Gleb" This reverts commit `64cade15cc`, reversing changes made to `9f62a3538c`. This commit is suspected of corrupting the response stream. Fixes #5479.	2019-12-17 11:06:10 +02:00
Gleb Natapov	64cfb9b1f6	lwt: take raw lock for entire cas duration It will prevent parallel update by the same coordinator and should reduce contention.	2019-12-11 14:41:31 +02:00
Gleb Natapov	898d2330a2	lwt: drop invoke_on in paxos_state prepare and accept Since lwt requests are now running on an owning shard there is no longer a need to invoke cross shard call.	2019-12-11 14:41:31 +02:00
Vladimir Davydov	bf5f864d80	paxos: piggyback result query on prepare response Current LWT implementation uses at least three network round trips: - first, execute PAXOS prepare phase - second, query the current value of the updated key - third, propose the change to participating replicas (there's also learn phase, but we don't wait for it to complete). The idea behind the optimization implemented by this patch is simple: piggyback the current value of the updated key on the prepare response to eliminate one round trip. To generate less network traffic, only the closest to the coordinator replica sends data while other participating replicas send digests which are used to check data consistency. Note, this patch changes the API of some RPC calls used by PAXOS, but this should be okay as long as the feature in the early development stage and marked experimental. To assess the impact of this optimization on LWT performance, I ran a simple benchmark that starts a number of concurrent clients each of which updates its own key (uncontended case) stored in a cluster of three AWS i3.2xlarge nodes located in the same region (us-west-1) and measures the aggregate bandwidth and latency. The test uses shard-aware gocql driver. Here are the results: latency 99% (ms) bandwidth (rq/s) timeouts (rq/s) clients before after before after before after 1 2 2 626 637 0 0 5 4 3 2616 2843 0 0 10 3 3 4493 4767 0 0 50 7 7 10567 10833 0 0 100 15 15 12265 12934 0 0 200 48 30 13593 14317 0 0 400 185 60 14796 15549 0 0 600 290 94 14416 15669 0 0 800 568 118 14077 15820 2 0 1000 710 118 13088 15830 9 0 2000 1388 232 13342 15658 85 0 3000 1110 363 13282 15422 233 0 4000 1735 454 13387 15385 329 0 That is, this optimization improves max LWT bandwidth by about 15% and allows to run 3-4x more clients while maintaining the same level of system responsiveness.	2019-11-24 11:35:29 +02:00
Vladimir Davydov	3d1d4b018f	paxos: remove unnecessary move constructor invocations invoke_on() guarantees that captures object won't be destroyed until the future returned by the invoked function is resolved so there's no need to move key, token, proposal for calling paxos_state::*_impl helpers.	2019-11-24 11:35:29 +02:00
Vladimir Davydov	b75862610e	paxos_state: account paxos round latency This patch adds the following per table stats: cas_prepare_latency cas_propose_latency cas_commit_latency They are equivalent to CasPropose, CasPrepare, CasCommit metrics exposed by Cassandra.	2019-10-29 19:26:18 +03:00
Gleb Natapov	b3e01a45d7	lwt: storage_proxy: implement paxos protocol This patch adds all functionality needed for Paxos protocol. The implementation does not strictly adhere to Paxos paper since the original paper allows setting a value only once, while for LWT we need to be able to make another Paxos round after "learn" phase completes, which requires things like repair to be introduced.	2019-10-27 23:21:51 +03:00
Gleb Natapov	d1774693bf	lwt: Define state needed by paxos and persist it Paxos protocol relies on replicas having a state that persists over crashes/restarts. This patch defines such state and stores it in the database itself in the paxos table to make it persistent. The stored state is: in_progress_ballot - promised ballot proposal - accepted value proposal_ballot - the ballot of the accepted value most_recent_commit - most recently learned value most_recent_commit_at - the ballot of the most recently learned value	2019-10-27 23:21:51 +03:00
Gleb Natapov	15b935b95d	lwt: add data structures needed for paxos implementation This patch add two data structures that will be used by paxos. First one is "proposal" which contains a ballot and a mutation representing a value paxos protocol is trying to set. Second one is "prepare_response" which is a value returned by paxos prepare stage. It contains currently accepted value (if any) and most recently learned value (again if any). The later is used to "repair" replicas that missed previous "learn" message.	2019-10-27 23:21:51 +03:00

47 Commits