Commit Graph

47 Commits

Author SHA1 Message Date
Avi Kivity
582802825a treewide: use system-#include (angle brackets) for seastar
Seastar is an external library from Scylla's point of view so
we should use the angle bracket #include style. Most of the source
follows this, this patch fixes a few stragglers.

Also fix cases of #include which reached out to seastar's directory
tree directly, via #include "seastar/include/sesatar/..." to
just refer to <seastar/...>.

Closes #10433
2022-04-26 14:46:42 +03:00
Mikołaj Sielużycki
6f1b6da68a compile: Fix headers so that *-headers targets compile cleanly.
Closes #10273
2022-03-25 16:19:26 +02:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Avi Kivity
bbad8f4677 replica: move ::database, ::keyspace, and ::table to replica namespace
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.

References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.

scylla-gdb.py is adjusted to look for both the new and old names.
2022-01-07 12:04:38 +02:00
Avi Kivity
ae3a360725 database: Move database, keyspace, table classes to replica/ directory
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.

As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
2022-01-06 17:07:30 +02:00
Avi Kivity
7bdc999bba service: paxos_state: wean off get_local_storage_proxy()
Instead of calling get_local_storage_proxy in paxos_state, get it from the
caller (who is, in fact, storage_proxy or one of its components).

Some of the callers, although they are storage_proxy components, don't
have a storage_proxy reference handy and so they ignomiously call
get_local_storage_proxy() themselves. This will be adjusted later.

The other callers who are, in fact, storage_proxy, have to take special
care not to cross a shard boundary. When they do, smp::submit_to()
is converted to sharded::invoke_on() in order to get the correct local instance.

Test: unit (dev)

Closes #9824
2021-12-20 00:31:13 +02:00
Pavel Emelyanov
5cfeac0c90 paxos: Drop forward declarations of seastar pointers
They will break compilation after next seastar update, but the
good news is that scylla compiles even without them.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211202173643.1070-1-xemul@scylladb.com>
2021-12-02 19:49:03 +02:00
Benny Halevy
ce3fcc121e paxos_state: prepare: handle exception getting data or digest
This exception is ignored by design, but if it's left unhandled,
it generates `Exceptional future ignored` warnings, like the following.

Also, ignore f2 if f1 failed since we return early in this case.

```
[shard 5] seastar - Exceptional future ignored: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem), backtrace: 0x431689e 0x4316d40 0x43170e8 0x3f35486 0x218d14a 0x3f8002f 0x3f81217 0x3f9f868 0x3f4b76a /opt/scylladb/libreloc/libpthread.so.0+0x93f8 /opt/scylladb/libreloc/libc.so.6+0x101902#012
N7seastar12continuationINS_8internal22promise_base_with_typeISt7variantIJN5utils4UUIDEN7service5paxos7promiseEEEEEZZZZNS7_11paxos_state7prepareEN7tracing15trace_state_ptrENS_13lw_shared_ptrIK6schemaEERKN5query12read_commandERK13partition_keyS5_bNSI_16digest_algorithmENSt6chrono10time_pointINS_12lowres_clockENSQ_8durationIlSt5ratioILl1ELl1000EEEEEEENK3$_0clEvENUlvE_clEvENKUlSB_E_clESB_EUlT_E_ZNS_6futureISt5tupleIJNS13_IvEENS13_IS14_IJNSE_INSI_6resultEEE17cache_temperatureEEEEEEE14then_impl_nrvoIS12_NS13_IS9_EEEET0_OS11_EUlOSA_RS12_ONS_12future_stateIS1B_EEE_S1B_EE#012
seastar::continuation<seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::finally_body<seastar::with_semaphore<seastar::semaphore_default_exception_factory, seastar::lowres_clock, service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#1}>(seastar::basic_semaphore<seastar::semaphore_default_exception_factory, seastar::lowres_clock>&, unsigned long, seastar::lowres_clock::duration, std::result_of&&)::{lambda(seastar::basic_semaphore)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, seastar::lowres_clock> >(seastar::basic_semaphore)::{lambda()#1}, false>, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::then_wrapped_nrvo<seastar::future<std::variant<utils::UUID, service::paxos::promise> >, seastar::semaphore_units<seastar::semaphore_default_exception_factory, seastar::lowres_clock> >(seastar::future<std::variant<utils::UUID, service::paxos::promise> >&&)::{lambda(seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >&&, seastar::semaphore_units<seastar::semaphore_default_exception_factory, seastar::lowres_clock>&, seastar::future_state<std::variant<utils::UUID, service::paxos::promise> >&&)#1}, std::variant<utils::UUID, service::paxos::promise> >#012
seastar::continuation<seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::finally_body<service::paxos::paxos_state::key_lock_map::with_locked_key<service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#1}>(dht::token const&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >, std::result_of)::{lambda()#1}, false>, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::then_wrapped_nrvo<seastar::future<std::variant<utils::UUID, service::paxos::promise> >, {lambda()#1}>({lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >&&, {lambda()#1}&, seastar::future_state<std::variant<utils::UUID, service::paxos::promise> >&&)#1}, std::variant<utils::UUID, service::paxos::promise> >#012
seastar::continuation<seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::finally_body<service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#2}, false>, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::then_wrapped_nrvo<seastar::future<std::variant<utils::UUID, service::paxos::promise> >, service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#2}>(service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#2}&&)::{lambda(seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >&&, service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#2}&, seastar::future_state<std::variant<utils::UUID, service::paxos::promise> >&&)#1}, std::variant<utils::UUID, service::paxos::promise> >#012
seastar::continuation<seastar::internal::promise_base_with_type<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > >, service::storage_proxy::init_messaging_service()::$_51::operator()(seastar::rpc::client_info const&, seastar::rpc::opt_time_point, query::read_command, partition_key, utils::UUID, bool, query::digest_algorithm, std::optional<tracing::trace_info>) const::{lambda(seastar::lw_shared_ptr<schema const>)#1}::operator()(seastar::lw_shared_ptr<schema const>)::{lambda()#1}::operator()() const::{lambda(std::variant<utils::UUID, service::paxos::promise>)#1}, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::then_impl_nrvo<{lambda()#1}, {lambda()#1}<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > > >({lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > >&&, {lambda()#1}&, seastar::future_state<std::variant<utils::UUID, service::paxos::promise> >&&)#1}, std::variant<utils::UUID, service::paxos::promise> >#012
seastar::continuation<seastar::internal::promise_base_with_type<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > >, seastar::future<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > >::finally_body<seastar::smp::submit_to<service::storage_proxy::init_messaging_service()::$_51::operator()(seastar::rpc::client_info const&, seastar::rpc::opt_time_point, query::read_command, partition_key, utils::UUID, bool, query::digest_algorithm, std::optional<tracing::trace_info>) const::{lambda(seastar::lw_shared_ptr<schema const>)#1}::operator()(seastar::lw_shared_ptr<schema const>)::{lambda()#1}>(unsigned int, se
```

Refs #7779
Refs #9331

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210919053007.13960-1-bhalevy@scylladb.com>
2021-09-19 11:58:21 +03:00
Pavel Emelyanov
c39f04fa6f code: Remove storage-service header from irrelevant places
Some .cc files over the code include the storage service
for no real need. Drop the header and include (in some)
what's really needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-22 18:50:19 +03:00
Avi Kivity
4d70f3baee storage_proxy: change unordered_set<inet_address> to small_vector in write path
The write paths in storage_proxy pass replica sets as
std::unordered_set<gms::inet_address>. This is a complex type, with
N+1 allocations for N members, so we change it to a small_vector (via
inet_address_vector_replica_set) which requires just one allocation, and
even zero when up to three replicas are used.

This change is more nuanced than the corresponding change to the read path
abe3d7d7 ("Merge 'storage_proxy: use small_vector for vectors of
inet_address' from Avi Kivity"), for two reasons:

 - there is a quadratic algorithm in
   abstract_write_response_handler::response(): it searches for a replica
   and erases it. Since this happens for every replica, it happens N^2/2
   times.
 - replica sets for writes always include all datacenters, while reads
   usually involve just one datacenter.

So, a write to a keyspace that has 5 datacenters will invoke 15*(15-1)/2
=105 compares.

We could remove this by sending the index of the replica in the replica
set to the replica and ask it to include the index in the response, but
I think that this is unnecessary. Those 105 compares need to be only
105/15 = 7 times cheaper than the corresponding unordered_set operation,
which they surely will. Handling a response after a cross-datacenter round
trip surely involves L3 cache misses, and a small_vector reduces these
to a minimum compared to an unordered_set with its bucket table, linked
list walking and managent, and table rehashing.

Tests using perf_simple_query --write --smp 1 --operations-per-shard 1000000
 --task-quota-ms show two allocations removed (as expected) and a nice
reduction in instructions executed.

before: median 204842.54 tps ( 54.2 allocs/op,  13.2 tasks/op,   49890 insns/op)
after:  median 206077.65 tps ( 52.2 allocs/op,  13.2 tasks/op,   49138 insns/op)

Closes #8847
2021-06-17 13:46:40 +03:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Pavel Solodovnikov
2187a59089 treewide: move service::cas_request out from storage_proxy.hh
And remove all remaining inclusions of `storage_proxy.hh` in the
headers.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-06 19:18:49 +03:00
Konstantin Osipov
c83cf1f965 uuid: switch the API to use std::chrono
A follow up for the patch for #7611. This change was requested
during review and moved out of #7611 to reduce its scope.

The patch switches UUID_gen API from using plain integers to
hold time units to units from std::chrono.

For one, we plan to switch the entire code base to std::chrono units,
to ensure type safety. Secondly, using std::chrono units allows to
increase code reuse with template metaprogramming and remove a few
of UUID_gen functions that beceme redundant as a result.

* switch  get_time_UUID(), unix_timestamp(), get_time_UUID_raw(), switch
  min_time_UUID(), max_time_UUID(), create_time_safe() to
  std::chrono
* remove unused variant of from_unix_timestamp()
* remove unused get_time_UUID_bytes(), create_time_unsafe(),
  redundant get_adjusted_timestamp()
* inline get_raw_UUID_bytes()
* collapse to similar implementations of get_time_UUID()
* switch internal constants to std::chrono
* remove unnecessary unique_ptr from UUID_gen::_instance
Message-Id: <20210406130152.3237914-2-kostja@scylladb.com>
2021-04-06 17:12:54 +03:00
Benny Halevy
088f92e574 paxos_state: learn: fix injected error description
It was copy-pasted from another injection point.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201220091439.3604201-1-bhalevy@scylladb.com>
2021-01-24 11:51:23 +02:00
Gleb Natapov
85cffd1aeb lwt: rewrite storage_proxy::cas using coroutings
Makes code much simpler to understand.

Message-Id: <20201201160213.GW1655743@scylladb.com>
2020-12-17 18:15:35 +01:00
Pavel Solodovnikov
055fd3d8ad lwt: store column_mapping's for each table schema version upon a DDL change
This patch introduces a new system table: `system.scylla_table_schema_history`,
which is used to keep track of column mappings for obsolete table
schema versions (i.e. schema becomes obsolete when it's being changed
by means of `CREATE TABLE` or `ALTER TABLE` DDL operations).

It is populated automatically when a new schema version is being
pulled from a remote in get_schema_definition() at migration_manager.cc
and also when schema change is being propagated to system schema tables
in do_merge_schema() at schema_tables.cc.

The data referring to the most recent table schema version is always
present. Other entries are garbage-collected when the corresponding
table schema version is obsoleted (they will be updated with a TTL equal
to `DEFAULT_GC_GRACE_SECONDS` on `ALTER TABLE`).

In case we failed to persist column mapping after a schema change,
missing entries will be recreated on node boot.

Later, the information from this table is used in `paxos_state::learn`
callback in case we have a mismatch between the most recent schema
version and the one that is stored inside the `frozen_mutation`
for the accepted proposal.

Such situation may arise under following circumstances:
 1. The previous LWT operation crashed on the "accept" stage,
    leaving behind a stale accepted proposal, which waits to be
    repaired.
 2. The table affected by LWT operation is being altered, so that
    schema version is now different. Stored proposal now references
    obsolete schema.
 3. LWT query is retried, so that Scylla tries to repair the
    unfinished Paxos round and apply the mutation in the learn stage.

When such mismatch happens, prior to that patch the stored
`frozen_mutation` is able to be applied only if we are lucky enough
and column_mapping in the mutation is "compatible" with the new
table schema.

It wouldn't work if, for example, the columns are reordered, or
some columns, which are referenced by an LWT query, are dropped.

With this patch we try to look up the column mapping for
the obsolete schema version, then upgrade the stored mutation
using obtained column mapping and apply an upgraded mutation instead.

In case we don't find a column_mapping we just return an error
from the learn stage.

Tests: unit(dev, debug), dtests(paxos_tests.py:TestPaxos.schema_mismatch_*_test)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-10-15 19:24:30 +03:00
Pavel Solodovnikov
88ba184247 paxos: use schema_registry when applying accepted proposal if there is schema mismatch
Try to look up and use schema from the local schema_registry
in case when we have a schema mismatch between the most recent schema
version and the one that is stored inside the `frozen_mutation` for
the accepted proposal.

When such situation happens the stored `frozen_mutation` is able
to be applied only if we are lucky enough and column_mapping in
the mutation is "compatible" with the new table schema.

It wouldn't work if, for example, the columns are reordered, or
some columns, which are referenced by an LWT query, are dropped.

With the patch we are able to mitigate these cases as long as the
referenced schema is still present in the node cache (e.g.
it didn't restart/crash or the cache entry is not too old
to be evicted).

Tests: unit(dev, debug), dtest(paxos_tests.schema_mismatch_*_test)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200827150844.624017-1-pa.solodovnikov@scylladb.com>
2020-08-27 19:04:09 +02:00
Pavel Solodovnikov
9aa4712270 lwt: introduce paxos_grace_seconds per-table option to set paxos ttl
Previously system.paxos TTL was set as max(3h, gc_grace_seconds).

Introduce new per-table option named `paxos_grace_seconds` to set
the amount of seconds which are used to TTL data in paxos tables
when using LWT queries against the base table.

Default value is equal to `DEFAULT_GC_GRACE_SECONDS`,
which is 10 days.

This change allows to easily test various issues related to paxos TTL.

Fixes #6284

Tests: unit (dev, debug)

Co-authored-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200816223935.919081-1-pa.solodovnikov@scylladb.com>
2020-08-17 16:44:14 +02:00
Botond Dénes
159d37053d storage_proxy: use read_command::max_result_size to pass max result size around
Use the recently added `max_result_size` field of `query::read_command`
to pass the max result size around, including passing it to remote
nodes. This means that the max result size will be sent along each read,
instead of once per connection.
As we want to select the appropriate `max_result_size` based on the type
of the query as well as based on the query class (user or internal) the
previous method won't do anymore. If the remote doesn't fill this
field, the old per-connection value is used.
2020-07-28 18:00:29 +03:00
Amnon Heiman
186301aff8 per table metrics: change estimated_histogram to time_estimated_histogram
This patch changes the per table latencies histograms: read, write,
cas_prepare, cas_accept, and cas_learn.

Beside changing the definition type and the insertion method, the API
was changed to support the new metrics.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-07-14 11:17:43 +03:00
Rafael Ávila de Espíndola
555d8fe520 build: Be consistent about system versus regular headers
We were not consistent about using '#include "foo.hh"' instead of
'#include <foo.hh>' for scylla's own headers. This patch fixes that
inconsistency and, to enforce it, changes the build to use -iquote
instead of -I to find those headers.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200608214208.110216-1-espindola@scylladb.com>
2020-06-10 15:49:51 +03:00
Alejo Sanchez
59d60ae672 paxos: fix indentation
Fix indentation

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-06-03 14:47:18 +02:00
Alejo Sanchez
019c96cfda paxos: add error injections
Adds error injections on critical points for:

    prepare
    accept
    learn
    release_semaphore_for_key

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-06-03 14:44:53 +02:00
Piotr Sarna
92aadb94e5 treewide: propagate trace state to write path
In order to add tracing to places where it can be useful,
e.g. materialized view updates and hinted handoff, tracing state
is propagated to all applicable call sites.
2020-05-18 16:05:23 +02:00
Avi Kivity
beaeda5234 database: remove variadic future from query() and query_mutations()
Variadic futures are deprecated; replace with future<std::tuple<...>>.

Tests: unit (dev)
2020-05-17 18:45:38 +02:00
Gleb Natapov
97af6bb0bd lwt: make load_paxos_state to take partition_key_view instead of a deference
Some caller have partition_key_view, but not partition_key, so thy need
to create a temporary and copy just to pass a reference. Change it by
accepting a view.
2020-04-22 13:51:43 +03:00
Gleb Natapov
c970da3811 lwt: do not copy proposal in paxos_state::accept
A proposal is passed as a reference and all callers have it in stable
memory until the call ends, so it is safe to use the reference
everywhere.
2020-04-22 13:51:43 +03:00
Konstantin Osipov
18b9bb57ac lwt: rename metrics to match accepted terminology
Rename inherited metrics cas_propose and cas_commit
to cas_accept and cas_learn respectively.

A while ago we made a decision to stick to widely accepted
terms for Paxos rounds: prepare, accept, learn. The rest
of the code is using these terms, so rename the metrics
to avoid confusion/technical debt.

While at it, rename a few internal methods and functions.

Fixes #6169

Message-Id: <20200414213537.129547-1-kostja@scylladb.com>
2020-04-15 12:20:30 +02:00
Pavel Solodovnikov
3206c1bf66 paxos_state: introduce error injections for testing timeouts in paxos stages
The following sleep injections are added to paxos_state:
 * paxos_state_prepare_timeout (timeouts in paxos_state::prepare)
 * paxos_state_accept_timeout (timeouts in paxos_state::accept)
 * paxos_state_learn_timeout (timeouts in paxos_state::learn)

Tests: unit ({dev}), unit ({debug})

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20200403092107.181057-1-alejo.sanchez@scylladb.com>
2020-04-08 10:47:15 +02:00
Gleb Natapov
8a408ac5a8 lwt: remove entries from system.paxos table after successful learn stage
The learning stage of PAXOS protocol leaves behind an entry in
system.paxos table with the last learned value (which can be large). In
case not all participants learned it successfully next round on the same
key may complete the learning using this info. But if all nodes learned
the value the entry does not serve useful purpose any longer.

The patch adds another round, "prune", which is executed in background
(limited to 1000 simultaneous instances) and removes the entry in
case all nodes replied successfully to the "learn" round.  It uses the
ballot's timestamp to do the deletion, so not to interfere with the
next round. Since deletion happens very close to previous writes it will
likely happen in memtable and will never reach sstable, so that reduces
memtable flush and compaction overhead.

Fixes #5779

Message-Id: <20200330154853.GA31074@scylladb.com>
2020-03-30 21:02:14 +03:00
Rafael Ávila de Espíndola
eca0ac5772 everywhere: Update for deprecated apply functions
Now apply is only for tuples, for varargs use invoke.

This depends on the seastar changes adding invoke.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200324163809.93648-1-espindola@scylladb.com>
2020-03-25 08:49:53 +02:00
Piotr Jastrzebski
2d7532f87f dht: add dht::get_token
and replace all calls to dht::global_partitioner().get_token

dht::get_token is better because it takes schema and uses it
to obtain partitioner instead of using a global partitioner.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Gleb Natapov
ff88ff880b lwt: use cached truncation record instead of quering the database
Message-Id: <20200206163838.5220-3-gleb@scylladb.com>
2020-02-06 18:15:48 +01:00
Gleb Natapov
0d0c05a569 lwt: allow only one paxos instance to run for each key simultaneously
This will prevent contention in case of parallel updates of the same row
by the same coordinator. The patch does it by introducing a new per key
lock map and taking it before running PAXOS protocol (either for write
of for read).

Message-Id: <20200117101228.GA14816@scylladb.com>
2020-01-28 12:39:23 +02:00
Gleb Natapov
51672e5990 paxos: immediately sync commitlog entries for writes made by paxos learn stage 2020-01-15 12:15:42 +02:00
Nadav Har'El
f16e3b0491 merge: bouncing lwt request to an owning shard
Merged patch series from Gleb Natapov:

"LWT is much more efficient if a request is processed on a shard that owns
a token for the request. This is because otherwise the processing will
bounce to an owning shard multiple times. The patch proposes a way to
move request to correct shard before running lwt.  It works by returning
an error from lwt code if a shard is incorrect one specifying the shard
the request should be moved to. The error is processed by the transport
code that jumps to a correct shard and re-process incoming message there.

The nicer way to achieve the same would be to jump to a right shard
inside of the storage_proxy::cas(), but unfortunately with current
implementation of the modification statements they are unusable by
a shard different from where it was created, so the jump should happen
before a modification statement for an cas() is created. When we fix our
cql code to be more cross-shard friendly this can be reworked to do the
jump in the storage_proxy."

Gleb Natapov (4):
  transport: change make_result to takes a reference to cql result
    instead of shared_ptr
  storage_service: move start_native_transport into a thread
  lwt: Process lwt request on a owning shard
  lwt: drop invoke_on in paxos_state prepare and accept

 auth/service.hh                           |   5 +-
 message/messaging_service.hh              |   2 +-
 service/client_state.hh                   |  30 +++-
 service/paxos/paxos_state.hh              |  10 +-
 service/query_state.hh                    |   6 +
 service/storage_proxy.hh                  |   2 +
 transport/messages/result_message.hh      |  20 +++
 transport/messages/result_message_base.hh |   4 +
 transport/request.hh                      |   4 +
 transport/server.hh                       |  25 ++-
 cql3/statements/batch_statement.cc        |   6 +
 cql3/statements/modification_statement.cc |   6 +
 cql3/statements/select_statement.cc       |   8 +
 message/messaging_service.cc              |   2 +-
 service/paxos/paxos_state.cc              |  48 ++---
 service/storage_proxy.cc                  |  47 ++++-
 service/storage_service.cc                | 120 +++++++------
 test/boost/cql_query_test.cc              |   1 +
 thrift/handler.cc                         |   3 +
 transport/messages/result_message.cc      |   5 +
 transport/server.cc                       | 203 ++++++++++++++++------
 21 files changed, 377 insertions(+), 180 deletions(-)
2020-01-14 09:59:59 +02:00
Gleb Natapov
5753ab7195 lwt: drop invoke_on in paxos_state prepare and accept
Since lwt requests are now running on an owning shard there is no longer
a need to invoke cross shard call on paxos_state level. RPC calls may
still arrive to a wrong shard so we need to make cross shard call there.
2020-01-13 10:26:02 +02:00
Gleb Natapov
feed544c5d paxos: fix truncation time checking during learn stage
The comparison is done in millisecons, not microseconds.

Fixes #5566

Message-Id: <20200108094927.GN9084@scylladb.com>
2020-01-08 14:37:07 +01:00
Avi Kivity
f7d69b0428 Revert "Merge "bouncing lwt request to an owning shard" from Gleb"
This reverts commit 64cade15cc, reversing
changes made to 9f62a3538c.

This commit is suspected of corrupting the response stream.

Fixes #5479.
2019-12-17 11:06:10 +02:00
Gleb Natapov
64cfb9b1f6 lwt: take raw lock for entire cas duration
It will prevent parallel update by the same coordinator and should
reduce contention.
2019-12-11 14:41:31 +02:00
Gleb Natapov
898d2330a2 lwt: drop invoke_on in paxos_state prepare and accept
Since lwt requests are now running on an owning shard there is no longer
a need to invoke cross shard call.
2019-12-11 14:41:31 +02:00
Vladimir Davydov
bf5f864d80 paxos: piggyback result query on prepare response
Current LWT implementation uses at least three network round trips:
 - first, execute PAXOS prepare phase
 - second, query the current value of the updated key
 - third, propose the change to participating replicas

(there's also learn phase, but we don't wait for it to complete).

The idea behind the optimization implemented by this patch is simple:
piggyback the current value of the updated key on the prepare response
to eliminate one round trip.

To generate less network traffic, only the closest to the coordinator
replica sends data while other participating replicas send digests which
are used to check data consistency.

Note, this patch changes the API of some RPC calls used by PAXOS, but
this should be okay as long as the feature in the early development
stage and marked experimental.

To assess the impact of this optimization on LWT performance, I ran a
simple benchmark that starts a number of concurrent clients each of
which updates its own key (uncontended case) stored in a cluster of
three AWS i3.2xlarge nodes located in the same region (us-west-1) and
measures the aggregate bandwidth and latency. The test uses shard-aware
gocql driver. Here are the results:

                latency 99% (ms)    bandwidth (rq/s)    timeouts (rq/s)
    clients     before  after       before  after       before  after
          1          2      2          626    637            0      0
          5          4      3         2616   2843            0      0
         10          3      3         4493   4767            0      0
         50          7      7        10567  10833            0      0
        100         15     15        12265  12934            0      0
        200         48     30        13593  14317            0      0
        400        185     60        14796  15549            0      0
        600        290     94        14416  15669            0      0
        800        568    118        14077  15820            2      0
       1000        710    118        13088  15830            9      0
       2000       1388    232        13342  15658           85      0
       3000       1110    363        13282  15422          233      0
       4000       1735    454        13387  15385          329      0

That is, this optimization improves max LWT bandwidth by about 15%
and allows to run 3-4x more clients while maintaining the same level
of system responsiveness.
2019-11-24 11:35:29 +02:00
Vladimir Davydov
3d1d4b018f paxos: remove unnecessary move constructor invocations
invoke_on() guarantees that captures object won't be destroyed until the
future returned by the invoked function is resolved so there's no need
to move key, token, proposal for calling paxos_state::*_impl helpers.
2019-11-24 11:35:29 +02:00
Vladimir Davydov
b75862610e paxos_state: account paxos round latency
This patch adds the following per table stats:

  cas_prepare_latency
  cas_propose_latency
  cas_commit_latency

They are equivalent to CasPropose, CasPrepare, CasCommit metrics exposed
by Cassandra.
2019-10-29 19:26:18 +03:00
Gleb Natapov
b3e01a45d7 lwt: storage_proxy: implement paxos protocol
This patch adds all functionality needed for Paxos protocol. The
implementation does not strictly adhere to Paxos paper since the original
paper allows setting a value only once, while for LWT we need to be able
to make another Paxos round after "learn" phase completes, which requires
things like repair to be introduced.
2019-10-27 23:21:51 +03:00
Gleb Natapov
d1774693bf lwt: Define state needed by paxos and persist it
Paxos protocol relies on replicas having a state that persists over
crashes/restarts. This patch defines such state and stores it in the
database itself in the paxos table to make it persistent.

The stored state is:
  in_progress_ballot    - promised ballot
  proposal              - accepted value
  proposal_ballot       - the ballot of the accepted value
  most_recent_commit    - most recently learned value
  most_recent_commit_at - the ballot of the most recently learned value
2019-10-27 23:21:51 +03:00
Gleb Natapov
15b935b95d lwt: add data structures needed for paxos implementation
This patch add two data structures that will be used by paxos. First
one is "proposal" which contains a ballot and a mutation representing
a value paxos protocol is trying to set. Second one is
"prepare_response" which is a value returned by paxos prepare stage.
It contains currently accepted value (if any) and most recently
learned value (again if any). The later is used to "repair" replicas
that missed previous "learn" message.
2019-10-27 23:21:51 +03:00