The learning stage of PAXOS protocol leaves behind an entry in
system.paxos table with the last learned value (which can be large). In
case not all participants learned it successfully next round on the same
key may complete the learning using this info. But if all nodes learned
the value the entry does not serve useful purpose any longer.
The patch adds another round, "prune", which is executed in background
(limited to 1000 simultaneous instances) and removes the entry in
case all nodes replied successfully to the "learn" round. It uses the
ballot's timestamp to do the deletion, so not to interfere with the
next round. Since deletion happens very close to previous writes it will
likely happen in memtable and will never reach sstable, so that reduces
memtable flush and compaction overhead.
Fixes#5779
Message-Id: <20200330154853.GA31074@scylladb.com>
(cherry picked from commit 8a408ac5a8)
Adds a field to abstract_write_response_handler that points to the cdc
operation result tracker, and a function for registering the tracker in
the handlers that currently write to a CDC log table.
This patch defines a CDC metrics object and registers all of its
counters.
storage_proxy is chosen as the owner of the metrics object. Because in
subsequent commits it will become possible for CDC metrics to be updated
after a write operation ends, and because the cdc_service has shorter
lifetime than storage_proxy, we could risk a use-after-free if we placed
this object inside cdc_service.
Changes the name of storage_proxy::mutate_hint_from_scratch function to
another name, whose meaning is more clear: send_hint_to_all_replicas.
Tests: unit(dev)
When replaying a hint with a destination node that is no longer in the
cluster, it will be sent with cl=ALL to all its new replicas. Before
this patch, the MUTATION verb was used, which causes such hints to be
handled on the same connection and with the same priority as regular
writes. This can cause problems when a large number of hints is
orphaned and they are scheduled to be sent at once. Such situation
may happen when replacing a dead node - all nodes that accumulated hints
for the dead node will now send them with cl=ALL to their new replicas.
This patch changes the verb used to send such hints to HINT_MUTATION.
This verb is handled on a separate connection and with streaming
scheduling group, which gives them similar priority to non-orphaned
hints.
Refs: #4712
Tests: unit(dev)
Take const schema& as a parameter of shard_of and
use it to obtain partitioner instead of calling
global_partitioner().
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
"
There's a lot of code around that needs storage service purely to
get the specific feature value (cluster_supports_<something> calls).
This creates several circular dependencies, e.g. storage_service <->
migration_manager one and database <-> storage_servuce. Also features
sit on storage_service, but register themselfs on the feature_service
and the former subscribes on them back which also looks strange.
I propose to keep all the features on feature_service, this keeps the
latter intependent from other components, makes it possible to break
one of the mentioned circle dependencyand heavily relax the other.
Also the set helps us fighting the globals and, after it, the
feature_service can be safely stopped at the very last moment.
Tests: unit(dev), manual debug build start-stop
"
* 'br-features-to-service-5' of https://github.com/xemul/scylla:
gossiper: Avoid string merge-split for nothing
features: Stop on shutdown
storage_service: Remove helpers
storage_service: Prepare to switch from on-board feature helpers
cql3: Check feature in .validate
database: Use feature service
storage_proxy: Use feature service
migration_manager: Use feature service
start: Pass needed feature as argument into migrate_truncation_records
features: Unfriend storage_service
features: Simplify feature registration
features: Introduce known_feature_set
features: Move disabled features set from storage_service
features: Move schema_features helper
features: Move all features from storage_service to feature_service
storage_service: Use feature_config from _feature_service
features: Add feature_config
storage_service: Kill set_disabled_features
gms: Move features stuff into own .cc file
migration_manager: Move some fns into class
We consider globals like service::get_storage_proxy() a bad idea,
and would like to reduce their use as much as possible - and eventually,
eliminate it completely.
One easy case to fix case is when we already have a shard-local proxy,
but now we need the sharded object, to invoke_on() something on it.
In this patch, we turn storage_proxy into a peering_sharded_service.
This means that if you already have a storage_proxy, you can call
its container() function to get the sharded<storage_proxy>, without
needing to call the global service::get_storage_proxy().
We found a few such cases in storage_proxy itself, and in Alternator,
and fixed them to use container() instead of the global function.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Keep reference on local feature service from storage_proxy
and use it in places that have (local) storage_proxy at hands.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This commit builds on top of the introduced per scheduling group
statistics template and employs it for achieving a per scheduling
group statistics in storage_proxy.
Some of the statistics also had meaning as a global - per
shard one. Those are the ones for determining if to
throttle the write request. This was handled by creating a
global stats struct that will hold those stats and by changing
the stat update to also include the global one.
One point that complicated it is an already existing aggregation
over the per shard stats that now became a per scheduling group
per shard stats, converting the aggregation to a two-dimensional
aggregation.
One thing this commit doesn't handle is validating that an individual
statistic didn't "cross a scheduling group boundary", such validation
is possible but it can easily be added in the future. There is a
subtlety to doing so since if the operation did cross to other
scheduling group two connected statistics can lose balance
for example written bytes and completed write transactions.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
This will prevent contention in case of parallel updates of the same row
by the same coordinator. The patch does it by introducing a new per key
lock map and taking it before running PAXOS protocol (either for write
of for read).
Message-Id: <20200117101228.GA14816@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/5567
from Calle Wilund:
Fixes#5314
Instead of tying CDC handling into cql statement objects, this patch set
moves it to storage proxy, i.e. shared code for mutating stuff. This means
we automatically handle cdc for code paths outside cql (i.e. alternator).
It also adds api handling (though initially inefficient) for batch statements.
CDC is tied into storage proxy by giving the former a ref to the latter (per
shard). Initially this is not a constructor parameter, because right now we
have chicken and egg issues here. Hopefully, Pavels refactoring of migration
manager and notifications will untie these and this relationship can become
nicer.
The actual augmentation can (as stated above) be made much more efficient.
Hopefully, the stream management refactoring will deal with expensive stream
lookup, and eventually, we can maybe coalesce pre-image selects for batches.
However, that is left as an exercise for when deemed needed.
The augmentation API has an optional return value for a "post-image handler"
to be used iff returned after mutation call is finished (and successful).
It is not yet actually invoked from storage_proxy, but it is at least in the
call chain.
LWT is much more efficient if a request is processed on a shard that owns
a token for the request. This is because otherwise the processing will
bounce to an owning shard multiple times. The patch proposes a way to
move request to correct shard before running lwt. It works by returning
an error from lwt code if a shard is incorrect one specifying the shard
the request should be moved to. The error is processed by transport code
that jumps to a correct shard and re-process incoming message there.
The cdc service is assigned from outside, post construction, mainly
because of the chickens and eggs in main startup. Would be nice to
have it unconditionally, but this is workable.
It is (and shall) only be called from inside storage proxy,
and we would like this to be reflected in the interface
so our eventual moving of cdc logic into the mutate call
chains become easier to verify and comprehend.
The JMX interface is implemented by the scylla-jmx project, not scylla.
Therefore, let's remove this historical reference to MBeans from
storage_proxy.
Message-Id: <20191211121652.22461-1-penberg@scylladb.com>
Current LWT implementation uses at least three network round trips:
- first, execute PAXOS prepare phase
- second, query the current value of the updated key
- third, propose the change to participating replicas
(there's also learn phase, but we don't wait for it to complete).
The idea behind the optimization implemented by this patch is simple:
piggyback the current value of the updated key on the prepare response
to eliminate one round trip.
To generate less network traffic, only the closest to the coordinator
replica sends data while other participating replicas send digests which
are used to check data consistency.
Note, this patch changes the API of some RPC calls used by PAXOS, but
this should be okay as long as the feature in the early development
stage and marked experimental.
To assess the impact of this optimization on LWT performance, I ran a
simple benchmark that starts a number of concurrent clients each of
which updates its own key (uncontended case) stored in a cluster of
three AWS i3.2xlarge nodes located in the same region (us-west-1) and
measures the aggregate bandwidth and latency. The test uses shard-aware
gocql driver. Here are the results:
latency 99% (ms) bandwidth (rq/s) timeouts (rq/s)
clients before after before after before after
1 2 2 626 637 0 0
5 4 3 2616 2843 0 0
10 3 3 4493 4767 0 0
50 7 7 10567 10833 0 0
100 15 15 12265 12934 0 0
200 48 30 13593 14317 0 0
400 185 60 14796 15549 0 0
600 290 94 14416 15669 0 0
800 568 118 14077 15820 2 0
1000 710 118 13088 15830 9 0
2000 1388 232 13342 15658 85 0
3000 1110 363 13282 15422 233 0
4000 1735 454 13387 15385 329 0
That is, this optimization improves max LWT bandwidth by about 15%
and allows to run 3-4x more clients while maintaining the same level
of system responsiveness.
Pass contention by reference to begin_and_repair_paxos(), where it is
incremented on every sleep. Rationale: we want to account the total
number of times query() / cas() had to sleep, either directly or within
begin_and_repair_paxos(), no matter if the function failed or succeeded.
Introduce service::cas_request abstract base class
which can be used to parameterize Paxos logic.
Implement storage_proxy::cas() - compare and swap - the storage proxy
entry point for lightweight transactions.
Currently the code that manipulates mutations during write need to
check what kind of mutations are those and (sometimes) choose different
code paths. This patch encapsulates the differences in virtual
functions of mutation_holder object, so that high level code will not
concern itself with the details. The functions that are added:
apply_locally(), apply_remotely() and store_hint().
This patch adds all functionality needed for Paxos protocol. The
implementation does not strictly adhere to Paxos paper since the original
paper allows setting a value only once, while for LWT we need to be able
to make another Paxos round after "learn" phase completes, which requires
things like repair to be introduced.
"
This is a collection of assorted patches that will be needed for LWT.
Most of them are trivial, but one touches a lot of files, so have a
good chance to cause rebase headache (I already had to rebase it on
top of Alternator). Lets push them earlier instead of carrying them in
the lwt branch.
"
* 'gleb/lwt-prepare-v2' of github.com:scylladb/seastar-dev:
lwt: make _last_timestamp_micros static
lwt: Add client_state::get_timestamp_for_paxos() function
lwt: Pass client_state reference all the way to storage_proxy::query
exceptions: Add a constructor for unavailable_exception that allows providing a custom message
serializer: Add std::variant support
lwt: Add missing functions to utils/UUID_gen.hh
client_state holds a state to generate monotonically increasing unique
timestamp. Queries with a SERIAL consistency level need it to generate
a paxos round.
Refactor remove_response_handler_entry out of remove_response_handler,
to be called on a valid iterator found by _response_handlers.find(id).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
on_down() iterates over _view_update_handlers_list, but it yields during iteration,
and while it yields, elements in that list can be removed, resulting in a
use-after-free.
Prevent this by registering iterators that can be potentially invalidated, and
any time we remove an element from the list, check whether we're removing an element
that is being pointed to by a live iterator. If that is the case, advance the iterator
so that it points at a valid element (or at the end of the list).
Fixes#4912.
Tests: unit (dev)
A read executor exists until read operation completes in its entirety
so storing a permit there guaranties that it will be freed only after
no background work left for the request on this server.
A write response handler exists until write operation completes in its
entirety so storing a permit there guaranties that it will be freed only
after no background work left for the request on this server.
Current cql transport code acquire a permit before processing a query and
release it when the query gets a reply, but some quires leave work behind.
If the work is allowed to accumulate without any limit a server may
eventually run out of memory. To prevent that the permit system should
account for the background work as well. The patch is a first step in
this direction. It passes a permit down to storage proxy where it will
be later hold by background work.
A read which arrived to a non-replica and had to be forwarded to a
replica by the coordinator is accounted in an own metric,
reads_coordinator_outside_replica_set.
Most often such read is produced by a driver which is unaware of
token distribution on the ring.
If a read was forwarded to another replica due to heat weighted
load balancing or query preference set by the user, it's not accounted
in the metric.
In case of a multi-partition read (a query using IN statement,
e.g. x in (1, 2, 3)), if any of the keys is read from a
non-local node the read is accounted as a non-local.
The rationale behind it is that if the user tries to be careful and send
IN queries only to the same vnode, they are rewarded with the counter
staying at zero, while if they send multi-partition IN queries without
any precautions, they will see the metric go up which gives them a
starting point for investigating performance problems.
Closes#4338
Currently, each shard protects itself by not reading from rpc and the native
transport if in-flight requests consume too much memory for that shard. However,
if all shards then forward their requests to some other shard, then that shard
can easily run out of memory since its load can be multiplied by the number of
shards that send it requests.
To protect against this, use the new Seastar smp_service_group infrastructure.
We create three groups: read, write, and write ack (the latter is needed to
avoid ABBA deadlocks is shard A exhausts all its resources sending writes to shard B,
and shard B simulateously does the same; neither will be able to send
acknowledgements, so if the writes are throttled, they will never be unthrottled
until a timeout occurs).
Range scans are not addressed by this patch since they are handled by
multishard_mutation_query, which has its own complex cross-shard communication
scheme, but it be a similar solution.
Ref #1105 (missing range scan protection)
Tests: unit (dev)
Message-Id: <20190512142243.17795-1-avi@scylladb.com>
When storage proxy is shutting down, all interruptible writes
can be timed out in order not to wait for them. Instead, the mechanism
will fall back to storing hints and/or not progressing with view
building.
In order to be able to iterate over view update write response handlers,
an intrusive list of them is added to storage proxy. This way
iteration can be easily yielded without invalidating operators and all
logic is moved to slow path.
Do not recalculate too much ranges in advance, it requires large
allocation and usually means that a consumer of the interface is going
to do to much work in parallel.
Fixes: #3767
get_restricted_ranges() function gets query provided key ranges
and divides them on vnode boundaries. It iterates over all ranges and
calculates all vnodes, but all its users are usually interested in only
one vnode since most likely it will be enough to populate a page. If it
will be not enough they will ask for more. This patch introduces new
interface instead of the function that allows to generate vnode ranges
on demand instead of precalculating all of them.