_user cannot outlive client_state class instance, so there is no point
in holding it in shared_ptr.
Tested: debug test.py and dtest auth_test.py
Message-Id: <20191128131217.26294-5-gleb@scylladb.com>
Only do_stop_rpc_server uses the shared_ptr to prolong server's
lifetime until stop() completes, but do_with() can be used to achieve the
same.
Message-Id: <20191128131217.26294-3-gleb@scylladb.com>
Only do_stop_native_transport() uses the shared_ptr to prolong server's
lifetime until stop() completes, but do_with() can be used to achieve the
same.
Message-Id: <20191128131217.26294-2-gleb@scylladb.com>
This patchset adds missing "const" function qualifiers throughout
the Scylla code base, which would make code less error-prone.
The changeset incorporates Kostja's work regarding const qualifiers
in the cql code hierarchy along with a follow-up patch addressing the
review comment of the corresponding patch set (the patch subject is
"cql: propagate const property through prepared statement tree.").
By default, semaphore exceptions bring along very little context:
either that a semaphore was broken or that it timed out.
In order to make debugging easier without introducing significant
runtime costs, a notion of named semaphore is added.
A named semaphore is simply a semaphore with statically defined
name, which is present in its errors, bringing valuable context.
A semaphore defined as:
auto sem = semaphore(0);
will present the following message when it breaks:
"Semaphore broken"
However, a named semaphore:
auto named_sem = named_semaphore(0, named_semaphore_exception_factory{"io_concurrency_sem"});
will present a message with at least some debugging context:
"Semaphore broken: io_concurrency_sem"
It's not much, but it would really help in pinpointing bugs
without having to inspect core dumps.
At the same time, it does not incur any costs for normal
semaphore operations (except for its creation), but instead
only uses more CPU in case an error is actually thrown,
which is considered rare and not to be on the hot path.
Refs #4999
Tests: unit(dev), manual: hardcoding a failure in view building code
Before stopping the db itself, stop the migration service.
It must be stopped before RPC, but RPC is not stopped yet
itself, so we should be safe here.
Here's the tail of the resulting logs:
INFO 2019-11-20 11:22:35,193 [shard 0] init - shutdown migration manager
INFO 2019-11-20 11:22:35,193 [shard 0] migration_manager - stopping migration service
INFO 2019-11-20 11:22:35,193 [shard 1] migration_manager - stopping migration service
INFO 2019-11-20 11:22:35,193 [shard 0] init - Shutdown database started
INFO 2019-11-20 11:22:35,193 [shard 0] init - Shutdown database finished
INFO 2019-11-20 11:22:35,193 [shard 0] init - stopping prometheus API server
INFO 2019-11-20 11:22:35,193 [shard 0] init - Scylla version 666.development-0.20191120.25820980f shutdown complete.
Also -- stop the mm on drain before the commitlog it stopped.
[Tomasz: mm needs the cl because pulling schema changes from other nodes
involves applying them into the database. So cl/db needs to be
stopped after mm is stopped.]
The drain logs would look like
...
INFO 2019-11-25 11:00:40,562 [shard 0] migration_manager - stopping migration service
INFO 2019-11-25 11:00:40,562 [shard 1] migration_manager - stopping migration service
INFO 2019-11-25 11:00:40,563 [shard 0] storage_service - DRAINED:
and then on stop
...
INFO 2019-11-25 11:00:46,427 [shard 0] init - shutdown migration manager
INFO 2019-11-25 11:00:46,427 [shard 0] init - Shutdown database started
INFO 2019-11-25 11:00:46,427 [shard 0] init - Shutdown database finished
INFO 2019-11-25 11:00:46,427 [shard 0] init - stopping prometheus API server
INFO 2019-11-25 11:00:46,427 [shard 0] init - Scylla version 666.development-0.20191125.3eab6cd54 shutdown complete.
Fixes#5300
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20191125080605.7661-1-xemul@scylladb.com>
Current LWT implementation uses at least three network round trips:
- first, execute PAXOS prepare phase
- second, query the current value of the updated key
- third, propose the change to participating replicas
(there's also learn phase, but we don't wait for it to complete).
The idea behind the optimization implemented by this patch is simple:
piggyback the current value of the updated key on the prepare response
to eliminate one round trip.
To generate less network traffic, only the closest to the coordinator
replica sends data while other participating replicas send digests which
are used to check data consistency.
Note, this patch changes the API of some RPC calls used by PAXOS, but
this should be okay as long as the feature in the early development
stage and marked experimental.
To assess the impact of this optimization on LWT performance, I ran a
simple benchmark that starts a number of concurrent clients each of
which updates its own key (uncontended case) stored in a cluster of
three AWS i3.2xlarge nodes located in the same region (us-west-1) and
measures the aggregate bandwidth and latency. The test uses shard-aware
gocql driver. Here are the results:
latency 99% (ms) bandwidth (rq/s) timeouts (rq/s)
clients before after before after before after
1 2 2 626 637 0 0
5 4 3 2616 2843 0 0
10 3 3 4493 4767 0 0
50 7 7 10567 10833 0 0
100 15 15 12265 12934 0 0
200 48 30 13593 14317 0 0
400 185 60 14796 15549 0 0
600 290 94 14416 15669 0 0
800 568 118 14077 15820 2 0
1000 710 118 13088 15830 9 0
2000 1388 232 13342 15658 85 0
3000 1110 363 13282 15422 233 0
4000 1735 454 13387 15385 329 0
That is, this optimization improves max LWT bandwidth by about 15%
and allows to run 3-4x more clients while maintaining the same level
of system responsiveness.
invoke_on() guarantees that captures object won't be destroyed until the
future returned by the invoked function is resolved so there's no need
to move key, token, proposal for calling paxos_state::*_impl helpers.
"
This patch series adds only UDF support, UDA will be in the next patch series.
With this all CQL types are mapped to Lua. Right now we setup a new
lua state and copy the values for each argument and return. This will
be optimized once profiled.
We require --experimental to enable UDF in case there is some change
to the table format.
"
* 'espindola/udf-only-v4' of https://github.com/espindola/scylla: (65 commits)
Lua: Document the conversions between Lua and CQL
Lua: Implement decimal subtraction
Lua: Implement decimal addition
Lua: Implement support for returning decimal
Lua: Implement decimal to string conversion
Lua: Implement decimal to floating point conversion
Lua: Implement support for decimal arguments
Lua: Implement support for returning varint
Lua: Implement support for returning duration
Lua: Implement support for duration arguments
Lua: Implement support for returning inet
Lua: Implement support for inet arguments
Lua: Implement support for returning time
Lua: Implement support for time arguments
Lua: Implement support for returning timeuuid
Lua: Implement support for returning uuid
Lua: Implement support for uuid and timeuuid arguments
Lua: Implement support for returning date
Lua: Implement support for date arguments
Lua: Implement support for returning timestamp
...
This patch resurrects Cassandra's code validating a consistency level
for CAS requests. Basically, it makes CAS requests use a special
function instead of validate_for_write to make error messages more
coherent.
Note, we don't need to resurrect requireNetworkTopologyStrategy as
EACH_QUORUM should work just fine for both CAS and non-CAS writes.
Looks like it is just an artefact of a rebase in the Cassandra
repository.
MV backpressure code frees mutation for delayed client replies earlier
to save memory. The commit 2d7c026d6e that
introduced the logic claimed to do it only when all replies are received,
but this is not the case. Fix the code to free only when all replies
are received for real.
Fixes#5242
Message-Id: <20191113142117.GA14484@scylladb.com>
Adds per-table metrics for counting partition and row reuse
in memtables. New metrics are as follows:
- memtable_partition_writes - number of write operations performed
on partitions in memtables,
- memtable_partition_hits - number of write operations performed
on partitions that previously existed in a memtable,
- memtable_row_writes - number of row write operations performed
in memtables,
- memtable_row_hits - number of row write operations that ovewrote
rows previously present in a memtable.
Tests: unit(release)
This adds a requires_thread predicate to functions and propagates that
up until we get to code that already returns futures.
We can then use the predicate to decide if we need to use
seastar::async.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
With this it is possible to create user defined functions and
aggregates and they are saved to disk and the schema change is
propagated.
It is just not possible to call them yet.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This patch adds the following per table stats:
cas_prepare_latency
cas_propose_latency
cas_commit_latency
They are equivalent to CasPropose, CasPrepare, CasCommit metrics exposed
by Cassandra.
This patch implements accounting of Cassandra's metrics related to
lightweight transactions, namely:
cas_read_latency transactional read latency (histogram)
cas_write_latency transactional write latency (histogram)
cas_read_timeouts number of transactional read timeouts
cas_write_timeouts number of transactional write timeouts
cas_read_unavailable number of transactional read
unavailable errors
cas_write_unavailable number of transactional write
unavailable errors
cas_read_unfinished_commit number of transaction commit attempts
that occurred on read
cas_write_unfinished_commit number of transaction commit attempts
that occurred on write
cas_write_condition_not_met number of transaction preconditions
that did not match current values
cas_read_contention how many contended reads were
encountered (histogram)
cas_write_contention how many contended writes were
encountered (histogram)
Pass contention by reference to begin_and_repair_paxos(), where it is
incremented on every sleep. Rationale: we want to account the total
number of times query() / cas() had to sleep, either directly or within
begin_and_repair_paxos(), no matter if the function failed or succeeded.
We may want to change paxos tables format and change internode protocol,
so hide lwt behind experimental flag for now.
Message-Id: <20191029102725.GM2866@scylladb.com>
A SELECT statement that has clustering key restrictions isn't supposed
to return static content if no regular rows matches the restrictions,
see #589. However, for the CAS statement we do need to return static
content on failure so this patch adds a flag that allows the caller to
override this behavior.
Introduce service::cas_request abstract base class
which can be used to parameterize Paxos logic.
Implement storage_proxy::cas() - compare and swap - the storage proxy
entry point for lightweight transactions.
Currently the code that manipulates mutations during write need to
check what kind of mutations are those and (sometimes) choose different
code paths. This patch encapsulates the differences in virtual
functions of mutation_holder object, so that high level code will not
concern itself with the details. The functions that are added:
apply_locally(), apply_remotely() and store_hint().
This patch adds all functionality needed for Paxos protocol. The
implementation does not strictly adhere to Paxos paper since the original
paper allows setting a value only once, while for LWT we need to be able
to make another Paxos round after "learn" phase completes, which requires
things like repair to be introduced.
Paxos protocol relies on replicas having a state that persists over
crashes/restarts. This patch defines such state and stores it in the
database itself in the paxos table to make it persistent.
The stored state is:
in_progress_ballot - promised ballot
proposal - accepted value
proposal_ballot - the ballot of the accepted value
most_recent_commit - most recently learned value
most_recent_commit_at - the ballot of the most recently learned value
This patch add two data structures that will be used by paxos. First
one is "proposal" which contains a ballot and a mutation representing
a value paxos protocol is trying to set. Second one is
"prepare_response" which is a value returned by paxos prepare stage.
It contains currently accepted value (if any) and most recently
learned value (again if any). The later is used to "repair" replicas
that missed previous "learn" message.
Add a cluster feature for non-frozen UDTs.
If the cluster supports non-frozen UDTs, do not return an error
message when trying to create a table with a non-frozen user type.
From Shlomi:
4 node cluster Node A, B, C, D (Node A: seed)
cassandra-stress write n=10000000 -pop seq=1..10000000 -node <seed-node>
cassandra-stress read duration=10h -pop seq=1..10000000 -node <seed-node>
while read is progressing
Node D: nodetool decommission
Node A: nodetool status node - wait for UL
Node A: nodetool cleanup (while decommission progresses)
I get the error on c-s once decommission ends
java.io.IOException: Operation x0 on key(s) [383633374d31504b5030]: Data returned was not validated
The problem is when a node gets new ranges, e.g, the bootstrapping node, the
existing nodes after a node is removed or decommissioned, nodetool cleanup will
remove data within the new ranges which the node just gets from other nodes.
To fix, we should reject the nodetool cleanup when there is pending ranges on that node.
Note, rejecting nodetool cleanup is not a full protection because new ranges
can be assigned to the node while cleanup is still in progress. However, it is
a good start to reject until we have full protection solution.
Refs: #5045
The flag did nothing. It was used in one place to check if there's a
bug, but it can easily by proven by reading the code that the check
would never pass.
The storage_service::bootstrap method took a parameter: tokens to
bootstrap with. However, this method is only called in one place
(join_token_ring) with only one parameter: _bootstrap_tokens. It doesn't
make sense to call this method anywhere else with any other parameter.
This commit also adds a comment explaining what the method does and
moves it into the private section of storage_service.
When a non-seed node was bootstrapping, system_keyspace::update_tokens
was called twice: first right after the tokens were generated (or
received if we were replacing a different node) in the call to
`bootstrap`, and then later in join_token_ring. The second call was
redundant.
The join_token_ring call was also redundant if we were not bootstrapping
and had tokens saved previously (e.g. when restarting). In that case we
would have read them from LOCAL and then save the same tokens again.
This commit removes the redundant call and inserts calls to
update_tokens where they are necessary, when new tokens are generated.
The aim is to make the code easier to understand.
It also adds a comment which explains why the tokens don't need to be
generated in one of the cases.
After commit 36ccf72f3c, this method
was used only in one place.
Its name did not make it obvious what it does and when is it safe to call it.
This commit pulls out the code from set_tokens to the point where it was
called (join_token_ring). The code is only possible to understand in
context.
This code was also saving the tokens to the LOCAL table before
retrieving them from this table again. There is no point in doing that:
1. there are no races, since when join_token_ring is running, it is the
only function which can call system_keyspace::update_tokens (which saves them to the
LOCAL table). There can be no multiple instances of join_token_ring.
2. Even if there was a race, this wouldn't fix anything. The tokens we
retrieve from LOCAL by calling get_local_tokens().get0() could already
be different in the LOCAL table when the get0() returns.
Replace the two variables:
tokens_to_update_in_metadata
tokens_to_update_in_system_keyspace
which were exactly the same, with one variable owned_tokens.
The new name describes what the variable IS instead what's it used for.
Add a comment to clarify what "owned" means: those are the tokens the
node chose and any collision was resolved positively for this node.
Move the variable definition further down in the code, where it's
actually needed.
We would like to share with other nodes
the value of ignore_msb_bits property used by the node.
This is needed because CDC will operate on
streams of changes. Each shard on each node
will have its own stream that will be identified
by a stream_id. Stream_id will be selected in
such a way that using stream_id as partition key
will locate partition identified by stream_id on
a node and shard that the stream belongs to.
To be able to generate such stream_id we need
to know ignore_msb_bits property value for each node.
IMPORTANT NOTE: At this point CDC does not support
topology changes. It will work only on a stable cluster.
Support for topology modifications will be added in
later steps.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
We would like to share with other nodes
the number of shards available at the node.
This is needed because CDC will operate on
streams of changes. Each shard on each node
will have its own stream that will be identified
by a stream_id. Stream_id will be selected in
such a way that using stream_id as partition key
will locate partition identified by stream_id on
a node and shard that the stream belongs to.
To be able to generate such stream_id we need
to know how many shards are on each node.
IMPORTANT NOTE: At this point CDC does not support
topology changes. It will work only on a stable cluster.
Support for topology modifications will be added in
later steps.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
When another node is reported to be down, view updates queued
for it are cancelled, but some of them may already be initiated.
Right now, cancelling such a write resulted in an exception,
but on conceptual level it's not really an exception, since
this behaviour is expected.
Previous version of this patch was based on introducing a special
exception type that was later handled specially, but it's not clear
if it's a good direction. Instead, this patch simply makes this
path non-exceptional, as was originally done by Nadav in the first
version of the series that introduced handling unstarted write
cancellations. Additionally, a message containing the information
that a write is cancelled is logged with debug level.