Commit Graph

1519 Commits

Author SHA1 Message Date
Gleb Natapov
75499896ab client_state: store _user as optional instead of shared_ptr
_user cannot outlive client_state class instance, so there is no point
in holding it in shared_ptr.

Tested: debug test.py and dtest auth_test.py

Message-Id: <20191128131217.26294-5-gleb@scylladb.com>
2019-11-28 15:48:59 +02:00
Gleb Natapov
ce5d6d5eee storage_service: store thrift server as an optional instead of shared_ptr
Only do_stop_rpc_server uses the shared_ptr to prolong server's
lifetime until stop() completes, but do_with() can be used to achieve the
same.

Message-Id: <20191128131217.26294-3-gleb@scylladb.com>
2019-11-28 15:48:51 +02:00
Gleb Natapov
b9b99431a8 storage_service: store cql server as an optional instead of shared_ptr
Only do_stop_native_transport() uses the shared_ptr to prolong server's
lifetime until stop() completes, but do_with() can be used to achieve the
same.

Message-Id: <20191128131217.26294-2-gleb@scylladb.com>
2019-11-28 15:48:47 +02:00
Pavel Emelyanov
8532093c61 cql: The cql_server does not need proxy reference
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20191127153842.4098-1-xemul@scylladb.com>
2019-11-28 10:58:46 +01:00
Tomasz Grabiec
87b72dad3e Merge "treewide: add missing const qualifiers" from Pavel Solodovnikov
This patchset adds missing "const" function qualifiers throughout
the Scylla code base, which would make code less error-prone.

The changeset incorporates Kostja's work regarding const qualifiers
in the cql code hierarchy along with a follow-up patch addressing the
review comment of the corresponding patch set (the patch subject is
"cql: propagate const property through prepared statement tree.").
2019-11-27 10:56:20 +01:00
Piotr Sarna
9c5a5a5ac2 treewide: add names to semaphores
By default, semaphore exceptions bring along very little context:
either that a semaphore was broken or that it timed out.
In order to make debugging easier without introducing significant
runtime costs, a notion of named semaphore is added.
A named semaphore is simply a semaphore with statically defined
name, which is present in its errors, bringing valuable context.
A semaphore defined as:

  auto sem = semaphore(0);

will present the following message when it breaks:
"Semaphore broken"
However, a named semaphore:

  auto named_sem = named_semaphore(0, named_semaphore_exception_factory{"io_concurrency_sem"});

will present a message with at least some debugging context:

  "Semaphore broken: io_concurrency_sem"

It's not much, but it would really help in pinpointing bugs
without having to inspect core dumps.

At the same time, it does not incur any costs for normal
semaphore operations (except for its creation), but instead
only uses more CPU in case an error is actually thrown,
which is considered rare and not to be on the hot path.

Refs #4999

Tests: unit(dev), manual: hardcoding a failure in view building code
2019-11-26 15:14:21 +02:00
Pavel Solodovnikov
2f442f28af treewide: add const qualifiers throughout the code base 2019-11-26 02:24:49 +03:00
Pavel Emelyanov
f6ac969f1e mm: Stop migration manager
Before stopping the db itself, stop the migration service.
It must be stopped before RPC, but RPC is not stopped yet
itself, so we should be safe here.

Here's the tail of the resulting logs:

INFO  2019-11-20 11:22:35,193 [shard 0] init - shutdown migration manager
INFO  2019-11-20 11:22:35,193 [shard 0] migration_manager - stopping migration service
INFO  2019-11-20 11:22:35,193 [shard 1] migration_manager - stopping migration service
INFO  2019-11-20 11:22:35,193 [shard 0] init - Shutdown database started
INFO  2019-11-20 11:22:35,193 [shard 0] init - Shutdown database finished
INFO  2019-11-20 11:22:35,193 [shard 0] init - stopping prometheus API server
INFO  2019-11-20 11:22:35,193 [shard 0] init - Scylla version 666.development-0.20191120.25820980f shutdown complete.

Also -- stop the mm on drain before the commitlog it stopped.
[Tomasz: mm needs the cl because pulling schema changes from other nodes
involves applying them into the database. So cl/db needs to be
stopped after mm is stopped.]

The drain logs would look like

...
INFO  2019-11-25 11:00:40,562 [shard 0] migration_manager - stopping migration service
INFO  2019-11-25 11:00:40,562 [shard 1] migration_manager - stopping migration service
INFO  2019-11-25 11:00:40,563 [shard 0] storage_service - DRAINED:

and then on stop

...
INFO  2019-11-25 11:00:46,427 [shard 0] init - shutdown migration manager
INFO  2019-11-25 11:00:46,427 [shard 0] init - Shutdown database started
INFO  2019-11-25 11:00:46,427 [shard 0] init - Shutdown database finished
INFO  2019-11-25 11:00:46,427 [shard 0] init - stopping prometheus API server
INFO  2019-11-25 11:00:46,427 [shard 0] init - Scylla version 666.development-0.20191125.3eab6cd54 shutdown complete.

Fixes #5300

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20191125080605.7661-1-xemul@scylladb.com>
2019-11-25 12:59:01 +01:00
Vladimir Davydov
bf5f864d80 paxos: piggyback result query on prepare response
Current LWT implementation uses at least three network round trips:
 - first, execute PAXOS prepare phase
 - second, query the current value of the updated key
 - third, propose the change to participating replicas

(there's also learn phase, but we don't wait for it to complete).

The idea behind the optimization implemented by this patch is simple:
piggyback the current value of the updated key on the prepare response
to eliminate one round trip.

To generate less network traffic, only the closest to the coordinator
replica sends data while other participating replicas send digests which
are used to check data consistency.

Note, this patch changes the API of some RPC calls used by PAXOS, but
this should be okay as long as the feature in the early development
stage and marked experimental.

To assess the impact of this optimization on LWT performance, I ran a
simple benchmark that starts a number of concurrent clients each of
which updates its own key (uncontended case) stored in a cluster of
three AWS i3.2xlarge nodes located in the same region (us-west-1) and
measures the aggregate bandwidth and latency. The test uses shard-aware
gocql driver. Here are the results:

                latency 99% (ms)    bandwidth (rq/s)    timeouts (rq/s)
    clients     before  after       before  after       before  after
          1          2      2          626    637            0      0
          5          4      3         2616   2843            0      0
         10          3      3         4493   4767            0      0
         50          7      7        10567  10833            0      0
        100         15     15        12265  12934            0      0
        200         48     30        13593  14317            0      0
        400        185     60        14796  15549            0      0
        600        290     94        14416  15669            0      0
        800        568    118        14077  15820            2      0
       1000        710    118        13088  15830            9      0
       2000       1388    232        13342  15658           85      0
       3000       1110    363        13282  15422          233      0
       4000       1735    454        13387  15385          329      0

That is, this optimization improves max LWT bandwidth by about 15%
and allows to run 3-4x more clients while maintaining the same level
of system responsiveness.
2019-11-24 11:35:29 +02:00
Vladimir Davydov
3d1d4b018f paxos: remove unnecessary move constructor invocations
invoke_on() guarantees that captures object won't be destroyed until the
future returned by the invoked function is resolved so there's no need
to move key, token, proposal for calling paxos_state::*_impl helpers.
2019-11-24 11:35:29 +02:00
Vladimir Davydov
ef2e96c47c storage_proxy: factor out helper to sort endpoints by proximity
We need it for PAXOS.
2019-11-24 11:35:29 +02:00
Vladimir Davydov
63d4590336 storage_proxy: move digest_algorithm upper
We need it for PAXOS.

Mark it as static inline while we are at it.
2019-11-24 11:35:29 +02:00
Avi Kivity
1fe062aed4 Merge "Add basic UDF support" from Rafael
"

This patch series adds only UDF support, UDA will be in the next patch series.

With this all CQL types are mapped to Lua. Right now we setup a new
lua state and copy the values for each argument and return. This will
be optimized once profiled.

We require --experimental to enable UDF in case there is some change
to the table format.
"

* 'espindola/udf-only-v4' of https://github.com/espindola/scylla: (65 commits)
  Lua: Document the conversions between Lua and CQL
  Lua: Implement decimal subtraction
  Lua: Implement decimal addition
  Lua: Implement support for returning decimal
  Lua: Implement decimal to string conversion
  Lua: Implement decimal to floating point conversion
  Lua: Implement support for decimal arguments
  Lua: Implement support for returning varint
  Lua: Implement support for returning duration
  Lua: Implement support for duration arguments
  Lua: Implement support for returning inet
  Lua: Implement support for inet arguments
  Lua: Implement support for returning time
  Lua: Implement support for time arguments
  Lua: Implement support for returning timeuuid
  Lua: Implement support for returning uuid
  Lua: Implement support for uuid and timeuuid arguments
  Lua: Implement support for returning date
  Lua: Implement support for date arguments
  Lua: Implement support for returning timestamp
  ...
2019-11-17 16:38:19 +02:00
Vladimir Davydov
25aeefd6f3 cql: fix CAS consistency level validation
This patch resurrects Cassandra's code validating a consistency level
for CAS requests. Basically, it makes CAS requests use a special
function instead of validate_for_write to make error messages more
coherent.

Note, we don't need to resurrect requireNetworkTopologyStrategy as
EACH_QUORUM should work just fine for both CAS and non-CAS writes.
Looks like it is just an artefact of a rebase in the Cassandra
repository.
2019-11-14 12:15:39 +01:00
Gleb Natapov
552c56633e storage_proxy: do not release mutation if not all replies were received
MV backpressure code frees mutation for delayed client replies earlier
to save memory. The commit 2d7c026d6e that
introduced the logic claimed to do it only when all replies are received,
but this is not the case. Fix the code to free only when all replies
are received for real.

Fixes #5242

Message-Id: <20191113142117.GA14484@scylladb.com>
2019-11-13 16:23:19 +02:00
Piotr Dulikowski
59fbbb993f memtables: add partition/row hit/miss counters
Adds per-table metrics for counting partition and row reuse
in memtables. New metrics are as follows:
    - memtable_partition_writes - number of write operations performed
          on partitions in memtables,
    - memtable_partition_hits - number of write operations performed
          on partitions that previously existed in a memtable,
    - memtable_row_writes - number of row write operations performed
          in memtables,
    - memtable_row_hits - number of row write operations that ovewrote
          rows previously present in a memtable.

Tests: unit(release)
2019-11-12 13:35:41 +01:00
Rafael Ávila de Espíndola
d9337152f3 Use threads when executing user functions
This adds a requires_thread predicate to functions and propagates that
up until we get to code that already returns futures.

We can then use the predicate to decide if we need to use
seastar::async.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
fc72a64c67 Add schema propagation and storage for UDF
With this it is possible to create user defined functions and
aggregates and they are saved to disk and the schema change is
propagated.

It is just not possible to call them yet.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:41:08 -08:00
Rafael Ávila de Espíndola
ce6304d920 UDF: Add a feature and config option to track if udf is enabled
It can only be enabled with --experimental.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-11-07 08:40:47 -08:00
Vladimir Davydov
b75862610e paxos_state: account paxos round latency
This patch adds the following per table stats:

  cas_prepare_latency
  cas_propose_latency
  cas_commit_latency

They are equivalent to CasPropose, CasPrepare, CasCommit metrics exposed
by Cassandra.
2019-10-29 19:26:18 +03:00
Vladimir Davydov
c27ab87410 storage_proxy: add cas request accounting
This patch implements accounting of Cassandra's metrics related to
lightweight transactions, namely:

  cas_read_latency              transactional read latency (histogram)
  cas_write_latency             transactional write latency (histogram)
  cas_read_timeouts             number of transactional read timeouts
  cas_write_timeouts            number of transactional write timeouts
  cas_read_unavailable          number of transactional read
                                unavailable errors
  cas_write_unavailable         number of transactional write
                                unavailable errors
  cas_read_unfinished_commit    number of transaction commit attempts
                                that occurred on read
  cas_write_unfinished_commit   number of transaction commit attempts
                                that occurred on write
  cas_write_condition_not_met   number of transaction preconditions
                                that did not match current values
  cas_read_contention           how many contended reads were
                                encountered (histogram)
  cas_write_contention          how many contended writes were
                                encountered (histogram)
2019-10-29 19:25:47 +03:00
Vladimir Davydov
967a9e3967 storage_proxy: zap ballot_and_contention
Pass contention by reference to begin_and_repair_paxos(), where it is
incremented on every sleep. Rationale: we want to account the total
number of times query() / cas() had to sleep, either directly or within
begin_and_repair_paxos(), no matter if the function failed or succeeded.
2019-10-29 19:22:18 +03:00
Gleb Natapov
0e9df4eaf8 lwt: mark lwt as experimental
We may want to change paxos tables format and change internode protocol,
so hide lwt behind experimental flag for now.

Message-Id: <20191029102725.GM2866@scylladb.com>
2019-10-29 14:33:48 +02:00
Gleb Natapov
e5e44bfda2 client_state: fix get_timestamp_for_paxos() to always advance a timestamp
Message-Id: <20191029102336.GL2866@scylladb.com>
2019-10-29 13:07:33 +02:00
Vladimir Davydov
e0b31dd273 query: add flag to return static row on partition with no rows
A SELECT statement that has clustering key restrictions isn't supposed
to return static content if no regular rows matches the restrictions,
see #589. However, for the CAS statement we do need to return static
content on failure so this patch adds a flag that allows the caller to
override this behavior.
2019-10-28 21:50:44 +03:00
Konstantin Osipov
203eb3eccc lwt: sleep a random amount of time when retrying CAS
Sleep a random interval between 0 and 100 ms before retrying CAS.
Reuse sleep function, make the distribution object thread local.
2019-10-27 23:42:03 +03:00
Konstantin Osipov
0674fab05c lwt: implement storage_proxy::cas()
Introduce service::cas_request abstract base class
which can be used to parameterize Paxos logic.

Implement storage_proxy::cas() - compare and swap - the storage proxy
entry point for lightweight transactions.
2019-10-27 23:42:03 +03:00
Gleb Natapov
70adf65341 storage_proxy: make mutation holder responsible for mutation operation
Currently the code that manipulates mutations during write need to
check what kind of mutations are those and (sometimes) choose different
code paths. This patch encapsulates the differences in virtual
functions of mutation_holder object, so that high level code will not
concern itself with the details. The functions that are added:
apply_locally(), apply_remotely() and store_hint().
2019-10-27 23:21:51 +03:00
Gleb Natapov
b3e01a45d7 lwt: storage_proxy: implement paxos protocol
This patch adds all functionality needed for Paxos protocol. The
implementation does not strictly adhere to Paxos paper since the original
paper allows setting a value only once, while for LWT we need to be able
to make another Paxos round after "learn" phase completes, which requires
things like repair to be introduced.
2019-10-27 23:21:51 +03:00
Gleb Natapov
d1774693bf lwt: Define state needed by paxos and persist it
Paxos protocol relies on replicas having a state that persists over
crashes/restarts. This patch defines such state and stores it in the
database itself in the paxos table to make it persistent.

The stored state is:
  in_progress_ballot    - promised ballot
  proposal              - accepted value
  proposal_ballot       - the ballot of the accepted value
  most_recent_commit    - most recently learned value
  most_recent_commit_at - the ballot of the most recently learned value
2019-10-27 23:21:51 +03:00
Gleb Natapov
15b935b95d lwt: add data structures needed for paxos implementation
This patch add two data structures that will be used by paxos. First
one is "proposal" which contains a ballot and a mutation representing
a value paxos protocol is trying to set. Second one is
"prepare_response" which is a value returned by paxos prepare stage.
It contains currently accepted value (if any) and most recently
learned value (again if any). The later is used to "repair" replicas
that missed previous "learn" message.
2019-10-27 23:21:51 +03:00
Kamil Braun
e74b5deb5d cql3: enable non-frozen UDTs.
Add a cluster feature for non-frozen UDTs.

If the cluster supports non-frozen UDTs, do not return an error
message when trying to create a table with a non-frozen user type.
2019-10-25 12:04:44 +02:00
Asias He
f876580740 storage_service: Reject nodetool cleanup when there is pending ranges
From Shlomi:

4 node cluster Node A, B, C, D (Node A: seed)
cassandra-stress write n=10000000 -pop seq=1..10000000 -node <seed-node>
cassandra-stress read duration=10h -pop seq=1..10000000 -node <seed-node>
while read is progressing
Node D: nodetool decommission
Node A: nodetool status node - wait for UL
Node A: nodetool cleanup (while decommission progresses)

I get the error on c-s once decommission ends
  java.io.IOException: Operation x0 on key(s) [383633374d31504b5030]: Data returned was not validated

The problem is when a node gets new ranges, e.g, the bootstrapping node, the
existing nodes after a node is removed or decommissioned, nodetool cleanup will
remove data within the new ranges which the node just gets from other nodes.

To fix, we should reject the nodetool cleanup when there is pending ranges on that node.

Note, rejecting nodetool cleanup is not a full protection because new ranges
can be assigned to the node while cleanup is still in progress. However, it is
a good start to reject until we have full protection solution.

Refs: #5045
2019-10-23 19:20:36 +08:00
Asias He
a39c8d0ed0 Revert "storage_service: remove storage_service::_is_bootstrap_mode."
It will be needed by "storage_service: Reject nodetool cleanup when
there is pending ranges"

This reverts commit dbca327b46.
2019-10-23 19:20:36 +08:00
Kamil Braun
f1c26bf5c9 storage_service: more comments in join_token_ring
Explain why a call to update_normal_tokens is needed.
2019-10-21 11:11:03 +02:00
Kamil Braun
dbca327b46 storage_service: remove storage_service::_is_bootstrap_mode.
The flag did nothing. It was used in one place to check if there's a
bug, but it can easily by proven by reading the code that the check
would never pass.
2019-10-21 11:11:03 +02:00
Kamil Braun
b757a19f84 storage_service: simplify storage_service::bootstrap method
The storage_service::bootstrap method took a parameter: tokens to
bootstrap with. However, this method is only called in one place
(join_token_ring) with only one parameter: _bootstrap_tokens. It doesn't
make sense to call this method anywhere else with any other parameter.

This commit also adds a comment explaining what the method does and
moves it into the private section of storage_service.
2019-10-21 11:11:03 +02:00
Kamil Braun
84b41bd89b storage_service: fix typo in handle_state_moving 2019-10-21 11:11:03 +02:00
Kamil Braun
2ff4f9b8f4 storage_service: remove unnecessary use of stringstream 2019-10-21 11:11:03 +02:00
Kamil Braun
06cc7d409d storage_service: remove redundant call to update_tokens during join_token_ring
When a non-seed node was bootstrapping, system_keyspace::update_tokens
was called twice: first right after the tokens were generated (or
received if we were replacing a different node) in the call to
`bootstrap`, and then later in join_token_ring. The second call was
redundant.

The join_token_ring call was also redundant if we were not bootstrapping
and had tokens saved previously (e.g. when restarting). In that case we
would have read them from LOCAL and then save the same tokens again.

This commit removes the redundant call and inserts calls to
update_tokens where they are necessary, when new tokens are generated.
The aim is to make the code easier to understand.

It also adds a comment which explains why the tokens don't need to be
generated in one of the cases.
2019-10-21 11:11:03 +02:00
Kamil Braun
a223864f81 storage_service: remove storage_service::set_tokens method.
After commit 36ccf72f3c, this method
was used only in one place.
Its name did not make it obvious what it does and when is it safe to call it.
This commit pulls out the code from set_tokens to the point where it was
called (join_token_ring). The code is only possible to understand in
context.

This code was also saving the tokens to the LOCAL table before
retrieving them from this table again. There is no point in doing that:
1. there are no races, since when join_token_ring is running, it is the
only function which can call system_keyspace::update_tokens (which saves them to the
LOCAL table). There can be no multiple instances of join_token_ring.
2. Even if there was a race, this wouldn't fix anything. The tokens we
retrieve from LOCAL by calling get_local_tokens().get0() could already
be different in the LOCAL table when the get0() returns.
2019-10-21 11:09:59 +02:00
Kamil Braun
36ccf72f3c storage_service: remove is_survey_mode
That was dead, untested code, making it unnecessarily hard
to implement new features.
2019-10-21 10:38:49 +02:00
Kamil Braun
602c7268cc storage_service::handle_state_normal: tokens_to_update* -> owned_tokens
Replace the two variables:
    tokens_to_update_in_metadata
    tokens_to_update_in_system_keyspace
which were exactly the same, with one variable owned_tokens.
The new name describes what the variable IS instead what's it used for.

Add a comment to clarify what "owned" means: those are the tokens the
node chose and any collision was resolved positively for this node.

Move the variable definition further down in the code, where it's
actually needed.
2019-10-21 10:38:49 +02:00
Kamil Braun
2db07c697f storage_service::handle_state_normal: remove local_tokens_to_remove
That was dead code.
Removing tokens is handled inside remove_endpoint, using the
endpoints_to_remove set.
2019-10-21 10:38:49 +02:00
Piotr Jastrzebski
afe520ad77 gossip: Add application_state::IGNORE_MSB_BITS
We would like to share with other nodes
the value of ignore_msb_bits property used by the node.

This is needed because CDC will operate on
streams of changes. Each shard on each node
will have its own stream that will be identified
by a stream_id. Stream_id will be selected in
such a way that using stream_id as partition key
will locate partition identified by stream_id on
a node and shard that the stream belongs to.

To be able to generate such stream_id we need
to know ignore_msb_bits property value for each node.

IMPORTANT NOTE: At this point CDC does not support
topology changes. It will work only on a stable cluster.
Support for topology modifications will be added in
later steps.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-10-17 10:55:31 +02:00
Piotr Jastrzebski
a66d7cfe57 gossip: Add application_state::SHARD_COUNT
We would like to share with other nodes
the number of shards available at the node.

This is needed because CDC will operate on
streams of changes. Each shard on each node
will have its own stream that will be identified
by a stream_id. Stream_id will be selected in
such a way that using stream_id as partition key
will locate partition identified by stream_id on
a node and shard that the stream belongs to.

To be able to generate such stream_id we need
to know how many shards are on each node.

IMPORTANT NOTE: At this point CDC does not support
topology changes. It will work only on a stable cluster.
Support for topology modifications will be added in
later steps.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-10-17 10:55:31 +02:00
Piotr Jastrzebski
f7ce8e4f2b cdc: Add flag guarding it's usage
At first, CDC will only be enabled when experimental flag is on.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-10-17 10:55:31 +02:00
Piotr Sarna
36a1905e98 storage_proxy: handle unstarted write cancelling
When another node is reported to be down, view updates queued
for it are cancelled, but some of them may already be initiated.
Right now, cancelling such a write resulted in an exception,
but on conceptual level it's not really an exception, since
this behaviour is expected.
Previous version of this patch was based on introducing a special
exception type that was later handled specially, but it's not clear
if it's a good direction. Instead, this patch simply makes this
path non-exceptional, as was originally done by Nadav in the first
version of the series that introduced handling unstarted write
cancellations. Additionally, a message containing the information
that a write is cancelled is logged with debug level.
2019-10-07 16:55:36 +03:00
Avi Kivity
162730862d storage_proxy: remove variadic future from query_partition_key_range_concurrent()
Seastar variadic futures are deprecated, so replace with a nice struct.
2019-09-30 21:33:44 +03:00
Avi Kivity
968b34a2b4 storage_proxy: remove variadic future from digest_read_resolver
Seastar variadic futures are deprecated, so replace with a nice
struct.
2019-09-30 21:32:17 +03:00