Commit Graph

21980 Commits

Author SHA1 Message Date
Piotr Jastrzebski
1a43849cd2 table: Add cache_enabled member function
This function determines cache usage based both on table _config
and dynamic schema information.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-05-06 18:39:01 +02:00
Piotr Jastrzebski
546dbf1fcc cf_prop_defs: persist caching_options in schema
Previously 'WITH CACHING =' was ignored both in
CREATE TABLE and in ALTER TABLE statements.
Now it will be persisted in schema so that
it can be used later to control caching per table.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-05-06 18:38:37 +02:00
Piotr Jastrzebski
812dfd22bd property_definitions: add get that returns variant
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-05-06 18:38:04 +02:00
Piotr Jastrzebski
0475dab359 feature: add PER_TABLE_CACHING feature
This feature will ensure that caching can be switched
off per table only after the whole cluster supports it.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-05-05 08:14:49 +02:00
Piotr Jastrzebski
2d727114ed caching_options: add enabled parameter
Scylla inherits from Origin two caching parameters
(keys and rows_per_partition) that are ignored.

This patch adds a new parameter called "enabled"
which is true by default and controls whether cache
is used for a selected table or not.

If the parameter is missing in the map then it
has the default value of true. To minimize the impact
of this change, enabled == true is represented as an
absence of this parameter.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-05-05 08:14:49 +02:00
Piotr Sarna
1c4e8f5030 alternator: fix checking max item depth
Maximum item depth accepted by DynamoDB is 32, and alternator
chose 39 as its arbitrary value in order to provide 7 shining
new levels absolutely free of charge. Unfortunately, our code
which checks the nesting level in rapidjson parsing bumps
the counter by 2 for every object, which is due to rapidjson's
internal implementation. In order to actually support
at least 32 levels, the threshold is simply doubled.
This commit comes with a test case which ensures that
32-nested items are accepted both by alternator and DynamoDB.
The test case failed for alternator before the fix.

Fixes #6366
Tests: unit(dev), alternator(local, remote)
2020-05-04 23:46:20 +03:00
Glauber Costa
c5cdd77f8e gossip_test: start the compaction manager explicitly
Right now the compaction_manager needs to be started explicitly.
We may change it in the future, but right now that's how it is.

Everything works now even without it, because compaction_manager::stop
happens to work even if it was not started. But it is technically
illegal.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200504143048.17201-1-glauber@scylladb.com>
2020-05-04 17:40:32 +03:00
Bentsi Magidovich
e77dad3adf scylla_coredump_setup: Fix incorrect coredump directory mount
The issue is that the mount is /var/lib/scylla/coredump ->
/var/lib/systemd/coredump. But we need to do the opposite in order to
save the coredump on the partition that Scylla is using:
/var/lib/systemd/coredump-> /var/lib/scylla/coredump

Fixes #6301
2020-05-04 15:47:45 +03:00
Avi Kivity
f3bcd4d205 Merge 'Support SSL Certificate Hot Reloading' from Calle
"
Fixes #6067

Makes the scylla endpoint initializations that support TLS use reloadable certificate stores, watching used cert + key files for changes, and reload iff modified.

Tests in separate dtest set.
"

* elcallio-calle/reloadable-tls:
  transport: Use reloadable tls certificates
  redis: Use reloadable tls certificates
  alternator: Use reloadable tls certificates
  messaging_service: Use reloadable TLS certificates
2020-05-04 15:11:16 +03:00
Piotr Sarna
bec95a0605 treewide: use thread-safe variant of localtime
In order to ensure thread-safety, all usages of localtime()
are replaced with localtime_r(), which may accept a local
buffer.

Tests: unit(dev)
Fixes #6364
Message-Id: <ad4a0c0e1707f0318325718715a3a647e3ebfdfe.1588592156.git.sarna@scylladb.com>
2020-05-04 14:46:08 +03:00
Calle Wilund
70aca26a3e transport: Use reloadable tls certificates 2020-05-04 11:32:21 +00:00
Calle Wilund
bacf2fa981 redis: Use reloadable tls certificates 2020-05-04 11:32:21 +00:00
Calle Wilund
cc9bb6454c alternator: Use reloadable tls certificates 2020-05-04 11:32:21 +00:00
Calle Wilund
08d069f78d messaging_service: Use reloadable TLS certificates
Changes messaging service rpc to use reloadable tls
certificates iff tls is enabled-

Note that this means that the service cannot start
listening at construction time if TLS is active,
and user need to call start_listen_ex to initialize
and actually start the service.

Since "normal" messaging service is actually started
from gms, this route too is made a continuation.
2020-05-04 11:32:21 +00:00
Piotr Sarna
fb7fa7f442 alternator: fix signature timestamps
Generating timestamps for auth signatures used a non-thread-safe
::gmtime function instead of thread-safe ::gmtime_r.

Tests: unit(dev)
Fixes #6345
2020-05-04 14:12:11 +03:00
Piotr Sarna
05ec95134a clocks-impl: switch to thread-safe time conversion
std::gmtime() has a sad property of using a global static buffer
for returning its value. This is not thread-safe, so its usage
is replaced with gmtime_r, which can accept a local buffer.
While no regressions where observed in this particular area of code,
a similar bug caused failures in alternator, so it's better to simply
replace all std::gmtime calls with their thread-safe counterpart.

Message-Id: <39e91c74de95f8313e6bb0b12114bf12c0e79519.1588589151.git.sarna@scylladb.com>
2020-05-04 14:11:38 +03:00
Takuya ASADA
57f3f82ed1 redis: add EX option for set command
Add EX option for SET command, to set TTL for the key.
A behavior of SET EX is same as SETEX command, it just different syntax.

see: https://redis.io/commands/set
2020-05-04 13:58:18 +03:00
Eliran Sinvani
a346e862c1 Auth: return correct error code when role is not found
Scylla returns the wrong error code (0000 - server internal error)
in response to trying to do authentication/authorization operations
that involves a non-existing role.
This commit changes those cases to return error code 2200 (invalid
query) which is the correct one and also the one that Cassandra
returns.
Tests:
    Unit tests (Dev)
    All auth and auth_role dtests
2020-05-04 12:57:27 +03:00
Glauber Costa
55f5ca39a9 sstable_test: rework test to use a thread
The compaction_manager test lives inside a thread and it is not taking
advantage of it, with continuations all over.

One of the side effects of it is that the test is calling stop() twice
on the compaction_manager.  While this works today, it is not good
practice. A change I am making is just about to break it.

This patch converts the test to fully use .get() instead of chained
continuations and in doing so also guarantees that the compaction
manager will be RAII-stopped just one, from a defer object.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200503161420.8346-2-glauber@scylladb.com>
2020-05-03 19:54:04 +03:00
Piotr Sarna
bf5f247bc5 db: set gc grace period to 0 for local system tables
Local system tables from `system` namespace use LocalStrategy
replication, so they do not need to be concerned about gc grace
period. Some system tables already set gc grace period to 0,
but other ones, including system.large_partitions, did not.
That may result in millions of tombstones being needlessly
kept for these tables, which can cause read timeouts.

Fixes #6325
Tests: unit(dev), local(running cqlsh and playing with system tables)
2020-05-03 17:41:50 +03:00
Avi Kivity
9952cdfec1 Merge "scylla-gdb.py: improve finding references to intrusive container elements" from Botond
"
Intrusive containers often have references between containers elements
that point to some non-first word of the element. This references
currently fly below the radar of `scylla find` and `scylla
generate-object-graph`, as they are looking to references to only the
first word of the objects. So objects that are members of an intrusive
container often appear to have no inbound references at all.

This patch-set improves support for finding such references by looking
for references to non-first words of objects.

It also includes some generic, minor improvements to scylla
generate_object_graph.
"

* 'scylla-gdb.py-scylla-generate-object-graph-linked-lists/v1' of https://github.com/denesb/scylla:
  scylla-gdb.py: scylla generate_object_graph: make label of initial vertice bold
  scylla-gdb.py: scylla generate_object_graph: remove redundant lookup
  scylla-gdb.py: scylla generate_object_graph: print "to" offsets
  scylla-gdb.py: scylla generate-object-graph: use value-range to find references
  scylla-gdb.py: scylla find: allow finding ranges of values
  scylla-gdb.py: find_in_live(): return pointer_metadata instances
2020-05-03 16:22:22 +03:00
Glauber Costa
70e5252a5d table: no longer accept online loading of SSTable files in the main directory
Loading SSTables from the main directory is possible, to be compatible with
Cassandra, but extremely dangerous and not recommended.

From the beginning, we recommend using an separate, upload/ directory.
In all this time, perhaps due to how the feature's usefulness is reduced
in Cassandra due to the possible races, I have never seen anyone coming
from Cassandra doing procedures involving refresh at all.

Loading SSTables from the main directory forces us to disable writes to
the table temporarily until the SSTables are sorted out. If we get rid of
this, we can get rid of the disabling of the writes as well.

We can't do it now because if we want to be nice to the odd user that may
be using refresh through the main directory without our knowledge we should
at least error out.

This patch, then, does that: it errors out if SSTables are found in the main
directory. It will not proceed with the refresh, and direct the user to the
upload directory.

The main loop in reshuffle_sstables is left in place structurally for now, but
most of it is gone. The test for is is deleted.

After a period of deprecation we can start ignoring these SSTables and get rid
of the lock.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200429144511.13681-1-glauber@scylladb.com>
2020-05-03 08:40:38 +03:00
Glauber Costa
e44b2826ab compaction: avoid abandoned futures when using interposers
When using interposers, cancelling compactions can leave futures
that are not waited for (resharding, twcs)

The reason is when consume_end_of_stream gets called, it tries to
push end_of_stream into the queue_reader_handle. Because cancelling
a compaction is done through an exception, the queue_reader_handle
is terminated already at this time. Trying to push to it generates
another exception and prevents us from returning the future right
below it.

This patch adds a new method is_terminated() and if we detect
that the queue_reader_handle is already terminated by this point,
we don't try to push. We call it is_terminated() because the check
is to see if the queue_reader_handle has a _reader. The reader is
also set to null on successful destruction.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200430175839.8292-1-glauber@scylladb.com>
2020-05-01 16:30:23 +03:00
Avi Kivity
122f57871d Update seastar submodule
* seastar 0523b0fac...3c2e27811 (2):
  > future: Add a futurizer::satisfy_with_result_of
  > future: Move concept definitions earlier
2020-05-01 12:55:48 +03:00
Tomasz Grabiec
d78fbf7c16 Merge "storage_service: Make replacing node take writes" from Asias
Background:

Replace operation is used to replace a dead node in the cluster.
Currently during replace operation, the replacing node does not take any
writes. As a result, new writes to a range after the sync for that range
is done, e.g., after streaming for that range is finished, will not be
synced to the replacing node. Hinted hand off or repair after the
replacing operation will help. But it is better if we can make the
writes to the replacing node to avoid any post replacing operation
actions.

After this series and repair based node operation series, the replace
operation will guarantee the replacing node has all the latest copy of
data including the new writes during the replace operation. In short, no
more repairs before or after the replacing operation. Just replacing the
node is enough.

Implementation:

Filter the node being replaced out of the natural endpoints in
storage_proxy, so that:
The node being replaced will not be selected as the target for
normal write or normal read.

Do not depend on the gossip liveness to avoid selecting replacing node
for normal write or normal read when the replacing node has the same
ip address as the node being replaced. No more special handling for
hibernate state in gossip which makes it is simpler and more robust.
Replacing node will be marked as UP.

Put the replacing node in the pending list, so that:
Replacing node will take writes but write to replacing will not be
counted as CL.

Replacing node will not take normal read.

Example:

For example, with RF = 3, n1, n2, n3 in the cluster, n3 is dead and
being replaced by node n4. When n4 starts:

writes to nodes {n1, n2, n3} are changed to
normal_replica_writes = {n1, n2} and pending_replica_writes= {n4}.

reads to nodes {n1, n2, n3} are changed to
normal_replica_reads = {n1, n2} only.

This way, the replacing node n4 now takes writes but does not take reads.

Tests:

Measure the number of writes during pending period that is the
replacing starts and finishes the replace operation.
Start 5 nodes, n1 to n5.
Stop n5
Start write in the background
Start n6 to replace n5
Get scylla_database_total_writes metrics when the replacing node announces HIBERNATE (replacing) and NORMAL status.
Before:
2020-02-06 08:35:35.921837 Get metrics when other knows replacing node = HIBERNATE
2020-02-06 08:35:35.939493 scylla_database_total_writes: node1={'scylla_database_total_writes': 15483}
2020-02-06 08:35:35.950614 scylla_database_total_writes: node2={'scylla_database_total_writes': 15857}
2020-02-06 08:35:35.961820 scylla_database_total_writes: node3={'scylla_database_total_writes': 16195}
2020-02-06 08:35:35.978427 scylla_database_total_writes: node4={'scylla_database_total_writes': 15764}
2020-02-06 08:35:35.992580 scylla_database_total_writes: node6={'scylla_database_total_writes': 331}
2020-02-06 08:36:49.794790 Get metrics when other knows replacing node = NORMAL
2020-02-06 08:36:49.809189 scylla_database_total_writes: node1={'scylla_database_total_writes': 267088}
2020-02-06 08:36:49.823302 scylla_database_total_writes: node2={'scylla_database_total_writes': 272352}
2020-02-06 08:36:49.837228 scylla_database_total_writes: node3={'scylla_database_total_writes': 274004}
2020-02-06 08:36:49.851104 scylla_database_total_writes: node4={'scylla_database_total_writes': 262972}
2020-02-06 08:36:49.862504 scylla_database_total_writes: node6={'scylla_database_total_writes': 513}

Writes = 513 - 331

After:
2020-02-06 08:28:56.548047 Get metrics when other knows replacing node = HIBERNATE
2020-02-06 08:28:56.560813 scylla_database_total_writes: node1={'scylla_database_total_writes': 290886}
2020-02-06 08:28:56.573925 scylla_database_total_writes: node2={'scylla_database_total_writes': 310304}
2020-02-06 08:28:56.586305 scylla_database_total_writes: node3={'scylla_database_total_writes': 304049}
2020-02-06 08:28:56.601464 scylla_database_total_writes: node4={'scylla_database_total_writes': 303770}
2020-02-06 08:28:56.615066 scylla_database_total_writes: node6={'scylla_database_total_writes': 604}
2020-02-06 08:29:10.537016 Get metrics when other knows replacing node = NORMAL
2020-02-06 08:29:10.553257 scylla_database_total_writes: node1={'scylla_database_total_writes': 336126}
2020-02-06 08:29:10.567181 scylla_database_total_writes: node2={'scylla_database_total_writes': 358549}
2020-02-06 08:29:10.581939 scylla_database_total_writes: node3={'scylla_database_total_writes': 351416}
2020-02-06 08:29:10.595567 scylla_database_total_writes: node4={'scylla_database_total_writes': 350580}
2020-02-06 08:29:10.610548 scylla_database_total_writes: node6={'scylla_database_total_writes': 45460}

Writes = 45460 - 604

As we can see the replacing node did not take write before and take write after the patch.

Check log of writer handler in storage_proxy
storage_proxy - creating write handler for token: -2642068240672386521,
keyspace_name=ks, original_natrual={127.0.0.1, 127.0.0.5, 127.0.0.2},
natural={127.0.0.1, 127.0.0.2}, pending={127.0.0.6}

The node being replaced, n5=127.0.0.5, is filtered out and the replacing
node, n6=127.0.0.6 is in the pending list.

* asias/replace_take_writes:
  storage_service: Make replacing node take writes
  repair: Use token_metadata with the replacing node in do_rebuild_replace_with_repair
  abstract_replication_strategy: Add get_ranges which takes token_metadata
  abstract_replication_strategy: Add get_natural_endpoints_without_node_being_replaced
  abstract_replication_strategy: Add allow_remove_node_being_replaced_from_natural_endpoints
  token_metadata: Calculate pending ranges for replacing node
  storage_service: Unify handling of replaced node removal from gossip
  storage_service: Update tokens and replace address for replace operation
2020-04-30 19:28:35 +02:00
Pavel Emelyanov
513ce1e6a5 storage_proxy_stats: Make get_ep_stat() noexcept
The .get_ep_stat(ep) call can throw when registering metrics (we have
issue for it, #5697). This is not expected by it callers, in particular
abstract_write_response_handler::timeout_cb breaks in the middle and
doesn't call the on_timeout() and the _proxy->remove_response_handler(),
which results in not removed and not released responce handler. In turn
not released response handler doesn't set the _ready future on which
response_wait() waits -> stuck.

Although the issue with .get_ep_stat() should be fixed, an exception in
it mustn't lead to deadlocks, so the fix is to make the get_ep_stat()
noexcept by catching the exception and returning a dummy stat object
instead to let caller(s) finish.

Fixes #5985
Tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200430163639.5242-1-xemul@scylladb.com>
2020-04-30 19:40:08 +03:00
Avi Kivity
88224619b6 Update seastar submodule
* seastar d0cbf7d1e8...0523b0fac4 (1):
  > Merge "Fix issues found by valgrind" from Rafael
2020-04-30 19:20:37 +03:00
Asias He
b8ac10c451 config: Do not enable repair based node operations by default
Give it some more time to mature. Use the old stream plan based node
operations by default.

Fixes: #6305
Backports: 4.0
2020-04-30 12:37:24 +03:00
Avi Kivity
8925e00e96 Merge 'Fix hang in multishard_writer' from Asias
"
This series fix hang in multishard_writer when error happens. It contains
- multishard_writer: Abort the queue attached to consumers when producer fails
- repair: Fix hang when the writer is dead

Fixes #6241
Refs: #6248
"

* asias-stream_fix_multishard_writer_hang:
  repair: Fix hang when the writer is dead
  mutation_writer_test: Add test_multishard_writer_producer_aborts
  multishard_writer: Abort the queue attached to consumers when producer fails
2020-04-30 12:27:55 +03:00
Avi Kivity
280854ab46 Merge " Avoid use-after-free of sstable writer" from Rafael
"
The backlog_controller has a timer that periodically accesses the
sstable writers of ongoing writes.

This patch series makes sure we remove entries from the list of ongoing
writes before the corresponding sstable writer is destroyed.

Fixes #6221.
"

* 'espindola/fix-6221-v5' of https://github.com/espindola/scylla:
  sstables: Call revert_charges in compaction_write_monitor::write_failed
  sstables: Call monitor->write_failed earlier.
  sstables: Add write_failed to the write_monitor interface
2020-04-30 12:21:27 +03:00
Pekka Enberg
5c6265d14b Merge 'redis: add setex and ttl commands' from Takuya
"Enabling TTL feature, add setex and ttl commands to use it."

* 'redis_setex_ttl' of git://github.com/syuu1228/scylla:
  redis: add test for setex/ttl
  redis: add ttl command
  redis: add setex command
2020-04-30 09:39:48 +03:00
Pekka Enberg
d4c0d80f13 Merge 'redis: add lolwut test' from Takuya
"Add test for lolwut command, and also fix a bug on lolwut found by the test."

* 'redis_lolwut_test' of git://github.com/syuu1228/scylla:
  redis: lolwut parameter fix
  redis-test: add lolwut test
2020-04-30 09:30:43 +03:00
Piotr Sarna
c7c8bd0978 Update seastar submodule
* seastar 8fae03c2...d0cbf7d1 (6):
  > tests: restore compatibility with C++14 (broken due to std::filesystem)
  > http: make headers case-insensitive
  > on_internal_error: add scoped_no_abort_on_internal_error
  > Merge "make when_all functions noexcept" from Benny
  > chunked_fifo: fix underflow in reserve()
  > doc: document compatibility promises

Fixes #6319
2020-04-30 07:29:23 +02:00
Asias He
7d86a3b208 storage_service: Make replacing node take writes
Background:

Replace operation is used to replace a dead node in the cluster.
Currently during replace operation, the replacing node does not take any
writes. As a result, new writes to a range after the sync for that range
is done, e.g., after streaming for that range is finished, will not be
synced to the replacing node. Hinted hand off or repair after the
replacing operation will help. But it is better if we can make the
writes to the replacing node to avoid any post replacing operation
actions.

After this series and repair based node operation series, the replace
operation will guarantee the replacing node has all the latest copy of
data including the new writes during the replace operation. In short, no
more repairs before or after the replacing operation. Just replacing the
node is enough.

Implementation:

1) Filter the node being replaced out of the natural endpoints in
   storage_proxy, so that:

- The node being replaced will not be selected as the target for
  normal write or normal read.

- Do not depend on the gossip liveness to avoid selecting replacing node
  for normal write or normal read when the replacing node has the same
  ip address as the node being replaced. No more special handling for
  hibernate state in gossip which makes it is simpler and more robust.
  Replacing node will be marked as UP.

2) Put the replacing node in the pending list, so that:

- Replacing node will take writes but write to replacing will not be
  counted as CL.

- Replacing node will not take normal read.

Example:

For example, with RF = 3, n1, n2, n3 in the cluster, n3 is dead and
being replaced by node n4. When n4 starts:

- writes to nodes {n1, n2, n3} are changed to
  normal_replica_writes = {n1, n2} and pending_replica_writes= {n4}.

- reads to nodes {n1, n2, n3} are changed to
  normal_replica_reads = {n1, n2} only.

This way, the replacing node n4 now takes writes but does not take reads.

Tests:

1) Measure the number of writes during pending period that is the
   replacing starts and finishes the replace operation.

- Start 5 nodes, n1 to n5.
- Stop n5
- Start write in the background
- Start n6 to replace n5
- Get scylla_database_total_writes metrics when the replacing node announces HIBERNATE (replacing) and NORMAL status.

Before:
2020-02-06 08:35:35.921837 Get metrics when other knows replacing node = HIBERNATE
2020-02-06 08:35:35.939493 scylla_database_total_writes: node1={'scylla_database_total_writes': 15483}
2020-02-06 08:35:35.950614 scylla_database_total_writes: node2={'scylla_database_total_writes': 15857}
2020-02-06 08:35:35.961820 scylla_database_total_writes: node3={'scylla_database_total_writes': 16195}
2020-02-06 08:35:35.978427 scylla_database_total_writes: node4={'scylla_database_total_writes': 15764}
2020-02-06 08:35:35.992580 scylla_database_total_writes: node6={'scylla_database_total_writes': 331}
2020-02-06 08:36:49.794790 Get metrics when other knows replacing node = NORMAL
2020-02-06 08:36:49.809189 scylla_database_total_writes: node1={'scylla_database_total_writes': 267088}
2020-02-06 08:36:49.823302 scylla_database_total_writes: node2={'scylla_database_total_writes': 272352}
2020-02-06 08:36:49.837228 scylla_database_total_writes: node3={'scylla_database_total_writes': 274004}
2020-02-06 08:36:49.851104 scylla_database_total_writes: node4={'scylla_database_total_writes': 262972}
2020-02-06 08:36:49.862504 scylla_database_total_writes: node6={'scylla_database_total_writes': 513}

Writes = 513 - 331

After:
2020-02-06 08:28:56.548047 Get metrics when other knows replacing node = HIBERNATE
2020-02-06 08:28:56.560813 scylla_database_total_writes: node1={'scylla_database_total_writes': 290886}
2020-02-06 08:28:56.573925 scylla_database_total_writes: node2={'scylla_database_total_writes': 310304}
2020-02-06 08:28:56.586305 scylla_database_total_writes: node3={'scylla_database_total_writes': 304049}
2020-02-06 08:28:56.601464 scylla_database_total_writes: node4={'scylla_database_total_writes': 303770}
2020-02-06 08:28:56.615066 scylla_database_total_writes: node6={'scylla_database_total_writes': 604}
2020-02-06 08:29:10.537016 Get metrics when other knows replacing node = NORMAL
2020-02-06 08:29:10.553257 scylla_database_total_writes: node1={'scylla_database_total_writes': 336126}
2020-02-06 08:29:10.567181 scylla_database_total_writes: node2={'scylla_database_total_writes': 358549}
2020-02-06 08:29:10.581939 scylla_database_total_writes: node3={'scylla_database_total_writes': 351416}
2020-02-06 08:29:10.595567 scylla_database_total_writes: node4={'scylla_database_total_writes': 350580}
2020-02-06 08:29:10.610548 scylla_database_total_writes: node6={'scylla_database_total_writes': 45460}

Writes = 45460 - 604

As we can see the replacing node did not take write before and take write after the patch.

2) Check log of writer handler in storage_proxy

storage_proxy - creating write handler for token: -2642068240672386521,
keyspace_name=ks, original_natrual={127.0.0.1, 127.0.0.5, 127.0.0.2},
natural={127.0.0.1, 127.0.0.2}, pending={127.0.0.6}

The node being replaced, n5=127.0.0.5, is filtered out and the replacing
node, n6=127.0.0.6 is in the pending list.

Fixes: #5482
2020-04-30 10:22:30 +08:00
Asias He
e3fbc8fba1 repair: Use token_metadata with the replacing node in do_rebuild_replace_with_repair
We will change the update of tokens in token_metadata in the next patch
so that the tokens of the replacing node are updated to token_metadata
only after the replace operation is done. In order to get the correct
ranges for the replacing node in do_rebuild_replace_with_repair, we need
to use a copy of token_metadata contains the tokens of the replacing
node.

Refs: #5482
2020-04-30 10:22:30 +08:00
Asias He
b640614aa6 abstract_replication_strategy: Add get_ranges which takes token_metadata
It is useful when the caller wants to calculate ranges using a
custom token_metadata.

It will be used soon in do_rebuild_replace_with_repair for replace
operation.

Refs: #5482
2020-04-30 10:22:30 +08:00
Asias He
37d3d3e051 abstract_replication_strategy: Add get_natural_endpoints_without_node_being_replaced
Similar to natural_endpoints but with the node being replaced filtered out.

Refs: #5482
2020-04-30 10:22:30 +08:00
Asias He
1a75a60cfc abstract_replication_strategy: Add allow_remove_node_being_replaced_from_natural_endpoints
Decide if the replication strategy allow removing the node being replaced from
the natural endpoints when a node is being replaced in the cluster.
LocalStrategy is the not allowed to do so because it always returns the node
itself as the natural_endpoints and the node will not appear in the
pending_endpoints.

It is needed by the "Make replacing node take writes" work.

Refs: #5482
2020-04-30 10:22:30 +08:00
Pekka Enberg
eac9e253e7 sstables: Fix open-coded version parsing in make_descriptor()
The make_descriptor() function parses a string representation of sstable
version using a ternary operator. Clean it up by using
sstables::from_string(), which is future-proof when we add support for
later sstable formats.
Message-Id: <20200429082126.15944-1-penberg@scylladb.com>
2020-04-29 16:25:12 +02:00
Asias He
bd6691301e token_metadata: Calculate pending ranges for replacing node
It will be needed soon for making replace node take writes.

Refs: #5482
2020-04-29 16:02:10 +08:00
Asias He
75cf1d18b5 storage_service: Unify handling of replaced node removal from gossip
Currently, after the replacing node finishes the replace operation, it
removes the node being replaced from gossip directly in
storage_service::join_token_ring() with gossiper::replaced_endpoint(),
so the gossip states for the replaced node is gone.

When other nodes knows the replace operation is done, they will call
storage_service::remove_endpoint() and gossiper::remove_endpoint() to
quarantine the node but keep the gossip states. To prevent the
replacing node learns the state of replaced node again from existing
node again, the replacing node uses 2X quarantine time.

This makes the gossip states for the replaced node different on other
nodes and replacing nodes. It makes it is harder to reason about the
gossip states because the discrepancy of the states between nodes.

To fix, we unify the handling of replaced node on both replacing node
and other nodes. On all the nodes, once the replacing node becomes
NORMAL status, we remove the replaced node from token_metadata and
quarantine it but keep the gossip state. Since the replaced node is no
longer a member of the cluster, the fatclient timer will count and
expire and remove the replaced node from gossip.

Refs: #5482
2020-04-29 16:02:10 +08:00
Asias He
66c1907524 storage_service: Update tokens and replace address for replace operation
The motivation is to make the replacing node has the same view of the
token ring as the rest of the cluster.

If the replacing node has the same ip of the node being replaced, we
should update the tokens in token_metadata when the replace operation
starts, so that this replacing node and the rest of the cluster see the
same token ring.

If the replacing node has the different ip address of the node being
replaced, we should update the tokens in token_metadata only when
replace operation is done, because the other nodes will update the
replacing node's token in token_metadata when the replace operation is
done.

Refs: #5482
2020-04-29 16:02:00 +08:00
Nadav Har'El
ff5615d59d alternator test: drastically reduce time to boot Scylla
The alternator test, test/alternator/run, runs Scylla and runs the
various tests against it. Before this patch, just booting Scylla took
about 26 seconds (for a dev build, on my laptop). This patch reduces
this delay to less than one second!

It turns out that almost the entire delay was artificial, two periods
of 12 seconds "waiting for the gossip to settle", which are completely
unnecessary in the one-node cluster used in the Alternator test.
So a simple "--skip-wait-for-gossip-to-settle 0" parameter eliminates
these long delays completely.

Amusingly, the Scylla boot is now so fast, that I had to change a "sleep 2"
in the test script to "sleep 1", because 2 seconds is now much more than
it takes to boot Scylla :-)

Fixes #6310.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200428145035.22894-1-nyh@scylladb.com>
2020-04-29 07:55:03 +02:00
Benny Halevy
3b31acfa80 exceptions: drop OVERFLOW_ERROR cql binary protocol extension
Client drivers act differently on errors codes they don't recognize.
Adding new errors codes is considered a protocol extension and
should be negotiated with the client.

This change keeps `overflow_error_exception` internally but uses
the INVALID cql error code to return the error message back to the client
similar to keyspace_not_defined_exception.

We (and cassandra) already use `invalid_request_exception` extensively
to return various errors related to invalid values or types in the query.

Fixes #6264

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Reviewed-by: Gleb Natapov <gleb@scylladb.com>
Message-Id: <20200422130011.108003-1-bhalevy@scylladb.com>
2020-04-28 12:16:00 +03:00
Piotr Sarna
09e4f3b917 alternator: implement ScanIndexForward
The ScanIndexForward parameter is now fully implemented
and can accept ScanIndexForward=false in order to query
the partitions in reverse clustering order.
Note that reading partition slices in reverse order is less
efficient than forward scans and may put a strain on memory
usage, especially for large partitions, since the whole partition
is currently fetched in order to be reversed.

Fixes #5153
2020-04-28 11:44:46 +03:00
Piotr Sarna
be5d3f4733 Merge 'A bunch of refactors in versioned_value and gossiper' from Kamil
1. Remove the `versioned_value::factory` class, it didn't add any value. It just
   forced us to create an object for making `versioned_value`s, for no sensible
   reason.
2. Move some `versioned_value` deserialization code (string -> internal data
   structures) into the versioned_value module. Previously, it was scattered all
around the place.
3. Make `gossiper::get_seeds` const and return a const reference.

I needed these refactors for a PR I was preparing to fix an issue with CDC. The
attempt of fixing the issue failed (I'm trying something different now), but the
refactors might be useful anyway.

* kbr--vv-refactor:
  gossiper: make `get_seeds` method const and return a const ref
  versioned_value: remove versioned_value::factory class
  gms: move TOKENS string deserialization code into versioned_value
2020-04-28 10:27:45 +02:00
Pavel Solodovnikov
ed7a7554b8 storage_proxy: allow cas() to accept nullptr read_command
This patch allows users of storage_proxy::cas() to supply nullptr
as `query::read_command` which is supposed to skip the procedure
of reading the existing value.

The feature is used in alternator code for Read-Modify-Write
operations: some of them don't require reading previous item
values before updating.

Move `read_nothing_read_command` from alternator code to
storage_proxy layer and fabricate a new no-op command from it when
storage_proxy::cas() is used with nullptr read_command.

This allows to avoid sprinkling if-else branches all over the code
in order to check for null-equality of `cmd`.

We return from storage_proxy::query() very early with an empty
result in case we're given an empty partition_slice (which resides
inside the passed `read_command`) so this approach should be
perfectly fine.

Expand documentation for the `cas()` function to cover new
possible value for `cmd` argument.

Fixes: #6238
Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200428065235.5714-1-pa.solodovnikov@scylladb.com>
2020-04-28 10:44:19 +03:00
Asias He
35c5ef78b9 repair: Fix hang when the writer is dead
Consdier:

When repair master gets data from repair follower:

1) apply_rows_on_master_in_thread is called
2) a repair writer is created with _repair_writer.create_writer
3) the repair writer fails
4) data is written to the queue _mq[node_idx]->push_eventually attached
   with the writer

Since the writer is dead. No one is going to fetch data from the _mq
queue. The apply_rows_on_master_in_thread will block forever.

To fix, when the writer is failed, we should abort the _mq queue.

Refs: #6248
2020-04-28 12:14:32 +08:00
Raphael S. Carvalho
02e046608f api/service: fix segfault when taking a snapshot without keyspace specified
If no keyspace is specified when taking snapshot, there will be a segfault
because keynames is unconditionally dereferenced. Let's return an error
because a keyspace must be specified when column families are specified.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200427195634.99940-1-raphaelsc@scylladb.com>
2020-04-27 23:37:00 +03:00
Pekka Enberg
3a10bddd7d configure.py: Add '--with-seastar' option
This patch adds a '--with-seastar=<PATH>' option to configure.py, which
allows user to override the default seastar submodule path. This is
useful when building packages from source tarballs, for example.

Message-Id: <20200427165511.6448-1-penberg@scylladb.com>
2020-04-27 20:01:35 +03:00