Commit Graph

24908 Commits

Author SHA1 Message Date
Nadav Har'El
cb9e2ee00a cql-pytest: tests for fromJson() setting a map<ascii, int>
The fromJson() function can take a map JSON and use it to set a map column.
However, the specific example of a map<ascii, int> doesn't work in Scylla
(it does work in Cassandra). The xfailing tests in this patch demonstrate
this. Although the tests use perfectly legal ASCII, scylla fails the
fromJson() function, with a misleading error.

Refs #7949.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210121233855.100640-1-nyh@scylladb.com>
2021-01-22 14:29:25 +01:00
Pavel Emelyanov
90d445464b compaction: Remove compaction_manager::enabled()
This method was marked with 'FIXME -- should not be public'
when it was introduced. Since then it has stopped being used
and can even be removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210122083146.5886-1-xemul@scylladb.com>
2021-01-22 14:07:38 +02:00
Kamil Braun
570d15c7bc multishard_combining_reader: do not use smp::count
`multishard_combining_reader` currently only works under the assumption
that every table uses the same sharder configured using the node's number
of shards. But we could potentially specify a different sharder for a chosen table,
e.g. one that puts everything on shard 0.
Then this assumption will be broken and the reader causes a segfault.

Fixes #7945.
2021-01-21 18:28:18 +02:00
Nadav Har'El
328be1ca7c cql-pytest: tests for fromJson() not accepting empty string as integer
When writing to an integer column, Cassandra's fromJson() function allows
not just JSON number constants, it also allows a string containing a
number. Strings which do not hold a number fail with a FunctionFailure.
In particular, the empty string "" is an invalid number, and should fail.

The tests in this patch check this for two integer types: int and
varint.

Curiously, Cassandra and Scylla have opposite bugs here: Scylla fails
to recognize the error for varint, while Cassandra fails to recognize
the error for int. The tests in this patch reproduce these bugs.

The tests demonstrating Scylla's bug are marked xfail, and the tests
demonstrating Cassandra's bug is marked "cassandra_bug" (which means
it is marked xfail only when running against Cassandra, but expected
to succeed on Scylla.

Refs #7944.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210121133833.66075-1-nyh@scylladb.com>
2021-01-21 15:24:48 +01:00
Nadav Har'El
702b1b97bf cql: fix error return from execution of fromJson() and other functions
As reproduced in cql-pytest/test_json.py and reported in issue #7911,
failing fromJson() calls should return a FUNCTION_FAILURE error, but
currently produce a generic SERVER_ERROR, which can lead the client
to think the server experienced some unknown internal error and the
query can be retried on another server.

This patch adds a new cassandra_exception subclass that we were missing -
function_execution_exception - properly formats this error message (as
described in the CQL protocol documentation), and uses this exception
in two cases:

1. Parse errors in fromJson()'s parameters are converted into a
   function_execution_exception.

2. Any exceptions during the execute() of a native_scalar_function_for
   function is converted into a function_execution_exception.
   In particular, fromJson() uses a native_scalar_function_for.

   Note, however, that functions which already took care to produce
   a specific Cassandra error, this error is passed through and not
   converted to a function_execution_exception. An example is
   the blobAsText() which can return an invalid_request error, so
   it is left as such and not converted. This also happens in Cassandra.

All relevant tests in cql-pytest/test_json.py now pass, and are
no longer marked xfail. This patch also includes a few more improvements
to test_json.py.

Fixes #7911

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210118140114.4149997-1-nyh@scylladb.com>
2021-01-21 15:21:13 +01:00
Nadav Har'El
49440d67ad Merge: Fix multiple issues with timeuuid type
Merged patch series by Konstantin Osipov:

"These series improve uniqueness of generated timeuuids and change
list append/prepend logic to use client/LWT timestamp in timeuuids
generated for list keys. Timeuuid compare functions are
optimized.

The test coverage is extended for all of the above."

  uuid: add a comment warning against UUID::operator<
  uuid: replace slow versions of timeuiid compare with optimized/tested versions.
  test: add tests for legacy uuid compare & msb monotonicity
  test: add a test case for append/prepend limit
  test: add a test case for monotonicity of timeuuid least significant bits
  uuid: implement optimized timeuuid compare
  test: add a test case for list prepend/append with custom timestamp
  lists: rewrite list prepend to use append machinery
  lists: use query timestamp for list cell values during append
  uuid: fill in UUID node identifier part of UUID
  test: add a CQL test for list append/prepend operations
2021-01-21 13:20:07 +02:00
Konstantin Osipov
e18e2cb9f2 uuid: add a comment warning against UUID::operator< 2021-01-21 13:03:59 +03:00
Konstantin Osipov
845f6c667b uuid: replace slow versions of timeuiid compare with optimized/tested versions. 2021-01-21 13:03:59 +03:00
Konstantin Osipov
56d8d166cb test: add tests for legacy uuid compare & msb monotonicity 2021-01-21 13:03:59 +03:00
Konstantin Osipov
257c5b0879 test: add a test case for append/prepend limit 2021-01-21 13:03:59 +03:00
Konstantin Osipov
d6e65a3735 test: add a test case for monotonicity of timeuuid least significant bits
Ensure that timeuuid least significant bits are compared correctly.
2021-01-21 13:03:59 +03:00
Konstantin Osipov
0af3758aff uuid: implement optimized timeuuid compare
Introduce uint64_t based comparator for serialized timeuuids.

Respect Cassandra legacy for timeuuid compare order.

Scylla uses two versions of timeuuid compare:
- one for timeuuid values stored in uuid columns
- a different one for timeuuid values stored in timeuuid columns.

This commit re-implements the implementations of these comparators in
types.cc and deprecates the respective implementations types.cc. They
will be removed in a following patch.

A micro-benchmark at https://github.com/alecco/timeuuid-bench/
shows 2-4x speed up of the new comparators.
2021-01-21 13:03:59 +03:00
Konstantin Osipov
b4500a55c7 test: add a test case for list prepend/append with custom timestamp
Scylla now takes a custom timestamp into account when
executing list append/prepend operations. Test the new
semantics.
2021-01-21 13:03:59 +03:00
Konstantin Osipov
232ce6f611 lists: rewrite list prepend to use append machinery
Rewrite list prepend to use the same machinery
as append, and thus produce correct results when used in LWT.

After this patch, list prepend begins to honor user supplied timestamps.

If a user supplied timestamp for prepend is less than 2010-01-01 00:00:00
an exception is thrown.

Fixes #7611
2021-01-21 13:03:59 +03:00
Konstantin Osipov
2b8ce83eea lists: use query timestamp for list cell values during append
Scylla list cells are represented internally as a map of
timeuuid => value. To append a new value to a list
the coordinator generates a timeuuid reflecting the current time as key
and adds a value to the map using this key.

Before this patch, Scylla always generated a timeuuid for a new
value, even if the query had a user supplied or LWT timestamp.
This could break LWT linearizability. User supplied timestamps were
ignored.

This is reported as https://github.com/scylladb/scylla/issues/7611

A statement which appended multiple values to a list or a BATCH
generated an own microsecond-resolution timeuuid for each value:

BEGIN BATCH
  UPDATE ... SET a = a + [3]
  UPDATE ... SET a = a + [4]
APPLY BATCH

UPDATE ... SET a = a + [3, 4]

To fix the bug, it's necessary to preserve monotonicity of
timeuuids within a batch or multi-value append, but make sure
they all use the microsecond time, as is set by LWT or user.

To explain the fix, it's first necessary to recall the structure
of time-based UUIDs:

60 bits: time since start of GMT epoch, year 1582, represented
         in 100-nanosecond units
4 bits:  version
14 bits: clock sequence, a random number to avoid duplicates
         in case system clock is adjusted
2 bits:  type
48 bits: MAC address (or other hardware address)

The purpose of clockseq bits is as defined in
https://tools.ietf.org/html/rfc4122#section-4.1.5
is to reduce the probability of UUID collision in case clock
goes back in time or node id changes. The implementation should reset it
whenever one of these events may occur.

Since LWT microsecond time is guaranteed to be
unique by Paxos, the RFC provisioning for clockseq and MAC
slots becomes excessive.

The fix thus changes timeuuid slot content in the following way:
- time component now contains the same microsecond time for all
  values of a statement or a batch. The time is unique and monotonic in
  case of LWT. Otherwise it's most always monotonic, but may not be
  unique if two timestamps are created on different coordinators.
- clockseq component is used to store a sequence number which is
  unique and monotonic for all values within the statement/batch.
- to protect against time back-adjustments and duplicates
  if time is auto-generated, MAC component contains a random (spoof)
  MAC address, re-created on each restart. The address is different
  at each shard.

The change is made for all sources of time: user, generated, LWT.
Conditioning the list key generation algorithm on the source of
time would unnecessarily complicate the code while not increase
quality (uniqueness) of created list keys.

Since 14 bits of clockseq provide us with only 16383 distinct slots
per statement or batch, 3 extra bits in nanosecond part of the time
are used to extend the range to 131071 values per statement/batch.
If the rang is exceeded beyond the limit, an exception is produced.

A twist on the use of clockseq to extend timeuuid uniqueness is
that Scylla, like Cassandra, uses int8 compare to compare lower
bits of timeuuid for ordering. The patch takes this into account
and sign-complements the clockseq value to make it monotonic
according to the legacy compare function.

Fixes #7611

test: unit (dev)
2021-01-21 13:03:59 +03:00
Konstantin Osipov
6d1781be36 uuid: fill in UUID node identifier part of UUID
Before this patch, UUID generation code was not creating
sufficiently unique IDs: the 6 byte node identifier was mostly
empty, i.e. only containing shard id. This could lead to
collisions between queries executed concurrently at different
coordinators, and, since timeuuid is used as key in list append
and prepend operations, lead to lost updates.

To generate a unique node id, the patch uses a combination of
hardware MAC address (or a random number if no hardware address is
available) and the current shard id.

The shard id is mixed into higher bits of MAC, to reduce the
chances on NIC collision within the same network.

With sufficiently unique timeuuids as list cell keys, such updates
are no longer lost, but multi-value update can still be "merged"
with another multi-value update.

E.g. if node A executes SET l = l + [4, 5] and node B executes SET
l  = l + [6, 7], the list value could be any of [4, 5, 6, 7], [4,
6, 5, 7], [6, 4, 5, 7] and so on.

At least we are now less likely to get any value lost.

Fixes #6208.

@todo: initialize UUID subsystem explicitly in main()
and switch to using seastar::engine().net().network_interfaces()

test: unit (dev)
2021-01-21 13:03:53 +03:00
Avi Kivity
4cfaab208e allocation_strategy: set preferred max contiguous allocation to 128k for standard allocations
Now that managed_bytes and its users do not assume that a managed_bytes
instance allocated using standard_allocation_strategy is non-fragmented,
we can set the preferred max contiguous allocation to 128k. This causes
managed_bytes to fragment instances that are larger than this size.

Note that managed_bytes is the only user.

Closes #7943
2021-01-21 11:15:13 +02:00
Tomasz Grabiec
f08a3e3fd8 Merge "raft: test fixes, etcd tests, simplification" from Alejo
This patch set adds etcd unit tests for raft.

It also includes a fix for replication test in debug mode and a
simplification for append_request.

Tests: unit ({dev}), unit ({debug}), unit ({release})

*  https://github.com/alecco/scylla/tree/raft-ale-tests-09b:
  raft: etcd unit tests: test log replication
  raft: boost test etcd: test fsm can vote from any state
  raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs
  raft: replication test: add etcd test for cycling leaders
  raft: testing: provide primitives to wait for log propagation
  raft: etcd unit tests: initial boost tests
  raft: combine append_request _receive and _send
2021-01-21 10:41:33 +02:00
Pekka Enberg
7d98e05923 Update tools/python3 submodule
* tools/python3 1763a1a...c579207 (1):
  > dist/debian: handle rc version correctly
2021-01-21 10:41:33 +02:00
Avi Kivity
daa0e964fc dbuild: avoid --pids-limit with podman and cgroupsv1
Podman doesn't correctly support --pids-limit with cgroupsv1. Some
versions ignore it, and some versions reject the option.

To avoid the error, don't supply --pids-limit if cgroupsv2 is not
available (detected by its presence in /proc/filesystems). The user
is required to configure the pids limit in
/etc/containers/containers.conf.

Fixes #7938.

Closes #7939
2021-01-21 10:41:33 +02:00
Botond Dénes
4d581f1bb3 docs/README.md: guides: also mention running and debugging
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210120083304.36447-1-bdenes@scylladb.com>
2021-01-20 16:07:29 +02:00
Avi Kivity
f11a0700a8 Merge "mutation_writer: explicitly close writers" from Benny
"
_consumer_fut is expected to return an exception
on the abort path.  Wait for it and drop any exception
so it won't be abandoned as seen in #7904.

A future<> close() method was added to return
_consumer_fut.  It is called both after abort()
in the error path, and after consume_end_of_stream,
on the success path.

With that, consume_end_of_stream was made void
as it doesn't return a future<> anymore.

Fixes #7904
Test: unit(release)
"

* tag 'close-bucket-writer-v5' of github.com:bhalevy/scylla:
  mutation_writer: bucket_writer: add close
  mutation_writer/feed_writers: refactor bucket/shard writers
  mutation_writer: update bucket/shard writers consume_end_of_stream
2021-01-20 16:07:29 +02:00
Pekka Enberg
6cc981d089 scylla: Add "--build-mode" command line option
This adds a "--build-mode" command line option to "scylla" executable:

$ ./build/dev/scylla --build-mode
dev

This allows you to discover the build mode of a "scylla" executable
without resorting to "readelf", for example, to verify that you are
looking at the correct executable while debugging packaging issues.

Closes #7865
2021-01-20 16:07:29 +02:00
Botond Dénes
7eb8c71342 tools/scylla-types: add link to cql3-type-mapping.md
Just like scylla-sstable-index, scylla-types accepts types in (short)
cassandra class name notation. The mapping from the clq3 type names to
the class names is not straight-forward in all cases, so provide a link
to a table which lists the cassandra class name of all supported types
(and more).

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210120083816.37774-2-bdenes@scylladb.com>
2021-01-20 10:50:33 +02:00
Botond Dénes
882ade7c6a types/scylla-sstable-index: update URL to cql3-type-mapping.md
Said document was recently moved but the URL was not updated.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210120083816.37774-1-bdenes@scylladb.com>
2021-01-20 10:50:33 +02:00
Avi Kivity
114da51d73 Revert "commitlog: fix size of a write used to zero a segment"
This reverts commit df2f67626b. The fix
is correct, but has an unfortunate side effect with O_DSYNC: each
128k write also needs to flush the XFS log. This translates to
32MB/128k = 256 flushes, compared to one flush with the original code.

A better fix would be to prezero without O_DSYNC, then reopen the file
with O_DSYNC, but we can do that later.

Reopens #5857.
2021-01-20 10:23:43 +02:00
Avi Kivity
586f16bf79 Merge "Cut snitch -> storage service dependency" from Pavel E
"
Currently storage service and snitch implicitly depend on each
other. Storage service gossips snitch data on start, snitch
kicks the storage service when its configuration changes.

This interdependency is relaxed:

- snitch gossips all its state itself without using the
  storage service as a mediator
- storage service listens for snitch updates with the help
  of self-breaking subscription

Both changes make snitch independent from storage service,
remove yet another call for global storage service from the
codebase and make the storage service -> snitch reference
robust against dagling pointers/references

tests: unit(dev), dtest.rebuild.TestRebuild.simple_rebuild(dev)
"

* 'br-snitch-gossip-2' of https://github.com/xemul/scylla:
  storage-service: Subscribe to snitch to update topology
  snitch: Introduce reconfiguration signal
  snitch: Always gossip snitch info itself
  snitch: Do gossip DC and RACK itself
  snitch: Add generic gossiping helper
2021-01-20 10:23:43 +02:00
Pavel Solodovnikov
041072b59f raft: rename storage to persistence
The new naming scheme more clearly communicates to the client of
the raft library that the `persistence` interface implements
persistency layer of the fsm that is powering the raft
protocol itself rather than the client-side workflow and
user-provided `state_machine`.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201126135114.7933-1-pa.solodovnikov@scylladb.com>
2021-01-20 10:23:43 +02:00
Gleb Natapov
248449816b raft: fix snapshot transfer with existing log prefix
Current code that checks when snapshot has to be transferred does not
take in account the case where there can be log entries preceding the
snapshot. Fix the code to correctly test for snapshot transfer
condition.

Message-Id: <20210117095801.GB733394@scylladb.com>
2021-01-20 10:23:43 +02:00
Gleb Natapov
1ab262e86b raft: test: change replication_test to submit one entry at a time
replication_test's state machine is not commutative, so if commands are
applied in different order the states will be different as well. Since
the preemption check was added into co_await in seastar even waiting for
a ready future can preempt which will cause reordering of simultaneously
submitted entries in debug mode. For a long time we tried to keep entries
submission parallel in the test, but with the above seastar change it
is no longer possible to maintain it without changing the state machine
to be commutative. The patch changes the test to submit entries one by
one.

Message-Id: <20210117095147.GA733394@scylladb.com>
2021-01-20 10:23:43 +02:00
Benny Halevy
f29732573a mutation_writer: bucket_writer: add close
bucket_writer::close waits for the _consumer_fut.
It is called both after consume_end_of_stream()
and after abort().

_consumer_fut is expected to return an exception
on the abort path.  Wait for it and drop any exception
so it won't be abandoned as seen in #7904.

With that moved to close() time, consume_end_of_stream
doesn't need to return a future and is made void
all the way in the stack.  This is ok since
queue_reader_handle::push_end_of_stream is synchronous too.

Added a unit test that aborts the reader consumer
during `segregate_by_timestamp`, reproducing the
Exceptional future ignored issue without the fix.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-19 19:03:58 +02:00
Benny Halevy
fc3f9a57ff mutation_writer/feed_writers: refactor bucket/shard writers
Consolidate shard_based_splitting_writer::shard_writer
and timestamp_based_splitting_writer::bucket_writer
common code into mutation_writer::bucket_writer.

This provides a common place to handle consume_end_of_stream()
and abort(), and in particular the handling of the underlying
_conmsumer_fut.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-19 18:48:01 +02:00
Benny Halevy
a9d91a2d09 mutation_writer: update bucket/shard writers consume_end_of_stream
After 61520a33d6
feed_writers doesn't call consume_end_of_stream
after abort() so no need to test
            if (!_handle.is_terminated()) {

and consume_end_of_stream is now called in then_wrapped
rather than `finally` so it's ok if it throws.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-19 18:44:40 +02:00
Kamil Braun
1a8630e6a7 transport: silence "broken pipe" and "connection reset by peer" errors
The code would already silence broken pipe exceptions since it's
expected when the other side closes the connection or when we shutdown the
socket during Scylla shutdown, but the code wouldn't handle the following:
1. "Connection reset by peer" errors: these can also happen in the
   aforementioned two scenarios; the conditions that determine which of
   the two types of errors occur are unclear.
2. The scenarios would sometimes result in a `seastar::nested_exception`,
   mainly during shutdown. The errors could happen once when trying to send
   a response to a request (`_write_buf.write(...)/flush(...)`) and then
   again when trying to close the connection in a `finally` block. These
   nested exceptions were not silenced.

The commit handles each of these cases.
Closes #7907.

Closes #7931
2021-01-19 10:30:17 +02:00
Tomasz Grabiec
94749b01eb Merge "futurize flat_mutation_reader::next_partition" from Benny
The main motivation for this patchset is to prepare
for adding a async close() method to flat_mutation_reader.

In order to close the reader before destroying it
in all paths we need to make next_partition asynchronous
so it can asynchronously close a current reader before
destoring it, e.g. by reassignment of flat_mutation_reader_opt,
as done in scanning_reader::next_partition.

Test: unit(release, debug)

* git@github.com:bhalevy/scylla.git futurize-next-partition-v1:
  flat_mutation_reader: return future from next_partition
  multishard_mutation_query: read_context: save_reader: destroy reader_meta from the calling shard
  mutation_reader: filtering_reader: fill_buffer: futurize inner loop
  flat_mutation_reader::impl: consumer_adapter: futurize handle_result
  flat_mutation_reader: consume_pausable/in_thread: futurize_invoke consumer
  flat_mutation_reader: FlatMutationReaderConsumer: support also async consumer
  flat_mutation_reader:impl: get rid of _consume_done member
2021-01-19 10:19:03 +02:00
Alejo Sanchez
8a61e7defc raft: etcd unit tests: test log replication
etcd TestLogReplication

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
417b18aaad raft: boost test etcd: test fsm can vote from any state
etcd TestVoteFromAnyState

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
5a75c0e06a raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs
Log truncation of follower when node re-gains leadership.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
f14c44c686 raft: replication test: add etcd test for cycling leaders
This test cycles 3 nodes as leaders without adding entries.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
f627972186 raft: testing: provide primitives to wait for log propagation
For tests to be able to transition in a consistent state, in some cases
it's needed to allow the followers to catch up with the leader.

This prevents occasional hangs in debug mode for incoming tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
948ae813e4 raft: etcd unit tests: initial boost tests
First batch of ported etcd raft unit tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:12 -04:00
Gleb Natapov
6d47a535b9 raft: combine append_request _receive and _send
Combine structs for append request send and receive into a single
struct.

Author:    Gleb Natapov <gleb@scylladb.com>
Date:      Mon Nov 23 14:33:14 2020 +0200
2021-01-18 12:24:13 -04:00
Konstantin Osipov
bf1a031bd6 test: add a CQL test for list append/prepend operations
Test single- and multi- value list append, prepend,
append and prepend in a batch, conditional statements.

This covers the parts of Cassandra which are working as documented
and which we intend to preserve compatibility with.
2021-01-18 17:32:00 +03:00
Jenkins
faf71c6f75 release: prepare for 4.5.dev 2021-01-18 16:05:25 +02:00
Avi Kivity
df3ef800c2 Merge 'Introduce load and stream feature' from Asias He
storage_service: Introduce load_and_stream

=== Introduction ===

This feature extends the nodetool refresh to allow loading arbitrary sstables
that do not belong to a node into the cluster. It loads the sstables from disk
and calculates the owning nodes of the data and streams to the owners
automatically.

From example, say the old cluster has 6 nodes and the new cluster has 3 nodes.
We can copy the sstables from the old cluster to any of the new nodes and
trigger the load and stream process.

This can make restores and migrations much easier.

=== Performance ===

I managed to get 40MB/s per shard on my build machine.
CPU: AMD Ryzen 7 1800X Eight-Core Processor
DISK: Samsung SSD 970 PRO 512GB

Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32
shards, we can finish the load and stream 1TB of data in 13 mins on each
node.

1TB / 40 MB per shard * 32 shard / 60 s = 13 mins

=== Tests ===

backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test
which creates a cluster with 4 nodes and inserts data, then use
load_and_stream to restore to a 2 nodes cluster.

=== Usage ===

curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true

=== Notes ===

Btw, with the old nodetool refresh, the node will not pick up the data
that does not belong to this node but it will not delete it either. One
has to run nodetool cleanup to remove those data manually which is a
surprise to me and probably to users as well. With load and stream, the
process will delete the sstables once it finishes stream, so no nodetool
cleanup is needed.

The name of this feature load and stream follows load and store in CPU world.

Fixes #7831

Closes #7846

* github.com:scylladb/scylla:
  storage_service: Introduce load_and_stream
  distributed_loader: Add get_sstables_from_upload_dir
  table: Add make_streaming_reader for given sstables set
2021-01-18 15:08:19 +02:00
Avi Kivity
60f5ec3644 Merge 'managed_bytes: switch to explicit linearization' from Michał Chojnowski
This is a revival of #7490.

Quoting #7490:

The managed_bytes class now uses implicit linearization: outside LSA, data is never fragmented, and within LSA, data is linearized on-demand, as long as the code is running within with_linearized_managed_bytes() scope.

We would like to stop linearizing managed_bytes and keep it fragmented at all times, since linearization can require large contiguous chunks. Large contiguous allocations are hard to satisfy and cause latency spikes.

As a first step towards that, we remove all implicitly linearizing accessors and replace them with an explicit linearization accessor, with_linearized().

Some of the linearization happens long before use, by creating a bytes_view of the managed_bytes object and passing it onwards, perhaps storing it for later use. This does not work with with_linearized(), which creates a temporary linearized view, and does not work towards the longer term goal of never linearizing. As a substitute a managed_bytes_view class is introduced that acts as a view for managed_bytes (for interoperability it can also be a view for bytes and is compatible with bytes_view).

By the end of the series, all linearizations are temporary, within the scope of a with_linearized() call and can be converted to fragmented consumption of the data at leisure.

This has limited practical value directly, as current uses of managed_bytes are limited to keys (which are limited to 64k). However, it enables converting the atomic_cell layer back to managed_bytes (so we can remove IMR) and the CQL layer to managed_bytes/managed_bytes_view, removing contiguous allocations from the coordinator.

Closes #7820

* github.com:scylladb/scylla:
  test: add hashers_test
  memtable: fix accounting of managed_bytes in partition_snapshot_accounter
  test: add managed_bytes_test
  utils: fragment_range: add a fragment iterator for FragmentedView
  keys: update comments after changes and remove an unused method
  mutation_test: use the correct preferred_max_contiguous_allocation in measuring_allocator
  row_cache: more indentation fixes
  utils: remove unused linearization facilities in `managed_bytes` class
  misc: fix indentation
  treewide: remove remaining `with_linearized_managed_bytes` uses
  memtable, row_cache: remove `with_linearized_managed_bytes` uses
  utils: managed_bytes: remove linearizing accessors
  keys, compound: switch from bytes_view to managed_bytes_view
  sstables: writer: add write_* helpers for managed_bytes_view
  compound_compat: transition legacy_compound_view from bytes_view to managed_bytes_view
  types: change equal() to accept managed_bytes_view
  types: add parallel interfaces for managed_bytes_view
  types: add to_managed_bytes(const sstring&)
  serializer_impl: handle managed_bytes without linearizing
  utils: managed_bytes: add managed_bytes_view::operator[]
  utils: managed_bytes: introduce managed_bytes_view
  utils: fragment_range: add serialization helpers for FragmentedMutableView
  bytes: implement std::hash using appending_hash
  utils: mutable_view: add substr()
  utils: fragment_range: add compare_unsigned
  utils: managed_bytes: make the constructors from bytes and bytes_view explicit
  utils: managed_bytes: introduce with_linearized()
  utils: managed_bytes: constrain with_linearized_managed_bytes()
  utils: managed_bytes: avoid internal uses of managed_bytes::data()
  utils: managed_bytes: extract do_linearize_pure()
  thrift: do not depend on implicit conversion of keys to bytes_view
  clustering_bounds_comparator: do not depend on implicit conversion of keys to bytes_view
  cql3: expression: linearize get_value_from_mutation() eariler
  bytes: add to_bytes(bytes)
  cql3: expression: mark do_get_value() as static
2021-01-18 11:01:28 +02:00
Asias He
4d32d03172 storage_service: Introduce load_and_stream
=== Introduction ===

This feature extends the nodetool refresh to allow loading arbitrary sstables
that do not belong to a node into the cluster. It loads the sstables from disk
and calculates the owning nodes of the data and streams to the owners
automatically.

From example, say the old cluster has 6 nodes and the new cluster has 3 nodes.
We can copy the sstables from the old cluster to any of the new nodes and
trigger the load and stream process.

This can make restores and migrations much easier.

=== Performance ===

I managed to get 40MB/s per shard on my build machine.
CPU: AMD Ryzen 7 1800X Eight-Core Processor
DISK: Samsung SSD 970 PRO 512GB

Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32
shards, we can finish the load and stream 1TB of data in 13 mins on each
node.

1TB / 40 MB per shard * 32 shard / 60 s = 13 mins

=== Tests ===

backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test
which creates a cluster with 4 nodes and inserts data, then use
load_and_stream to restore to a 2 nodes cluster.

=== Usage ===

curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true

=== Notes ===

Btw, with the old nodetool refresh, the node will not pick up the data
that does not belong to this node but it will not delete it either. One
has to run nodetool cleanup to remove those data manually which is a
surprise to me and probably to users as well. With load and stream, the
process will delete the sstables once it finishes stream, so no nodetool
cleanup is needed.

The name of this feature load and stream follows load and store in CPU world.

Fixes #7831
2021-01-18 16:32:33 +08:00
Avi Kivity
ab44464911 Revert "docker: remove sshd from the image"
This reverts commit 32fd38f349. Some
tests (in scylla-cluster-tests) depend on it.
2021-01-17 14:34:40 +02:00
Raphael S. Carvalho
00c29e1e24 table: Move notify_bootstrap_or_replace_*() out of line
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210117045747.69891-9-raphaelsc@scylladb.com>
2021-01-17 10:36:13 +02:00
Asias He
28007f13f8 distributed_loader: Add get_sstables_from_upload_dir
This function scans sstables under the upload directory and return a list of
sstables for each shard.

Refs #7831
2021-01-16 20:03:17 +08:00