Commit Graph

267 Commits

Author SHA1 Message Date
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Benny Halevy
f081e651b3 memtable_list: rename request_flush to just flush
Now that it returns a future that always waits on
pending flushes there is no point in calling it `request_flush`.
`flush()` is simpler and better describes its function.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Benny Halevy
948a9da832 table: do_apply: verify that _async_gate is open
Applying changes to the memtable after table::stop
is prohibited. Verify that by making sure that
the _async_gate is still open in `do_apply`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210601055042.41380-1-bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Calle Wilund
131da30856 table: Always use explicit commitlog discard + clear out rp_set
Fixes #8733

If a memtable flush is still pending when we call table::clear(),
we can end up doing a "discard-all" call to commitlog, followed
by a per-segment-count (using rp_set) _later_. This will foobar
our internal usage counts and quite probably cause assertion
failures.
Fixed by always doing per-memtable explicit discard call. But to
ensure this works, since a memtable being flushed remains on
memtable list for a while (why?), we must also ensure we clear
out the rp_set on discard.

Closes #8766
2021-06-06 09:21:23 +03:00
Benny Halevy
1c0769d789 table: clear: make exception safe
It is currently possible that _memtables->add_memtable()
will throw after _memtables->clear(), leaving the memtables
list completely empty.  However, we do rely on always
having at least one allocated in the memtables list
as active_memtable() references a lw_shared_ptr<memtable>
at the back of the memtables vector, and it expected
to always be allocated via add_memtable() upon construction
and after clear().

This change moves the implementation of this convention
to memtable_list::clear() and makes the latter exception safe
by first allocating the to-be-added empty memtable and
only then clearing the vector.

Refs #8749

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210530100232.2104051-1-bhalevy@scylladb.com>
2021-05-30 13:22:52 +03:00
Asias He
72cc596842 repair: Wire off-strategy compaction for regular repair
We have enabled off-strategy compaction for bootstrap, replace,
decommission and removenode operations when repair based node operation
is enabled. Unlike node operations like replace or decommission, it is
harder to know when the repair of a table is finished because users can
send multiple repair requests one after another, each request repairing
a few token ranges.

This patch wires off-strategy compaction for regular repair by adding
a timeout based automatic off-strategy compaction trigger mechanism.
If there is no repair activity for sometime, off-strategy compaction
will be triggered for that table automatically.

Fixes #8677

Closes #8678
2021-05-26 11:41:27 +03:00
Benny Halevy
6144656b25 table: seal_active_memtable: update stats also on the error path
Currently the pending (memtables) flushes stats are adjusted back
only on success, therefore they will "leak" on error, so move
use a .then_wrapped clause to always update the stats.

Note that _commitlog->discard_completed_segments is still called
only on success, and so is returning the previous_flush future.

Test: unit(dev)
DTest: alternator_tests.py:AlternatorTest.test_batch_with_auto_snapshot_false(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210525055336.1190029-2-bhalevy@scylladb.com>
2021-05-25 12:51:54 +02:00
Avi Kivity
50f3bbc359 Merge "treewide: various header cleanups" from Pavel S
"
The patch set is an assorted collection of header cleanups, e.g:
* Reduce number of boost includes in header files
* Switch to forward declarations in some places

A quick measurement was performed to see if these changes
provide any improvement in build times (ccache cleaned and
existing build products wiped out).

The results are posted below (`/usr/bin/time -v ninja dev-build`)
for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX).

Before:

	Command being timed: "ninja dev-build"
	User time (seconds): 28262.47
	System time (seconds): 824.85
	Percent of CPU this job got: 3979%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2129888
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1402838
	Minor (reclaiming a frame) page faults: 124265412
	Voluntary context switches: 1879279
	Involuntary context switches: 1159999
	Swaps: 0
	File system inputs: 0
	File system outputs: 11806272
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

After:

	Command being timed: "ninja dev-build"
	User time (seconds): 26270.81
	System time (seconds): 767.01
	Percent of CPU this job got: 3905%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2117608
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1400189
	Minor (reclaiming a frame) page faults: 117570335
	Voluntary context switches: 1870631
	Involuntary context switches: 1154535
	Swaps: 0
	File system inputs: 0
	File system outputs: 11777280
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

The observed improvement is about 5% of total wall clock time
for `dev-build` target.

Also, all commits make sure that headers stay self-sufficient,
which would help to further improve the situation in the future.
"

* 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla:
  transport: remove extraneous `qos/service_level_controller` includes from headers
  treewide: remove evidently unneded storage_proxy includes from some places
  service_level_controller: remove extraneous `service/storage_service.hh` include
  sstables/writer: remove extraneous `service/storage_service.hh` include
  treewide: remove extraneous database.hh includes from headers
  treewide: reduce boost headers usage in scylla header files
  cql3: remove extraneous includes from some headers
  cql3: various forward declaration cleanups
  utils: add missing <limits> header in `extremum_tracking.hh`
2021-05-24 14:24:20 +03:00
Avi Kivity
1d508106be table: drop unused field database_sstable_write_monitor::_compaction_manager 2021-05-21 21:04:20 +03:00
Pavel Solodovnikov
fff7ef1fc2 treewide: reduce boost headers usage in scylla header files
`dev-headers` target is also ensured to build successfully.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 01:33:18 +03:00
Nadav Har'El
58e275e362 cross-tree: reduce dependency on db/config.hh and database.hh
Every time db/config.hh is modified (e.g., to add a new configuration
option), 110 source files need to be recompiled. Many of those 110 didn't
really care about configuration options, and just got the dependency
accidentally by including some other header file.

In this patch, I remove the include of "db/config.hh" from all header
files. It is only needed in source files - and header files only
need forward declarations. In some cases, source files were missing
certain includes which they got incidentally from db/config.hh, so I
had to add these includes explicitly.

After this patch, the number of source files that get recompiled after a
change to db/config.hh goes down from 110 to 45.
It also means that 65 source files now compile faster because they don't
include db/config.hh and whatever it included.

Additionally, this patch also eliminates a few unnecessary inclusions
of database.hh in other header files, which can use a forward declaration
or database_fwd.hh. Some of the source files including one of those
header files relied on one of the many header files brought in by
database.hh, so we need to include those explicitly.
In view_update_generator.hh something interesting happened - it *needs*
database.hh because of code in the header file, but only included
database_fwd.hh, and the only reason this worked was that the files
including view_update_generator.hh already happened to unnecessarily
include database.hh. So we fix that too.

Refs #1

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505102111.955470-1-nyh@scylladb.com>
2021-05-05 13:23:00 +03:00
Avi Kivity
3e6232bb92 Merge "Wire offstrategy compaction to repair-based removenode" from Raphael
"
From now on, offstrategy compaction is triggered on completion of repair-based
removenode. So compaction will no longer act aggressively while removenode
is going on, which helps reducing both latency and operation time.

Refs #5226.
"

* 'offstrategy_removenode' of github.com:raphaelsc/scylla:
  repair: Wire offstrategy compaction to repair-based removenode
  table: introduce trigger_offstrategy_compaction()
  repair/row_level: make operations_supported static const
2021-04-28 12:02:07 +03:00
Benny Halevy
825acd4031 table: for_all_partitions_slow: close iteration_step reader when done
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
320cb67b08 table: query, mutation_query: close querier when done
Make sure to close the querier and subsequently its reader before
destroying it (unless it was moved).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
efe938cf1f flat_mutation_reader: make sure to close reader passed to read_mutation_from_flat_mutation_reader
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Avi Kivity
daeddda7cc treewide: remove inclusions of storage_proxy.hh from headers
storage_proxy.hh is huge and includes many headers itself, so
remove its inclusions from headers and re-add smaller headers
where needed (and storage_proxy.hh itself in source files that
need it).

Ref #1.
2021-04-20 21:23:00 +03:00
Raphael S. Carvalho
84f7ae2c82 table: remove unneeded code as sstables are not shared anymore
given that resharding is now a synchronous mandatory step, before
table is populated, snapshot() can now get rid of code which takes
into account whether or not a sstable is shared.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210414121549.85858-1-raphaelsc@scylladb.com>
2021-04-15 11:59:41 +02:00
Raphael S. Carvalho
5c630f405a table: introduce trigger_offstrategy_compaction()
this function will be used on repair-based operation completion,
to notify table about the need to start offstrategy compaction
process on the maintenance sstables produced by the operation.
Function which notifies about bootstrap and replace completion
is changed to use this new function.
Removenode and decommission will reuse this function.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-04-09 14:53:14 -03:00
Botond Dénes
5c8f142fe5 table: add mutation_query()
We want to migrate `database::mutation_query()` off `mutation_query()`
to use `table::mutation_query()` instead. The reason is the same as for
making `table::query()` standalone: the `mutation_query()`
implementation increasingly became specific to how tables are queried
and is about to became even more specific due to impending changes to
how permits are obtained. As no-one in the codebase is doing generic
mutation queries on generic mutation sources we can just make this a
member of table.
This patch just adds `table::mutation_query()`, no user exists yet.
`table::mutation_query()` is identical to `mutation_query()`, except
that it is a coroutine.
2021-04-09 13:40:27 +03:00
Botond Dénes
c3f0681011 table: query(): inline data_query() code into query()
`data_query()` is now just a thin wrapper over
`data_querier::consume_page()`. Furthermore, contrary to the old data
query method, it is not a generic way of querying a mutation source, it
is now closely tied to how we query tables. It does a querier lookup and
save. In the future we plan on tying it even closer to the table in how
permits are obtained. For this reason it is better to just inline it
into the `query()` method which invokes it.
2021-04-09 13:40:27 +03:00
Botond Dénes
b03f360bb0 table: make query() a coroutine
This method is very hard to read or modify in its current form due to
all the continuation-chain boilerplate. Make it a coroutine to
facilitate future changes in the next patches but not just.
2021-04-09 11:04:35 +03:00
Botond Dénes
32ae51dc2c table: query(): fix typo (short_read_allwoed)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210408133018.65692-1-bdenes@scylladb.com>
2021-04-08 16:34:08 +03:00
Raphael S. Carvalho
65b09567dd table: Wire up off-strategy compaction on repair-based bootstrap and replace
Now, sstables created by bootstrap and replace will be added to the
maintenance set, and once the operation completes, off-strategy compaction
will be started.

We wait until the end of operation to trigger off-strategy, as reshaping
can be more efficient if we wait for all sstables before deciding what
to compact. Also, waiting for completion is no longer an issue because
we're able to read from new sstables using partitioned_sstable_set and
their existence aren't accounted by the compaction backlog tracker yet.

Refs #5226.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
c45d2e1d27 table: extend add_sstable_and_update_cache() for off-strategy
Function is extended to add sstable to maintenance set if requested
by the caller.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
e0e5bf8285 table: Introduce off-strategy compaction on maintenance sstable set
Off-strategy compaction is about incrementally reshaping the off-strategy
sstables in maintenance set, using our existing reshape mechanism, until
the set is ready for integration into the main sstable set.
The whole operation is done in maintenance mode, using the streaming
scheduling group.
We can do it this way because data in maintenance set is disjoint, so
effects on read amplification is avoided by using
partitioned_sstable_set, which is able to efficiently and incrementally
retrieve data from disjoint sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
439e9b6fab table: change build_new_sstable_list() to accept other sstable sets
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
6e95860e09 table: change non_staging_sstables() to filter out off-strategy sstables
SSTables that are off-strategy should be excluded by this function as
it's used to select candidates for regular compaction.
So in addition to only returning candidates from the main set, let's
also rename it to precisely reflect its behavior.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
c64a156c53 table: Introduce maintenance sstable set
This new sstable set will hold sstables created by repair-based
operations. A repair-based op creates 1 sstable per vrange (256),
so sstables added to this new set are disjoint, therefore they
can be efficiently read from using partitioned_sstable_set.

Compound set is changed to include this new set, so sstables in
this new set are automatically included when creating readers,
computing statistics, and so on.
This new set is not backlog tracked, so changes were needed to
prevent a sstable in this set from being added or removed from
the tracker.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:47 -03:00
Raphael S. Carvalho
1e7a444a8b table: Wire compound sstable set
From now own, _sstables  becomes the compound set, and _main_sstables refer
only to the main sstables of the table. In the near future, maintenance
set will be introduced and will also be managed by the compound set.

So add_sstable() and on_compaction_completion() are changed to
explicitly insert and remove sstables from the main set.

By storing compound set in _sstables, functions which used _sstables for
creating reader, computing statistics, etc, will not have to be changed
when we introduce the maintenance set, so code change is a lot minimized
by this approach.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:46:06 -03:00
Raphael S. Carvalho
42b309b43e table: prepare make_reader_excluding_sstables() to work with compound sstable set
Compound set will not be inserted or erased directly, so let's change
this function to build a new set from scratch instead.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:42:50 -03:00
Raphael S. Carvalho
4e142458eb table: prepare discard_sstables() to work with compound sstable set
After compound set, discard_sstables() will have to prune each set
individually and later refresh the compound set. So let's change
the function to support multiple sstable sets, taking into account
that a sstable set may not want to be backlog tracked.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:42:50 -03:00
Raphael S. Carvalho
d25822a030 table: extract add_sstable() common code into a function
The purpose is to allow the code to be eventually reused by maintenance
sstable set, which will be soon introduced.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:42:50 -03:00
Raphael S. Carvalho
f6fc32c8da table: use new sstable_set::for_each_sstable
for_each_sstable() is preferred over all() because it's guaranteed to
perform no copy.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210311163009.42210-2-raphaelsc@scylladb.com>
2021-03-11 18:47:17 +02:00
Raphael S. Carvalho
05b07c7161 sstable_set: preparatory work to change sstable_set::all() api
users of sstable_set::all() rely on the set itself keeping a reference
to the returned list, so user can iterate through the list assuming
that it is alive all the way through.

this will change in the future though, because there will be a
compound set impl which will have to merge the all() of multiple
managed sets, and the result is a temporary value.

so even range-based loops on all() have to keep a ref to the returned
list, to avoid the list from being prematurely destroyed.

so the following code
	for (auto& sst : *sstable_set.all()) { ...}
becomes
	for (auto sstables = sstable_set.all(); auto& sst : *sstables) { ... }

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-10 12:02:12 -03:00
Avi Kivity
5f4bf18387 Revert "Merge 'sstables: add versioning to the sstable_set ' from Wojciech Mitros"
This reverts commit 31909515b3, reversing
changes made to ef97adc72a. It shows many
serious regressions in dtest.

Fixes #8197.
2021-03-02 13:21:22 +02:00
Avi Kivity
31909515b3 Merge 'sstables: add versioning to the sstable_set ' from Wojciech Mitros
Currently, the sstable_set in a table is copied before every change
to allow accessing the unchanged version by existing sstable readers.
This patch changes the sstable_set to a structure that keeps all its
versions that are referenced somewhere and provides a way of getting
a reference to an immutable version of the set.
Each sstable in the set is associated with the versions it is alive in,
and is removed when all such versions don't have references anymore.
To avoid copying, the object holding all sstables in the set version is
changed to a new structure, sstable_list, which was previously an alias
for std::unordered_set<shared_sstable>, and which implements most of the
methods of an unordered_set, but its iterator uses the actual set with
all sstables from all referenced versions and iterates over those
sstables that belong to the captured version.
The methods that modify the sets contents give strong exception guarantee
by trying to insert new sstables to its containers, and erasing them in
the case of an caught exception.
To release shared_sstables as soon as possible (i.e. when all references
to versions that contain them die), each time a version is removed, all
sstables that were referenced exclusively by this version are erased. We
are able to find these sstables efficiently by storing, for each version,
all sstables that were added and erased in it, and, when a version is
removed, merging it with the next one. When a version that adds an sstable
gets merged with a version that removes it, this sstable is erased.

Fixes #2622

Signed-off-by: Wojciech Mitros wojciech.mitros@scylladb.com

Closes #8111

* github.com:scylladb/scylla:
  sstables: add test for checking the latency of updating the sstable_set in a table
  sstables: move column_family_test class from test/boost to test/lib
  sstables: use fast copying of the sstable_set instead of rebuilding it
  sstables: replace the sstable_set with a versioned structure
  sstables: remove potential ub
  sstables: make sstable_set constructor less error-prone
2021-03-01 14:16:36 +02:00
Tomasz Grabiec
fb1d3fe2cf table: Fix schema mismatch between memtable reader and sstable writer
The schema used to create the sstable writer has to be the same as the
schema used by the reader, as the former is used to intrpret mutation
fragments produced by the reader.

Commit 9124a70 intorduced a deferring point between reader creation
and writer creation which can result in schema mismatch if there was a
concurrent alter.

This could lead to the sstable write to crash, or generate a corrupted
sstable.

Fixes #7994

Message-Id: <20210222153149.289308-1-tgrabiec@scylladb.com>
2021-02-22 17:51:00 +02:00
Wojciech Mitros
0feff8712e sstables: use fast copying of the sstable_set instead of rebuilding it
The sstable_set enables copying without iterating over all its elements,
so it's faster to copy a set and modify it than copy all its elements
while filtering the ones that were erased.

The modifications are done on a temporary version of the set, so that
if an operation fails the base version remains unchanged

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-02-11 11:02:55 +01:00
Avi Kivity
4082f57edc Merge 'Make commitlog disk limit a hard limit.' from Calle Wilund
Refs #6148

Commitlog disk limit was previously a "soft" limit, in that we allowed allocating new segments, even if we were over
disk usage max. This would also cause us sometimes to create new segments and delete old ones, if badly timed in
needing and releasing segments, in turn causing useless disk IO for pre-allocation/zeroing.

This patch set does:
* Make limit a hard limit. If we have disk usage > max, we wait for delete or recycle.
* Make flush threshold configurable. Default is ask for flush when over 50% usage. (We do not wait for results)
* Make flush "partial". We flush X% of the used space (used - thres/2), and make the rp limit accordingly. This means we will try to clear the N oldest segments, not all. I.e. "lighter" flush. Of course, if the CL is wholly dominated by a single CF, this will not really help much. But when > 1 cf is used, it means we can skip those not having unflushed data < req rp.
* Force more eager flush/recycle if we're out of segments

Note: flush threshold is not exposed in scylla config (yet). Because I am unsure of wording, and even if it should.
Note: testing is sparse, esp. in regard to latency/timeouts added in high usage scenarios. While I can fairly easily provoke "stalls" (i.e. forced waiting for segments to free up) with simple C-S, it is hard to say exactly where in a more sane config (I set my limits looow) latencies will start accumulating.

Closes #7879

* github.com:scylladb/scylla:
  commitlog: Force earlier cycle/flush iff segment reserve is empty
  commitlog: Make segment allocation wait iff disk usage > max
  commitlog: Do partial (memtable) flushing based on threshold
  commitlog: Make flush threshold configurable
  table: Add a flush RP mark to table, and shortcut if not above
2021-02-08 16:44:05 +02:00
Raphael S. Carvalho
e1261d10f1 table: Avoid useless allocations when updating cache on memtable flush completion
we're unconditionally using make_combined_mutation_source(), which causes extra
allocations, even if memtable was flushed into a single sstable, which is the
most common case. memtable will only be flushed into more than one sstable if
TWCS is used and memtable had old data written into it due to out-of-order
writes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210205182028.439948-1-raphaelsc@scylladb.com>
2021-02-06 20:03:33 +02:00
Benny Halevy
22f6023ac3 sstables: sstable_writer_config: add origin member
Add a string describing where the sstables originated
from (e.g. memtable, repair, streaming, compaction, etc.)

If configure_writer is called with a nullptr, the origin
will be equal to an empty string.

Introduce test_env_sstables_manager that provides an overload
of configure_writer with no parmeters that calls the base-class'
configure_writer with "test" origin.  This was to reduce the
code churn in this patch and to keep the tests simple.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-01 16:45:52 +02:00
Botond Dénes
080bc2ffec sstables: pass partition_range to create_single_key_sstable_reader()
We want to unify the various sstable reader creation methods and this
method taking a ring position instead of a partition range like
everybody else stands in the way of that.

This is effect reverts 68663d0de.
2021-01-27 17:38:14 +02:00
Avi Kivity
df3ef800c2 Merge 'Introduce load and stream feature' from Asias He
storage_service: Introduce load_and_stream

=== Introduction ===

This feature extends the nodetool refresh to allow loading arbitrary sstables
that do not belong to a node into the cluster. It loads the sstables from disk
and calculates the owning nodes of the data and streams to the owners
automatically.

From example, say the old cluster has 6 nodes and the new cluster has 3 nodes.
We can copy the sstables from the old cluster to any of the new nodes and
trigger the load and stream process.

This can make restores and migrations much easier.

=== Performance ===

I managed to get 40MB/s per shard on my build machine.
CPU: AMD Ryzen 7 1800X Eight-Core Processor
DISK: Samsung SSD 970 PRO 512GB

Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32
shards, we can finish the load and stream 1TB of data in 13 mins on each
node.

1TB / 40 MB per shard * 32 shard / 60 s = 13 mins

=== Tests ===

backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test
which creates a cluster with 4 nodes and inserts data, then use
load_and_stream to restore to a 2 nodes cluster.

=== Usage ===

curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true

=== Notes ===

Btw, with the old nodetool refresh, the node will not pick up the data
that does not belong to this node but it will not delete it either. One
has to run nodetool cleanup to remove those data manually which is a
surprise to me and probably to users as well. With load and stream, the
process will delete the sstables once it finishes stream, so no nodetool
cleanup is needed.

The name of this feature load and stream follows load and store in CPU world.

Fixes #7831

Closes #7846

* github.com:scylladb/scylla:
  storage_service: Introduce load_and_stream
  distributed_loader: Add get_sstables_from_upload_dir
  table: Add make_streaming_reader for given sstables set
2021-01-18 15:08:19 +02:00
Raphael S. Carvalho
00c29e1e24 table: Move notify_bootstrap_or_replace_*() out of line
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210117045747.69891-9-raphaelsc@scylladb.com>
2021-01-17 10:36:13 +02:00
Calle Wilund
c3d95811da table: Add a flush RP mark to table, and shortcut if not above
Adds a second RP to table, marking where we flushed last.
If a new flush request comes in that is below this mark, we
can skip a second flush.

This is to (in future) support incremental CL flush.
2021-01-05 18:16:09 +00:00
Raphael S. Carvalho
9124a708f1 table: Wire interposer consumer for memtable flush
From now on, memtable flush will use the strategy's interposer consumer
iff split_during_flush is enabled (disabled by default).
It has effect only for TWCS users as TWCS it's the only strategy that
goes on to implement this interposer consumer, which consists of
segregating data according to the window configuration.

Fixes #4617.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-01-04 16:26:07 -03:00
Raphael S. Carvalho
c926a948e5 table: Add write_memtable_to_sstable variant which accepts flat_mutation_reader
This new variant will be needed for interposer consumer.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-01-04 16:23:00 -03:00
Raphael S. Carvalho
32acb44fec table: Allow sstable write permit to be shared across monitors
As a preparation for interposer on flush, let's allow database write monitor
to store a shared sstable write permit, which will be released as soon as
any of the sstable writers reach the sealing stage.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-01-04 14:46:43 -03:00
Raphael S. Carvalho
5519fdba72 table: Extend cache update to operate a memtable split into multiple sstables
This extension is needed for future work where a memtable will be segregated
during flush into one sstable or more. So now multiple sstables can be added
to the set after a memtable flush, and compaction is only triggered at the
end.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-01-04 13:24:10 -03:00
Asias He
84f482bde4 table: Add make_streaming_reader for given sstables set
Add a streaming reader that streams from a given sstables set.

Refs #7831
2020-12-30 08:32:42 +08:00