Now that it returns a future that always waits on
pending flushes there is no point in calling it `request_flush`.
`flush()` is simpler and better describes its function.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fixes#8733
If a memtable flush is still pending when we call table::clear(),
we can end up doing a "discard-all" call to commitlog, followed
by a per-segment-count (using rp_set) _later_. This will foobar
our internal usage counts and quite probably cause assertion
failures.
Fixed by always doing per-memtable explicit discard call. But to
ensure this works, since a memtable being flushed remains on
memtable list for a while (why?), we must also ensure we clear
out the rp_set on discard.
Closes#8766
It is currently possible that _memtables->add_memtable()
will throw after _memtables->clear(), leaving the memtables
list completely empty. However, we do rely on always
having at least one allocated in the memtables list
as active_memtable() references a lw_shared_ptr<memtable>
at the back of the memtables vector, and it expected
to always be allocated via add_memtable() upon construction
and after clear().
This change moves the implementation of this convention
to memtable_list::clear() and makes the latter exception safe
by first allocating the to-be-added empty memtable and
only then clearing the vector.
Refs #8749
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210530100232.2104051-1-bhalevy@scylladb.com>
We have enabled off-strategy compaction for bootstrap, replace,
decommission and removenode operations when repair based node operation
is enabled. Unlike node operations like replace or decommission, it is
harder to know when the repair of a table is finished because users can
send multiple repair requests one after another, each request repairing
a few token ranges.
This patch wires off-strategy compaction for regular repair by adding
a timeout based automatic off-strategy compaction trigger mechanism.
If there is no repair activity for sometime, off-strategy compaction
will be triggered for that table automatically.
Fixes#8677Closes#8678
Currently the pending (memtables) flushes stats are adjusted back
only on success, therefore they will "leak" on error, so move
use a .then_wrapped clause to always update the stats.
Note that _commitlog->discard_completed_segments is still called
only on success, and so is returning the previous_flush future.
Test: unit(dev)
DTest: alternator_tests.py:AlternatorTest.test_batch_with_auto_snapshot_false(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210525055336.1190029-2-bhalevy@scylladb.com>
"
The patch set is an assorted collection of header cleanups, e.g:
* Reduce number of boost includes in header files
* Switch to forward declarations in some places
A quick measurement was performed to see if these changes
provide any improvement in build times (ccache cleaned and
existing build products wiped out).
The results are posted below (`/usr/bin/time -v ninja dev-build`)
for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX).
Before:
Command being timed: "ninja dev-build"
User time (seconds): 28262.47
System time (seconds): 824.85
Percent of CPU this job got: 3979%
Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2129888
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1402838
Minor (reclaiming a frame) page faults: 124265412
Voluntary context switches: 1879279
Involuntary context switches: 1159999
Swaps: 0
File system inputs: 0
File system outputs: 11806272
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
After:
Command being timed: "ninja dev-build"
User time (seconds): 26270.81
System time (seconds): 767.01
Percent of CPU this job got: 3905%
Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2117608
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1400189
Minor (reclaiming a frame) page faults: 117570335
Voluntary context switches: 1870631
Involuntary context switches: 1154535
Swaps: 0
File system inputs: 0
File system outputs: 11777280
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
The observed improvement is about 5% of total wall clock time
for `dev-build` target.
Also, all commits make sure that headers stay self-sufficient,
which would help to further improve the situation in the future.
"
* 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla:
transport: remove extraneous `qos/service_level_controller` includes from headers
treewide: remove evidently unneded storage_proxy includes from some places
service_level_controller: remove extraneous `service/storage_service.hh` include
sstables/writer: remove extraneous `service/storage_service.hh` include
treewide: remove extraneous database.hh includes from headers
treewide: reduce boost headers usage in scylla header files
cql3: remove extraneous includes from some headers
cql3: various forward declaration cleanups
utils: add missing <limits> header in `extremum_tracking.hh`
Every time db/config.hh is modified (e.g., to add a new configuration
option), 110 source files need to be recompiled. Many of those 110 didn't
really care about configuration options, and just got the dependency
accidentally by including some other header file.
In this patch, I remove the include of "db/config.hh" from all header
files. It is only needed in source files - and header files only
need forward declarations. In some cases, source files were missing
certain includes which they got incidentally from db/config.hh, so I
had to add these includes explicitly.
After this patch, the number of source files that get recompiled after a
change to db/config.hh goes down from 110 to 45.
It also means that 65 source files now compile faster because they don't
include db/config.hh and whatever it included.
Additionally, this patch also eliminates a few unnecessary inclusions
of database.hh in other header files, which can use a forward declaration
or database_fwd.hh. Some of the source files including one of those
header files relied on one of the many header files brought in by
database.hh, so we need to include those explicitly.
In view_update_generator.hh something interesting happened - it *needs*
database.hh because of code in the header file, but only included
database_fwd.hh, and the only reason this worked was that the files
including view_update_generator.hh already happened to unnecessarily
include database.hh. So we fix that too.
Refs #1
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505102111.955470-1-nyh@scylladb.com>
"
From now on, offstrategy compaction is triggered on completion of repair-based
removenode. So compaction will no longer act aggressively while removenode
is going on, which helps reducing both latency and operation time.
Refs #5226.
"
* 'offstrategy_removenode' of github.com:raphaelsc/scylla:
repair: Wire offstrategy compaction to repair-based removenode
table: introduce trigger_offstrategy_compaction()
repair/row_level: make operations_supported static const
Make sure to close the querier and subsequently its reader before
destroying it (unless it was moved).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
storage_proxy.hh is huge and includes many headers itself, so
remove its inclusions from headers and re-add smaller headers
where needed (and storage_proxy.hh itself in source files that
need it).
Ref #1.
given that resharding is now a synchronous mandatory step, before
table is populated, snapshot() can now get rid of code which takes
into account whether or not a sstable is shared.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210414121549.85858-1-raphaelsc@scylladb.com>
this function will be used on repair-based operation completion,
to notify table about the need to start offstrategy compaction
process on the maintenance sstables produced by the operation.
Function which notifies about bootstrap and replace completion
is changed to use this new function.
Removenode and decommission will reuse this function.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We want to migrate `database::mutation_query()` off `mutation_query()`
to use `table::mutation_query()` instead. The reason is the same as for
making `table::query()` standalone: the `mutation_query()`
implementation increasingly became specific to how tables are queried
and is about to became even more specific due to impending changes to
how permits are obtained. As no-one in the codebase is doing generic
mutation queries on generic mutation sources we can just make this a
member of table.
This patch just adds `table::mutation_query()`, no user exists yet.
`table::mutation_query()` is identical to `mutation_query()`, except
that it is a coroutine.
`data_query()` is now just a thin wrapper over
`data_querier::consume_page()`. Furthermore, contrary to the old data
query method, it is not a generic way of querying a mutation source, it
is now closely tied to how we query tables. It does a querier lookup and
save. In the future we plan on tying it even closer to the table in how
permits are obtained. For this reason it is better to just inline it
into the `query()` method which invokes it.
This method is very hard to read or modify in its current form due to
all the continuation-chain boilerplate. Make it a coroutine to
facilitate future changes in the next patches but not just.
Now, sstables created by bootstrap and replace will be added to the
maintenance set, and once the operation completes, off-strategy compaction
will be started.
We wait until the end of operation to trigger off-strategy, as reshaping
can be more efficient if we wait for all sstables before deciding what
to compact. Also, waiting for completion is no longer an issue because
we're able to read from new sstables using partitioned_sstable_set and
their existence aren't accounted by the compaction backlog tracker yet.
Refs #5226.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Off-strategy compaction is about incrementally reshaping the off-strategy
sstables in maintenance set, using our existing reshape mechanism, until
the set is ready for integration into the main sstable set.
The whole operation is done in maintenance mode, using the streaming
scheduling group.
We can do it this way because data in maintenance set is disjoint, so
effects on read amplification is avoided by using
partitioned_sstable_set, which is able to efficiently and incrementally
retrieve data from disjoint sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
SSTables that are off-strategy should be excluded by this function as
it's used to select candidates for regular compaction.
So in addition to only returning candidates from the main set, let's
also rename it to precisely reflect its behavior.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This new sstable set will hold sstables created by repair-based
operations. A repair-based op creates 1 sstable per vrange (256),
so sstables added to this new set are disjoint, therefore they
can be efficiently read from using partitioned_sstable_set.
Compound set is changed to include this new set, so sstables in
this new set are automatically included when creating readers,
computing statistics, and so on.
This new set is not backlog tracked, so changes were needed to
prevent a sstable in this set from being added or removed from
the tracker.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
From now own, _sstables becomes the compound set, and _main_sstables refer
only to the main sstables of the table. In the near future, maintenance
set will be introduced and will also be managed by the compound set.
So add_sstable() and on_compaction_completion() are changed to
explicitly insert and remove sstables from the main set.
By storing compound set in _sstables, functions which used _sstables for
creating reader, computing statistics, etc, will not have to be changed
when we introduce the maintenance set, so code change is a lot minimized
by this approach.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Compound set will not be inserted or erased directly, so let's change
this function to build a new set from scratch instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
After compound set, discard_sstables() will have to prune each set
individually and later refresh the compound set. So let's change
the function to support multiple sstable sets, taking into account
that a sstable set may not want to be backlog tracked.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The purpose is to allow the code to be eventually reused by maintenance
sstable set, which will be soon introduced.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
users of sstable_set::all() rely on the set itself keeping a reference
to the returned list, so user can iterate through the list assuming
that it is alive all the way through.
this will change in the future though, because there will be a
compound set impl which will have to merge the all() of multiple
managed sets, and the result is a temporary value.
so even range-based loops on all() have to keep a ref to the returned
list, to avoid the list from being prematurely destroyed.
so the following code
for (auto& sst : *sstable_set.all()) { ...}
becomes
for (auto sstables = sstable_set.all(); auto& sst : *sstables) { ... }
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, the sstable_set in a table is copied before every change
to allow accessing the unchanged version by existing sstable readers.
This patch changes the sstable_set to a structure that keeps all its
versions that are referenced somewhere and provides a way of getting
a reference to an immutable version of the set.
Each sstable in the set is associated with the versions it is alive in,
and is removed when all such versions don't have references anymore.
To avoid copying, the object holding all sstables in the set version is
changed to a new structure, sstable_list, which was previously an alias
for std::unordered_set<shared_sstable>, and which implements most of the
methods of an unordered_set, but its iterator uses the actual set with
all sstables from all referenced versions and iterates over those
sstables that belong to the captured version.
The methods that modify the sets contents give strong exception guarantee
by trying to insert new sstables to its containers, and erasing them in
the case of an caught exception.
To release shared_sstables as soon as possible (i.e. when all references
to versions that contain them die), each time a version is removed, all
sstables that were referenced exclusively by this version are erased. We
are able to find these sstables efficiently by storing, for each version,
all sstables that were added and erased in it, and, when a version is
removed, merging it with the next one. When a version that adds an sstable
gets merged with a version that removes it, this sstable is erased.
Fixes#2622
Signed-off-by: Wojciech Mitros wojciech.mitros@scylladb.comCloses#8111
* github.com:scylladb/scylla:
sstables: add test for checking the latency of updating the sstable_set in a table
sstables: move column_family_test class from test/boost to test/lib
sstables: use fast copying of the sstable_set instead of rebuilding it
sstables: replace the sstable_set with a versioned structure
sstables: remove potential ub
sstables: make sstable_set constructor less error-prone
The schema used to create the sstable writer has to be the same as the
schema used by the reader, as the former is used to intrpret mutation
fragments produced by the reader.
Commit 9124a70 intorduced a deferring point between reader creation
and writer creation which can result in schema mismatch if there was a
concurrent alter.
This could lead to the sstable write to crash, or generate a corrupted
sstable.
Fixes#7994
Message-Id: <20210222153149.289308-1-tgrabiec@scylladb.com>
The sstable_set enables copying without iterating over all its elements,
so it's faster to copy a set and modify it than copy all its elements
while filtering the ones that were erased.
The modifications are done on a temporary version of the set, so that
if an operation fails the base version remains unchanged
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Refs #6148
Commitlog disk limit was previously a "soft" limit, in that we allowed allocating new segments, even if we were over
disk usage max. This would also cause us sometimes to create new segments and delete old ones, if badly timed in
needing and releasing segments, in turn causing useless disk IO for pre-allocation/zeroing.
This patch set does:
* Make limit a hard limit. If we have disk usage > max, we wait for delete or recycle.
* Make flush threshold configurable. Default is ask for flush when over 50% usage. (We do not wait for results)
* Make flush "partial". We flush X% of the used space (used - thres/2), and make the rp limit accordingly. This means we will try to clear the N oldest segments, not all. I.e. "lighter" flush. Of course, if the CL is wholly dominated by a single CF, this will not really help much. But when > 1 cf is used, it means we can skip those not having unflushed data < req rp.
* Force more eager flush/recycle if we're out of segments
Note: flush threshold is not exposed in scylla config (yet). Because I am unsure of wording, and even if it should.
Note: testing is sparse, esp. in regard to latency/timeouts added in high usage scenarios. While I can fairly easily provoke "stalls" (i.e. forced waiting for segments to free up) with simple C-S, it is hard to say exactly where in a more sane config (I set my limits looow) latencies will start accumulating.
Closes#7879
* github.com:scylladb/scylla:
commitlog: Force earlier cycle/flush iff segment reserve is empty
commitlog: Make segment allocation wait iff disk usage > max
commitlog: Do partial (memtable) flushing based on threshold
commitlog: Make flush threshold configurable
table: Add a flush RP mark to table, and shortcut if not above
we're unconditionally using make_combined_mutation_source(), which causes extra
allocations, even if memtable was flushed into a single sstable, which is the
most common case. memtable will only be flushed into more than one sstable if
TWCS is used and memtable had old data written into it due to out-of-order
writes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210205182028.439948-1-raphaelsc@scylladb.com>
Add a string describing where the sstables originated
from (e.g. memtable, repair, streaming, compaction, etc.)
If configure_writer is called with a nullptr, the origin
will be equal to an empty string.
Introduce test_env_sstables_manager that provides an overload
of configure_writer with no parmeters that calls the base-class'
configure_writer with "test" origin. This was to reduce the
code churn in this patch and to keep the tests simple.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We want to unify the various sstable reader creation methods and this
method taking a ring position instead of a partition range like
everybody else stands in the way of that.
This is effect reverts 68663d0de.
storage_service: Introduce load_and_stream
=== Introduction ===
This feature extends the nodetool refresh to allow loading arbitrary sstables
that do not belong to a node into the cluster. It loads the sstables from disk
and calculates the owning nodes of the data and streams to the owners
automatically.
From example, say the old cluster has 6 nodes and the new cluster has 3 nodes.
We can copy the sstables from the old cluster to any of the new nodes and
trigger the load and stream process.
This can make restores and migrations much easier.
=== Performance ===
I managed to get 40MB/s per shard on my build machine.
CPU: AMD Ryzen 7 1800X Eight-Core Processor
DISK: Samsung SSD 970 PRO 512GB
Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32
shards, we can finish the load and stream 1TB of data in 13 mins on each
node.
1TB / 40 MB per shard * 32 shard / 60 s = 13 mins
=== Tests ===
backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test
which creates a cluster with 4 nodes and inserts data, then use
load_and_stream to restore to a 2 nodes cluster.
=== Usage ===
curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true
=== Notes ===
Btw, with the old nodetool refresh, the node will not pick up the data
that does not belong to this node but it will not delete it either. One
has to run nodetool cleanup to remove those data manually which is a
surprise to me and probably to users as well. With load and stream, the
process will delete the sstables once it finishes stream, so no nodetool
cleanup is needed.
The name of this feature load and stream follows load and store in CPU world.
Fixes#7831Closes#7846
* github.com:scylladb/scylla:
storage_service: Introduce load_and_stream
distributed_loader: Add get_sstables_from_upload_dir
table: Add make_streaming_reader for given sstables set
Adds a second RP to table, marking where we flushed last.
If a new flush request comes in that is below this mark, we
can skip a second flush.
This is to (in future) support incremental CL flush.
From now on, memtable flush will use the strategy's interposer consumer
iff split_during_flush is enabled (disabled by default).
It has effect only for TWCS users as TWCS it's the only strategy that
goes on to implement this interposer consumer, which consists of
segregating data according to the window configuration.
Fixes#4617.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
As a preparation for interposer on flush, let's allow database write monitor
to store a shared sstable write permit, which will be released as soon as
any of the sstable writers reach the sealing stage.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This extension is needed for future work where a memtable will be segregated
during flush into one sstable or more. So now multiple sstables can be added
to the set after a memtable flush, and compaction is only triggered at the
end.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>