"The motivation is that it's no longer needed after new resharding
algorithm that is the sole responsible for working with shared
sstables and regular compaction will not work with those!
So resharding will schedule deletion of shared sstables once it's
certain that shards that own them have the new unshared sstables.
The manager was needed for orchestrating deletion of shared sstable
across shards. It brings extra complexity that's not longer needed,
and it was also overloading shard 0, but the latter could have
been fixed.
Tests:
- unit: release mode
- dtest: resharding_test.py"
* 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla:
Remove SSTable's atomic deletion manager
Stop using SSTable's atomic deletion manager
database: split column_family::rebuild_sstable_list
"We have noticed in the past that the compiler is too conservative when it comes
to deciding which functions to inline. Since inlining functions enables further
optimisations such as const folding in some cases the difference in performance
was significant enough to force us to add [[gnu::always_inline]] attribute in
numerous places. However, this is neither a partical nor an elegant solution.
A better way to deal with the problem is to adjust the compiler tunables that
control the heuristics used for making inlining decisions. In particular,
inline-unit-growth seems to affect the performance of the emitted code most.
Apart from making the compiler more eager to inline functions bumping the
optimisation level to -O3 also seems to have a positive impact on the
performance.
Fixes#1644.
Tests: unit-test (release)
Performance tested with gcc 7.3.
Macrobenchmark
perf_simple_query
Flags: -c4 --duration 60
All results are medians.
./before ./after diff
read 338662.12 405377.80 19.7%
write 387378.89 466744.15 20.5%
Microbenchmarks
single run duration: 1.000s
number of runs: 5
BEFORE
test iterations median mad min max
combined.one_row 858933 536.389ns 0.819ns 534.823ns 537.208ns
combined.single_active 8469 77.131us 11.000ns 77.118us 77.145us
combined.many_overlapping 1199 664.105us 160.807ns 663.818us 668.527us
combined.disjoint_interleaved 8100 75.522us 22.254ns 75.500us 75.732us
combined.disjoint_ranges 8288 72.580us 10.571ns 72.568us 72.599us
memtable.one_partition_one_row 1216233 825.581ns 0.446ns 821.450ns 826.027ns
memtable.one_partition_many_rows 127336 7.855us 2.153ns 7.853us 7.898us
memtable.many_partitions_one_row 57919 17.356us 6.028ns 17.259us 17.362us
memtable.many_partitions_many_rows 4751 210.496us 102.339ns 210.393us 211.188us
AFTER
test iterations median mad min max
combined.one_row 1002321 450.292ns 0.313ns 447.202ns 450.605ns
combined.single_active 9605 67.086us 8.620ns 67.073us 67.115us
combined.many_overlapping 1476 519.554us 5.334ns 519.549us 519.953us
combined.disjoint_interleaved 9280 64.363us 5.328ns 64.335us 64.369us
combined.disjoint_ranges 9481 61.893us 3.620ns 61.885us 61.903us
memtable.one_partition_one_row 1432668 699.775ns 0.106ns 696.023ns 699.918ns
memtable.one_partition_many_rows 153692 6.536us 6.885ns 6.501us 6.543us
memtable.many_partitions_one_row 63319 15.879us 5.080ns 15.793us 15.884us
memtable.many_partitions_many_rows 5659 176.717us 66.770ns 176.650us 177.778us"
* tag 'optimise-and-inline/v2' of https://github.com/pdziepak/scylla:
configure.py: set optimisation level to -O3
configure.py: set inline-unit-growth to 300
configure.py: flag_supported: support flags with spaces
configure.py: rename warning_supported to flag_supported
configure.py: pass optimisation flags to seastar/configure.py
cql3/select_statement: do not capture stack variables by reference
In this patchset I am resubmitting Avi's enablement of the CPU scheduler
in his behalf. I've done a ton of testing in the series and there are
some improvements / changes that I had previously sent as a separate series.
What you see here is the result of merging that work.
After this patchset is applied, workloads are smoother and we are able to
uphold the pre-defined shares among the various actors.
We also finally have everything we need to merge the CPU and I/O controllers.
After that is done the code is now much simpler. But also, as a bonus,
controllers that were previously available for I/O only (compactions) are
enabled for CPU as well.
* git@github.com:glommer/scylla.git cpusched-v7:
Avi Kivity (4):
database, sstables, compaction: convert use of thread_scheduling_group
to seastar cpu scheduler
memtable, database: make memtable::clear_gently() inherit
scheduling_group
config: mark background_writer_scheduling_quota as Unused
database: place data_query execution stage into scheduling_group
Glauber Costa (9):
database, main: set up scheduling_groups for our main tasks
row_cache: actually use the scheduling group for update_cache
allow update_cache and clear_gently to use the entire task quota.
database: remove cpu_flush_quota metric
controllers: retire auto_adjust_flush_quota
controllers: allow memtable I/O controller to have shares statically
set
controllers: update control points for memtable I/O controller
controllers: allow a static priority to override the controller output
controllers: unify the I/O and CPU controllers
It has been discovered that the compiler is too conservative when
deciding which functions to inline. In particular, the limiting tunable
turned out to be inline-unit-growth which limits inlining in large
translation units.
* seastar 6d02263...2b0a81d (7):
> configure.py: add -Wno-stringop-overflow
> configure.py: add --optflags for specifying optimisation flags
> build: add protobuf-compiler to docker dev image
> build: update docker builder to newer Fedora
> json_element: stream_object to get its parameter by value
> json_element: stream range object
> build: add yaml-cpp-devel installation to Dockerfile
The motivation is that it's no longer needed after new resharding
algorithm that is the sole responsible for working with shared
sstables and regular compaction will not work with those!
So resharding will schedule deletion of shared sstables once it's
certain that shards that own them have the new unshared sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The motivation is that resharding will not want the code that is
specific to regular compaction after atomic deletion is removed.
Resharding will eventually only need to replace old tables with
new ones, and it will be in charge of deletion of old tables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We have had so far an I/O controller, for compactions and memtables, and
a CPU controller, for memtables only -- since the scheduling was still
quota-based.
Now that the CPU scheduler is fully functional, it is time to do away
with the differences and integrate them both into one. We now have a
memtable controller and a compaction controller, and they control both
CPU and I/O.
In the future, we may want to control processes that don't do one of
them, like cache updates. If that ever happens, we'll try to make
controlling one of them optional. But for now, since the I/O and CPU
controllers for our main two processes would look exactly the same we
should integrate them.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We have merged the I/O controller without this, but we want to integrate
the CPU and I/O controllers into one. Currently, the quota can be
statically set for the CPU controller. For now, until we gain more
experience with it we should allow a static value to override the
controller's output as well.
That is particularly important since we don't yet control some
strategies like LCS and the time-based ones. Users in the field may be
using one of those strategies with a static value for background quota.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Right now CPU and I/O controllers have slightly different control points
for no good reason. Let's use the CPU controller ones as the standard, as
we have been using it in the field for longer and trust it more.
The end goal is to fully integrate them.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
It no longer makes sense now that we have the full scheduler +
controllers. In its lieu, we will provide an option to statically set
the controller's shares as a safe guard against us getting this wrong.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We have had a quota of partitions to process in clear_gently /
update_cache, so that we don't overwork. However, with those things now
being in their own task group there is no harm in allowing it to run
until we reach a natural preemption point.
While we are at it, clear_gently did not check for need_preempt()
before, so this patch fixes it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We have moved clear_gently from using a seastar::thread's scheduling_group to
using the CPU scheduler's. However, update_cache was forgotten.
This patch fixes that and gets rid of the old group just in case.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Because execution stages defer and batch processing of the function
they run, they escape their fiber's context and therefore the
scheduling group.
Fix (for data_query) by initializing the execution_stage with the
query scheduling_group. To do that we have to move the execution
stage into the database object, so it has access to the scheduling
group during initialization.
Set up scheduling groups for streaming, compaction, memtable flush, query,
and commitlog.
The background writer scheduling group is retired; it is split into
the memtable flush and compaction groups.
Comments from Glauber:
This patch is based in a patch from Avi with the same subject, but the
differences are signficant enough so that I reset authorship. In
particular:
1) A bug/regression is fixed with the boundary calculations for the
memtable controller sampling function.
2) A leftover is removed, where after flushing a memtable we would
go back to the main group before going to the cache group again
3) As per Tomek's suggestion, now the submission of compactions
themselves are run in the compaction scheduling group. Having that
working is what changes this patch the most: we now store the
scheduling group in the compaction manager and let the compaction
manager itself enforce the scheduling group.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
thread_scheduling_groups are converted to plain scheduling_group. Due to
differences in initialization (scheduling_group initializtion defers), we
create the scheduling_groups in main.cc and propagate them to users via
a new class database_config.
The sstable writer loses its thread_scheduling_group parameter and instead
inherits scheduling from its caller.
Since shares are in the 1-1000 range vs. 0-1 for thread scheduling quotas,
the flush controller was adjusted to return values within the higher ranges.
The SSTable tests are a bit fragile now because they rely on min_threshold
having a particular value. That is the default value, but if I change that
default - which I am planning to do - the test breaks.
Right now the test is not broken, but if we are planning on relying on a
property having a particular value in tests, we should explicitly set it.
So I am proactively chaning min_threshold in the tests to have the value
of 4 explicitly, so we can change that in the future without breaking anything.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180207155513.12498-1-glauber@scylladb.com>
Since we unconditionally running blkdiscard on disks, we may get ioctl error
message on some disks which does not support TRIM.
This can be ignore but it's bad UX, so let's skip running blkdiscard when TRIM
is not supported on the disk.
Fixes#2774
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1517992904-13838-1-git-send-email-syuu@scylladb.com>
"When moving whole partition entries from memtable to cache, we move
snapshots as well. It is incorrect to evict from such snapshots
though, because associated readers would miss data.
Solution is to record evictability of partition version references (snapshots)
and avoiding eviction from non-evictable snapshots.
Could affect scanning reads, if the reader uses partition entry from
memtable, and the partition is too large to fit in reader's buffer,
and that entry gets moved to cache (was absent in cache), and then
gets evicted (memory pressure). The reader will not see the remainder
of that entry. Found during code review.
Introduced in ca8e3c4, so affects 2.1+
Fixes#3186.
Tests: unit (release)"
* 'tgrabiec/do-not-evict-memtable-snapshots' of github.com:tgrabiec/scylla:
tests: mvcc: Add test for eviction with non-evictable snapshots
mutation_partition: Define + operator on tombstones
tests: mvcc: Check that partition is fully discontinuous after eviction
tests: row_cache: Add test for memtable readers surviving flush and eviction
memtable: Make printable
mvcc: Take partition_entry by const ref in operator<<()
mvcc: Do not evict from non-evictable snapshots
mvcc: Drop unnecessary assignment to partition_snapshot::_version
tests: Use partition_entry::make_evictable() where appropriate
mvcc: Encapsulate construction of evictable entries
When moving whole partition entries from memtable to cache, we move
snapshots as well. It is incorrect to evict from such snapshots
though, because associated readers would miss data.
Solution is to record evictability of partition version references (snapshots)
and avoiding eviction from non-evictable snapshots.
Could affect scanning reads, if the reader uses partition entry from
memtable, and the partition is too large to fit in reader's buffer,
and that entry gets moved to cache (was absent in cache), and then
gets evicted (memory pressure). The reader will not see the remainder
of that entry.
Introduced in ca8e3c4, so affects 2.1+
Fixes#3186.
merge_partition_versions() is responsible for merging versions
unpinned by the current snapshot. If that fails, we don't need to set
_version back since versions must be still referenced by someone else,
this snapshot is not a unique owner.
This change makes it easier to add tracking of evictability.
Race condition was introduced by commit 028c7a0888, which introduces chunk offset
compression, because a reading state is kept in the compress structure which is
supposed to be immutable and can be shared among shards owning the same sstable.
So it may happen that shard A updates state while shard B relies on information
previously set which leads to incorrect decompression, which in turn leads to
read misbehaving.
We could serialize access to at() which would only lead to contention issues for
shared sstables, but that can be avoided by moving state out of compress structure
which is expected to be immutable after sstable is loaded and feeded to shards that
own it. Sequential accessor (wraps state and reference to segmented_offset) is
added to prevent at() and push_back() interfaces from being polluted.
Tests: release mode.
Fixes#3148.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180205192432.23405-1-raphaelsc@scylladb.com>
Internal invariants of MVCC are better preserved by partition_entry
methods, so move construction of partition entries out of cache_entry
constructors.
* seastar 21badbd...6d02263 (4):
> build: detect name of ninja executable
> queue: pop_eventually/push_eventually should throw when called after abort
> build: compile libfmt out-of-line
> core/gate: Ensure with_gate leaves gate on exception
We uses AmbientCapabilities directive on systemd unit, but it does not work
on older kernel, causes following error:
"systemd[5370]: Failed at step CAPABILITIES spawning /usr/bin/scylla: Invalid argument"
It only works on kernel-3.10.0-514 == CentOS7.3 or later, block installing rpm
to prevent the error.
Fixes#3176
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1517822764-2684-1-git-send-email-syuu@scylladb.com>
* seastar 19efbd9...21badbd (4):
> reactor: change adjustment method for tasks becoming active
> Merge 'Update ARM port' from Avi
> http: Do not wait for close connection on stop if listen did not completed
> core/future-util: Don't allow rvalues in do_for_each()