This is the last missing tracker among the major strategies. After
this, only DTCS is left.
To calculate the backlog, we will define the point of zero-backlog
as having all data in the last level. The backlog is then:
Sum(L in levels) sizeof(L) * (max_levels - L) * fan_out,
where:
* the fan_out is the amount of SSTables we usually compact with the
next level (usually 10).
* max_levels is the number of levels currently populated
* sizeof(L) is the total amount of data in a particular level.
Care is taken for the backlog not to jump when a new level has been just
recently created.
Aside from that, SSTables that accumulate in L0 can be subject to STCS.
We will then add a STCS backlog in those SSTables to represent that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Right now we are constructing the _max_sstable_size_in_mb property in
the body of the constructor, which it makes it hard for us to use from
other properties.
We are doing that because we'd like to test for bounds of that value. So
a cleaner way is to have a helper function for that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Our current implementation of LCS can end up with situations in which
just a bit of data is in the highest levels, with the majority in the
lowest levels. That happens because we will only promote things to
highest levels if the amount of data in the current level is higher than
the maximum.
This is a pre-existing problem in itself, but became even clearer when
we started trying to define what is the backlog for LCS.
We have discussed ways to fix this it by redefining the criteria on when
to move data to the next levels. That would require us to change the way
things are today considerably, allowing parallel compactions, etc. There
is significant risk that we'll increase write amplication and we would
need to carefully validate that.
For now I will propose a simpler change, that essentially solves the
"inverted pyramid" problem of current LCS without major disruption:
keep selecting compaction candidates with the same criteria that we do
today, we should help make sure we are not compacting high levels for no
reason; but if there is nothing to do, use the idle time to push data to
higher levels. As an added benefit, old data that is in the higher level
can also be compacted away faster.
With this patch we see that in an idle, post-load system all data is
eventually pushed to the last level. Systems under constant writes keep
behaving the same way they did before.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We increase levels in powers of 10 but that is a parameter
of the algorithm. At least make it into a constant so that we can
reuse it somewhere else.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
For SSTables being written, we don't know their level yet. Add that
information to the write monitor. New SSTables will always be at L0.
Compacted SSTables will have their level determined by the compaction
process.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Scylla now expose the prometheus API by default. This patch chagnes
scyllatop to use the Prometheus API, the collect API is still available.
The main changes in the patch:
* Move collectd specific logic inside collectd.
* Add support for help information.
* Add command line to configure prometheus end point and to enable
collectd.
* Add a prometheus class that collect information from prometheus.
Fixes: #1541
Message-Id: <20180531124156.26336-1-amnon@scylladb.com>
Only libjsoncpp >= 1.6.0 offers a safe name() method for value
iterators. For older versions, deprecated memberName() is used
instead. Note that memberName() was deprecated because of its
inability to deal with embedded null characters.
Fixes#3471
Message-Id: <e64a62bfc24ef06daee238d79d557fe6ec8979d3.1527758708.git.sarna@scylladb.com>
A view schema's view_info contains the id of the base regular column
that view includes in its primary key. Since the column id of a
particular column can potentially change with a new schema version, we
need to refresh the stored column id. We weren't doing that when
unselected base columns are added, and this patch fixes it by
triggering an update of the view schema when base columns are added
and the view contains a base regular column in its PK.
Fixes#3443
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180530194536.51202-1-duarte@scylladb.com>
tests/view_complex_test.cc contained a #ifdef'ed-out test claiming to
be a reproducer for issue #3362. Unfortunately, it it is not - after
earlier commits the only reason this test still fails is a mistake in
the test, which expects 0 rows in a case where the real result is 1 row.
Issue #3362 does *not* have to be fixed to fix this test.
So this patch fixes the broken test, and enables it. It also adds comments
explaining what this test is supposed to do, and why it works the way it
does.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180530142214.29398-1-nyh@scylladb.com>
"
Add handling for static rows and tests for it.
"
* 'haaawk/sstables3/read-static-v1' of ssh://github.com/scylladb/seastar-dev:
sstable_3_x_test: Add test_uncompressed_compound_static_row_read
sstable_3_x_test: add test_uncompressed_static_row_read
flat_mutation_reader_assertions: improve static row assertions
data_consume_rows_context_m: Implement support for static rows
mp_row_consumer_m: Implement support for static rows
mp_row_consumer_m: Extract fill_cells
"
We currently suffer from reactor stalls caused by non-preemptible processing
of large partitions in the following places:
(1) dropping partition entries from cache or memtables does not defer
(2) dropping partition versions abandoned by detached snapshots does not defer
(3) merging of partition versions when snapshots go away does not defer
(4) cache update from memtable processes partition entries without deferring (#2578)
(5) partition entries are upgraded to new schema atomically
This series fixes problems (1), (2) and (4), but not (3) and (5).
(1) and (2) are fixed by introducing mutation_cleaner objects which are
containers for garbage partition versions which are delaying actual freeing.
Freeing happens from memory reclaimers and is incremental.
(3) and (5) are not solved yet.
(4) is solved by having partition merging process partitions with row
granularity and defer in the middle of partition. In order to preserve update
atomicity on partition level as perceived by reads, when update starts we
create a snapshot to the current version of partition and process memtable
entry by inserting data into a separate partition version. This way if upgrade
defers in the middle of partition reads can still go to the old version and
not see partial writes. Snapshots are marked with phase numbers, and reads
will use the previous phase until whole partition is upgraded. When partition
is finally merged, the snapshots go away and the new version will eventually
be merged to the old version. Due to (3) however, this merging may still add
latency to the upgrade path.
Remaining work:
- Solving problem (3). I think the approach to take here would be to
move the task of merging versions to the background, maybe into mutation_cleaner.
- Merging range tombstones incrementally.
Performance
===========
Performance improvements were evaluated using tests/perf_row_cache_update -c1 -m1G,
which measures time it takes to update cache from memtable for various workloads
and schemas.
For large partition with lots of small rows we see a significant reduction of
scheduling latency from ~550ms to ~23ms. The cause of remainig latency is
problem (3) stated above. The run time is reduced by 70%.
For small partition case without clustering columns we see no degradation.
For small partition case with clustering key, but only 3 small rows per partition,
we see a 30% degradation in run time.
For large partition with lots of range tombstones we see degradation of 15% in
run time and scheduling latency.
Below you can see full statistics for cache update run time:
=== Small partitions, no overwrites:
Before:
avg = 433.965155
stdev = 35.958024
min = 340.093201
max = 468.564514
After:
avg = 436.929447 (+1%)
stdev = 37.130237
min = 349.410339
max = 489.953400
=== Small partition with a few rows:
Before:
avg = 315.379316
stdev = 30.059120
min = 240.340561
max = 342.408295
After:
avg = 407.232691 (+30%)
stdev = 53.918717
min = 269.514648
max = 444.846649
=== Large partition, lots of small rows:
Before:
avg = 412.870689
stdev = 227.411317
min = 286.990631
max = 1263.417847
After:
avg = 124.351705 (-70%)
stdev = 4.705762
min = 110.063255
max = 129.643387
=== Large partition, lots of range tombstones:
Before:
avg = 601.172644
stdev = 121.376866
min = 223.502136
max = 874.111572
After:
avg = 695.627588 (+15%)
stdev = 135.057004
min = 337.173950
max = 784.838745
"
* tag 'tgrabiec/clear-gently-all-partitions-v3' of github.com:tgrabiec/scylla:
mvcc: Use small_vector<> in partition_snapshot_row_cursor
utils: Extract small_vector.hh
mvcc: Erase rows gradually in apply_to_incomplete()
mvcc: partition_snapshot_row_cursor: Avoid row copying in consume() when possible
cache: real_dirty_memory_accounter: Move unpinning out of the hot path
mvcc: partition_snapshot_row_cursor: Reduce lookups in ensure_entry_if_complete()
mutation_partition: Reduce row lookups in apply_monotonically()
cache: Release dirty memory with row granularity
cache: Defer during partition merging
mvcc: partition_snapshot_row_cursor: Introduce consume_row()
mvcc: partition_snapshot_row_cursor: Introduce maybe_refresh_static()
mvcc: Make apply_to_incomplete() work with attached versions
cache: Propagate phase to apply_to_incomplete()
cache: Prepare for incremental apply_to_incomplete()
Introduce a coroutine wrapper
tests: mvcc: Encapsulate memory management details
tests: cache: Take into account that update() may defer
cache: real_dirty_memory_accounter: Allow construction without memtable
cache: Extract real_dirty_memory_accounter
mvcc: Destroy memtable partition versions gently
memtable: Destroy partitions incrementally from clear_gently()
mvcc: Remove rows from tracker gently
cache: Destroy partition versions incrementally
Introduce mutation_cleaner
mvcc: Introduce partition_version_list
mvcc: Fix move constructor of partition_version_ref() not preserving _unique_owner
database: Add API for incremental clearing of partition entries
cache: Define trivial methods inline
tests: Improve perf_row_cache_update
mutation_reader: Make empty mutation source advertize no partitions
Leverage the fact that it is called with monotonically increasing
positions, and avoid lookups in case the current target entry is the
successor of desired position. Reduces cache update latency by 40%
for large partition in a time-series workload.
This change speeds up merging of partition versions with many rows in
case the merged version has many rows which fall between existing rows
in the target version. This is often the case for time-series
workloads, which insert rows at the front. Lookup can be avoided for
all but the first row in the stride because we already have a
reference to the successor in the target tree, we only need to check
that the current entry in the target tree is still the successor.
This change greatly reduces amount of lookups per row during version
merging of large partitions in time-series workloads.
Incremental merging will be implemented by the means of resumable
functions, which return stop_iteration::no when not yet
finished. We're not using futures, so that the caller can do work
around preemption points as well.
Represents a deferring operation which defers cooperatively with the caller.
The operation is started and resumed by calling run(), which returns
with stop_iteration::no whenever the operation defers and is not
completed yet. When the operation is finally complete, run() returns
with stop_iteration::yes.
This allows the caller to:
1) execute some post-defer and pre-resume actions atomically
2) have control over when the operation is resumed and in which context,
in particular the caller can cancel the operation at deferring points.
It will be used to implement deferring partition_version::apply_to_incomplete().
Curently tests have a single LSA region lock around construction of
managed objects, their manipulation, and access. This way we avoid the
complexity of dealing with allocating sections. That will not be
possible once apply_to_incomplete() is changed to enter an allocating
section itself becasue this requires region to be unlocked at
entry. The tests will have to take more fine-grained locks. That is
somewhat tricky add would add a lot of noise to tests. This patch will
make things easier by abstracting LSA management, among other things,
inside mvcc_conatiner and mvcc_partition classes.
The test incorrectly assumed that once update() is started the
cache will return only versions from last_generation. This will not
hold once we start to defer during partition merging.
Now all snapshots will have a mutation_cleaner which they will use to
gently destroy freed partition_version objects.
Destruction of memtable entries during cache update is also using the
gentle cleaner now. We need to have a separate cleaner for memtable
objects even though they're owned by cache's region, because memtable
versions must be cleared without a cache_tracker.
Each memtable will have its own cleaner, which will be merged with the
cache's cleaner when memtable is merged into cache.
Fixes some sources of reactor stalls on cache update when there are
large partition entries in memtables.
Instead of destroying whole partition_versions at once, we will do that
gently using mutation_cleaner to avoid reactor stalls.
Large deletions could happen when large partition gets invalidated,
upgraded to a new schema, or when it's abandaned by a detached snapshot.
Refs #3289.
Partitions can get very large. Destroying them all at once can stall
the reactor for significant amount of time. We want to avoid that by
doing destruction incrementally, deferring in between. A new API is
added for that at various levels:
stop_iteration clear_gently() noexcept;
It returns stop_iteration::yes when the object is fully cleared and
can be now destroyed quickly. So a deferring destruction can look like
this:
return repeat([this] { return clear_gently(); });
The reason why clear_gently() doesn't return a future<> itself is that some
contexts cannot defer, like memory reclamation.
"
This series provides reasoning and clarification for the current
structure of mutate_MV(), and how we handle some scenarios related to
range movements.
"
* 'materialized-views/clarifications/v3' of github.com:duarten/scylla:
db/view: Remove ifdef'd Java code
db/view: Ignore scenario where base replica hasn't joined the ring
db/view: Handle case when base has no paired view replica
"
Add handling for clustering columns and tests for it.
"
* 'haaawk/sstables3/read-ck-v3' of ssh://github.com/scylladb/seastar-dev:
Add test_uncompressed_compound_ck_read for SSTables 3.x
Add test_uncompressed_simple_read for SSTables 3.x
Implement reading clustering key from SSTables 3.x
column_translation: cache fixed value lengths for ck
data_consume_rows_context_m: use cached fixed column value lenghts
column_translation: store fix lengths of column values
consume_row_start: change type of clustering key
Rename ROW_BODY state to CLUSTERING_ROW