That's gonna be useful to quickly determine if it's worth resharding
a column family.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
A large token span sstable may find its way into high level due to resharding,
which means the strategy invariant is broken. The invariant is restored by
compacting first set of overlapping sstables, meaning that the restoration
is done incrementally for multiple overlapping sets.
Invariant is restored by regular compaction after resharding puts new unshared
sstables into their original level, where level > 0.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Extends compaction for new resharding algorithm. Not wired yet.
New resharding will compact shared sstable(s) and create one
sstable for each owner. It's up to the caller to open these
new unshared sstables at their respective column families.
This new approach will save a lot of bandwidth because we'll
no longer read the entire shared sstable #smp::count times.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's possible by shard A explicitly saying that sstable is created
for shard B. If we don't do that, sharding metadata isn't correct,
and consequently sstable will report wrong owners.
We'll need this for resharding which will create sstables for all
shards that own the shared sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
compacting_sstable_writer only allowed one writer so far, but we will need
multiple ones for resharding.
It's done by moving writer management to compaction.
finish_sstable_writer() is added for compaction impl to stop all writers,
whereas stop_sstable_writer() will only stop current writer (needed when
current sstable reaches max limit size for example).
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
No need for compacting_sstable_writer to store items that are available
in compaction class. Also, that's a step towards supporting multiple
writers for compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
compact_sstables() supported both regular and cleanup compaction,
but with lots of conditions that made it ugly and hard to extend.
In the future, we want to introduce a new type of compaction for
resharding that will create one sstable for every shard owning
the sstable(s) given as input. That will be easier now.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
* seastar 2eec212...194d80f (4):
> removing the collectd tests
> fix fstream metrics reporting.
> do_for_each: Make it check for need preempt
> core/sharded: introduce copy method to foreign_ptr
"Currently eviction is performed until occupancy of the whole region
drops below the 85% threshold. This may take a while if region had
high occupancy and is large. We could improve the situation by only
evicting until occupancy of the sparsest segment drops below the
threshold, as is done by this change.
I tested this using a c-s read workload in which the condition
triggers in the cache region, with 1G per shard:
lsa-timing - Reclamation cycle took 12.934 us.
lsa-timing - Reclamation cycle took 47.771 us.
lsa-timing - Reclamation cycle took 125.946 us.
lsa-timing - Reclamation cycle took 144356 us.
lsa-timing - Reclamation cycle took 655.765 us.
lsa-timing - Reclamation cycle took 693.418 us.
lsa-timing - Reclamation cycle took 509.869 us.
lsa-timing - Reclamation cycle took 1139.15 us.
The 144ms pause is when large eviction is necessary.
Statistics for reclamation pauses for a read workload over
larger-than-memory data set:
Before:
avg = 865.796362
stdev = 10253.498038
min = 93.891000
max = 264078.000000
sum = 574022.988000
samples = 663
After:
avg = 513.685650
stdev = 275.270157
min = 212.286000
max = 1089.670000
sum = 340573.586000
samples = 663
Refs #1634."
* tag 'tgrabiec/lsa-reduce-reclaim-latency-v3' of github.com:cloudius-systems/seastar-dev:
lsa: Reduce reclamation latency
tests: Add test for log_histogram
log_histogram: Allow non-power-of-two minimum values
lsa: Use regular compaction threshold in on-idle compaction
tests: row_cache_test: Induce update failure more reliably
lsa: Add getter for region's eviction function
Currently eviction is performed until occupancy of the whole region
drops below the 85% threshold. This may take a while if region had
high occupancy and is large. We could improve the situation by only
evicting until occupancy of the sparsest segment drops below the
threshold, as is done by this change.
I tested this using a c-s read workload in which the condition
triggers in the cache region, with 1G per shard:
lsa-timing - Reclamation cycle took 12.934 us.
lsa-timing - Reclamation cycle took 47.771 us.
lsa-timing - Reclamation cycle took 125.946 us.
lsa-timing - Reclamation cycle took 144356 us.
lsa-timing - Reclamation cycle took 655.765 us.
lsa-timing - Reclamation cycle took 693.418 us.
lsa-timing - Reclamation cycle took 509.869 us.
lsa-timing - Reclamation cycle took 1139.15 us.
The 144ms pause is when large eviction is necessary.
Statistics for reclamation pauses for a read workload over
larger-than-memory data set:
Before:
avg = 865.796362
stdev = 10253.498038
min = 93.891000
max = 264078.000000
sum = 574022.988000
samples = 663
After:
avg = 513.685650
stdev = 275.270157
min = 212.286000
max = 1089.670000
sum = 340573.586000
samples = 663
Refs #1634.
Message-Id: <1484730859-11969-1-git-send-email-tgrabiec@scylladb.com>
We will want to reuse the min_size mechanism for the whole compaction
threshold, including the occupancy threshold. That threshold is close
to the segment size and we cannot pick a power of two which would be
close enough to what we need.
Therefore, change log_histogram to support arbitrary minimum base.
bucket_of() was moved into log_histogram_options so that it can be used
in number_of_buckets(), which makes for a simple and much less
error-prone implementation.
Idle-time compaction should not produce not-compactible segments
becuase that means we would have to evict a lot when we finally need
to reclaim some memory, so that occupancy falls below the regular
compaction threshold. This may cause latency spikes.
Refs #1634.
After changing region evicitability condition to be less strict, cache
update stopped failing because reclamation was able to compact dense
region. Induce failure by installing evictor which refuses to evict
from cache beyond few elements.
"This series makes several optimizations to sstable mutation reader relevant
for large partitions.
Some highlights:
One optimization is to use the index for skipping across clustering restrictions.
Currently we read whole partition in such cases. That includes the case when
we need to read a static row and then jump to some clustering row in the
middle of the partition. Another case is having more than one clustering
restriction, e.g. selecting multiple single rows from the same partition.
Another optimization is using information from the index for creation of
streamed_mutation. That can save us the cost of reading the partition header
form the data file in case we would not continue reading, but skip to the
middle of that partition. Or we may not even attempt to read anything from
that partition, if after we determine the key that reader will be put behind
other readers, which will exhaust the query limit first.
Another optimization is switching single-partition queries to use the
index_reader infrastructure. Index lookups via index_reader are faster than
find_disk_ranges(). This is also a cleanup, a step towards converting all code
to use the index_reader."
* tag 'tgrabiec/optimize-sstable-reads-with-restrictions-v2' of github.com:cloudius-systems/seastar-dev: (44 commits)
sstables: Remove unused code
sstables: mutation_reader: Use index_reader::advance_to_next_partition() to skip to next partition
sstables: mutation_reader: Use index_reader for single-partition reads
sstables: mutation_reader: Add trace-level logging
sstables: mutation_reader: Move partition reading code to sstable_data_source
sstables: mutation_reader: Move definitions out of the class body
sstables: Move binary_search() to a header
database: Pass partition_range to single_key_sstable_reader to avoid copies and decorating
sstables: index_reader: Introduce advance_to_next_partition()
sstables: index_reader: Introduce advance_and_check_if_present()
sstables: index_reader: Introduce advance_past()
sstables: index_reader: Make copyable
sstables: index_reader: Optimize advancing to extreme positions
sstables: index_reader: Keep two last pages alive
dht: ring_position_view: Add key getter
dht: ring_position_view: Add constructor and factory from ring_position_view
sstables: mutation_reader: Advance to next partition using index in some cases
sstables: index_reader: Expose access to partition key and tombstone
sstables: index_reader: Introduce promoted_index_view
sstables: mutation_reader: Move _index_in_current to sstable_data_source
...
This switches single-partition query to use the index_reader
infrastructure. Index lookups via index_reader are faster than
find_disk_ranges().
perf_fast_forward, rows: 1000000, value size: 100
Before:
Testing forwarding with clustering restriction in a large partition:
pk-scan time [s] frags frag/s aio [KiB] blocked dropped idx hit idx miss idx blk cpu
no 0.002182 2 916 3 152 2 0 0 1 1 88.1%
After:
Testing forwarding with clustering restriction in a large partition:
pk-scan time [s] frags frag/s aio [KiB] blocked dropped idx hit idx miss idx blk cpu
no 0.000758 2 2639 3 152 2 0 0 1 1 48.6%
This is also a cleanup, a step towards converting all code to use the
index_reader.
There are instantiations of binary_search() used in sstables.cc, but
defined in partition.cc. The instantiations are explicitly declared in
partition.cc, but the types changed and they became obsolete. The
thing worked because partition.cc also instantiated it with the right
type. But after that code will be removed, it no longer would, and we
would get a linker error. To avoid such problems, define
binary_search() in a header.
The idea behind caching is that when we have two index readers where
one is catching up with the other, each page will be read only
once. Currently that's not always the case. There is a case when
advance_to() may need to read two pages. That's when the target
position is not found in the first page as determined by the summary
index. The second reader which catches up would have to read the first
page as well, but it would not be in cache any more. To avoid this
extra I/O let's keep a reference to the two last pages touched by the
index.
To produce a streamed_mutation for the next partition, we need to read
its key and the tombstone. Currently we always do that by consuming
the partition header from the data file. In some cases that may cause
unnecessary IO.
It's better to obtain partition information from the index if we
already have it. We can save on IO if the user will skip past the
front of partition immediately after.
It is also better to pay the cost of reading the index if we know that
we will need to use the index anyway soon. This patch predicts that by
checking if there are any clustering restrictions. If there are any,
we will almost surely need_skip() and use the index anyway.
This change also lays the ground for unification of multi and single
partiton queries without loss of performance.
sstable_data_source holds a shared state between mutation_reader and
streamed_mutation for sstables. The information whether index is in
current partition will have to be accessed by both in the following
patches.
Before the change, the following scenario was happening:
1) we try to skip based on clustering restrictions
2) we find the page and fast forward to it, recording walker's
lower bound counter
3) we read the first fragment, it's not a tombstone, so we reset the walker,
and its lower bound counter too
4) the fragment is not in range (the range starts in the middle of the page)
5) needs_skip() is true, we redo the index lookup, which wastes some CPU
This change fixes the problem by avoiding resetting the walker. We can
do that because leading tombstones are checked with a non-mutable
contains_tombstone()
Allows detecting changes of lower_bound().
Result of advance_to() is not enough. When we get false from
advance_to() twice in a row, lower bound may or may not have changed.