LCS will choose its candidate by starting from highest level and
getting sstable which has highest droppable tombstone ratio.
Unlike STCS which needs to choose oldest sstable from biggest tier,
LCS can choose the one with highest d__t__r because sstables in
a given level don't overlap.
Sstable picked up for tombstone removal compaction won't be demoted
or promoted.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Larger sstables are hard to find sstable peers and therefore are
left uncompacted for a long time. Expired data and tombstones which
can be purged will waste disk space meanwhile.
sstable tracks droppable tombstone from which ratio can be calculated.
If ratio is greater than threshold (0.2 by default), sstable will
be eligible for compaction. Oldest sstables from biggest tiers are
preferrable because droppable data in them are more likely to satisfy
the conditions for purge, like not shadowing data in another sstable.
Subsequent patches will add support in leveled and date tiered
strategies.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Function used to estimate ratio of droppable tombstone.
A tombstone is considered droppable for cells expired before
gc_before and regular tombstones older than gc_before.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This is useful for a column family which isn't generating new content
and will have lots of expired data later on that can be purged.
Compaction submission is NO-OP if there's nothing to do, so I think
it's reasonable to do it at an interval of 1 hour.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This function is used to estimate number of points in interval
[-inf,b]. It will be useful for estimating droppable tombstone
ratio in a given sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Find bad histogram which had incorrect elements merged due to use of
unordered map. The keys will be unordered. Histogram which size is
less than max allowed will be correct because no entries needed to be
merged, so we can avoid discarding those.
This is important because histogram for tombstone will be used to
estimate droppable tombstone ratio. If it's incorrectly high for many
of existing sstables, we will needlessly compact lots of them.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This bug was introduced when converting java code. Return value
of map::erase() was used as if it were the value of the removed
entry, but it's actually the number of removed entries.
update() also relies on ordered keys, so map is used instead
by histogram.
In addition, histograms will be written in sorted order (like C*
does) such that we can detect bad histograms, using disk_array.
disk_array is also used from now on to read histograms.
The conversion from array to map is fine because histograms for
sstables are limited to 100 elements.
Coming patch will detect bad histograms (generated only by us)
and discard them, because we can't rely on their information.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
streaming histogram will later be placed in /utils, so we want
it to use std::unordered_map<> instead of disk_hash<>.
That also requires implementing serialization/deserialization
functions for it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That tombstone_histogram is used to determine droppable data ratio
for a sstable, and unlike C*, we were only updating it for
tombstones. We need to update it with expiration time of cells too,
if any. Creation time (expiration - ttl) cannot be used because if
ttl > gc_grace_seconds, the resulting sstable could be considered
worth dropping by tomstone compaction before any data is actually
expired.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently our build script only supports Ubuntu 14.04/16.04 and Debian 8, this
change extends support to Ubuntu non-LTS versions / Debian non-stable versions.
Note that this is unofficial support, users should build the package for these
distributions theirselves.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1498491473-28691-1-git-send-email-syuu@scylladb.com>
When peer nodes have the same partition data, i.e., with the same
checksum, we currently choose to stream from any of them randomly.
To improve streaming performance, select the peer within the same DC.
This patch is supposed to improve repair perforamnce with multiple DC.
Message-Id: <c6a345b6e8ed2b59f485e53c865241e463b44507.1498490831.git.asias@scylladb.com>
Fixes#2528.
* tag 'asias/gossip_talk_to_more_nodes/v3' of github.com:cloudius-systems/seastar-dev:
gossip: Use vector for _live_endpoints
gossip: Talk to more live nodes in each gossip round
In large clusters with multiple DC deployment, it is observed that it
takes long delay for gossip update to disseminate in the cluster.
To speed up, talk to more live nodes in each gossip round.
Fixes#2528
This patch does the same thing to column_family::make_sstable_reader() as
commit 186f031 did to sstable::as_mutation_source().
Although usually one can fast_forward_to() on the result of a
column_family::make_sstable_reader(), earlier we had an optimization
where if a single partition was specified, it was read exactly,
and fast_forward_to() was *NOT* allowed.
With the mutation_reader::forwarding flag patch, when this flag
was on - requesting fast_forward_to() - we disabled this optimization.
This makes sense, but is not backward compatible with the code which
previously assumes this optimization exists. In particular,
column_family::data_query() does a single partition read but does not
specify forwarding::no explicitly.
So this patch returns this optimization, despite this meaning that we
blatently ignore the fwd_mr flag in that case.
Fixes#2524.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170626141121.30322-1-nyh@scylladb.com>
"This series enables cache to keep partial partitions.
Reads no longer have to read whole partition from sstables
in order to cache the result.
The 10MB threshold for partition size in cache is lifted.
Known issues:
- There is no partial eviction yet, whole partitions are still evicted,
and partition snapshots held by active reads are not evictable at all
- Information about range continuity is not recorded if that
would require inserting a dummy entry, or if previous entry
doesn't belong to the latest snapshot
- Cache update after memtable flush happening concurrently with reads
may inhibit that reads' ability to populate cache (new issue)
- Cache update from flushed memtables has partition granularity,
so may cause latency problems with large partition
- Schema is still tracked per-partition, so after schema changes
reads may induce high latency due to whole partition needing
to be converted atomically
- Range tombstones are repeated in the stream for every range between
cache entries they cover (new issue)
- Populating scans for both small and large partitions (perf_fast_forward)
experienced a 40% reduction of throughput, CPU bound
How was this tested:
- test.py --mode release
- row_cache_stress_test -c1 -m1G
- perf_fast_forward, passes except for the test case checking range continuity population
which would require inserting a dummy entry (mentioned above)
- perf_simple_query (-c1 -m1G --duration 32):
before: 90k [ops/s] stdev: 4k [ops/s]
after: 94k [ops/s] stdev: 2k [ops/s]"
* tag 'tgrabiec/introduce-partial-cache-v8' of github.com:cloudius-systems/seastar-dev: (130 commits)
tests: row_cache: Add test_tombstone_merging_in_partial_partition test case
tests: Introduce row_cache_stress_test
utils: Add helpers for dealing with nonwrapping_range<int>
tests: simple_schema: Allow passing the tombstone to make_range_tombstone()
tests: simple_schema: Accept value by reference
tests: simple_schema: Make add_row() accept optional timestamp
tests: simple_schema: Make new_timestamp() public
tests: simple_schema: Introduce make_ckeys()
tests: simple_schema: Introduce get_value(const clustered_row&) helper
tests: simple_schema: Fix comment
tests: simple_schema: Add missing include
row_cache: Introduce evict()
tests: Add cache_streamed_mutation_test
tests: mutation_assertions: Allow expecting fragments
mutation_fragment: Implement equality check
tests: row_cache: Add test for population of random partitions
tests: row_cache: Add test for partition tombstone population
tests: row_cache: Test reading randomly populated partition
tests: row_cache: Add test_single_partition_update()
tests: row_cache: Add test_scan_with_partial_partitions
...
Runs readers, updates and eviction concurrently and verifies the
following property of reads:
- reads see all past writes
- reads see no partial writes within a single partition
[tgrabiec:
- extracted from a larger commit
- removed coupling with how cache_streamed_mutation is created (the
code went out of sync), used more stable make_reader(). it's simpler too.
- replaced false/true literals with is_continuous/is_dummy where appropraite
- dropped tests for cache::underlying (class is gone)
- reused streamed_mutation_assertions, it has better error messages
- fixed the tests to not create tombstones with missing timestamps
- relaxed range tombstone assertions to only check information relevant for the query range
- print cache on failure for improved debuggability
]