That tombstone_histogram is used to determine droppable data ratio
for a sstable, and unlike C*, we were only updating it for
tombstones. We need to update it with expiration time of cells too,
if any. Creation time (expiration - ttl) cannot be used because if
ttl > gc_grace_seconds, the resulting sstable could be considered
worth dropping by tomstone compaction before any data is actually
expired.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently our build script only supports Ubuntu 14.04/16.04 and Debian 8, this
change extends support to Ubuntu non-LTS versions / Debian non-stable versions.
Note that this is unofficial support, users should build the package for these
distributions theirselves.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1498491473-28691-1-git-send-email-syuu@scylladb.com>
When peer nodes have the same partition data, i.e., with the same
checksum, we currently choose to stream from any of them randomly.
To improve streaming performance, select the peer within the same DC.
This patch is supposed to improve repair perforamnce with multiple DC.
Message-Id: <c6a345b6e8ed2b59f485e53c865241e463b44507.1498490831.git.asias@scylladb.com>
Fixes#2528.
* tag 'asias/gossip_talk_to_more_nodes/v3' of github.com:cloudius-systems/seastar-dev:
gossip: Use vector for _live_endpoints
gossip: Talk to more live nodes in each gossip round
In large clusters with multiple DC deployment, it is observed that it
takes long delay for gossip update to disseminate in the cluster.
To speed up, talk to more live nodes in each gossip round.
Fixes#2528
This patch does the same thing to column_family::make_sstable_reader() as
commit 186f031 did to sstable::as_mutation_source().
Although usually one can fast_forward_to() on the result of a
column_family::make_sstable_reader(), earlier we had an optimization
where if a single partition was specified, it was read exactly,
and fast_forward_to() was *NOT* allowed.
With the mutation_reader::forwarding flag patch, when this flag
was on - requesting fast_forward_to() - we disabled this optimization.
This makes sense, but is not backward compatible with the code which
previously assumes this optimization exists. In particular,
column_family::data_query() does a single partition read but does not
specify forwarding::no explicitly.
So this patch returns this optimization, despite this meaning that we
blatently ignore the fwd_mr flag in that case.
Fixes#2524.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170626141121.30322-1-nyh@scylladb.com>
"This series enables cache to keep partial partitions.
Reads no longer have to read whole partition from sstables
in order to cache the result.
The 10MB threshold for partition size in cache is lifted.
Known issues:
- There is no partial eviction yet, whole partitions are still evicted,
and partition snapshots held by active reads are not evictable at all
- Information about range continuity is not recorded if that
would require inserting a dummy entry, or if previous entry
doesn't belong to the latest snapshot
- Cache update after memtable flush happening concurrently with reads
may inhibit that reads' ability to populate cache (new issue)
- Cache update from flushed memtables has partition granularity,
so may cause latency problems with large partition
- Schema is still tracked per-partition, so after schema changes
reads may induce high latency due to whole partition needing
to be converted atomically
- Range tombstones are repeated in the stream for every range between
cache entries they cover (new issue)
- Populating scans for both small and large partitions (perf_fast_forward)
experienced a 40% reduction of throughput, CPU bound
How was this tested:
- test.py --mode release
- row_cache_stress_test -c1 -m1G
- perf_fast_forward, passes except for the test case checking range continuity population
which would require inserting a dummy entry (mentioned above)
- perf_simple_query (-c1 -m1G --duration 32):
before: 90k [ops/s] stdev: 4k [ops/s]
after: 94k [ops/s] stdev: 2k [ops/s]"
* tag 'tgrabiec/introduce-partial-cache-v8' of github.com:cloudius-systems/seastar-dev: (130 commits)
tests: row_cache: Add test_tombstone_merging_in_partial_partition test case
tests: Introduce row_cache_stress_test
utils: Add helpers for dealing with nonwrapping_range<int>
tests: simple_schema: Allow passing the tombstone to make_range_tombstone()
tests: simple_schema: Accept value by reference
tests: simple_schema: Make add_row() accept optional timestamp
tests: simple_schema: Make new_timestamp() public
tests: simple_schema: Introduce make_ckeys()
tests: simple_schema: Introduce get_value(const clustered_row&) helper
tests: simple_schema: Fix comment
tests: simple_schema: Add missing include
row_cache: Introduce evict()
tests: Add cache_streamed_mutation_test
tests: mutation_assertions: Allow expecting fragments
mutation_fragment: Implement equality check
tests: row_cache: Add test for population of random partitions
tests: row_cache: Add test for partition tombstone population
tests: row_cache: Test reading randomly populated partition
tests: row_cache: Add test_single_partition_update()
tests: row_cache: Add test_scan_with_partial_partitions
...
Runs readers, updates and eviction concurrently and verifies the
following property of reads:
- reads see all past writes
- reads see no partial writes within a single partition
[tgrabiec:
- extracted from a larger commit
- removed coupling with how cache_streamed_mutation is created (the
code went out of sync), used more stable make_reader(). it's simpler too.
- replaced false/true literals with is_continuous/is_dummy where appropraite
- dropped tests for cache::underlying (class is gone)
- reused streamed_mutation_assertions, it has better error messages
- fixed the tests to not create tombstones with missing timestamps
- relaxed range tombstone assertions to only check information relevant for the query range
- print cache on failure for improved debuggability
]
Those methods first create a neutral mutation_partition, and left-fold
it with the versions. The problem is that there is no neutral element
for static row continuity, the flag from the first addend always
wins. We have to copy the flag from the first version to preserve
the logical value.