The test inserts some values with a TTL of 1 second and then
reads them back expecting them not to be expired yet. That may not
always be the case if the machine is slow and we are running in the
debug mode. Increasising the TTLs by x100 should help avoid these
false positives.
Message-Id: <20180219133816.17452-1-pdziepak@scylladb.com>
(cherry picked from commit d97eebe82d)
Commit 6ccd317 introduced a bug in partition_entry::evict() where a
partition entry may be partially evicted if there are non-evictable
snapshots in it. Partially evicting some of the versions may violate
consistency of a snapshot which includes evicted versions. For one,
continuity flags are interpreted realtive to the merged view, not
within a version, so evicting from some of the versions may mark
reanges as continuous when before they were discontinuous. Also, range
tombtsones of the snapshot are taken from all versions, so we can't
partially evict some of them without marking all affected ranges as
discontinuous.
The fix is to revert back to full eviciton, and avoid moving
non-evictable snapshots to cache. When moving whole partition entry to
cache, we first create a neutral empty partition entry and then merge
the memtable entry into it just like we would if the entry already
existed.
Fixes#3215.
Tests: unit (release)
Message-Id: <1518710592-21925-2-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit b0b57b8143)
The labels in database active_reads metrics where not define correctly.
Label should be created so it will be possible to select based on their
value.
The current implementation define a label "class" with three instances:
user, streaming, system.
Fixes: #2770
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180123125206.23660-1-amnon@scylladb.com>
(cherry picked from commit a0a1961b6d)
"When moving whole partition entries from memtable to cache, we move
snapshots as well. It is incorrect to evict from such snapshots
though, because associated readers would miss data.
Solution is to record evictability of partition version references (snapshots)
and avoiding eviction from non-evictable snapshots.
Could affect scanning reads, if the reader uses partition entry from
memtable, and the partition is too large to fit in reader's buffer,
and that entry gets moved to cache (was absent in cache), and then
gets evicted (memory pressure). The reader will not see the remainder
of that entry. Found during code review.
Introduced in ca8e3c4, so affects 2.1+
Fixes#3186.
Tests: unit (release)"
* 'tgrabiec/do-not-evict-memtable-snapshots' of github.com:tgrabiec/scylla:
tests: mvcc: Add test for eviction with non-evictable snapshots
mutation_partition: Define + operator on tombstones
tests: mvcc: Check that partition is fully discontinuous after eviction
tests: row_cache: Add test for memtable readers surviving flush and eviction
memtable: Make printable
mvcc: Take partition_entry by const ref in operator<<()
mvcc: Do not evict from non-evictable snapshots
mvcc: Drop unnecessary assignment to partition_snapshot::_version
tests: Use partition_entry::make_evictable() where appropriate
mvcc: Encapsulate construction of evictable entries
(cherry picked from commit 6ccd317c38)
Race condition was introduced by commit 028c7a0888, which introduces chunk offset
compression, because a reading state is kept in the compress structure which is
supposed to be immutable and can be shared among shards owning the same sstable.
So it may happen that shard A updates state while shard B relies on information
previously set which leads to incorrect decompression, which in turn leads to
read misbehaving.
We could serialize access to at() which would only lead to contention issues for
shared sstables, but that can be avoided by moving state out of compress structure
which is expected to be immutable after sstable is loaded and feeded to shards that
own it. Sequential accessor (wraps state and reference to segmented_offset) is
added to prevent at() and push_back() interfaces from being polluted.
Tests: release mode.
Fixes#3148.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180205192432.23405-1-raphaelsc@scylladb.com>
(cherry picked from commit 09f4ee808f)
These patches deal with the remaining exception safety issues in the
memtable partition range readers. That includes moving the assignment
to iterator_reader::_last outside of allocating section to avoid
problems caused by exception-unsafe assignment operator. Memory
accotuning code is also moved out of the retryable context to improve
the code robustness and avoid potential problems in the future.
Fixes#3172.
* https://github.com/pdziepak/scylla.git memtable-range-read-exception-safety-2.1/v1:
memtable: do not update iterator_reader::_last in alloc section
memtable: do not change accounting state in alloc section
tests/memtable: add more reader exception safety tests
These patches change the memtable reader implementation (in particular
partition_snapshot_reader) so that the existing exception safety
paroblems are fixed, but also in a way that, hopefully, would make it
easier to reason about the error handling and avoid future bugs in that
area.
The main difficulty related to exception safety is that when an
exception is thrown out of an allocating section that code is run again
with increased memory reserved. If the retryable code has side effects
it is very easy to get incorrect behaviour.
In addition to that, entering an allocating section is not exactly cheap
which encourages doing so rarely and having large sections.
The approach taken by this series is to, first, make entering allocating
sections cheaper and then reducing the amount of logic that runs inside
of them to a minimum.
This means that instead of entering a section once per a call to
flat_mutation_reader::fill_buffer() the allocation section is entered
once for each emitted row. The only state modified from within the
section are cached iterators to the current row, which are dropped on
retry. Hopefully, this would make the reader code easier to reason
about.
The optimisations to the allocating sections and managed_bytes
linearised context has successfully eliminated any penalty caused by
much more fine grained allocating sections.
Fixes#3123.
Fixes#3133.
Tests: unit-tests (release)
BEFORE
test iterations median mad min max
memtable.one_partition_one_row 1155362 869.139ns 0.282ns 868.465ns 873.253ns
memtable.one_partition_many_rows 127252 7.871us 15.252ns 7.851us 7.886us
memtable.many_partitions_one_row 58715 17.109us 2.765ns 17.013us 17.112us
memtable.many_partitions_many_rows 4839 206.717us 212.385ns 206.505us 207.448us
AFTER
test iterations median mad min max
memtable.one_partition_one_row 1194453 839.223ns 0.503ns 834.952ns 842.841ns
memtable.one_partition_many_rows 133785 7.477us 4.492ns 7.473us 7.507us
memtable.many_partitions_one_row 60267 16.680us 18.027ns 16.592us 16.700us
memtable.many_partitions_many_rows 4975 201.048us 144.929ns 200.822us 201.699us
./before_sq ./after_sq diff
read 337373.86 353694.24 4.8%
write 388759.99 394135.78 1.4%
* https://github.com/pdziepak/scylla.git memtable-exception-safety-2.1/v1:
flat_mutation_reader: add allocation point in push_mutation_fragment
linearization_context: remove non-trivial operations from fast path
lsa: split alloc section into reserving and reclamation-disabled parts
lsa: optimise disabling reclamation and invalidation counter
mutation_fragment: allow creating clustering row in place
paratition_snapshot_reader: minimise amount of retryable code
memtable: drop memtable_entry::read()
tests/memtable: add test for reader exception safety
Retryable code that has side effects is a recipe for bugs. This patch
reworkds the snapshot reader so that the amount of logic run with
reclamation disabled is minimal and has a very limited side effects.
Moving clustering_row is expensive due to amount of data stored
internally. Adding a mutation_fragment constructor that builds a
clustering_row in-place saves some of that moving.
Most of the lsa gory details are hidden in utils/logalloc.cc. That
includes the actual implementation of a lsa region: region_impl.
However, there is code in the hot path that often accesses the
_reclaiming_enabled member as well as its base class
allocation_strategy.
In order to optimise those accesses another class is introduced:
basic_region_impl that inherits from allocation_strategy and is a base
of region_impl. It is defined in utils/logalloc.hh so that it is
publicly visible and its member functions are inlineable from anywhere
in the code. This class is supposed to be as small as possible, but
contain all members and functions that are accessed from the fast path
and should be inlined.
Allocating sections reserves certain amount of memory, then disables
reclamation and attempts to perform given operation. If that fails due
to std::bad_alloc the reserve is increased and the operation is retried.
Reserving memory is expensive while just disabling reclamation isn't.
Moreover, the code that runs inside the section needs to be safely
retryable. This means that we want the amount of logic running with
reclamation disabled as small as possible, even if it means entering and
leaving the section multiple times.
In order to reduce the performance penalty of such solution the memory
reserving and reclamation disabling parts of the allocating sections are
separated.
Since linearization_context is thread_local every time it is accessed
the compiler needs to emit code that checks if it was already
constructed and does so if it wasn't. Moreover, upon leaving the context
from the outermost scope the map needs to be cleared.
All these operations impose some performance overhead and aren't really
necessary if no buffers were linearised (the expected case). This patch
rearranges the code so that lineatization_context is trivially
constructible and the map is cleared only if it was modified.
Exception safety tests inject a failure at every allocation and verify
whether the error is handled properly.
push_mutation_fragment() adds a mutation fragment to a circular_buffer,
in theory any call to that function can result in a memory allocation,
but in practice that depends on the implementation details. In order to
improve the effectiveness of the exception safety tests this patch adds
an explicit allocation point in push_mutation_fragment().
Fixes#3096
The credentials processing for transitional auth was broken
in ba6a41d, "auth: Switch to sharded service which effectively removed
the "virtualization" of underlying auth in the SASL challenge.
As a quick workaround, add the permissive exception handling to
sasl object as well.
Message-Id: <20180103102724.1083-1-calle@scylladb.com>
(cherry picked from commit 35b9ec868a)
* seastar 8d254a1...af1b789 (3):
> tls_test: Fix echo test not setting server trust store
> tls: Do not restrict re-handshake to client
> tls: Actually verify client certificate if requested
Fixes#3072
The reason sstable key estimation is inaccurate is that it doesn't account that
index sampling is now dynamic.
The estimation is done as follow:
uint64_t get_estimated_key_count() const {
return ((uint64_t)_components->summary.header.size_at_full_sampling + 1) *
_components->summary.header.min_index_interval;
}
The biggest problem is that _components->summary.header.min_index_interval isn't
actually the minimum interval, but instead the default interval value set in the
schema.
So the estimation gets worse the larger the average partition, because the larger
the average partition the lower the index sampling interval.
One of the problems is that estimation has a big influence on bloom filter size,
and so for large partitions we were generating bigger filters than we had to.
From now on, size at full sampling is calculated as if sampling were static
(which was the case until commit 8726ee937d which introduced size-based
sampling), using minimum index as a strict sampling interval.
Tests: units (release)
Fixes#3113.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180122233612.11147-1-raphaelsc@scylladb.com>
(cherry picked from commit 2c181b69c9)
"_free_segments_in_zones is not adjusted by
segment_pool::reclaim_segments() for empty zones on reclaim under some
conditions. For instance when some zone becomes empty due to regular
free() and then reclaiming is called from the std allocator, and it is
satisfied from a zone after the one which is empty. This would result
in free memory in such zone to appear as being leaked due to corrupted
free segment count, which may cause a later reclaim to fail. This
could result in bad_allocs.
The fix is to always collect such zones.
Fixes#3129
Refs #3119
Refs #3120"
* 'tgrabiec/fix-free_segments_in_zones-leak' of github.com:scylladb/seastar-dev:
tests: lsa: Test _free_segments_in_zones is kept correct on reclaim
lsa: Expose max_zone_segments for tests
lsa: Expose tracker::non_lsa_used_space()
lsa: Fix memory leak on zone reclaim
(cherry picked from commit 4ad212dc01)
The call chain is:
storage_service::on_change() -> storage_service::handle_state_removing()
-> storage_service::restore_replica_count() -> streamer->stream_async()
Listeners run as part of gossip message processing, which is serialized.
This means we won't be processing any gossip messages until streaming
completes.
In fact, there is no need to wait for restore_replica_count to complete
which can take a long time, since when it completes, this node will send
notification to tell the removal_coordinator that the restore process is
finished on this node. This node will be removed from _replicating_nodes
on the removal_coordinator.
Tested with update_cluster_layout_tests.py
Fixes#2886
Message-Id: <8b4fe637dfea6c56167ddde3ca86fefb8438ce96.1516088237.git.asias@scylladb.com>
(cherry picked from commit 5107b6ad16)
Commit 2d5fb9d109 (gms/gossiper: Replicate changes incrementally to
other shards) changes the way we replicate _token_metadata and
endpoint_state_map. Before they are replicated at the same time, after
they are not any more. This causes a shard in NORMAL status can still be
with a empty _token_metadata.
We saw errors:
[shard 12] token_metadata - sorted_tokens is empty in first_token_index!
during CorruptThenRepairNemesis.
Fix by setting the gossip status to NORMAL after replication of
_token_metadata, so that once a node is in NORMAL, we can do repair. The
commit 69c81bcc87 (repair: Do not allow repair until node is in NORMAL
status) prevents the early repair operation by checking if a node is in
NORMAL status.
Fixes#3121
Message-Id: <af6a223733d2e11351f1fa35f59eacfa7d65dd30.1516065564.git.asias@scylladb.com>
(cherry picked from commit 3c8ed255ac)
Problem introduced in 375ed938b4
Also remove redefinition of schema in dummy incremental selector
which is supposed to use the one in base class instead.
Following tests are fixed:
./build/release/tests/mutation_reader_test
./build/release/tests/sstable_test -- -c1
./build/release/tests/row_cache_test
./build/release/tests/cache_flat_mutation_reader_test
./build/release/tests/row_cache_stress_test
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180111153831.17462-1-raphaelsc@scylladb.com>
On Debian 9, 'pbuilder create' fails because of lack of GPG key for
3rdparty repo, so we need --allow-untrusted on 'pbuilder create' and
'pbuilder update'.
Also, apt-key adv --fetch-keys does not works correctly on it, but we can use
"curl <URL> | apt-key add -" as workaround.
Fixes#3088
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1513797714-18067-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit b68ee98310)
We provided "boost1.63" package for Debian 8 since we couldn't build
"scylla-boost163" package witch is available on Ubuntu14/16, but I fixed the
problem and now we have it for Debian 8 too, so switch to it.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1514220163-25985-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 51013f561d)
The legacy mutation_reader/streamed_mutation design allowed very easily
to skip the partition merging logic if there was only one underlying
reader that has emitted it.
That optimisation was lost after conversion to flat mutation readers
which has impacted the performance. This patch mostly recovers it by
bypassing most of mutation_reader_merger logic if there is only a single
active reader for a given partition.
The performance regression was introduced in
8731c1bc66 "Flatten the implementation of
combined_mutation_reader".
perf_simple_query -c4 read results (medians of 60):
original regression
before 8731c1 after 8731c1 diff
read 326241.02 300244.09 -8.0%
this patch
before after diff
read 313882.59 325148.05 3.6%
Message-Id: <20180103121019.764-1-pdziepak@scylladb.com>
(cherry picked from commit b4a4c04bab)
The uninitialized session has no peer associated with it yet. There is
no point sending the failed message when abort the session. Sending the
failed message in this case will send to a peer with uninitialized
dst_cpu_id which will casue the receiver to pass a bogus shard id to
smp::submit_to which cases segfault.
In addition, to be safe, initialize the dst_cpu_id to zero. So that
uninitialized session will send message to shard zero instead of random
bogus shard id.
Fixes the segfault issue found by
repair_additional_test.py:RepairAdditionalTest.repair_abort_test
Fixes#3115
Message-Id: <9f0f7b44c7d6d8f5c60d6293ab2435dadc3496a9.1515380325.git.asias@scylladb.com>
(cherry picked from commit 774307b3a7)