Represents a deferring operation which defers cooperatively with the caller.
The operation is started and resumed by calling run(), which returns
with stop_iteration::no whenever the operation defers and is not
completed yet. When the operation is finally complete, run() returns
with stop_iteration::yes.
This allows the caller to:
1) execute some post-defer and pre-resume actions atomically
2) have control over when the operation is resumed and in which context,
in particular the caller can cancel the operation at deferring points.
It will be used to implement deferring partition_version::apply_to_incomplete().
This overload alows searching the elements by an arbitrary key as long as it is "hashable"
to the same values as the default key and if there is a comparator for
this new key.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
remove(key): removes the entry with the given key if exists, otherwise does nothing.
remote(iterator): removes an entry by a given iterator (returned from loading_cache::find()).
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
"
Main optimization is in the patch titled "lsa: Reduce amount of segment compactions".
I measured 50% reduction of cache update run time in a steady state for an
append-only workload with large partition, in perf_row_cache_update version from:
c3f9e6ce1f/tests/perf_row_cache_update.cc
Other workloads, and other allocation sites probably also could see the
improvement.
"
* tag 'tgrabiec/reduce-lsa-segment-compactions-v1' of github.com:tgrabiec/scylla:
lsa: Expose counters for allocation and compaction throughput
lsa: Reduce amount of segment compactions
lsa: Avoid the call to segment_pool::descriptor() in compact()
lsa: Make reclamation on reserve refill more efficient
Reclaiming memory through segment compaction is expensive. For
occupancy of 85%, in order to reclaim one free segment, we need to
compact 7 segments, by migrating 6 segments worth of data. This results
in significant amplification. Compaction involves moving objects,
which in some cases is expensive in itself as well
(See https://github.com/scylladb/scylla/issues/3247).
This patch reduces amount of segment compactions in favor of doing
more eviction. It especially helps workloads in which LRU order
matches allocation order, in which case there will be no segment
compaction, and just eviction.
In perf_row_cache_update test case for large partition with lots of
rows, which simulates appending workload, I measured that for each new
object allocated, 2 need to be migrated, before the patch. After the
patch, only 0.003 objects are migrated. This reduces run time of
cache update part by 50%.
We are slightly underestimating the amount of memory we use. Now that
the chunked vector can exports its internal memory usage we can use that
directly.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There are times in which we would like to estimate how much memory
a chunked_vector is using. We have two strategies to do it:
1) multiply the size by the size of the elements. That is wrong, because
the chunked_vector can allocate larger chunks in anticipation of more
elements to come.
2) multiply the number of chunks by 128kB. That is also wrong, because
the chunk_vector will not always allocate the entire chunk if there are
only a few elements in it.
The best way to deal with it is to allow the chunked_vector to exports
its current memory usage.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Currently reserve refill allocates segments repeatedly until the
reserve threhsold is met. If single segment allocation needs to
reclaim memory, it will ask the reclaimer for one segment. The
reclaimer could make better decisions if it knew the total number of
segments we try to allocate. In particular, it would not attempt to
compact any segment until it evicts total amount of memory first,
which may reduce the total amount of segment compactions during
refill.
This patch changes refill to increase reclamation step used by
allocate_segment() so that it matches the total amount of memory we
refill.
While the migration function should have enough information to obtain
the object size itself, the LSA logic needs to compute it as well.
IMR is going to make calculating object sizes more expensive, so by
providing the infromation to the migrator we can avoid some needless
operations.
It is non-trivial to get the size of an IMR object. However, the
standard allocator doesn't really need it and LSA can compute it itself
by asking the migrator.
Having migrators dynamically registered and deregistered opens a new
class of bugs. This patch adds some additional checks in the debug mode
with the hopes of catching any misuse early.
With the introduction of the new in-memory representation we will get
type- and schema-dependent migrators. Since there is no bound how many
times they can be created and destroyed it is better to be safe and
reuse registered migrator ids.
"Fixes a bug in partition_snapshot::merge_partition_versions(), which would not
attempt merging if the snapshot is attached to the latest version (in which
case _version is nullptr and _entry is != nullptr). This would cause
partition_version objects to accumulate if there was an older snapshot and it
went away before the latest snapshot. Versions will be removed when the whole
entry goes away (flush or eviction).
May cause performance problems.
Fixes #3402."
* 'tgrabiec/fix-merge_partition_versions' of github.com:tgrabiec/scylla:
mvcc: Test version merging when snapshots go away
anchorless_list: Make ranges conform to SinglePassRange
anchorless_list: Drop deprecated use of std::iterator
mvcc: Fix partition_snapshot::merge_partition_versions() to not leave latest versions unmerged
The two hash values, base and increment, used to produce indices for
setting bits in the filter, have been swapped in SSTables 3.0.
See CASSANDRA-8413 for details.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
* seastar 70aecca...ac02df7 (5):
> Merge "Prefix preprocessor definitions" from Jesse
> cmake: Do not enable warnings transitively
> posix: prevent unused variable warning
> build: Adjust DPDK options to fix compilation
> io_scheduler: adjust property names
DEBUG, DEFAULT_ALLOCATOR, and HAVE_LZ4_COMPRESS_DEFAULT macro
references prefixed with SEASTAR_. Some may need to become
Scylla macros.
They will be re-used for collecting encoding statistics which is needed
to write SSTables 3.0.
Part of #1969.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This method takes a data_source and returns another data_source
that returns data from the input source but in chunks of limited
size.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
_lsa_managed is always 1:1 with _region, so we can remove it, saving
some space in the segment descriptor vector.
Tests: unit (release), logalloc_test (debug)
Message-Id: <20180410122606.10671-1-avi@scylladb.com>
save and load functions for the large_bitset were introduced by Avi with
d590e327c0.
In that commit, Avi says:
"... providing iterator-based load() and save() methods. The methods
support partial load/save so that access to very large bitmaps can be
split over multiple tasks."
The only user of this interface is SSTables. And turns out we don't really
split the access like that. What we do instead is to create a chunked vector
and then pass its begin() method with position = 0 and let it write everything.
The problem here is that this require the chunked vector to be fully
initialized, not just reserved. If the bitmap is large enough that in itself
can take a long time without yielding (up to 16ms seen in my setup).
We can simplify things considerably by moving the large_bitset to use a
chunked vector internally: it already uses a poor man's version of it
by allocating chunks internally (it predates the chunked_vector).
By doing that, we can turn save() into a simple copy operation, and do
away with load altogether by adding a new constructor that will just
copy an existing chunked_vector.
Fixes#3341
Tests: unit (release)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180409234726.28219-1-glauber@scylladb.com>
Address Sanitizer has a global limit on the number of allocations
(note: not number of allocations less number of frees, but cumulative
number of allocations). Running some tests in debug mode on a machine
with sufficient memory can break that limit.
Work around that limit by restricting the amount of memory the
debug mode segment_pool can allocate. It's also nicer for running
the test on a workstation.
To segregate std and lsa allocations, we prime the segment pool
during initialization so that lsa will release lower-addressed
memory to std, rather than lsa and std competing for memory at
random addresses.
However, tests often evict all of lsa memory for their own
purposes, which defeats this priming.
Extract the functionality into a new prime_segment_pool()
function for use in tests that rely on allocation segregation.
We may fail to reclaim because a region has reclaim disabled (usually because
it is in an allocating_section. Failed reclaims can cause high CPU usage
if all of the lower addresses happen to be in a reclaim-disabled region (this
is somewhat mitigated by the fact that checking for reclaim disabled is very
cheap), but worse, failing a segment reclaim can lead to reclaimed memory
being fragmented. This results in the original allocation continuing to fail.
To combat that, we limit the number of failed reclaims. If we reach the limit,
we fail the reclaim. The surrounding allocating_section will release the
reclaim_lock, and increase reserves, which will result in reclaim being
retried with all regions being reclaimable, and succeed in allocating
contiguous memory.
Since lsa tries to keep some non-lsa memory as reserve, we end up
with three blocks of memory: at low addresses, non-lsa memory that was
allocated during startup or subsequently freed by lsa; at middle addresses,
lsa; and at the top addresses, memory that lsa left alone during initial
cache population due to the reserve.
After time passes, both std and lsa will allocate from the top section,
causing a mix of lsa and non-lsa memory. Since lsa tries to free from
lower addresses, this mix will stay there forever, increasing fragmentation.
Fix that by disabling the reserve during startup and allocating all of memory
for lsa. Any further allocation will then have to be satisfied by lsa first
freeing memory from the low addresses, so we will now have just two sections
of memory: low addresses for std, and top addresses for lsa.
Note that this startup allocation does not page in lsa segments, since the
segment constructor does not touch memory.
They are no longer used, and cannot be efficiently implemenented
for large bitsets using a summary vector approach without slowing
down the find_*_set() variants, which are used.
Also remove find_previous_set() for the same reason.
Reducing the segment size reduces the time needed to compact segments,
and increases the number of segments that can be compacted (and so
the probability of finding low-occupancy segments).
128k is the size of I/O buffers and of thread stacks, so we can't
go lower than that without more significant changes.
This patch replaces the zones mechanism with something simpler: a
single segment is moved from the standard allocator to lsa and vice
versa, at a time. Fragmentation resistance is (hopefully) achieved
by having lsa prefer high addresses for lsa data, and return segments
at low address to the standard allocator. Over time, the two will move
apart.
Moving just once segment at a time reduces the latency costs of
transferring memory between free and std.
While building with -O1, I saw that the linker could not find
the vtable for named_value<log_level>. Rather than fixing up the
includes (and likely lengthening build time), fix by defining
the class as an extern template, preventing it from being
instantiated at the call site.
Message-Id: <20180401150235.13451-1-avi@scylladb.com>
This reverts commit 3b53f922a3. It's broken
in two ways:
1. concrete_allocating_function::allocate()'ss caller,
region_group::start_releaser() loop, will delete the object
as soon as it returns; however we scheduled some work depending
on `this` in a separate continuation (via with_scheduling_group())
2. the calling loop's termination condition depends on the work being
done immediately, not later.
"
Additional extension points.
* Allows wrapping commitlog file io (including hinted handoff).
* Allows system schema modification on boot, allowing extensions
to inject extensions into hardcoded schemas.
Note: to make commitlog file extensions work, we need to both
enforce we can be notified on segment delete, and thus need to
fix the old issue of hard ::unlink call in segment destructor.
Segment delete is therefore moved to a batch routine, run at
intervals/flush. Replay segments and hints are also deleted via
the commitlog object, ensuring an extension is notified (metadata).
Configurable listeneres are now allowed to inject configuration
object into the main config. I.e. a local object can, either
by becoming a "configurable" or manually, add references to
self-describing values that will be parsed from the scylla.yaml
file, effectively extending it.
All these wonderful abstractions courtesy of encryption of course.
But super generalized!
"
* 'calle/commitlog_ext' of github.com:scylladb/seastar-dev:
db::extensions: Allow extensions to modify (system) schemas
db::commitlog: Add commitlog/hints file io extension
db::commitlog: Do segment delete async + force replay delete go via CL
main/init: Change configurable callbacks and calls to allow adding opts
util::config_file: Add "add" config item overload
When we call run_when_memory_available, it is entirely possible that
the caller is doing that inside a scheduling_group. If we don't defer
we will execute correctly. But if we do defer, the current code will
execute - in the future - with the default scheduling group.
This patch fixes that by capturing the caller scheduling group and
making sure the function is executed later using it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>