Refs #9053
Flips default for commitlog disk footprint hard limit enforcement to off due
to observed latency stalls with stress runs. Instead adds an optional flag
"commitlog_use_hard_size_limit" which can be turned on to in fact do enforce it.
Sort of tape and string fix until we can properly tweak the balance between
cl & sstable flush rate.
Closes#9195
(cherry picked from commit 3633c077be)
Fixes#8952
In 5ebf5835b0 we added a segment
prune after flushing, to deal with deadlocks in shutdown.
This means that calls that issue sync/flush-like ops "for-all",
need to operate on a defensive copy of the list.
Closes#8980
(cherry picked from commit ce45ffdffb)
Fixes#8270
If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk footprint that is over limit causing new segment allocation to stall.
We need to take a few things into account:
1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here.
2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task.
3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit.
4.) (v2) Must ensure discard/delete routines are executed. Because we can race with background disk syncs, we may need to
issue segment prunes from end_flush() so we wake up actual file deletion/recycling
5.) (v2) Shutdown must ensure discard/delete is run after we've disabled background task etc, otherwise we might fail waking up replenish and get stuck in gate
6.) (v2) Recycling or deleting segments must be consistent, regardless of shutdown. For same reason as above.
7.) (v3) Signal recycle/delete queues/promise on shutdown (with recognized marker) to handle edge case where we only have a single (allocating) segment in the list, and cannot wake up replenisher in any more civilized way.
Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything).
New attempt at this, should fix intermittent shutdown deadlocks in commitlog_test.
Closes#8764
* github.com:scylladb/scylla:
commitlog_test: Add test case for usage/disk size threshold mismatch
commitlog_test: Improve test assertion
commitlog: Add waitable future for background sync/flush
commitlog: abort queues on shutdown
commitlog: break out "abort" calls into member functions
commitlog: Do explicit discard+delete in shutdown
commitlog: Recycle or not should not depend on shutdown state
commitlog: Issue discard_unused_segments on segment::flush end IFF deletable
commitlog: Flush all segments if we only have one.
commitlog: Always force flush if segment allocation is waiting
commitlog: Include segment wasted (slack) size in footprint check
commitlog: Adjust (lower) usage threshold
(cherry picked from commit 14252c8b71)
Shutdown must never fail, otherwise it may cause hangs
as seen in https://github.com/scylladb/scylla/issues/8577.
This change wraps the file created in `allocate_segment_ex` in `make_checked_file` so that scylla will abort when failing to write to the commitlog files.
In case other errors are seen during shutdown, just log them and continue with shutting down to prevent scylla from hanging.
Fixes#8577
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#8578
* github.com:scylladb/scylla:
commitlog: segment_manager::shutdown: abort on errors
commitlog: allocate_segment_ex: make_checked_file
(cherry picked from commit 48ff641f67)
This was triggered by the test_total_space_limit_of_commitlog dtest.
When it passes a very large commitlog_segment_size_in_mb (1/6th of the
free memory size, in mb), segment_manager constructor limits max_size
to std::numeric_limits<position_type>::max() which is 0xffffffff.
This causes allocate_segment_ex to loop forever when writing the segment
file since `dma_write` returns 0 when the count is unaligned (seen 4095).
The fix here is to select a sligtly small maxsize that is aligned
down to a multiple of 1MB.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210407121059.277912-1-bhalevy@scylladb.com>
(cherry picked from commit 705f9c4f79)
Fixes#8363Fixes#8376
Delete segements has two issues when running with size-limited
commit log and strict adherence to said limit.
1.) It uses parallel processing, with deferral. This means that
the disk usage variables it looks at might not be fully valid
- i.e. we might have already issued a file delete that will
reduce disk footprint such that a segment could instead be
recycled, but since vars are (and should) only updated
_post_ delete, we don't know.
2.) It does not take into account edge conditions, when we only
delete a single segment, and this segment is the border segment
- i.e. the one pushing us over the limit, yet allocation is
desperately waiting for recycling. In this case we should
allow it to live on, and assume that next delete will reduce
footprint. Note: to ensure exact size limit, make sure
total size is a multiple of segment size.
if we had an error in recycling (disk rename?), and no elements
are available, we could have waiters hoping they will get segements.
abort the queue (not permanent, but wakes up waiters), and let them
retry. Since we did deletions instead, disk footprint should allow
for new allocs at least. Or more likely, everything is broken, but
we will at least make more noise.
Closes#8372
* github.com:scylladb/scylla:
commitlog: Add signalling to recycle queue iff we fail to recycle
commitlog: Fix race and edge condition in delete_segments
commitlog: coroutinize delete_segments
commitlog_test: Add test for deadlock in recycle waiter
(cherry picked from commit 8e808a56d2)
Fixes#8369
This was originally found (and fixed) by @gleb-cloudius, but the patch set with
the fix was reverted at some point, and the fix went away. Now the error remains
even in new, nice coroutine code.
We check the wrong var in the inner loop of the pre-fill path of
allocate_segment_ex, often causing us to generate giant writev:s of more or less
the whole file. Not intended.
Closes#8370
This open option tells seastar that the file in question
will be truncated to the needed size right at once and all
the subsequent writes will happen within this size. This
hint turns off append optimization in seastar that's not
that cheap and helps so save few cpu cycles.
The option was introduced in seastar by 8bec57bc.
tests: unit(dev), dtest(commitlog:
test_batch_commitlog,
test_periodic_commitlog,
test_commitlog_replay_on_startup)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210323115409.31215-1-xemul@scylladb.com>
Refs #7794
Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.
v2:
* More comment
v3:
* Add missing flush
v4:
* comment
v5:
* Split coroutine and fix into separate patches
This commit adds an option which, when turned on, allows the commitlog
to go over configured size limit. After reaching the limit, commitlog
will be more conservative with its usage of the disk space - for
example, it won't increase the segment reserve size or reuse recycled
segments. Most importantly, it won't block writes until the space used
by the commitlog goes down.
This change is necessary for hinted handoff to keep its current
behavior. Hinted handoff does not let the commitlog free segments
itself - instead, it re-creates it every 10 seconds and manually deletes
segments after all hints are sent from a segment.
Commit aab6b0ee27 introduced the
controversial new IMR format, which relied on a very template-heavy
infrastructure to generate serialization and deserialization code via
template meta-programming. The promise was that this new format, beyond
solving the problems the previous open-coded representation had (working
on linearized buffers), will speed up migrating other components to this
IMR format, as the IMR infrastructure reduces code bloat, makes the code
more readable via declarative type descriptions as well as safer.
However, the results were almost the opposite. The template
meta-programming used by the IMR infrastructure proved very hard to
understand. Developers don't want to read or modify it. Maintainers
don't want to see it being used anywhere else. In short, nobody wants to
touch it.
This commit does a conceptual revert of
aab6b0ee27. A verbatim revert is not
possible because related code evolved a lot since the merge. Also, going
back to the previous code would mean we regress as we'd revert the move
to fragmented buffers. So this revert is only conceptual, it changes the
underlying infrastructure back to the previous open-coded one, but keeps
the fragmented buffers, as well as the interface of the related
components (to the extent possible).
Fixes: #5578
Refs #6148
Commitlog disk limit was previously a "soft" limit, in that we allowed allocating new segments, even if we were over
disk usage max. This would also cause us sometimes to create new segments and delete old ones, if badly timed in
needing and releasing segments, in turn causing useless disk IO for pre-allocation/zeroing.
This patch set does:
* Make limit a hard limit. If we have disk usage > max, we wait for delete or recycle.
* Make flush threshold configurable. Default is ask for flush when over 50% usage. (We do not wait for results)
* Make flush "partial". We flush X% of the used space (used - thres/2), and make the rp limit accordingly. This means we will try to clear the N oldest segments, not all. I.e. "lighter" flush. Of course, if the CL is wholly dominated by a single CF, this will not really help much. But when > 1 cf is used, it means we can skip those not having unflushed data < req rp.
* Force more eager flush/recycle if we're out of segments
Note: flush threshold is not exposed in scylla config (yet). Because I am unsure of wording, and even if it should.
Note: testing is sparse, esp. in regard to latency/timeouts added in high usage scenarios. While I can fairly easily provoke "stalls" (i.e. forced waiting for segments to free up) with simple C-S, it is hard to say exactly where in a more sane config (I set my limits looow) latencies will start accumulating.
Closes#7879
* github.com:scylladb/scylla:
commitlog: Force earlier cycle/flush iff segment reserve is empty
commitlog: Make segment allocation wait iff disk usage > max
commitlog: Do partial (memtable) flushing based on threshold
commitlog: Make flush threshold configurable
table: Add a flush RP mark to table, and shortcut if not above
Fixes#7615
Allows N mutations to be written "atomically" (i.e. in the same
call). Either all are added to segement, or none.
Returns rp_handle vector corresponding to the call vector.
Allows writing more than one blob of data using a single
"add" call into segment. The old call sites will still
just provide a single entry.
To ensure we can determine the health of all the entries
as a unit, we need to wrap them in a "parent" entry.
For this, we bump the commitlog segment format and
introduce a magic marker, which if present, means
we have entries in entry, totalling "size" bytes.
We checksum the entra header, and also checksum
the individual checksums of each sub-entry (faster).
This is added as a post-word.
When parsing/replaying, if v2+ and marker, we have to
read all entries + checksums into memory, verify, and
_then_ we can actually send the info to caller.
This reverts commit df2f67626b. The fix
is correct, but has an unfortunate side effect with O_DSYNC: each
128k write also needs to flush the XFS log. This translates to
32MB/128k = 256 flushes, compared to one flush with the original code.
A better fix would be to prezero without O_DSYNC, then reopen the file
with O_DSYNC, but we can do that later.
Reopens#5857.
Attempt to hurry flushing/segment delete/recycle if we are trying
to get a segment for allocation, and reserve is empty when above
disk threshold. This is minimize time waited in allocation semaphore.
When recycling a segment in O_DSYNC mode if the size of the segment
is neither shrunk nor grown, avoid calling file::truncate() or
file::allocate().
Message-Id: <20201215182332.1017339-2-kostja@scylladb.com>
Normally a file size should be aligned around block size, since
we never write to it any unaligned size. However, we're not
protected against partial writes.
Just to be safe, align up the amount of bytes to zerofill
when recycling a segment.
Message-Id: <20201211142628.608269-4-kostja@scylladb.com>
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
`count` function was often used in various ways.
`contains` does not only express the intend of the code better but also
does it in more unified way.
This commit replaces all the occurences of the `count` with the
`contains`.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <b4ef3b4bc24f49abe04a2aba0ddd946009c9fcb2.1597314640.git.piotr@scylladb.com>
Refs #6148
Separates disk usage into two cases: Allocated and used.
Since we use both reserve and recycled segments, both
which are not actually filled with anything at the point
of waiting.
Also refuses to recycle segments or increase reserve size
if our current disk footprint exceeds threshold.
And finally uses some initial heuristics to determine when
we should suggest flushing, based on disk limit, segment
size, and current usage. Right now, when we only have
a half segment left before hitting used == max.
Some initial tests show an improved adherence to limit
though it will still be exceeded, because we do _not_
force waiting for segments to become cleared or similar
if we need to add data, thus slow flushing can still make
usage create extra segments. We will however attempt to
shrink disk usage when load is lighter.
Somewhat unclear how much this impacts performance
with tight limits, and how much this matters.
v2:
* Add some comments/explanations
v3:
* Made disk footprint subtract happen post delete (non-optimistic)
C++20 introduced std::erase_if which simplifies removal of elements
from the collection. Previously the code pattern looked like:
<collection>.erase(
std::remove_if(<collection>.begin(), <collection>.end(), <predicate>),
<collection>.end());
In C++20 the same can be expressed with:
std::erase_if(<collection>, <predicate>);
This commit replaces all the occurences of the old pattern with the new
approach.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <6ffcace5cce79793ca6bd65c61dc86e6297233fd.1597064990.git.piotr@scylladb.com>
Besdies being more robust than passing const descriptor&
to continuations, this helps simplify making allocate_segment_ex's
continuations nothrow_move_constructible, that is need for using
seastar::with_file_close_on_failure().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a mutation is written to the commitlog, a rp_handle object is
returned which keeps a reference to commitlog segment. A segment is
"dirty" when its reference count is not zero, otherwise it is "clean".
When commitlog object is being destroyed, a warning is being printed
for every dirty segment. On the other hand, clean segments are deleted.
In case of standard mutation writing path, the rp_handle moves
responsibility for releasing the reference to the memtable to which the
mutation is written. When the memtable is flushed to disk, all
references accumulated in the memtable are released. In this context, it
makes sense to warn about dirty segments, because such segments contain
mutations that are not written to sstables, and need to be replayed.
However, hinted handoff uses a different workflow - it recreates a
commitlog object periodically. When a hint is written to commitlog, the
rp_handle reference is not released, so that segments with hints are not
deleted when destroying the commitlog. When commitlog is created again,
we get a list of saved segments with hints that we can try to send at a
later time.
Although this is intended behavior, now that releasing the hints
commitlog is done properly, it causes the mentioned warning to
periodically appear in the logs.
This patch adds a parameter for the commitlog that allows to disable
this warning. It is only used when creating hinted handoff commitlogs.
This makes the code a bit easier to read as there are no discarded
futures and no references to having to keep a subscription alive,
which we don't with current seastar.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200527013120.179763-1-espindola@scylladb.com>
Seastar recently lost support for the experimental Concepts Technical
Specification (TS) and gained support for C++20 concepts. Re-enable
concepts in Scylla by updating our use of concepts to the C++20
standard.
This change:
- peels off uses of the GCC6_CONCEPT macro
- removes inclusions of <seastar/gcc6-concepts.hh>
- replaces function-style concepts (no longer supported) with
equation-style concepts
- semicolons added and removed as needed
- deprecated std::is_pod replaced by recommended replacement
- updates return type constraints to use concepts instead of
type names (either std::same_as or std::convertible_to, with
std::same_as chosen when possible)
No attempt is made to improve the concepts; this is a specification
update only.
Message-Id: <20200531110254.2555854-1-avi@scylladb.com>
C++20 deprecates capturing this in default-copy lambdas ([=]), with
good reason. Move to explicit captures to avoid any ambiguity and
reduce warning spew.
Message-Id: <20200517150834.753463-1-avi@scylladb.com>
Fixes#6265
Return type for read_log_file was previously changed from
subscription to future<>, returning the previously returned
subscriptions result of done(). But it did not preserve the
subscription itself, which in turn will cause us to (in
work::stream), call back into a deleted object.
Message-Id: <20200422090856.5218-1-calle@scylladb.com>
Commitlog replay is given a filename prefix to filter files against, but it
ignores it. As a result we will replay anything in that directory, including
recycled segments, which is wasteful.
Fix by adding a check for the prefix.
Tests: unit (dev), manual test that regular commitlog files are not
filtered.
Message-Id: <20200416174542.133230-1-avi@scylladb.com>
When I/O error (e.g. EMFILE / ENOSPC) happens we hit
an assert in ~append_challenged_posix_file_impl(): Assertion _closing_state == state::closed' failed.
Commit 6160b9017d add close on failure
of the lamda defined in allocate_segment_ex, but it doesn't handle an error
after the file is opened/created while it is wrapped with commitlog_file_extensions.
Refs #5657
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Reviewed-by: Calle Wilund <calle@scylladb.com>
Message-Id: <20200414115231.298632-1-bhalevy@scylladb.com>
This removes the need to include reactor.hh, a source of compile
time bloat.
In some places, the call is qualified with seastar:: in order
to resolve ambiguities with a local name.
Includes are adjusted to make everything compile. We end up
having 14 translation units including reactor.hh, primarily for
deprecated things like reactor::at_exit().
Ref #1
Fixes#5899
When terminating (closing) a segment, we write a trailing block
of zero so reader can have an empty region after last used chunk
as end marker. This is due to using recycled, pre-allocated
segments with potentially non-zero data extending over the point
where we are ending the segment (i.e. we are not fully filling
the segment due to a huge mutation or similar).
However, if we reach end of segment writing the final block
(typically many small mutations), the file will end naturally
after the data written, and any trailing zero block would in fact
just extend the file further. While this will only happen once per
segment recycled (independent on how many times it is recycled),
it is still both slightly breaking the disk usage contract and
also potentially causing some disk stalls due to metadata changes
(though of course very infrequent).
We should only write trailing zero if we are below the max_size
file size when terminating
Adds a small size check to commitlog test to verify size bounds.
(Which breaks without the patch)
v2:
- Fix test to take into account that files might be deleted
behind our backs.
v3:
- Fix test better, by doing verification _before_ segments are
queued for delete.
Message-Id: <20200226121601.15347-2-calle@scylladb.com>
Message-Id: <20200324100235.23982-1-calle@scylladb.com>
This reverts commit 0b34d88957. According
to Rafael Avila de Espindola:
"I have bisected the recent failures [in commitlog_test] on next to this
patch."
Fixes#5899
When terminating (closing) a segment, we write a trailing block
of zero so reader can have an empty region after last used chunk
as end marker. This is due to using recycled, pre-allocated
segments with potentially non-zero data extending over the point
where we are ending the segment (i.e. we are not fully filling
the segment due to a huge mutation or similar).
However, if we reach end of segment writing the final block
(typically many small mutations), the file will end naturally
after the data written, and any trailing zero block would in fact
just extend the file further. While this will only happen once per
segment recycled (independent on how many times it is recycled),
it is still both slightly breaking the disk usage contract and
also potentially causing some disk stalls due to metadata changes
(though of course very infrequent).
We should only write trailing zero if we are below the max_size
file size when terminating
Adds a small size check to commitlog test to verify size bounds.
(Which breaks without the patch)
Message-Id: <20200226121601.15347-2-calle@scylladb.com>
Fixes#5891
Refs #5899
When creating segments with the o_dsync option active, we write max_size
zeros to disk, to ensure actual disk blocks are allocated.
However, if we recycle a segment, we should, when not actually creating
a new file, check the existing size on disk, and only zero any blocks
not already allocated (i.e. if recycled file was smaller than max_size,
due to segement truncation on close).
test: unit
Message-Id: <20200226121601.15347-1-calle@scylladb.com>
Due to a bug the entire segment is written in one huge write of 32Mb.
The idea was to split it to writes of 128K, so fix it.
Fixes#5857
Message-Id: <20200220102939.30769-1-gleb@scylladb.com>
There may be other commitlog writes waiting for zeroing to complete, so
not using proper scheduling class causes priority inversion.
Fixes#5858.
Message-Id: <20200220102939.30769-2-gleb@scylladb.com>