Fixes#8270
If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk footprint that is over limit causing new segment allocation to stall.
We need to take a few things into account:
1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here.
2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task.
3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit.
4.) (v2) Must ensure discard/delete routines are executed. Because we can race with background disk syncs, we may need to
issue segment prunes from end_flush() so we wake up actual file deletion/recycling
5.) (v2) Shutdown must ensure discard/delete is run after we've disabled background task etc, otherwise we might fail waking up replenish and get stuck in gate
6.) (v2) Recycling or deleting segments must be consistent, regardless of shutdown. For same reason as above.
7.) (v3) Signal recycle/delete queues/promise on shutdown (with recognized marker) to handle edge case where we only have a single (allocating) segment in the list, and cannot wake up replenisher in any more civilized way.
Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything).
New attempt at this, should fix intermittent shutdown deadlocks in commitlog_test.
Closes#8764
* github.com:scylladb/scylla:
commitlog_test: Add test case for usage/disk size threshold mismatch
commitlog_test: Improve test assertion
commitlog: Add waitable future for background sync/flush
commitlog: abort queues on shutdown
commitlog: break out "abort" calls into member functions
commitlog: Do explicit discard+delete in shutdown
commitlog: Recycle or not should not depend on shutdown state
commitlog: Issue discard_unused_segments on segment::flush end IFF deletable
commitlog: Flush all segments if we only have one.
commitlog: Always force flush if segment allocation is waiting
commitlog: Include segment wasted (slack) size in footprint check
commitlog: Adjust (lower) usage threshold
(cherry picked from commit 14252c8b71)
Fixes#8363Fixes#8376
Delete segements has two issues when running with size-limited
commit log and strict adherence to said limit.
1.) It uses parallel processing, with deferral. This means that
the disk usage variables it looks at might not be fully valid
- i.e. we might have already issued a file delete that will
reduce disk footprint such that a segment could instead be
recycled, but since vars are (and should) only updated
_post_ delete, we don't know.
2.) It does not take into account edge conditions, when we only
delete a single segment, and this segment is the border segment
- i.e. the one pushing us over the limit, yet allocation is
desperately waiting for recycling. In this case we should
allow it to live on, and assume that next delete will reduce
footprint. Note: to ensure exact size limit, make sure
total size is a multiple of segment size.
if we had an error in recycling (disk rename?), and no elements
are available, we could have waiters hoping they will get segements.
abort the queue (not permanent, but wakes up waiters), and let them
retry. Since we did deletions instead, disk footprint should allow
for new allocs at least. Or more likely, everything is broken, but
we will at least make more noise.
Closes#8372
* github.com:scylladb/scylla:
commitlog: Add signalling to recycle queue iff we fail to recycle
commitlog: Fix race and edge condition in delete_segments
commitlog: coroutinize delete_segments
commitlog_test: Add test for deadlock in recycle waiter
(cherry picked from commit 8e808a56d2)
Refs #7794
Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.
v2:
* More comment
v3:
* Add missing flush
v4:
* comment
v5:
* Split coroutine and fix into separate patches
commitlog was changed to use fragmented_temporary_buffer::ostream (db::commitlog::output).
So if there are discontiguous small memory blocks, they can be used to satisfy
an allocation even if no contiguous memory blocks are available.
To prevent that, as Avi suggested, this change allocates in 128K blocks
and frees the last one to succeed (so that we won't fail on allocating continuations).
Fixes#8028
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210203100333.862036-1-bhalevy@scylladb.com>
seastar::make_lw_shared has a constructor taking a T&&. There is no
such constructor in std::make_shared:
https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared
This means that we have to move from
make_lw_shared(T(...)
to
make_lw_shared<T>(...)
If we don't want to depend on the idiosyncrasies of
seastar::make_lw_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
All reader are soon going to require a valid permit, so make sure we
have a valid permit which we can pass to the delegate reader when
creating it. This means `memtable::make_flat_reader()` now also requires
a permit to be passed to it.
Internally the permit is stored in `scanning_reader`, which is used both
for flushes and normal reads. In the former case a permit is not
required.
When replaying the commitlog, pass keys to
`validation::validate_cql_key()`. Discard entries which fail validation
and warn about it in the logs.
This prevents invalid keys from getting into the system, possibly
failing the commitlog replay and the successful boot of the node,
preventing the node from recovering data.
Fixes#6195
test_commitlog_delete_when_over_disk_limit reads current segment list
in flush handler, to compare with result after allowing deletetion of
segement. However, it might be called more than once in rare cases,
because timing and us using rather small sizes.
Reading the list the second time however is not a good idea, because
it might just very well be exactly the same as what we read in the
test check code, and we actually overwrite the list we want to
check against. Because callback is on timer. And test is not.
Message-Id: <20200414114322.13268-1-calle@scylladb.com>
This removes the need to include reactor.hh, a source of compile
time bloat.
In some places, the call is qualified with seastar:: in order
to resolve ambiguities with a local name.
Includes are adjusted to make everything compile. We end up
having 14 translation units including reactor.hh, primarily for
deprecated things like reactor::at_exit().
Ref #1
Fixes#5899
When terminating (closing) a segment, we write a trailing block
of zero so reader can have an empty region after last used chunk
as end marker. This is due to using recycled, pre-allocated
segments with potentially non-zero data extending over the point
where we are ending the segment (i.e. we are not fully filling
the segment due to a huge mutation or similar).
However, if we reach end of segment writing the final block
(typically many small mutations), the file will end naturally
after the data written, and any trailing zero block would in fact
just extend the file further. While this will only happen once per
segment recycled (independent on how many times it is recycled),
it is still both slightly breaking the disk usage contract and
also potentially causing some disk stalls due to metadata changes
(though of course very infrequent).
We should only write trailing zero if we are below the max_size
file size when terminating
Adds a small size check to commitlog test to verify size bounds.
(Which breaks without the patch)
v2:
- Fix test to take into account that files might be deleted
behind our backs.
v3:
- Fix test better, by doing verification _before_ segments are
queued for delete.
Message-Id: <20200226121601.15347-2-calle@scylladb.com>
Message-Id: <20200324100235.23982-1-calle@scylladb.com>
This reverts commit 0b34d88957. According
to Rafael Avila de Espindola:
"I have bisected the recent failures [in commitlog_test] on next to this
patch."
Fixes#5899
When terminating (closing) a segment, we write a trailing block
of zero so reader can have an empty region after last used chunk
as end marker. This is due to using recycled, pre-allocated
segments with potentially non-zero data extending over the point
where we are ending the segment (i.e. we are not fully filling
the segment due to a huge mutation or similar).
However, if we reach end of segment writing the final block
(typically many small mutations), the file will end naturally
after the data written, and any trailing zero block would in fact
just extend the file further. While this will only happen once per
segment recycled (independent on how many times it is recycled),
it is still both slightly breaking the disk usage contract and
also potentially causing some disk stalls due to metadata changes
(though of course very infrequent).
We should only write trailing zero if we are below the max_size
file size when terminating
Adds a small size check to commitlog test to verify size bounds.
(Which breaks without the patch)
Message-Id: <20200226121601.15347-2-calle@scylladb.com>
1. Move tests to test (using singular seems to be a convention
in the rest of the code base)
2. Move boost tests to test/boost, other
(non-boost) unit tests to test/unit, tests which are
expected to be run manually to test/manual.
Update configure.py and test.py with new paths to tests.