Refs #9053
Flips default for commitlog disk footprint hard limit enforcement to off due
to observed latency stalls with stress runs. Instead adds an optional flag
"commitlog_use_hard_size_limit" which can be turned on to in fact do enforce it.
Sort of tape and string fix until we can properly tweak the balance between
cl & sstable flush rate.
Closes#9195
(cherry picked from commit 3633c077be)
Fixes#8363Fixes#8376
Delete segements has two issues when running with size-limited
commit log and strict adherence to said limit.
1.) It uses parallel processing, with deferral. This means that
the disk usage variables it looks at might not be fully valid
- i.e. we might have already issued a file delete that will
reduce disk footprint such that a segment could instead be
recycled, but since vars are (and should) only updated
_post_ delete, we don't know.
2.) It does not take into account edge conditions, when we only
delete a single segment, and this segment is the border segment
- i.e. the one pushing us over the limit, yet allocation is
desperately waiting for recycling. In this case we should
allow it to live on, and assume that next delete will reduce
footprint. Note: to ensure exact size limit, make sure
total size is a multiple of segment size.
if we had an error in recycling (disk rename?), and no elements
are available, we could have waiters hoping they will get segements.
abort the queue (not permanent, but wakes up waiters), and let them
retry. Since we did deletions instead, disk footprint should allow
for new allocs at least. Or more likely, everything is broken, but
we will at least make more noise.
Closes#8372
* github.com:scylladb/scylla:
commitlog: Add signalling to recycle queue iff we fail to recycle
commitlog: Fix race and edge condition in delete_segments
commitlog: coroutinize delete_segments
commitlog_test: Add test for deadlock in recycle waiter
(cherry picked from commit 8e808a56d2)
This commit adds an option which, when turned on, allows the commitlog
to go over configured size limit. After reaching the limit, commitlog
will be more conservative with its usage of the disk space - for
example, it won't increase the segment reserve size or reuse recycled
segments. Most importantly, it won't block writes until the space used
by the commitlog goes down.
This change is necessary for hinted handoff to keep its current
behavior. Hinted handoff does not let the commitlog free segments
itself - instead, it re-creates it every 10 seconds and manually deletes
segments after all hints are sent from a segment.
Refs #6148
Commitlog disk limit was previously a "soft" limit, in that we allowed allocating new segments, even if we were over
disk usage max. This would also cause us sometimes to create new segments and delete old ones, if badly timed in
needing and releasing segments, in turn causing useless disk IO for pre-allocation/zeroing.
This patch set does:
* Make limit a hard limit. If we have disk usage > max, we wait for delete or recycle.
* Make flush threshold configurable. Default is ask for flush when over 50% usage. (We do not wait for results)
* Make flush "partial". We flush X% of the used space (used - thres/2), and make the rp limit accordingly. This means we will try to clear the N oldest segments, not all. I.e. "lighter" flush. Of course, if the CL is wholly dominated by a single CF, this will not really help much. But when > 1 cf is used, it means we can skip those not having unflushed data < req rp.
* Force more eager flush/recycle if we're out of segments
Note: flush threshold is not exposed in scylla config (yet). Because I am unsure of wording, and even if it should.
Note: testing is sparse, esp. in regard to latency/timeouts added in high usage scenarios. While I can fairly easily provoke "stalls" (i.e. forced waiting for segments to free up) with simple C-S, it is hard to say exactly where in a more sane config (I set my limits looow) latencies will start accumulating.
Closes#7879
* github.com:scylladb/scylla:
commitlog: Force earlier cycle/flush iff segment reserve is empty
commitlog: Make segment allocation wait iff disk usage > max
commitlog: Do partial (memtable) flushing based on threshold
commitlog: Make flush threshold configurable
table: Add a flush RP mark to table, and shortcut if not above
Fixes#7615
Allows N mutations to be written "atomically" (i.e. in the same
call). Either all are added to segement, or none.
Returns rp_handle vector corresponding to the call vector.
Allows writing more than one blob of data using a single
"add" call into segment. The old call sites will still
just provide a single entry.
To ensure we can determine the health of all the entries
as a unit, we need to wrap them in a "parent" entry.
For this, we bump the commitlog segment format and
introduce a magic marker, which if present, means
we have entries in entry, totalling "size" bytes.
We checksum the entra header, and also checksum
the individual checksums of each sub-entry (faster).
This is added as a post-word.
When parsing/replaying, if v2+ and marker, we have to
read all entries + checksums into memory, verify, and
_then_ we can actually send the info to caller.
"
`close_on_failure` was committed to seastar so use
the library version.
This requires making the lambda function passed to
it nothrow move constructible, so this series also
makes db::commitlog::descriptor move constructor noexcept
and changes allocate_segment_ex and segment::segment
to get a descriptor by value rather than by reference.
Test: unit(dev), commitlog_test(debug)
"
* tag 'commit-log-use-with_file_close_on_failure-v1' of github.com:bhalevy/scylla:
commitlog: use seastar::with_file_close_on_failure
commitlog: descriptor: make nothrow move constructible
commitlog: allocate_segment_ex, segment: pass descriptor by value
commitlog: allocate_segment_ex: filename capture is unused
All they can live with forward declaration of the f._m._r. plus a
seastar header in commitlog code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When a mutation is written to the commitlog, a rp_handle object is
returned which keeps a reference to commitlog segment. A segment is
"dirty" when its reference count is not zero, otherwise it is "clean".
When commitlog object is being destroyed, a warning is being printed
for every dirty segment. On the other hand, clean segments are deleted.
In case of standard mutation writing path, the rp_handle moves
responsibility for releasing the reference to the memtable to which the
mutation is written. When the memtable is flushed to disk, all
references accumulated in the memtable are released. In this context, it
makes sense to warn about dirty segments, because such segments contain
mutations that are not written to sstables, and need to be replayed.
However, hinted handoff uses a different workflow - it recreates a
commitlog object periodically. When a hint is written to commitlog, the
rp_handle reference is not released, so that segments with hints are not
deleted when destroying the commitlog. When commitlog is created again,
we get a list of saved segments with hints that we can try to send at a
later time.
Although this is intended behavior, now that releasing the hints
commitlog is done properly, it causes the mentioned warning to
periodically appear in the logs.
This patch adds a parameter for the commitlog that allows to disable
this warning. It is only used when creating hinted handoff commitlogs.
Since seastar::streams are based on future/promise, variadic streams
suffer the same fate as variadic futures - deprecation and eventual
removal.
This patch therefore replaces a variadic stream in commitlog::read_log_file()
with a non-variadic stream, via a helper struct.
Tests: unit (dev)
Refs #3929
Optionally enables O_DSYNC mode for segment files, and when
enabled ignores actual flushing and just barriers any ongoing
writes.
Iff using O_DSYNC mode, we will not only truncate the file
to max size, but also do an actual initial write of zero:s
to it, since XFS (intended target) has observably less good
behaviour on non-physical file blocks. Once written (and maybe
recycled) we should have rather satisfying throughput on writes.
Note that the O_DSYNC behaviour is hidden behind a default
disabled option. While user should probably seldom worry about
this, we should add some sort of logic i main/init that unless
specified by user, evaluates the commitlog disk and sets this
to true if it is using XFS and looks ok. This is because using
O_DSYNC on things like EXT4 etc has quite horrible performance.
All above statements about performance and O_DSYNC behaviour
are based on a sampling of benchmark results (modified fsqual)
on a statistically non-ssignificant selection of disks. However,
at least there the observed behaviour is a rather large
difference between ::fallocate:ed disk area vs. actually written
using O_DSYNC on XFS, and O_DSYNC on EXT4.
Note also that measurements on O_DSYNC vs. no O_DSYNC does not
take into account the wall-clock time of doing manual disk flush.
This is intentionally ignored, since in the commitlog case, at
least using periodic mode, flushes are relatively rare.
Message-Id: <20190520120331.10229-1-calle@scylladb.com>
Introduce a common base class for all errors that indicate that the current
segment has "issues".
This allows a laconic "catch" clause for all such errors.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Refs #4085
Changes commitlog descriptor to both accept "Recycled-Commitlog..."
file names, and preserve said name in the descriptor.
This ensures we pick up the not-yet-used recycled segments left
from a crash for replay. The replay in turn will simply ignore
the recycled files, and post actual replay they will be deleted
as needed.
Message-Id: <20190129123311.16050-1-calle@scylladb.com>
Refs #3929
When deleting a segment, IFF we have not yet filled up all reserves,
instead of actually deleting the file, put it on a "recycle" list.
Next segment allocation will instead of creating a new one simply
rename the segment and reuse the file and its allocated space.
We rename the file twice: Once on adding to recycle list, with special
prefix so we don't mix up actual replayable segments and these. Second
when we actually re-use the file (also to ensure consecutive names).
Note that we limit the amount of recyclables, so a really stressed
application which somehow fills up the replenish queue might
cause us to still drop the segments. Could skip this but risk
getting to many files on disk.
Replay should be safe, since all entries are guarded by CRC based
on the file ID (i.e. file name). Thus replaying a recycled segment
will simply cause a CRC error in the main header and be ignored (see
previous patch).
Segments that are fully synced will have terminating zero-header (see
previous patch) so we know when to stop processing a recycled file.
If a file is the result of a mid-write crash, we will generate a CRC
processing error as "normally" in this case, when hitting partially
written block or coming to an old/new chunk boundary.
v2:
* Sync dir on rename
* auto -> const sstring&
* Allow recycling files as long as we're within disk space limits
v3:
* Use special names for files waiting for reuse
When reading the header chunk of a commitlog file, check the stored id
value against the id derived from the file name, and ignore if
mismatched. This is a prerequisite for re-using renamed commitlog files,
as we can then fail-fast should one such be left on disk, instead of
trying to replay it.
We also check said id via the CRC check for each chunk parsed. If we
find a chunk with
mismatched id, we will get a CRC error for the chunk, and replay will
terminate (albeit not gracefully).
* seastar d59fcef...b924495 (2):
> build: Fix protobuf generation rules
> Merge "Restructure files" from Jesse
Includes fixup patch from Jesse:
"
Update Seastar `#include`s to reflect restructure
All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
So far commitlog was using contiguous buffers for storing the data that
is about to be written to disk. It was able to coalesce small writes so
that multiple small mutations would use the same buffer, but if a
muation was large the commitlog would attempt to allocate a single,
appropriately large buffer. This excessively stresses the memory
allocator and may cause memory fragmentation to become an issue. The
solution is to use fixed-size buffers of 128 kB, which is the standard
buffer size in Scylla and keep large values fragmented.
Refs #2858
Push segement files to be deleted to a pending list, and process at
intervals or flush-requests (or shutdown). Note that we do _not_
indescrimenately do deletes in non-anchored tasks, because we need
to guarantee that finshed segments are fully deleted and gone on CL
shutdown, not to be mistaken for replayables.
Also make sure we delete segments replayed via commitlog call,
so IFF we add metadata processing for CL, we can clear it out.
At the moment, various different subsystems use their different
ideas of what a timeout_clock is. This makes it a bit harder to pass
timeouts between them because although most are actually a lowres_clock,
that is not guaranteed to be the case. As a matter of fact, the timeout
for restricted reads is expressed as nanoseconds, which is not a valid
duration in the lowres_clock.
As a first step towards fixing this, we'll consolidate all of the
existing timeout_clocks in one, now called db::timeout_clock. Other
things that tend to be expressed in terms of that clock--like the fact
that the maximum time_point means no timeout and a semaphore that
wait()s with that resolution are also moved to the common header.
In the upcoming patch we will fix the restricted reader timeouts to
be expressed in terms of the new timeout_clock.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In commit 878d58d23a, a new parameter was
added to commitlog::descriptor. The commit message says that "It's default
value is a descriptor::FILENAME_PREFIX." while in reality, it did not have
a default value and compilation of tests/commitlog_test.cc broke, because
it didn't specify a value.
So this patch adds a default value for this parameter, as was suggested
by the original commit message.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20171218131020.17883-1-nyh@scylladb.com>
Add a new field to db::commitlog::config that would define the metrics category name.
If not given - metrics are not going to be registered.
Set it to "commitlog" in db::commitlog::config(const db::config&).
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This parameter is used when creating a new segment.
It's default value is a descriptor::FILENAME_PREFIX.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Use per CF-id reference count instead, and use handles as result of
add operations. These must either be explicitly released or stored
(rp_set), or they will release the corresponding replay_position
upon destruction.
Note: this does _not_ remove the replay positioning ordering requirement
for mutations. It just removes it as a means to track segment liveness.
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
- 'net' namespace conflicts with seastar::net, renamed to 'netw'.
- 'transport' namespace conflicts with seastar::transport, renamed to
cql_transport.
- "logger" global variables now conflict with logger global type, renamed
to xlogger.
- other minor changes
On database stop, we do flush memtables and clean up commit log segment usage.
However, since we never actually destroy the distributed<database>, we
don't actually free the commitlog either, and thus never clear out
the remaining (clean) segments. Thus we leave perfectly clean segments
on disk.
This just adds a "release" method to commitlog, and calls it from
database::stop, after flushing CF:s.
Message-Id: <1485784950-17387-1-git-send-email-calle@scylladb.com>
* seastar 397685c...c1dbd89 (13):
> lowres_clock: drop cache-line alignment for _timer
> net/packet: add missing include
> Merge "Adding histogram and description support" from Amnon
> reactor: Fix the error: cannot bind 'std::unique_ptr' lvalue to 'std::unique_ptr&&'
> Set the option '--server' of tests/tcp_sctp_client to be required
> core/memory: Remove superfluous assignment
> core/memory: Remove dead code
> core/reactor: Use logger instead of cerr
> fix inverted logic in overprovision parameter
> rpc: fix timeout checking condition
> rpc: use lowres_clock instead of high resolution one
> semaphore: make semaphore's clock configurable
> rpc: detect timedout outgoing packets earlier
Includes treewide change to accomodate rpc changing its timeout clock
to lowres_clock.
Includes fixup from Amnon:
collectd api should use the metrics getters
As part of a preperation of the change in the metrics layer, this change
the way the collectd api uses the metrics value to use the getters
instead of calling the member directly.
This will be important when the internal implementation will changed
from union to variant.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1485457657-17634-1-git-send-email-amnon@scylladb.com>
The problem is that replay will unlink any segments which were on disk
at the time the replay starts. However, some of those segments may
have been created by current node since the boot. If a segment is part
of reserve for example, it will be unlinked by replay, but we will
still use that segment to log mutations. Those mutations will not be
visible to replay after a crash though.
The fix is to record preexisting segents before any new segments will
have a chance to be created and use that as the replay list.
Introduced in abe7358767.
dtest failure:
commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup
Message-Id: <1481117436-6243-1-git-send-email-tgrabiec@scylladb.com>
The current incarnation of commitlog establishes a maximum amount of
writes that can be in-flight, and blocks new requests after that limit
is reached.
That is obviously something we must do, but the current approach to it
is problematic for two main reasons:
1) It forces the requests that trigger a write to wait on the current
write to finish. That is excessive; ideally we would wait for one
particular write to finish, not necessarily the current one. That
is made worse by the fact that when a write is followed by a flush
(happens when we move to a new segment), then we must wait for
*all* writes in that segment to finish.
1) it casts concurrency in terms of writes instead of memory, which
makes the aforementioned problem a lot worse: if we have very big
buffers in flight and we must wait for them to finish, that can
take a long time, often in the order of seconds, causing timeouts.
The approach taken by this patch is to replace the _write_semaphore
with a request_controller. This data structure will account the amount
of memory used by the buffers and set a limit on it. New allocations
will be held until we go below that limit, and will be released
as soon as this happens.
This guarantees that the latencies introduced by this mechanism are
spread out a lot better among requests and will keep higher percentile
latencies in check.
To test this, I have ran a workload that times out frequently. That
workload use 10 threads to write 100 partitions (to isolate from the
effects of the memtable introduced latencies) in a loop and each
partition is 2MB in size.
After 10 minutes running this load, we are left with the following
percentiles:
latency mean : 51.9 [WRITE:51.9]
latency median : 9.8 [WRITE:9.8]
latency 95th percentile : 125.6 [WRITE:125.6]
latency 99th percentile : 1184.0 [WRITE:1184.0]
latency 99.9th percentile : 1991.2 [WRITE:1991.2]
latency max : 2338.2 [WRITE:2338.2]
After this patch:
latency mean : 54.9 [WRITE:54.9]
latency median : 43.5 [WRITE:43.5]
latency 95th percentile : 126.9 [WRITE:126.9]
latency 99th percentile : 253.9 [WRITE:253.9]
latency 99.9th percentile : 364.6 [WRITE:364.6]
latency max : 471.4 [WRITE:471.4]
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Goal is to calculate a size that is lesser or equal than the
segment-dependent size.
This was originally written by Tomasz, and featured in his submission
"commitlog: Handle overload more gracefully"
Extracted here so it sits clearly in a different patch.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The segment->segment_manager pointer has, until now, been a raw pointer,
which in a way is sensible, since making circular shared pointer
relations is in general bad. However, since the code and life cycle
of segments has evolved quite a bit since that initial relation
was defined, becoming both more and then suddenly, in a sense,
less, asynchronous over time, the usage of the relation is in fact
more consistent with a shared pointer, in that a segment needs to
access its manager to properly do things like write and flush.
These two ops in particular depend on accessing the segment manager
in a way that might be fine even using raw pointers, if it was not
again for that little annoying thing of continuation reordering.
So, lets just make the relation a shared pointer, solving the issue
of whether the manager is alive when a segment accesses it. If it
has been "released" (shut down), the existing mechanisms (gate)
will then trigger and prevent any actual _actions_ from taking
place. And we don't have to complicate anything else even more.
Only "big" change is that we need to explicitly orphan all
segments in commitlog destructor (segment_manager is essentially
a p-impl).
This fixes some spurious crashes in nightly unit tests.
Fixes#966.
Message-Id: <1456838735-17108-1-git-send-email-calle@scylladb.com>
Configured on start (for now - and dummy values at that).
When shard write/flush count reaches limit, and incoming ops will queue
until previous ones finish.
Consequently, if an allocation op forces a write, which blocks, any
other incoming allocations will also queue up to provide back pressure.
Current commitlog interface requires writers to specify the size of a
new entry which cannot depend on the segment to which the entry is
written.
If column mappings are going to be stored in the commitlog that's not
enough since we don't know whether column mapping needs to be written
until we known in which segment the entry is going to be stored.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>