Refs #18161
Yet another approach to dealing with large commitlog submissions.
We handle oversize single mutation by adding yet another entry
type: fragmented. In this case we only add a fragment (aha) of
the data that needs storing into each entry, along with metadata
to correlate and reconstruct the full entry on replay.
Because these fragmented entries are spread over N segments, we
also need to add references from the first segment in a chain
to the subsequent ones. These are released once we clear the
relevant cf_id count in the base.
*
This approach has the downside that due to how serialization etc
works w.r.t. mutations, we need to create an intermediate buffer
to hold the full serialized target entry. This is then incrementally
written into entries of < max_mutation_size, successively requesting
more segments.
On replay, when encountering a fragment chain, the fragment is
added to a "state", i.e. a mapping of currently processing
frag chains. Once we've found all fragments and concatenated
the buffers into a single fragmented one, we can issue a
replay callback as usual.
Note that a replay caller will need to create and provide such
a state object. Old signature replay function remains for tests
and such.
This approach bumps the file format (docs to come).
To ensure "atomicity" we both force syncronization, and should
the whole op fail, we restore segment state (rewinding), thus
discarding data all we wrote.
v2:
* Improve some bookeep, ensure we keep track of segments and flush
properly, to get counter correct
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.
Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.
To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.
[1] 66ef711d68Closesscylladb/scylladb#20006
If set, any remaining segment that has data older than this threshold
will request flushing, regardless of data pressure. I.e. even a system
where nothing happends will after X seconds flush data to free up the
commit log.
since fedora 38 is EOL. and fedora 39 comes with fmt v10.0.0, also,
we've switched to the build image based on fedora 40, which ships
fmt-devel v10.2.1, there is no need to use fmt::streamed() when
the corresponding format_as() as available.
simpler this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19594
since we've switched almost all callers of the operator<< to {fmt},
let's drop the unused operator<<:s.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19313
Fixes#18488
Due to the discrepancy between bytes added to CL and bytes written to disk
(due to CRC sector overhead), we fail to account for the proper byte count
when issuing account_memory_usage in allocate (using bytes added) and in
cycle:s notify_memory_written (disk bytes written).
This leads us to slowly, but surely, add to the semaphore all the time.
Eventually rendering it useless.
Also, terminate call would _not_ take any of this into account,
and the chunk overhead there would cause a (smaller) discrepancy
as well.
Fix by simply ensuring that buffer alloc handles its byte usage,
then accounting based on buffer position, not input byte size.
Closesscylladb/scylladb#18489
Fixes#18329
named_file::assign call uses old object "known_size" after a move
of the object. While this is wholly ok, since the attribute accessed
will not be modified/destroyed by the move, it causes warnings in
"tidy" runs, and might confuse or cause real errors should impl. change.
Closesscylladb/scylladb#18337
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we include `fmt/ranges.h` and/or `fmt/std.h`
for formatting the container types, like vector, map
optional and variant using {fmt} instead of the homebrew
formatter based on operator<<.
with this change, the changes adding fmt::formatter and
the changes using ostream formatter explicitly, we are
allowed to drop `FMT_DEPRECATED_OSTREAM` macro.
Refs scylladb#13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change:
* add `format_as()` for `segment` so we can use it as a fallback
after upgrading to {fmt} v10
* use fmt::streamed() when formatting `segment`, this will be used
the intermediate solution before {fmt} v10 after dropping
`FMT_DEPRECATED_OSTREAM` macro
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18019
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for
* db::commitlog::segment::cf_mark
* db::commitlog::segment_manager::named_file
* db::commitlog::segment_manager::dispose_mode
* db::commitlog::segment_manager::byte_flow<T>
please note, the formatter of `db::commitlog::segment` is not
included in this commit, as we are formatting it in the inline
definition of this class. so we cannot define the specialization
of `fmt::formatter` for this class before its callers -- we'd
either use `format_as()` provided by {fmt} v10, or use `fmt::streamed`.
either way, it's different from the theme of this commit, and we
will handle it in a separated commit.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17792
This reverts commit 370fbd346c, reversing
changes made to 0912d2a2c6.
This makes scylla-manager mis-interpret the data_file_directories
somehow, issue #17078
This change removes usage of db::config to
get path of commitlog_directory. Instead, it
introduces a new parameter to directly pass
the path to db::commitlog::config::from_db_config().
Refs: scylladb#5626
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for `db::replay_position`,
and drop its operator<<.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17014
Add a helper function which returns the minimum replay position
across all existing or future commitlog segments.
Only positions greater or equal to it can be replayed on the next reboot.
We will use this helper in a future patch to garbage collect some cleanup
metadata which refers to replay positions.
Refs #16757
Allows waiting for all previous and pending segment deletes to finish.
Useful if a caller of `discard_completed_segments` (i.e. a memtable
flush target) not only wants to ensure segments are clean and released,
but thoroughly deleted/recycled, and hence no treat to resurrecting
data on crash+restart.
Test included.
Closesscylladb/scylladb#16801
seastar::logger is using the compile-time format checking by default if
compiled using {fmt} 8.0 and up. and it requires the format string to be
consteval string, otherwise we have to use `fmt::runtime()` explicitly.
so adapt the change, let's use the consteval string when formatting
logging messages.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16612
Fixes#16298
The adjusted buffer position calculation in buffer_position(), introduced in https://github.com/scylladb/scylladb/pull/15494
was in fact broken. It calculated (like previously) a "position" based on diff between
underlying buffer size and ostream size() (i.e. avail), then adjusted this according to
sector overhead rules.
However, the underlying buffer size is in unadjusted terms, and the ostream is adjusted.
The two cannot be compared as such, which means the "positions" we get here are borked.
Luckily for us (sarcasm), the position calculation in replayer made a similar error,
in that it adjusts up current position by one sector overhead to much, leading to us
more or less getting the same, erroneous results in both ends.
However, when/iff one needs to adjust the segment file format further, one might very
quickly realize that this does not work well if, say, one needs to be able to safely
read some extra bytes before first chunk in a segment. Conversely, trying to adjust
this also exposes a latent potential error in the skip mechanism, manifesting here.
Issue fixed by keeping track of the initial ostream capacity for segment buffer, and
use this for position calculation, and in the case of replayer, move file pos adjustment
from read_data() to subroutine (shared with skipping), that better takes data stream
position vs. file position adjustment. In implementaion terms, we first inc the
"data stream" pos (i.e. pos in data without overhead), then adjust for overhead.
Also fix replayer::skip, so that we handle the buffer/pos relation correctly now.
Added test for intial entry position, as well as data replay consistency for single
entry_writer paths.
Fixes#16301
The calculation on whether data may be added is based on position vs. size of incoming data.
However, it did not take sector overhead into account, which lead us to writing past allowed
segment end, which in turn also leads to metrics overflows.
Closesscylladb/scylladb#16302
* github.com:scylladb/scylladb:
commitlog: Fix allocation size check to take sector overhead into account.
commitlog: Fix commitlog_segment::buffer_position() calculation and replay counterpart
Fixes#16301
The calculation on whether data may be added is based on position vs. size of incoming data.
However, it did not take sector overhead into account, which lead us to writing past allowed
segment end, which in turn also leads to metrics overflows.
Fixes#16298
The adjusted buffer position calculation in buffer_position(), introduced in #15494
was in fact broken. It calculated (like previously) a "position" based on diff between
underlying buffer size and ostream size() (i.e. avail), then adjusted this according to
sector overhead rules.
However, the underlying buffer size is in unadjusted terms, and the ostream is adjusted.
The two cannot be compared as such, which means the "positions" we get here are borked.
Luckily for us (sarcasm), the position calculation in replayer made a similar error,
in that it adjusts up current position by one sector overhead to much, leading to us
more or less getting the same, erroneous results in both ends.
However, when/iff one needs to adjust the segment file format further, one might very
quickly realize that this does not work well if, say, one needs to be able to safely
read some extra bytes before first chunk in a segment. Conversely, trying to adjust
this also exposes a latent potential error in the skip mechanism, manifesting here.
Issue fixed by keeping track of the initial ostream capacity for segment buffer, and
use this for position calculation, and in the case of replayer, move file pos adjustment
from read_data() to subroutine (shared with skipping), that better takes data stream
position vs. file position adjustment. In implementaion terms, we first inc the
"data stream" pos (i.e. pos in data without overhead), then adjust for overhead.
Also fix replayer::skip, so that we handle the buffer/pos relation correctly now.
Added test for intial entry position, as well as data replay consistency for single
entry_writer paths.
Prototype implementation of format suggested/requested by @avikivity:
Divides segments into disk-write-alignment sized pages, each tagged with segment ID + CRC of data content.
When read, we both verify sector integrity (CRC) to detect corruption, as well as matching ID read with expected one.
If the latter mismatches we have a prematurely terminated segment (read truncation), which, depending on whether the CL is
written in batch or periodic mode, as well as explicit sync, can mean data loss.
Note: all-zero pages are treated as kosher, both to align with newly allocated segments, as well as fully terminated (zero-page) ones.
Note: This is a preview/RFC - the rest of the file format is not modified. At least parts of entry CRC could probably be removed, but I have not done so yet (needs some thinking).
Note: Some slight abstraction breaks in impl. and probably less than maximal efficiency.
v2:
* Removed entry CRC:s in file format.
* Added docs on format v3
* Added one more test for recycling-truncation
v3:
* Fixed typos in size calc and docs
* Changed sect metadata order
* Explicit iter type
Closesscylladb/scylladb#15494
* github.com:scylladb/scylladb:
commitlog_test: Add test for replaying large-ish mutation
commitlog_test: Add additional test for segmnent truncation
docs: Add docs on commitlog format 3
commitlog: Remove entry CRC from file format
commitlog: Implement new format using CRC:ed sectors
commitlog: Add iterator adaptor for doing buffer splitting into sub-page ranges
fragmented_temporary_buffer: Add const iterator access to underlying buffers
commitlog_replayer: differentiate between truncated file and corrupt entries
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.
Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
This miniset, completes the prerequisites for enabling commitlog hard limit on by default.
Namely, start flushing and evacuating segments halfway to the limit in order to never hit it under normal circumstances.
It is worth mentioning that hitting the limit is an exceptional condition which it's root cause need to be resolved, however,
once we do hit the limit, the performance impact that is inflicted as a result of this enforcement is irrelevant.
Tests: unit tests.
LWT write test (#9331)
A whitebox testing has been performed by @wmitros , the test aimed at putting as much pressure as possible on the commitlog segments by using a write pattern that rewrites the partitions in the memtable keeping it at ~85% occupancy so the dirty memory manager will not kick in. The test compared 3 configurations:
1. The default configuration
2. Hard limit on (without changing the flush threshold)
3. the changes in this PR applied.
The last exhibited the "best" behavior in terms of metrics, the graphs were the flattest and less jaggy from the others.
Closesscylladb/scylladb#10974
* github.com:scylladb/scylladb:
commitlog: enforce commitlog size hard limit by default
commitlog: set flush threshold to half of the limit size
commitlog: unfold flush threshold assignment
Fixes#16207
commitlog::delete_segments deletes (or recycles) segments replayed.
The actual file size here is added to footprint so actual delete then
can determine iff things should be recycled or removed.
However, we build a pending delete list of named_files, and the files
we added did not have size set. Bad. Actual deletion then treated files
as zero-byte sized, i.e. footprint calculations borked.
Simple fix is just filling in the size of the objects when addind.
Added unit test for the problem.
Closesscylladb/scylladb#16210
Once we enable commitlog hard limit by default, we would like
to have some room in case flushing memtables takes some time
to catch up. This threshold is half the limit.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
This commit is only a cosmetic change. It is meant to
make the flush threshold assignment more readable and
comprehensible so future changes are easier to review.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Since CRC is already handled by disk blocks, we can remove some of the
entry CRC:ing, both simplifying code and making at least that part of
both write and read faster.
Breaks the file into individually tagged + crc:ed pages.
Each page (sized as disk write alignment) gets a trailing
12-byte metadata, including CRC of the first page-12 bytes,
and the ID of the segment being written.
When reading, each page read is CRC:ed and checked to be part
of the expected segment by comparing ID:s. If crc is broken,
we have broken data. If crc is ok, but ID does not match, we
have a prematurely terminated segment (truncated), which, depending
on whether we use batch mode or not, implied data loss.
Refs #11845
When replaying, differentiate between the two cases for failure we have:
- A broken actual entry - i.e. entry header/data does not hold up to
crc scrutiny
- Truncated file - i.e. a chunk header is broken or unreadable. This can
be due to either "corruption" (i.e. borked write, post-corruption, hw
whatever), or simply an unterminated segment.
The difference is that the former is recoverable, the latter is not.
We now signal and report the two separately. The end result for a user
is not much different, in either case they imply data loss and the
need for repair. But there is some value in differentiating which
of the two we encountered.
Modifies and adds test cases.
Fixes#15269
If segment being replayed is corrupted/truncated we can attempt skipping
completely bogues byte amounts, which can cause assert (i.e. crash) in
file_data_source_impl. This is not a crash-level error, so ensure we
range check the distance in the reader.
v2: Add to corrupt_size if trying to skip more than available. The amount added is "wrong", but at least will
ensure we log the fact that things are broken
Closesscylladb/scylladb#15270
Adds a lowest timestamp of GC clock whenever a CF is added to a CL segment
first. Because GC clock is wall clock time and only connected to TTL (not
cell/row timestamps), this gives a fairly accurate view of GC low bounds
per segment.
Includes of course a function to get the all-segment lowest per CF.
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).
So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command
The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields
Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)
Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile
The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13963
this change tries to reduce the number of callers using operator<<()
for printing UUID. they are found by compiling the tree after commenting
out `operator<<(std::ostream& out, const UUID& uuid)`. but this change
alone is not enough to drop all callers, as some callers are using
`operator<<(ostream&, const unordered_map&)` and other overloads to
print ranges whose elements contain UUID. so in order to limit the
scope of the change, we are not changing them here.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Fixes#12810
We did not update total_size_on_disk in commitlog totals when use o_dsync was off.
This means we essentially ran with no registered footprint, also causing broken comparisons in delete_segments.
Closes#12950
* github.com:scylladb/scylladb:
commitlog: Fix updating of total_size_on_disk on segment alloc when o_dsync is off
commitlog: change type of stored size
Fixes#12810
We did not update total_size_on_disk in commitlog totals when use o_dsync was off.
This means we essentially ran with no registered footprint, also causing broken
comparisons in delete_segments.
Refs #11710
Allows reusing regex for segment matching (for opening left-over segments after crash).
Should remove any stalls caused by commitlog replay preparation.
v2: Add unit test for descriptor parsing
Closes#12112
request_controller_timeout_exception_factory::timeout() creates an
instance of `request_controller_timed_out_error` whose ctor is
default-created by compiler from that of timed_out_error, which is
in turn default-created from the one of `std::exception`. and
`std::exception::exception` does not throw. so it's safe to
mark this factory method `noexcept`.
with this specifier, we don't need to worry about the exception thrown
by it, and don't need to handle them if any in `seastar::semaphore`,
where `timeout()` is called for the customized exception.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12759
Currently, segment file removal first calls `f.remove_file()` and
does `total_size_on_disk -= f.known_size()` later.
However, `remove_file()` resets `known_size` to 0, so in effect
the freed space in not accounted for.
`total_size_on_disk` is not just a metric. It is also responsible
for deciding whether a segment should be recycled -- it is recycled
only if `total_size_on_disk - known_size < max_disk_size`.
Therefore this bug has dire performance consequences:
if `total_size_on_disk - known_size` ever exceeds `max_disk_size`,
the recycling of commitlog segments will stop permanently, because
`total_size_on_disk - known_size` will never go back below
`max_disk_size` due to the accounting bug. All new segments from this
point will be allocated from scratch.
The bug was uncovered by a QA performance test. It isn't easy to trigger --
it took the test 7 hours of constant high load to step into it.
However, the fact that the effect is permanent, and degrades the
performance of the cluster silently, makes the bug potentially quite severe.
The bug can be easily spotted with Prometheus as infinitely rising
`commitlog_total_size_on_disk` on the affected shards.
Fixes#12645Closes#12646
The intention was for these logs to be printed during the
database shutdown sequence, but it was overlooked that it's not
the only place where commitlog::shutdown is called.
Commitlogs are started and shut down periodically by hinted handoff.
When that happens, these messages spam the log.
Fix that by adding INFO commitlog shutdown logs to database::stop,
and change the level of the commitlog::shutdown log call to DEBUG.
Fixes#11508Closes#11536