As a preparation for ensuring access safety for column families
related maps, add tables_metadata, access to members of which
would be protected by rwlock.
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).
So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command
The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields
Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)
Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile
The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13963
this change tries to reduce the number of callers using operator<<()
for printing UUID. they are found by compiling the tree after commenting
out `operator<<(std::ostream& out, const UUID& uuid)`. but this change
alone is not enough to drop all callers, as some callers are using
`operator<<(ostream&, const unordered_map&)` and other overloads to
print ranges whose elements contain UUID. so in order to limit the
scope of the change, we are not changing them here.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
now that fmtlib provides fmt::join(). see
https://fmt.dev/latest/api.html#_CPPv4I0EN3fmt4joinE9join_viewIN6detail10iterator_tI5RangeEEN6detail10sentinel_tI5RangeEEERR5Range11string_view
there is not need to revent the wheel. so in this change, the homebrew
join() is replaced with fmt::join().
as fmt::join() returns an join_view(), this could improve the
performance under certain circumstances where the fully materialized
string is not needed.
please note, the goal of this change is to use fmt::join(), and this
change does not intend to improve the performance of existing
implementation based on "operator<<" unless the new implementation is
much more complicated. we will address the unnecessarily materialized
strings in a follow-up commit.
some noteworthy things related to this change:
* unlike the existing `join()`, `fmt::join()` returns a view. so we
have to materialize the view if what we expect is a `sstring`
* `fmt::format()` does not accept a view, so we cannot pass the
return value of `fmt::join()` to `fmt::format()`
* fmtlib does not format a typed pointer, i.e., it does not format,
for instance, a `const std::string*`. but operator<<() always print
a typed pointer. so if we want to format a typed pointer, we either
need to cast the pointer to `void*` or use `fmt::ptr()`.
* fmtlib is not able to pick up the overload of
`operator<<(std::ostream& os, const column_definition* cd)`, so we
have to use a wrapper class of `maybe_column_definition` for printing
a pointer to `column_definition`. since the overload is only used
by the two overloads of
`statement_restrictions::add_single_column_parition_key_restriction()`,
the operator<< for `const column_definition*` is dropped.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The replayer code needs system keyspace to fetch truncation records
from, thus it needs this explicit dependency. By the time it runs system
keyspace is fully initialized already
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fixes#12810
We did not update total_size_on_disk in commitlog totals when use o_dsync was off.
This means we essentially ran with no registered footprint, also causing broken comparisons in delete_segments.
Closes#12950
* github.com:scylladb/scylladb:
commitlog: Fix updating of total_size_on_disk on segment alloc when o_dsync is off
commitlog: change type of stored size
the default generated operator<=> is exactly the same as the
handcrafted one. so let compiler do its job. also, since
operator<=> is defaulted, there is no need to define operator==
anymore, so drop it as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Fixes#12810
We did not update total_size_on_disk in commitlog totals when use o_dsync was off.
This means we essentially ran with no registered footprint, also causing broken
comparisons in delete_segments.
Refs #11710
Allows reusing regex for segment matching (for opening left-over segments after crash).
Should remove any stalls caused by commitlog replay preparation.
v2: Add unit test for descriptor parsing
Closes#12112
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Schema related files are moved there. This excludes schema files that
also interact with mutations, because the mutation module depends on
the schema. Those files will have to go into a separate module.
Closes#12858
Move mutation-related files to a new mutation/ directory. The names
are kept in the global namespace to reduce churn; the names are
unambiguous in any case.
mutation_reader remains in the readers/ module.
mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this
patch.
This is a step forward towards librarization or modularization of the
source base.
Closes#12788
request_controller_timeout_exception_factory::timeout() creates an
instance of `request_controller_timed_out_error` whose ctor is
default-created by compiler from that of timed_out_error, which is
in turn default-created from the one of `std::exception`. and
`std::exception::exception` does not throw. so it's safe to
mark this factory method `noexcept`.
with this specifier, we don't need to worry about the exception thrown
by it, and don't need to handle them if any in `seastar::semaphore`,
where `timeout()` is called for the customized exception.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12759
Currently, segment file removal first calls `f.remove_file()` and
does `total_size_on_disk -= f.known_size()` later.
However, `remove_file()` resets `known_size` to 0, so in effect
the freed space in not accounted for.
`total_size_on_disk` is not just a metric. It is also responsible
for deciding whether a segment should be recycled -- it is recycled
only if `total_size_on_disk - known_size < max_disk_size`.
Therefore this bug has dire performance consequences:
if `total_size_on_disk - known_size` ever exceeds `max_disk_size`,
the recycling of commitlog segments will stop permanently, because
`total_size_on_disk - known_size` will never go back below
`max_disk_size` due to the accounting bug. All new segments from this
point will be allocated from scratch.
The bug was uncovered by a QA performance test. It isn't easy to trigger --
it took the test 7 hours of constant high load to step into it.
However, the fact that the effect is permanent, and degrades the
performance of the cluster silently, makes the bug potentially quite severe.
The bug can be easily spotted with Prometheus as infinitely rising
`commitlog_total_size_on_disk` on the affected shards.
Fixes#12645Closes#12646
The intention was for these logs to be printed during the
database shutdown sequence, but it was overlooked that it's not
the only place where commitlog::shutdown is called.
Commitlogs are started and shut down periodically by hinted handoff.
When that happens, these messages spam the log.
Fix that by adding INFO commitlog shutdown logs to database::stop,
and change the level of the commitlog::shutdown log call to DEBUG.
Fixes#11508Closes#11536
Refs #11237
Don't store segments found on init scan in all shard instances,
instead retrieve (based on low time-pos for current gen) when
required. This changes very little, but we at last don't store
pointless string lists in shards 1 to X, and also we can potentially
ask for the list twice. More to the point, goes better hand-in-hand
with the semantics of "delete_segments", where any file sent in is
considered candidate for recycling, and included in footprint.
Fixes#11184Fixes#11237
In prev (broken) fix for #11184 we added the footprint for left-over
files (replay candidates) to disk footprint on commitlog init.
This effectively prevents us from creating segments iff we have tight
limits. Since we nowadays do quite a bit of inserts _before_ commitlog
replay (system.local, but...) we can end up in a situation where we
deadlock start because we cannot get to the actual replay that will
eventually free things.
Another, not thought through, consequence is that we add a single
footprint to _all_ commitlog shard instances - even though only
shard 0 will get to actually replay + delete (i.e. drop footprint).
So shards 1-X would all be either locked out or performance degraded.
Simplest fix is to add the footprint in delete call instead. This will
lock out segment creation until delete call is done, but this is fast.
Also ensures that only replay shard is involved.
Define table_id as a distinct utils::tagged_uuid modeled after raft
tagged_id, so it can be differentiated from other uuid-class types,
in particular from table_schema_version.
Fixes#11207
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add include statements to satisfy dependencies.
Delete, now unneeded, include directives from the upper level
source files.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fixes#11184
Not including it here can cause our estimate of "delete or not" after replay
to be skewed in favour of retaining segments as (new) recycles (or even flip
a counter), and if we have repeated crash+restarts we could be accumulating
an effectivly ever increasing segment footprint
Closes#11205
Currently, applying schema mutations involves flushing all schema
tables so that on restart commit log replay is performed on top of
latest schema (for correctness). The downside is that schema merge is
very sensitive to fdatasync latency. Flushing a single memtable
involves many syncs, and we flush several of them. It was observed to
take as long as 30 seconds on GCE disks under some conditions.
This patch changes the schema merge to rely on a separate commit log
to replay the mutations on restart. This way it doesn't have to wait
for memtables to be flushed. It has to wait for the commitlog to be
synced, but this cost is well amortized.
We put the mutations into a separate commit log so that schema can be
recovered before replaying user mutations. This is necessary because
regular writes have a dependency on schema version, and replaying on
top of latest schema satisfies all dependencies. Without this, we
could get loss of writes if we replay a write which depends on the
latest schema on top of old schema.
Also, if we have a separate commit log for schema we can delay schema
parsing for after the replay and avoid complexity of recognizing
schema transactions in the log and invoking the schema merge logic.
I reproduced bad behavior locally on my machine with a tired (high latency)
SSD disk, load driver remote. Under high load, I saw table alter (server-side part) taking
up to 10 seconds before. After the patch, it takes up to 200 ms (50:1 improvement).
Without load, it is 300ms vs 50ms.
Fixes#8272Fixes#8309Fixes#1459Closes#10333
* github.com:scylladb/scylla:
config: Introduce force_schema_commit_log option
config: Introduce unsafe_ignore_truncation_record
db: Avoid memtable flush latency on schema merge
db: Allow splitting initiatlization of system tables
db: Flush system.scylla_local on change
migration_manager: Do not drop system.IndexInfo on keyspace drop
Introduce SCHEMA_COMMITLOG cluster feature
frozen_mutation: Introduce freeze/unfreeze helpers for vectors of mutations
db/commitlog: Improve error messages in case of unknown column mapping
db/commitlog: Fix error format string to print the version
db: Introduce multi-table atomic apply()
Convert most use sites from `co_return coroutine::make_exception`
to `co_await coroutine::return_exception{,_ptr}` where possible.
In cases this is done in a catch clause, convert to
`co_return coroutine::exception`, generating an exception_ptr
if needed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10972
Adds measuring the apparent delta vector of footprint added/removed within
the timer time slice, and potentially include this (if influx is greater
than data removed) in threshold calculation. The idea is to anticipate
crossing usage threshold within a time slice, so request a flush slightly
earlier, hoping this will give all involved more time to do their disk
work.
Obviously, this is very akin to just adjusting the threshold downwards,
but the slight difference is that we take actual transaction rate vs.
segment free rate into account, not just static footprint.
Note: this is a very simplistic version of this anticipation scheme,
we just use the "raw" delta for the timer slice.
A more sophisiticated approach would perhaps do either a lowpass
filtered rate (adjust over longer time), or a regression or whatnot.
But again, the default persiod of 10s is something of an eternity,
so maybe that is superfluous...
Closes#10651
* github.com:scylladb/scylla:
commitlog: Add (internal) measurement of byte rates add/release/flush-req
commitlog: Add counters for # bytes released/flush requested
commitlog: Keep track of last flush high position to avoid double request
commitlog: Fix counter descriptor language
It happens in database::drain(), we know when it starts after keyspaces
are flushed, now it's good to know when it completes
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fixes#9367
The CL counters pending_allocations and requests_blocked_memory are
exposed in graphana (etc) and often referred to as metrics on whether
we are blocking on commit log. But they don't really show this, as
they only measure whether or not we are blocked on the memory bandwidth
semaphore that provides rate back pressure (fixed num bytes/s - sortof).
However, actual tasks in allocation or segment wait is not exposed, so
if we are blocked on disk IO or waiting for segments to become available,
we have no visible metrics.
While the "old" counters certainly are valid, I have yet to ever see them
be non-zero in modern life.
Closes#9368
Adds measuring the apparent delta vector of footprint added/removed within
the timer time slice, and potentially include this (if influx is greater
than data removed) in threshold calculation. The idea is to anticipate
crossing usage threshold within a time slice, so request a flush slightly
earlier, hoping this will give all involved more time to do their disk
work.
Obviously, this is very akin to just adjusting the threshold downwards,
but the slight difference is that we take actual transaction rate vs.
segment free rate into account, not just static footprint.
Note: this is a very simplistic version of this anticipation scheme,
we just use the "raw" delta for the timer slice.
A more sophisiticated approach would perhaps do either a lowpass
filtered rate (adjust over longer time), or a regression or whatnot.
But again, the default period of 10s is something of an eternity,
so maybe that is superfluous...
Adds "bytes_released" and "bytes_flush_requested", representing
total bytes released from disk as a result of segment release
(as allocation bytes + overhead - not counting unused "waste"),
resp. total size we've requested flush callbacks to release data,
also counted as actual used bytes in segments we request be made
released.
These counters, together with bytes_written, should in ideal use
cases be at an equilibrium (actually equal), thus observing them
should give an idea on whether we are imbalanced in managing to
release bytes in same rate as they are allocated (i.e. transaction
rate).
Apparent mismerge or something. We already have an unused "_flush_position",
intended to keep track of the last requested high rp.
Now actually update and use it. The latter to avoid sending requests for
segments/cf id:s we've already requested external flush of. Also enables
us to ensure we don't do double bookkeep here.
coroutine::parallel_for_each avoids an allocation and is therefore preferred. The lifetime
of the function object is less ambiguous, and so it is safer. Replace all eligible
occurences (i.e. caller is a coroutine).
One case (storage_service::node_ops_cmd_heartbeat_updater()) needed a little extra
attention since there was a handle_exception() continuation attached. It is converted
to a try/catch.
Closes#10699
After fcb8d040 ("treewide: use Software Package Data Exchange
(SPDX) license identifiers"), many dual-licensed files were
left with empty comments on top. Remove them to avoid visual
noise.
Closes#10562
We don't need the database to determine the shard of the mutation,
only its schema. So move the implementation to the respecive
definitions of mutation and frozen_mutation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10430
Saves a continuation. That matters very little. But...
Uses a special awaiter type on returns from the "then(...)"-wrapping
named_file methods (which use a then([...update]) to keep internal
size counters up-to-date, making the continuation instead a stored func
into the returned awaiter, executed on successul resume of the caller
co_await.
Changes delete/close queue, as well as deletetion queue into one, using
named_file objects + marker. Recycle list now also contains said named
file type.
This removes the need to re-eval file sizes on disk when deleting etc,
which in turn means we can dispose of recalculate_footprint on errors,
thus making things simpler and safer.