"
This series changes hinted handoff to work with `frozen_mutation`s
instead of naked `mutation`s. Instead of unfreezing a mutation from
the commitlog entry and then freezing it again for sending, now we'll
just keep the read, frozen mutation.
Tests: unit(release)
"
* 'hh-manager-cleanup/v1' of https://github.com/duarten/scylla:
db/hints/manager: Use frozen_mutation instead of mutation
db/hints/manager: Use database::find_schema()
db/commitlog/commitlog_entry: Allow moving the contained mutation
service/storage_proxy: send_to_endpoint overload accepting frozen_mutation
service/storage_proxy: Build a shared_mutation from a frozen_mutation
service/storage_proxy: Lift frozen_mutation_and_schema
service/storage_proxy: Allow non-const ranges in mutate_prepare()
(cherry picked from commit 1891779e64)
"
Hinted handoff should not overpower regular flows like READs, WRITEs or
background activities like memtable flushes or compactions.
In order to achieve this put its sending in the STEAMING CPU scheduling
group and its commitlog object into the STREAMING I/O scheduling group.
Fixes#3817
"
* 'hinted_handoff_scheduling_groups-v2' of https://github.com/vladzcloudius/scylla:
db::hints::manager: use "streaming" I/O scheduling class for reads
commitlog::read_log_file(): set the a read I/O priority class explicitly
db::hints::manager: add hints sender to the "streaming" CPU scheduling group
(cherry picked from commit 1533487ba8)
So far commitlog was using contiguous buffers for storing the data that
is about to be written to disk. It was able to coalesce small writes so
that multiple small mutations would use the same buffer, but if a
muation was large the commitlog would attempt to allocate a single,
appropriately large buffer. This excessively stresses the memory
allocator and may cause memory fragmentation to become an issue. The
solution is to use fixed-size buffers of 128 kB, which is the standard
buffer size in Scylla and keep large values fragmented.
Buffer pools were added in 7191a130bb
"Commitlog: recycle buffers to reduce fragmentation." They introduce a
lot of complexity and will become unnecessary once the code is switched
to use fixed-size 128kB buffers.
If a node cannot allocate a 128 kB it is already in a very bad shape, so
there isn't much value in trying to recover by attempting smaller
allocations and it just adds more complexity to the segment allocation.
It actually may be better to let some requests fail and give the node a
chance to recover rather than trying to use every last byte of free
memory and end up with bad_alloc in a noexcept context.
Fixes#3446
Previously, only shutdown-synced objects where actually closed,
which is wrong.
This introduces yet another queue, processed together with the
deletion objects, which ensures we explicitly close all objects
that have been discarded.
Message-Id: <20180521140456.32100-1-calle@scylladb.com>
We are currently moving the pointer we acquired to the segment inside
the lambda in which we'll handle the cycle.
The problem is, we also use that same pointer inside the exception
handler. If an exception happens we'll access it and we'll crash.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180518125820.10726-1-glauber@scylladb.com>
Refs #2858
Push segement files to be deleted to a pending list, and process at
intervals or flush-requests (or shutdown). Note that we do _not_
indescrimenately do deletes in non-anchored tasks, because we need
to guarantee that finshed segments are fully deleted and gone on CL
shutdown, not to be mistaken for replayables.
Also make sure we delete segments replayed via commitlog call,
so IFF we add metadata processing for CL, we can clear it out.
Operations on a segment's underlying append_challenged_posix_file_impl,
such as truncate(), schedule asynchronous operations when they are
executed, which capture the file object. To synchronize with them and
prevent use-after-free, we need to call close() and only delete the
segment and file when the returned future resolves.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180216235754.24257-1-duarte@scylladb.com>
When shutting down the commitlog we try to block all new requests by
acquiring all available resources. We were, however, letting go of the
semaphore permits too early, before closing the gate and shutting down
the active segments.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180216234826.24111-1-duarte@scylladb.com>
At the moment, various different subsystems use their different
ideas of what a timeout_clock is. This makes it a bit harder to pass
timeouts between them because although most are actually a lowres_clock,
that is not guaranteed to be the case. As a matter of fact, the timeout
for restricted reads is expressed as nanoseconds, which is not a valid
duration in the lowres_clock.
As a first step towards fixing this, we'll consolidate all of the
existing timeout_clocks in one, now called db::timeout_clock. Other
things that tend to be expressed in terms of that clock--like the fact
that the maximum time_point means no timeout and a semaphore that
wait()s with that resolution are also moved to the common header.
In the upcoming patch we will fix the restricted reader timeouts to
be expressed in terms of the new timeout_clock.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In commit 878d58d23a, a new parameter was
added to commitlog::descriptor. The commit message says that "It's default
value is a descriptor::FILENAME_PREFIX." while in reality, it did not have
a default value and compilation of tests/commitlog_test.cc broke, because
it didn't specify a value.
So this patch adds a default value for this parameter, as was suggested
by the original commit message.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20171218131020.17883-1-nyh@scylladb.com>
Add a new field to db::commitlog::config that would define the metrics category name.
If not given - metrics are not going to be registered.
Set it to "commitlog" in db::commitlog::config(const db::config&).
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This parameter is used when creating a new segment.
It's default value is a descriptor::FILENAME_PREFIX.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Replace the oblique process(T) overloads for integer types with
explicit process_le/be(T) methods that would interpret the given integer
as a stream of bytes using the corresponding endiannes.
For instance
process_le(0x11223344) would treat this integer as the following array of bytes:
{0x44, 0x33, 0x22, 0x11}.
process_be(0x11223344) on the other hand would treat this integer as if it's
{0x11, 0x22, 0x33, 0x44}.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
So that they are not left on disk even though we did a clean shutdown.
First part of the fix is to ensure that closed segments are recognized
as not allocating (_closed flag). Not doing this prevents them from
being collected by discard_unused_segments(). Second part is to
actually call discard_unused_segments() on shutdown after all segments
were shut down, so that those whose position are cleared can be
removed.
Fixes#2550.
Message-Id: <1499358825-17855-1-git-send-email-tgrabiec@scylladb.com>
Currently commitlog_entry_writer constructor calculates serialized size
before it is knows if a schema should be included into the entry. The
result is never used since it is recalculated when schema information is
supplied. The patch removes needless calculation.
Message-Id: <20170614114607.GA21915@scylladb.com>
Use per CF-id reference count instead, and use handles as result of
add operations. These must either be explicitly released or stored
(rp_set), or they will release the corresponding replay_position
upon destruction.
Note: this does _not_ remove the replay positioning ordering requirement
for mutations. It just removes it as a means to track segment liveness.
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
- 'net' namespace conflicts with seastar::net, renamed to 'netw'.
- 'transport' namespace conflicts with seastar::transport, renamed to
cql_transport.
- "logger" global variables now conflict with logger global type, renamed
to xlogger.
- other minor changes
The previous fix removed the additional insertion of "min rp" per source
shard based on whether we had processed existing CF:s or not (i.e. if
a CF does not exist as sstable at all, we must tag it as zero-rp, and
make whole shard for it start at same zero.
This is bad in itself, because it can cause data loss. It does not cause
crashing however. But it did uncover another, old old lingering bug,
namely the commitlog reader initiating its stream wrongly when reading
from an actual offset (i.e. not processing the whole file).
We opened the file stream from the file offset, then tried
to read the file header and magic number from there -> boom, error.
Also, rp-to-file mapping was potentially suboptimal due to using
bucket iterator instead of actual range.
I.e. three fixes:
* Reinstate min position guarding for unencoutered CF:s
* Fix stream creating in CL reader
* Fix segment map iterator use.
v2:
* Fix typo
Message-Id: <1490611637-12220-1-git-send-email-calle@scylladb.com>
Fixes#2173
Per-shard min positions can be unset if we never collected any
sstable/truncation info for it, yet replay segments of that id.
Wrap the lookups to handle "missing data -> default", which should have been
there in the first place.
Message-Id: <1490185101-12482-1-git-send-email-calle@scylladb.com>
Fixes#2098
Replay previously did all segments in parallel on shard 0, which
caused heavy memory load. To reduce this and spread footprint
across shards, instead do X segments per shard, sequential per shard.
v2:
* Fixed whitespace errors
Message-Id: <1489503382-830-1-git-send-email-calle@scylladb.com>
It is safe to copy column_mapping accros shards. Such guarantee comes at
the cost of performance.
This patch makes commitlog_entry_writer use IDL generated writer to
serialise commitlog_entry so that column_mapping is not copied. This
also simplifies commitlog_entry itself.
Performance difference tested with:
perf_simple_query -c4 --write --duration 60
(medians)
before after diff
write 79434.35 89247.54 +12.3%
Do not ignore a future<> retuned by cycle() since it will produce a
warning in case of an error. Log it instead.
Message-Id: <20170219151811.GN11471@scylladb.com>
On database stop, we do flush memtables and clean up commit log segment usage.
However, since we never actually destroy the distributed<database>, we
don't actually free the commitlog either, and thus never clear out
the remaining (clean) segments. Thus we leave perfectly clean segments
on disk.
This just adds a "release" method to commitlog, and calls it from
database::stop, after flushing CF:s.
Message-Id: <1485784950-17387-1-git-send-email-calle@scylladb.com>
Refs #1813 (fixes scylla part)
Added require_client_auth and priority_string options to
server_encryption_options/client_encryption_options an process them.
Allows TLS method/algo specification. Also enabled enforcing known cert
authentication for both node-to-node and client communication.
* seastar 397685c...c1dbd89 (13):
> lowres_clock: drop cache-line alignment for _timer
> net/packet: add missing include
> Merge "Adding histogram and description support" from Amnon
> reactor: Fix the error: cannot bind 'std::unique_ptr' lvalue to 'std::unique_ptr&&'
> Set the option '--server' of tests/tcp_sctp_client to be required
> core/memory: Remove superfluous assignment
> core/memory: Remove dead code
> core/reactor: Use logger instead of cerr
> fix inverted logic in overprovision parameter
> rpc: fix timeout checking condition
> rpc: use lowres_clock instead of high resolution one
> semaphore: make semaphore's clock configurable
> rpc: detect timedout outgoing packets earlier
Includes treewide change to accomodate rpc changing its timeout clock
to lowres_clock.
Includes fixup from Amnon:
collectd api should use the metrics getters
As part of a preperation of the change in the metrics layer, this change
the way the collectd api uses the metrics value to use the getters
instead of calling the member directly.
This will be important when the internal implementation will changed
from union to variant.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1485457657-17634-1-git-send-email-amnon@scylladb.com>
The per-node limit will be total memory divided by number of shards
instead of just total memory. For example, when Scylla is started with
-c16 -m16G, the commit log will induce flushes on given shard when
unflushed data exceeds on that shard 62MB instead of 1GB.
Fixes#2046.
Message-Id: <1485874534-10939-1-git-send-email-tgrabiec@scylladb.com>
column_mapping is not safe to access across shards, because data_type
is not safe to access. One of the manifestation of this is that
abstract_type::is_value_compatible_with() always fails if the two
types belong to different shards.
During replay, column_mapping lives on the replaying shard, and is
used by converting_mutation_partition_applier against the schema on
the target shard. Since types in the mapping will be considered
incompatible with types in the schema, all cells will be dropped.
Fix by using column_mapping in a safe way, by copying it to the target
shard if necessary. Each shard maintains its own cache of column
mappings.
Fixes#1924.
Message-Id: <1481310463-13868-1-git-send-email-tgrabiec@scylladb.com>
The semaphore future may be unavailable for many reasons. Specifically,
if the task quota is depleted right between sem.wait() and the .then()
clause in get_units() the resulting future won't be available.
That is particularly visible if we decrease the task quota, since those
events will be more frequent: we can in those cases clearly see this
counter going up, even though there aren't more requests pending than
usual.
This patch improves the situation by replacing that check. We now verify
whether or not there are waiters in the semaphore.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <113c0d6b43cd6653ce972541baf6920e5765546b.1481222621.git.glauber@scylladb.com>
The problem is that replay will unlink any segments which were on disk
at the time the replay starts. However, some of those segments may
have been created by current node since the boot. If a segment is part
of reserve for example, it will be unlinked by replay, but we will
still use that segment to log mutations. Those mutations will not be
visible to replay after a crash though.
The fix is to record preexisting segents before any new segments will
have a chance to be created and use that as the replay list.
Introduced in abe7358767.
dtest failure:
commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup
Message-Id: <1481117436-6243-1-git-send-email-tgrabiec@scylladb.com>
When requests hit the commitlog, each of them will be assigned a replay
position, which we expect to be ordered. If reorders happen, the request
will be discarded and re-applied. Although this is supposed to be rare,
it does increase our latencies, specially when big requests are
involved. Processing big requests is expensive and if we have to do it
twice that adds to the cost.
The commitlog is supposed to issue replay positions in order, and it
coudl be that the code that adds them to the memtables will reorder
them. However, there is one instance in which the commitlog will not
keep its side of the bargain.
That happens when the reserve is exhausted, and we are allocating a
segment directly at the same time the reserve is being replenished. The
following sequence of events with its deferring points will ilustrate
it:
on_timer:
return this->allocate_segment(false). // defer here // then([this](sseg_ptr s) {
At this point, the segment id is already allocated.
new_segment():
if (_reserve_segments.empty()) {
[ ... ]
return allocate_segment(true).then ...
At this point, we have a new segment that has an id that is higher than
the previous id allocated.
Then we resume the execution from the deferring point in on_timer():
i = _reserve_segments.emplace(i, std::move(s));
The next time we need to allocate a segment, we'll pick it from the
reserve. But the segment in the reserve has an id that is lower than the
id that we have already used.
Reorders are bad, but this one is particularly bad: because the reorder
happens with the segment id side of the replay position, that means that
every request that falls into that segment will have to be reinserted.
This bug can be a bit tricky to reproduce. To make it more common, we
can artificially add a sleep() fiber after the allocate_segment(false)
in on_timer(). If we do that, we'll see a sea of reinsertions going on
in the logs (if dblog is set to debug).
Applying this patch (keeping the sleep) will make them all disappear.
We do this by rewriting the reserve logic, so that the segments always
come from the reserve. If we draw from a single pool all the time, there
is no chance of reordering happening. To make that more amenable, we'll
have the reserve filler always running in the background and take it out
of the timer code.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <49eb7edfcafaef7f1fdceb270639a9a8b50cfce7.1480531446.git.glauber@scylladb.com>