This templates the code for listener_vector, renames it to
atomic_vector and moves it to the utils directory.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
"
Most of the code in `cell` and the `imr` infrastructure it is built on
is `noexcept`. This means that extra care must be taken to avoid rouge
exceptions as they will bring down the node. The changes introduced by
0a453e5d3a did just that - introduced rouge `std::bad_alloc` into this
code path by violating an undocumented and unvalidated assumption --
that fragment ranges passed to `cell::make_collection()` are nothrow
copyable and movable.
This series refactors `cell::make_collection()` such that it does not
have this assumption anymore and is safe to use with any range.
Note that the unit test included in this series, that was used to find
all the possible exception sources will not be currently run in any of
our build modes, due to `SEASTAR_ENABLE_ALLOC_FAILURE_INJECTION` not
being set. I plan to address this in a followup because setting this
flags fails other tests using the failure injection mechanism. This is
because these tests are normally run with the failure injection disabled
so failures managed to lurk in without anyone noticing.
Fixes: #5575
Refs: #5341
Tests: unit(dev, debug)
"
* 'data-cell-make-collection-exception-safety/v2' of https://github.com/denesb/scylla:
test: mutation_test: add exception safety test for large collection serialization
data/cell.hh: avoid accidental copies of non-nothrow copiable ranges
utils/fragment_range.hh: introduce fragment_range_view
A lightweight, trivially copyable and movable view for fragment ranges.
Allows for uniform treatment of all kinds of ranges, i.e. treating all
of them as a view. Currently `fragment_range.hh` provides lightweight,
view-like adaptors for empty and single-fragment ranges (`bytes_view`). To
allow code to treat owning multi-fragment ranges the shame way as the
former two, we need a view for the latter as well -- this is
`fragment_range_view`.
"
The original fix (10f6b125c8) didn't
take into account that if there was a failed memtable flush (Refs
flush) but is not a flushable memtable because it's not the latest in
the memtable list. If that happens, it means no other memtable is
flushable as well, cause otherwise it would be picked due to
evictable_occupancy(). Therefore the right action is to not flush
anything in this case.
Suspected to be observed in #4982. I didn't manage to reproduce after
triggering a failed memtable flush.
Fixes#3717
"
* tag 'avoid-ooming-with-flush-continuations-v2' of github.com:tgrabiec/scylla:
database: Avoid OOMing with flush continuations after failed memtable flush
lsa: Introduce operator bool() to occupancy_stats
lsa: Expose region_impl::evictable_occupancy in the region class
Different versions of boost have different rules for what conversions
from cpp_int to smaller intergers are allowed.
We already had a function that worked with all supported versions, but
it was not being use by lua.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200104041028.215153-1-espindola@scylladb.com>
If any two directories of data/commitlog/hints/view_hints
are the same we still end up running verify_owner_and_mode
and disk_sanity(check_direct_io_support) in parallel
on the same directoriea and hit #5510.
This change uses std::set rather than std::vector to
collect a unique set of directories that need initialization.
Fixes#5510
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191225160645.2051184-1-bhalevy@scylladb.com>
The hints and view_hints directory has per-shard sub-dirs,
and the directories code tries to create, check and lock
all of them, including the base one.
The manipulations in question are excessive -- it's enough
to check and lock either the base dir, or all the per-shard
ones, but not everything. Let's take the latter approach for
its simplicity.
Fixes#5510
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Looks-good-to: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191223142429.28448-1-xemul@scylladb.com>
The unordered_set is turned into vector since for fs::path
there's no hash() method that's needed for set.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the only future-able operation remained is the call to
parallel_for_each(), all the rest is non-blocking preparation,
so we can drop the seastar::async and just return the future
from parallel_for_each.
The indendation is now good, as in previous patch is was prepared
just for that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal is to drop the seastar::async() usage.
Currently we have two places that return futures -- calls to
parallel_for_each-s. We can either chain them together or,
since both are working on the same set of directories, chain
actions inside them.
For code simplicity I propose to chain actions.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The list of paths that should be touch-and-locked is already
at hands, this shortens the code and makes it slightly faster
(in theory).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In order not to pollute the root dir place the code in
utils/ directory, "utils" namespace.
While doing this -- move the touch_and_lock from the
class declaration.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The main.cc code that converts sstring to fs::path
will be patched soon, the file_desc::open belongs
to seastar and works on sstrings.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
`segment_manager' now uses a decorated version of `timed_out_error'
with hardcoded name. On the other hand `region_group' uses named
`on_request_expiry' within its `expiring_fifo'.
`linearizing_input_stream` allows transparently reading linearized
values from a fragmented buffer. This is done by linearizing on-the-fly
only those read values that happen to be split across multiple
fragments. This reduces the size of the largest allocation from the size
of the entire buffer (when the entire buffer is linearized) to the size
of the largest read value. This is a huge gain when the buffer contains
loads of small objects, and modest gains when the buffer contains few
large objects. But the even in the worst case the size of the largest
allocation will be less or equal compared to the case where the entire
buffer is linearized.
This stream is planned to be used as glue code between the fragmented
cell value and the collection deserialization code which expects to be
reading linearized values.
Not just bytes::output_iterator. Allow writing into streams other than
just `bytes`. In fact we should be very careful with writing into
`bytes` as they require potentially large contiguous allocations.
The `write()` method is now templatized also on the type of its first
argument, which now accepts any CharOutputIterator. Due to our poor
usage of namespace this now collides with `write` defined inside
`db/commitlog/commitlog.cc`. Luckily, the latter doesn't really have to
be templatized on the data type it reads from, and de-templatizing it
resolves the clash.
Those are typically symptoms of use-after-free or memory corruption in
the program. It's better to catch such error sooner than later.
That situation is also dangerous since if a valid descriptor would
land under the invalid access, not the one which was intended for the
operation, then the operation may be performed on the wrong file and
result in corruption.
Message-Id: <1565206788-31254-1-git-send-email-tgrabiec@scylladb.com>
This patch adds all functionality needed for Paxos protocol. The
implementation does not strictly adhere to Paxos paper since the original
paper allows setting a value only once, while for LWT we need to be able
to make another Paxos round after "learn" phase completes, which requires
things like repair to be introduced.
Merged patch series from Avi Kivity:
The static row can be rare: many tables don't have them, and tables
that do will often have mutations without them (if the static row
is rarely updated, it may be present in the cache and in readers,
but absent in memtable mutations). However, it always consumes ~100
bytes of memory, even if it not present, due to row's overhead.
Change it to be optional by allocating it as an external object rather
than inlined into mutation_partition. This adds overhead when the
static row is present (17 bytes for the reference, back reference,
and lsa allocator overhead).
perf_simple_query appears to marginally (2%) faster. Footprint is
reduced by ~9% for a cache entry, 12% in memtables. More details are
provided in the patch commitlog.
Tests: unit (debug)
Avi Kivity (4):
managed_ref: add get() accessor
managed_ref: add external_memory_usage()
mutation_partition: introduce lazy_row
mutation_partition: make static_row optional to reduce memory
footprint
cell_locking.hh | 2 +-
converting_mutation_partition_applier.hh | 4 +-
mutation_partition.hh | 284 ++++++++++++++++++++++-
partition_builder.hh | 4 +-
utils/managed_ref.hh | 12 +
flat_mutation_reader.cc | 2 +-
memtable.cc | 2 +-
mutation_partition.cc | 45 +++-
mutation_partition_serializer.cc | 2 +-
partition_version.cc | 4 +-
tests/multishard_mutation_query_test.cc | 2 +-
tests/mutation_source_test.cc | 2 +-
tests/mutation_test.cc | 12 +-
tests/sstable_mutation_test.cc | 10 +-
14 files changed, 355 insertions(+), 32 deletions(-)
While a managed_ref emulates a reference more closely than it does
a pointer, it is still nullable, so add a get() (similar to
unique_ptr::get()) that can be nullptr if the reference is null.
The immediate use will be mutation_partition::_static_row, which
is often empty and takes up about 10% of a cache entry.
We observed an abort on bad_alloc which was not caused by real OOM,
but could be explained by cache region being locked from a different
shard, which is not allowed, concurrently with memory reclamation.
It's impossible now to prove this, or, if that was indeed the case, to
determine which code path was attempting such lock. This patch adds an
assert which would catch such incorrect locking at the attempt.
Refs #4978
This patch silences those future discard warnings where it is clear that
discarding the future was actually the intent of the original author,
*and* they did the necessary precautions (handling errors). The patch
also adds some trivial error handling (logging the error) in some
places, which were lacking this, but otherwise look ok. No functional
changes.
Match SQL's LIKE in allowing an empty pattern, which matches only
an empty text field.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
This simplifies the debug implementation and it now should work with
scylla-gdb.py.
It is not clear what, if anything, is lost by not using random
ids. They were never being reused in the debug implementation anyway.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190618144755.31212-1-espindola@scylladb.com>
Merged patch series from Avi Kivity:
In rare but valid cases (reconciling many tombstones, paging disabled),
a reconciled_result can grow large. This triggers large allocation
warnings. Switch to chunked_vector to avoid the large allocation.
In passing, fix chunked_vector's begin()/end() const correctness, and
add the reverse iterator function family which is needed by the conversion.
Fixes#4780.
Tests: unit (dev)
Commit Summary
utils: chunked_vector: make begin()/end() const correct
utils::chunked_vector: add rbegin() and related iterators
reconcilable_result: use chunked_vector to hold partitions
The FIXME was added in the very first commit ("utils: Convert
utils/FBUtilities.java") that introduced the fb_utilities class as a
stub. However, we have long implemented the parts that we actually use,
so drop the FIXME as obsolete. In addition, drop the remaining
uncommented Java code as unused and also obsolete.
Message-Id: <20190808182758.1155-1-penberg@scylladb.com>
The handler is intended to be called when internal invariants are
violated and the operation cannot safely continue. The handler either
throws (default) or aborts, depending on configuration option.
Passing --abort-on-internal-error on the command line will switch to
aborting.
The reason we don't abort by default is that it may bring the whole
cluster down and cause unavailability, while it may not be necessary
to do so. It's safer to fail just the affected operation,
e.g. repair. However, failing the operation with an exception leaves
little information for debugging the root cause. So the idea is that the
user would enable aborts on only one of the nodes in the cluster to
get a core dump and not bring the whole cluster down.
begin() of a const vector should return a const_iterator, to avoid
giving the caller the ability to mutate it.
This slipped through since iterator's constructor does a const_cast.
Noticed by code inspection.
Fixes#4713
Modifying config files to use sharded storage misses the fact
that extensions are allowed to add non-member config fields to
the main configuration, typically from "extra" config_file
objects.
Unless those "extra" files are broadcast when main file broadcast,
the values will not be readable from other shards.
This patch propagates the broadcast to all other config files
whose entries are in the top level object. This ensures we
always keep data up to date on config reload.
Message-Id: <20190715135851.19948-1-calle@scylladb.com>
In debug mode the LSA needs objects to be 8-byte aligned in order to
maximise coverage from the AddressSanitizer.
Usually `close_active()` creates a dummy objects that covers the end of
the segment being closed. However, it the last real objects ends in the
last eight bytes of the segment then that dummy won't be created because
of the alignment requirements. This broke exit conditions on loops
trying to read all objects in the segment and caused them to attempt to
dereference address at the end of the segment. This patch fixes that.
Fixes#4653.
The get_value method returns a pointer to the value that is used by the
value_to_json method.
The assumption is that the void pointer points to the actual value.
Fixes#4678
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patchset allows changing the configuration at runtime, The user
triggers this by editing the configuration file normally, then
signalling the database with SIGHUP (as is traditional).
The implementation is somewhat complicated due the need to store
non-atomic mutable state per-shard and to synchronize the values in
all shards. This is somewhat similar to Seastar's sharded<>, but that
cannot be used since the configuration is read before Seastar is
initialized (due to the need to read command-line options).
Tests: unit (dev, debug), manual test with extra prints (dev)
Ref #2689Fixes#2517.
Replace the per-shard value we store with an updateable_value_source, which
allows updating it dynamically and allows users to track changes.
The broadcast_to_all_shards() function is augmented to apply modifications
when called on a live system.
Since some of our values are not atomic (strings) and the administrative
information needed to track references to values is also not atomic, we will
need to store them per-shard. To do that we add a vector of per-shard data
to config_file, where each element is itself a vector of configuration items.
Since we need to operate generically on items (copying them from shard to shard)
we store them in a type-erased form.
Only mutable state is stored per-shard.
The updateable_value and updateable_value_source classes allow broadcasting
configuration changes across the application. The updateable_value_source class
represents a value that can be updated, and updateable_value tracks its source
and reflects changes. A typical use replaces "uint64_t config_item" with
"updateable_value<uint64_t> config_item", and from now on changes to the source
will be reflected in config_item. For more complicated uses, which must run some
callback when configuration changes, you can also call
config_item.observe(callback) to be actively notified of changes.
Currently, we allow adjusting configuration via
cfg.whatever() = 5;
by returning a mutable reference from cfg.whatever(). Soon, however, this operation
will have side effects (updating all references to the config item, and triggering
notifiers). While this can be done with a proxy, it is too tricky.
Switch to an ordinary setter interface:
cfg.whatever.set(5);
Because boost::program_options no longer gets a reference to the value to be written
to, we have to move the update to a notifier, and the value_ex() function has to
be adjusted to infer whether it was called with a vector type after it is
called, not before.
config_file and db::config are soon not going to be copyable. The reason is that
in order to support live updating, we'll need per-shard copies of each value,
and per-shard tracking of references to values. While these can be copied, it
will be an asycnronous operation and thus cannot be done from a copy constructor.
So to prepare for these changes, replace all copies of db::config by references
and delete config_file's copy constructor.
Some existing references had to be made const in order to adapt the const-ness
of db::config now being propagated (rather than being terminated by a non-const
copy).