If any two directories of data/commitlog/hints/view_hints
are the same we still end up running verify_owner_and_mode
and disk_sanity(check_direct_io_support) in parallel
on the same directoriea and hit #5510.
This change uses std::set rather than std::vector to
collect a unique set of directories that need initialization.
Fixes#5510
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191225160645.2051184-1-bhalevy@scylladb.com>
The hints and view_hints directory has per-shard sub-dirs,
and the directories code tries to create, check and lock
all of them, including the base one.
The manipulations in question are excessive -- it's enough
to check and lock either the base dir, or all the per-shard
ones, but not everything. Let's take the latter approach for
its simplicity.
Fixes#5510
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Looks-good-to: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191223142429.28448-1-xemul@scylladb.com>
The unordered_set is turned into vector since for fs::path
there's no hash() method that's needed for set.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the only future-able operation remained is the call to
parallel_for_each(), all the rest is non-blocking preparation,
so we can drop the seastar::async and just return the future
from parallel_for_each.
The indendation is now good, as in previous patch is was prepared
just for that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal is to drop the seastar::async() usage.
Currently we have two places that return futures -- calls to
parallel_for_each-s. We can either chain them together or,
since both are working on the same set of directories, chain
actions inside them.
For code simplicity I propose to chain actions.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The list of paths that should be touch-and-locked is already
at hands, this shortens the code and makes it slightly faster
(in theory).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In order not to pollute the root dir place the code in
utils/ directory, "utils" namespace.
While doing this -- move the touch_and_lock from the
class declaration.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The main.cc code that converts sstring to fs::path
will be patched soon, the file_desc::open belongs
to seastar and works on sstrings.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
`segment_manager' now uses a decorated version of `timed_out_error'
with hardcoded name. On the other hand `region_group' uses named
`on_request_expiry' within its `expiring_fifo'.
`linearizing_input_stream` allows transparently reading linearized
values from a fragmented buffer. This is done by linearizing on-the-fly
only those read values that happen to be split across multiple
fragments. This reduces the size of the largest allocation from the size
of the entire buffer (when the entire buffer is linearized) to the size
of the largest read value. This is a huge gain when the buffer contains
loads of small objects, and modest gains when the buffer contains few
large objects. But the even in the worst case the size of the largest
allocation will be less or equal compared to the case where the entire
buffer is linearized.
This stream is planned to be used as glue code between the fragmented
cell value and the collection deserialization code which expects to be
reading linearized values.
Not just bytes::output_iterator. Allow writing into streams other than
just `bytes`. In fact we should be very careful with writing into
`bytes` as they require potentially large contiguous allocations.
The `write()` method is now templatized also on the type of its first
argument, which now accepts any CharOutputIterator. Due to our poor
usage of namespace this now collides with `write` defined inside
`db/commitlog/commitlog.cc`. Luckily, the latter doesn't really have to
be templatized on the data type it reads from, and de-templatizing it
resolves the clash.
Those are typically symptoms of use-after-free or memory corruption in
the program. It's better to catch such error sooner than later.
That situation is also dangerous since if a valid descriptor would
land under the invalid access, not the one which was intended for the
operation, then the operation may be performed on the wrong file and
result in corruption.
Message-Id: <1565206788-31254-1-git-send-email-tgrabiec@scylladb.com>
This patch adds all functionality needed for Paxos protocol. The
implementation does not strictly adhere to Paxos paper since the original
paper allows setting a value only once, while for LWT we need to be able
to make another Paxos round after "learn" phase completes, which requires
things like repair to be introduced.
Merged patch series from Avi Kivity:
The static row can be rare: many tables don't have them, and tables
that do will often have mutations without them (if the static row
is rarely updated, it may be present in the cache and in readers,
but absent in memtable mutations). However, it always consumes ~100
bytes of memory, even if it not present, due to row's overhead.
Change it to be optional by allocating it as an external object rather
than inlined into mutation_partition. This adds overhead when the
static row is present (17 bytes for the reference, back reference,
and lsa allocator overhead).
perf_simple_query appears to marginally (2%) faster. Footprint is
reduced by ~9% for a cache entry, 12% in memtables. More details are
provided in the patch commitlog.
Tests: unit (debug)
Avi Kivity (4):
managed_ref: add get() accessor
managed_ref: add external_memory_usage()
mutation_partition: introduce lazy_row
mutation_partition: make static_row optional to reduce memory
footprint
cell_locking.hh | 2 +-
converting_mutation_partition_applier.hh | 4 +-
mutation_partition.hh | 284 ++++++++++++++++++++++-
partition_builder.hh | 4 +-
utils/managed_ref.hh | 12 +
flat_mutation_reader.cc | 2 +-
memtable.cc | 2 +-
mutation_partition.cc | 45 +++-
mutation_partition_serializer.cc | 2 +-
partition_version.cc | 4 +-
tests/multishard_mutation_query_test.cc | 2 +-
tests/mutation_source_test.cc | 2 +-
tests/mutation_test.cc | 12 +-
tests/sstable_mutation_test.cc | 10 +-
14 files changed, 355 insertions(+), 32 deletions(-)
While a managed_ref emulates a reference more closely than it does
a pointer, it is still nullable, so add a get() (similar to
unique_ptr::get()) that can be nullptr if the reference is null.
The immediate use will be mutation_partition::_static_row, which
is often empty and takes up about 10% of a cache entry.
We observed an abort on bad_alloc which was not caused by real OOM,
but could be explained by cache region being locked from a different
shard, which is not allowed, concurrently with memory reclamation.
It's impossible now to prove this, or, if that was indeed the case, to
determine which code path was attempting such lock. This patch adds an
assert which would catch such incorrect locking at the attempt.
Refs #4978
This patch silences those future discard warnings where it is clear that
discarding the future was actually the intent of the original author,
*and* they did the necessary precautions (handling errors). The patch
also adds some trivial error handling (logging the error) in some
places, which were lacking this, but otherwise look ok. No functional
changes.
Match SQL's LIKE in allowing an empty pattern, which matches only
an empty text field.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
This simplifies the debug implementation and it now should work with
scylla-gdb.py.
It is not clear what, if anything, is lost by not using random
ids. They were never being reused in the debug implementation anyway.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190618144755.31212-1-espindola@scylladb.com>
Merged patch series from Avi Kivity:
In rare but valid cases (reconciling many tombstones, paging disabled),
a reconciled_result can grow large. This triggers large allocation
warnings. Switch to chunked_vector to avoid the large allocation.
In passing, fix chunked_vector's begin()/end() const correctness, and
add the reverse iterator function family which is needed by the conversion.
Fixes#4780.
Tests: unit (dev)
Commit Summary
utils: chunked_vector: make begin()/end() const correct
utils::chunked_vector: add rbegin() and related iterators
reconcilable_result: use chunked_vector to hold partitions
The FIXME was added in the very first commit ("utils: Convert
utils/FBUtilities.java") that introduced the fb_utilities class as a
stub. However, we have long implemented the parts that we actually use,
so drop the FIXME as obsolete. In addition, drop the remaining
uncommented Java code as unused and also obsolete.
Message-Id: <20190808182758.1155-1-penberg@scylladb.com>
The handler is intended to be called when internal invariants are
violated and the operation cannot safely continue. The handler either
throws (default) or aborts, depending on configuration option.
Passing --abort-on-internal-error on the command line will switch to
aborting.
The reason we don't abort by default is that it may bring the whole
cluster down and cause unavailability, while it may not be necessary
to do so. It's safer to fail just the affected operation,
e.g. repair. However, failing the operation with an exception leaves
little information for debugging the root cause. So the idea is that the
user would enable aborts on only one of the nodes in the cluster to
get a core dump and not bring the whole cluster down.
begin() of a const vector should return a const_iterator, to avoid
giving the caller the ability to mutate it.
This slipped through since iterator's constructor does a const_cast.
Noticed by code inspection.
Fixes#4713
Modifying config files to use sharded storage misses the fact
that extensions are allowed to add non-member config fields to
the main configuration, typically from "extra" config_file
objects.
Unless those "extra" files are broadcast when main file broadcast,
the values will not be readable from other shards.
This patch propagates the broadcast to all other config files
whose entries are in the top level object. This ensures we
always keep data up to date on config reload.
Message-Id: <20190715135851.19948-1-calle@scylladb.com>
In debug mode the LSA needs objects to be 8-byte aligned in order to
maximise coverage from the AddressSanitizer.
Usually `close_active()` creates a dummy objects that covers the end of
the segment being closed. However, it the last real objects ends in the
last eight bytes of the segment then that dummy won't be created because
of the alignment requirements. This broke exit conditions on loops
trying to read all objects in the segment and caused them to attempt to
dereference address at the end of the segment. This patch fixes that.
Fixes#4653.
The get_value method returns a pointer to the value that is used by the
value_to_json method.
The assumption is that the void pointer points to the actual value.
Fixes#4678
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patchset allows changing the configuration at runtime, The user
triggers this by editing the configuration file normally, then
signalling the database with SIGHUP (as is traditional).
The implementation is somewhat complicated due the need to store
non-atomic mutable state per-shard and to synchronize the values in
all shards. This is somewhat similar to Seastar's sharded<>, but that
cannot be used since the configuration is read before Seastar is
initialized (due to the need to read command-line options).
Tests: unit (dev, debug), manual test with extra prints (dev)
Ref #2689Fixes#2517.
Replace the per-shard value we store with an updateable_value_source, which
allows updating it dynamically and allows users to track changes.
The broadcast_to_all_shards() function is augmented to apply modifications
when called on a live system.
Since some of our values are not atomic (strings) and the administrative
information needed to track references to values is also not atomic, we will
need to store them per-shard. To do that we add a vector of per-shard data
to config_file, where each element is itself a vector of configuration items.
Since we need to operate generically on items (copying them from shard to shard)
we store them in a type-erased form.
Only mutable state is stored per-shard.
The updateable_value and updateable_value_source classes allow broadcasting
configuration changes across the application. The updateable_value_source class
represents a value that can be updated, and updateable_value tracks its source
and reflects changes. A typical use replaces "uint64_t config_item" with
"updateable_value<uint64_t> config_item", and from now on changes to the source
will be reflected in config_item. For more complicated uses, which must run some
callback when configuration changes, you can also call
config_item.observe(callback) to be actively notified of changes.
Currently, we allow adjusting configuration via
cfg.whatever() = 5;
by returning a mutable reference from cfg.whatever(). Soon, however, this operation
will have side effects (updating all references to the config item, and triggering
notifiers). While this can be done with a proxy, it is too tricky.
Switch to an ordinary setter interface:
cfg.whatever.set(5);
Because boost::program_options no longer gets a reference to the value to be written
to, we have to move the update to a notifier, and the value_ex() function has to
be adjusted to infer whether it was called with a vector type after it is
called, not before.
config_file and db::config are soon not going to be copyable. The reason is that
in order to support live updating, we'll need per-shard copies of each value,
and per-shard tracking of references to values. While these can be copied, it
will be an asycnronous operation and thus cannot be done from a copy constructor.
So to prepare for these changes, replace all copies of db::config by references
and delete config_file's copy constructor.
Some existing references had to be made const in order to adapt the const-ness
of db::config now being propagated (rather than being terminated by a non-const
copy).
This change aligns descriptors and values to 8 bytes so that poisoning
a descriptor or value doesn't interfere with other descriptors and
values.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
pow2_rank is undefined for 0.
bucket_of currently works around that by using a bitmask of 0.
To allow asserting that count_{leading,trailing}_zeros are not
called with 0, we want to avoid it at all call sites.
Fixes#4153
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190623162137.2401-1-bhalevy@scylladb.com>
With this patch, when using asan, we poison segment memory that has
been allocated from the system but should not be accessible to user
code.
Should help with debugging user after free bugs.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190607140313.5988-1-espindola@scylladb.com>
A lot of code in scylla is only reachable if SEASTAR_DEFAULT_ALLOCATOR
is not defined. In particular, refill_emergency_reserve in the default
allocator case is empty, but in the seastar allocator case it compacts
segments.
I am trying to debug a crash that seems to involve memory corruption
around the lsa allocator, and being able to use a debug build for that
would be awesome.
This patch reduces the differences between the two cases by having a
common segment_pool that defers only a few operations to different
segment_store implementations.
Tests: unit (debug, dev)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190606020937.118205-1-espindola@scylladb.com>