Column_family_test allows performing private methods on column_family's
sstable_set. It may be useful not only in the boost tests, so it's moved
from test/boost/sstable_test.hh to test/lib/sstable_test_env.hh.
sstable_test.hh includes sstable_test_env.hh, so no includes need to be
changed.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
The sstable_set enables copying without iterating over all its elements,
so it's faster to copy a set and modify it than copy all its elements
while filtering the ones that were erased.
The modifications are done on a temporary version of the set, so that
if an operation fails the base version remains unchanged
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Currently, the sstable_set in a table is copied before every change
to allow accessing the unchanged version by existing sstable readers.
This patch changes the sstable_set to a structure that allows copying
without actually copying all the sstables in the set, while providing
the same methods(and some extra) without majorly decreasing their speed.
This is achieved by associating all copies with sstable_set versions
which hold the changes that were performed in them, and references to
the versions that were copied, a.k.a. their parents. The set represented
by a version is the result of combining all changes of its ancestors.
This causes most methods of the version to have a time complexity
dependent on the number of its ancestors. To limit this number, versions
that represent copies that have already been deleted are merged with its
descendants.
The strategy used for deciding when and with which of its children
should a version be merged heavily depends on the use case of sstable_sets:
there is a main copy of the set in a table class which undergoes many
insertions and deletions, and there are copies of it in compaction or
mutation readers which are further copied or edited few or zero times.
It's worth to mention, that when a copy is made, the copied set should not
be modified anymore, because it would also modify the results given by the
copy. In order to still allow modifying the copied set, if a change is
to be performed on it, the version assiociated with this set is replaced
with a new version depending on the previous one.
As we can see, in our use case there is a main chain of versions(with
changes from the table), and smaller branches of versions that start
from a version from this chain, but are deleted soon after.
In such case we can merge a version when it has exactly one descendant,
as this limits the number of concurrent ancestors of a version to the
number of copies of its ancestors are concurrently used. During each
such merge, the parent version is removed and the child version is
modified so that all operations on it give the same results.
In order to preserve the same interface, the sstable_set still contains a
lw_shared_ptr<sstable_list>, but sstable_list (previously an alias for
unordered_set<shared_sstable>) is now a new structure. Each sstable_set
contains a sstable_list but not every sstable_list has to be contained
by a sstable_set, and we also want to allow fast copying of sstable_lists,
so the reference to the sstable_set_version is kept by the sstable_lists
and the sstable_set can access the sstable_set_version it's associated
with through its sstable_list.
Accessing sstables that are elements of a certain sstable_set copy(so
the select, select_sstable_runs and sstable_list's iterator) get results
from containers that hold all sstables from all versions(which are stored
in a single, shared "versioned_sstable_set_data" structure), and then
filter out these sstables that aren't present in the version in question.
This version of the sstable_set allows adding and erasing the same sstable
repeatedly. Inserting and erasing from the set modifies the containers in
a version only when it has an actual effect: if an sstable has been added
in the parent version, and hasn't been erased in the child version, adding
it again will have no effect. This ensures that when merging versions, the
versions have disjoint sets of added, and erased sstables (an sstable can
still be added in one and erased in the second). It's worth noting hat if
an sstable has been added in one of the merged sets and erased in the
second, the version that remains after merging doesn't need to have any
info about the sstable's inclusion in the set - it can be inferred from
the changes in previous versions (and it doesn't matter if the sstable has
been erased before or after being added).
To release pointers to sstables as soon as possible (i.e. when all references
to versions that contain them die), if an sstable is added/erased in all
child versions that are based on a version which has no external references,
this change gets removed from these versions and added to the parent version.
If an sstable's insertion gets overwritten as a result, we might be able
to remove the sstable completely from the set. We know how many times this
needs to happen by counting, for each sstable, in how many different verisions
has it been added. When a change that adds an sstable gets merged with a change
that removes it, or when a such a change simply gets deleted alongside its
associated version, this count is reduced, and when an sstable gets added to a
version that doesn't already contain it, this count is increased.
The methods that modify the sets contents give strong exception guarantee
by trying to insert new sstables to its containers, and erasing them in
the case of an caught exception.
Fixes#2622
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
If the range expression in a range based for loop returns a temporary,
its lifetime is extended until the end of the loop. The same can't be said
about temporaries created within the range expression. In our case,
*t->get_sstables_including_compacted_undeleted() returns a reference to a
const sstable_list, but the t->get_sstables_including_compacted_undeleted()
is a temporary lw_shared_ptr, so its lifetime may not be prolonged until the
end of the loop, and it may be the sole owner of the referenced sstable_list,
so the referenced sstable_list may be already deleted inside the loop too.
Fix by creating a local copy of the lw_shared_ptr, and get reference from it
in the loop.
Fixes#7605
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Adding an non-empty set of sstables as the set of all sstables in
an sstable_set could cause inconsistencies with the values returned
by select_sstable_runs because the _all_runs map would still be
initialized empty. For similar reasons, the provided sstable_set_impl
should also be empty.
Dispel doubts by removing the unordered_set from the constructor, and
adding a check of emptiness of the sstable_set_impl.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
"
Currently, register_inactive_read accepts an eviction_notify_handler
to be called when the inactive_read is evicted.
However, in case there was an error in register_inactive_read
the notification function isn't called leaving behind
state that needs to be cleaned up.
This series separates the register_inactive_reader interface
into 2 parts:
1. register_inactive_reader(flat_mutation_reader) - which just registers
the reader and return an inactive_read_handle, *if permitted*.
Otherwise, the notification handler is not called (it is not known yet)
and the caller is not expected to do anything fance at this point
that will require cleanup.
This optimizes the server when overloaded since we do less work
that we'd need to undo in case the reader_concurrecy_semaphore
runs out of resources.
2. After register_inactive_reader succeeded to return a valid
inactive_read_handle, the caller sets up its local state
and may call `set_notify_handler` to set the optional
notify_handler and ttl on the o_r_h.
After this state, the notify_handler will be called when
the inactive_reader is evicted, for any reason.
querier_cache::insert_querier was modified to use the
above procedure and to handle (and log/ignore) any error
in the process.
inactive_read_handle and inactive_read keeping track of each other
was simplified by keeping an iterator in the handle and a backpointer
in the inactive_read object. The former is used to evict the reader
and to set the notify_handler and/or ttl without having to lookup the i_r.
The latter is used to invalidate the i_r_h when the i_r is destroyed.
Test: unit(release), querier_cache_test(debug)
"
* tag 'register_inactive_read-error-handling-v6' of github.com:bhalevy/scylla:
querier_cache: insert_querier: ignore errors to register inactive reader
querier_cache: insert_querier: handle errors
querier_utils: mark functions noexcept
reader_concurrency_semaphore: register_inactive_read: make noexcept
reader_concurrency_semaphore: separate set_notify_handler from register_inactive_reader
reader_concurrency_semaphore: inactive_read: make ttl_timer non-optional
reader_concurrency_semaphore: inactive_read: use intrusive list
reader_concurrency_semaphore: do_wait_admission: use try_evict_one_inactive_read
reader_concurrency_semaphore: try_evict_one_inactive_read: pass evict_reason
reader_concurrency_semaphore: unregister_inactive_read: calling on wrong semaphore is an internal error
reader_concurrency_semaphore: unregister_inactive_read: do nothing if disengaged
reader_concurrency_semaphore: inactive_read_handle: swap definition order
reader_lifecycle_policy: retire low level try_resume method
reader_concurrency_semaphore: inactive_read: keep a flat_mutation_reader
One of the USING TIMEOUT tests relied on a specific TTL value,
but that's fragile if the test runs on the boundary of 2 seconds.
Instead, the test case simply checks if the TTL value is present
and is greater than 0, which makes the test robust unless its execution
lasts for more than 1 million seconds, which is highly unlikely.
Fixes#8062Closes#8063
Previous version, merged and dequeued due to a dependency bug: https://github.com/scylladb/scylla/pull/7297
Note: this pull request is temporarily created against /next, because it depends on https://github.com/scylladb/scylla/pull/7279.
This series adds support for `max_concurrent_requests_per_shard` config variable to alternator. Excessive requests are shed and RequestLimitExceeded is sent back to the client.
Tested manually by reloading Scylla multiple times and editing the config, while bombarding alternator with many concurrent requests. Observed excepted failures are:
`botocore.errorfactory.RequestLimitExceeded: An error occurred (RequestLimitExceeded) when calling the CreateTable operation: too many in-flight requests: 17
`
Fixes#7294Closes#8039
* github.com:scylladb/scylla:
alternator: server: return api_error instead of throwing
alternator: add requests_shed metrics
alternator: add handling max_concurrent_requests_per_shard
alternator: add RequestLimitExceeded error
The code that creates system keyspace open code a lot of things from
database::create_keyspace(). The patch makes create_keyspace() suitable
for both system and non system keyspaces and uses it to create system
keyspaces as well.
Message-Id: <20210209160506.1711177-1-gleb@scylladb.com>
The error is benign but if it is not handled "unhandled exception" error
will be printed in the logs.
Message-Id: <20210209150313.GA1708015@scylladb.com>
The interface of the failure detector service is cleaned up a little:
- an unimplemented method is removed (is_alive)
- a return type of another method is fixed (arrival_samples)
- a getter for the most recent successful update is added (last_update)
This code was tested manually during various overload protection
experiments, which check if the failure detector can be used to reject
requests which have a very small chance of succeeding within their
timeout.
Closes#8052
* github.com:scylladb/scylla:
failure_detector: add getting last update time point
failure_detector: return arrival samples by const reference
failure_detector: remove unimplemented is_alive method
The same trick is used as in C*:
79e693e16e/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java (L241)
The edited CQL test relied on quietly accepting non-existing DCs, so it had to
be removed. Also, one boost-test referred to nonexistent `datacenter2` and had
to be removed.
Fixes#7595Closes#8056
* github.com:scylladb/scylla:
tests: Adjusted tests for DC checking in NTS
locator: Check DC names in NTS
Since fea5067df we enforce a limit on the memory consumption of
otherwise non-limited queries like reverse and non-paged queries. This
limit is sent down to the replicas by the coordinator, ensuring that
each replica is working with the same limit. This however doesn't work
in a mixed cluster, when upgrading from a version which doesn't have
this series. This has been worked around by falling back to the old
max_result_size constant of 1MB in mixed clusters. This however resulted
in a regression when upgrading from a pre fea5067df to a post fea5067df
one. Pre fea5067df already had a limit for reverse queries, which was
generalized to also cover non-paged ones too by fea5067df.
The regression manifested in previously working reverse queries being
aborted. This happened because even though the user has set a generous
limit for them before the upgrade, in the mix cluster replicas fall back
to the much stricter 1MB limit temporarily ignoring the configured limit
if the coordinator is an old node. This patch solves this problem by
using the locally configured limit instead of the max_result_size
constant. This means that the user has to take extra care to configure
the same limit on all replicas, but at least they will have working
reverse queries during the upgrade.
Fixes: #8022
Tests: unit(release), manual test by user who reported the issue
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210209075947.1004164-1-bdenes@scylladb.com>
* seastar 4c7c5c7c4...76cff5896 (6):
> rpc: Make is possible for rpc server instance to refuse connection
> reactor: expose cumulative tasks processed statistic
> fair_queue: add missing #include <optional>
> reactor: optimize need_preempt() thread-local-storage access
> Merge " Use reference for backend->reactor link" from Pavel E
> test: coroutines: failed coroutine does not throw
CQL test relied on quietly acceptiong non-existing DCs, so it had
to be removed. Also, one boost-test referred to nonexisting
`datacenter2` and had to be removed.
Since the reader may normally dropped upon
registration, hitting an error is equivalent to having
it evicted at any time, so just log the exception
and ignore it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
They all are trivially noexcept.
Mark them so to simplify error handing assumptions in the
next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Catch error to allocate an inactive_read and just log them.
Return an empty inactive_read_handle in
this case, as if the inactive reader was evicted due to
lack of resources.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Register the inactive reader first with no
evict_notify_handler and ttl.
Those can be set later, only if registration succeeded.
Otherwise, as in the querier example, there is no need
to to place the querier in the index and erase it
on eviction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
By default it will be unarmed and with no callback
so there's no need to wrap it in a std::optional.
This saves an allocation and another potential
error case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To simplify insertion and eviction into the inactive_reads container,
use an intrusive list thta requires a single allocation for the
inactive_read object itself.
This allows passing a reference to the inactive_read
to evict it.
Note that the reader will be unlinked automatically from
the inactive_readers list if the inactive_read_handle is destroyed.
This is okay since there is no need to track the inactive_read
if the caller loses the i_r_h (e.g. if an error is thrown).
It is also safe to evict the inactive_reader while the
i_r_h is alive. In this case the i_r will be unlinked
after the flat_mutation_reader it holds is moved out of it.
bi::auto_unlink will detect that it's alredy unlinked
when destroyed and do nothing.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Calling unregister_inactive_read on the wrong semaphore is a blatant
bug so better call on_internal_error so it'd be easier to catch and fix.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There is no need to lookup the inactive_read if the i_r_h
is disengaged, it should not be registered so just return
quickly.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There's no need to hold a unique_ptr<flat_mutation_reader> as
flat_mutation_reader itself holds a unique_ptr<flat_mutation_reader::impl>
and functions as a unique ptr via flat_mutation_reader_opt.
With that, unregister_inactive_read was modified to return a
flat_mutation_reader_opt rather than a std::unique_ptr<flat_mutation_reader>,
keeping exactly the same semantics.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This fixes the problem where the cordinator already knows about the
new schema and issues a read which uses new objects, but the replica
doesn't know those objects yet. The read will fail in this case. We
can avoid this if we propagate schema changes with reads, like we
already do for writes.
Message-Id: <20210205163422.414275-1-tgrabiec@scylladb.com>
Refs #6148
Commitlog disk limit was previously a "soft" limit, in that we allowed allocating new segments, even if we were over
disk usage max. This would also cause us sometimes to create new segments and delete old ones, if badly timed in
needing and releasing segments, in turn causing useless disk IO for pre-allocation/zeroing.
This patch set does:
* Make limit a hard limit. If we have disk usage > max, we wait for delete or recycle.
* Make flush threshold configurable. Default is ask for flush when over 50% usage. (We do not wait for results)
* Make flush "partial". We flush X% of the used space (used - thres/2), and make the rp limit accordingly. This means we will try to clear the N oldest segments, not all. I.e. "lighter" flush. Of course, if the CL is wholly dominated by a single CF, this will not really help much. But when > 1 cf is used, it means we can skip those not having unflushed data < req rp.
* Force more eager flush/recycle if we're out of segments
Note: flush threshold is not exposed in scylla config (yet). Because I am unsure of wording, and even if it should.
Note: testing is sparse, esp. in regard to latency/timeouts added in high usage scenarios. While I can fairly easily provoke "stalls" (i.e. forced waiting for segments to free up) with simple C-S, it is hard to say exactly where in a more sane config (I set my limits looow) latencies will start accumulating.
Closes#7879
* github.com:scylladb/scylla:
commitlog: Force earlier cycle/flush iff segment reserve is empty
commitlog: Make segment allocation wait iff disk usage > max
commitlog: Do partial (memtable) flushing based on threshold
commitlog: Make flush threshold configurable
table: Add a flush RP mark to table, and shortcut if not above
The db::update_keyspace() needs sharded<storage_proxy>
reference, but the only caller of it already has it and
can pass one as argument.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210205175611.13464-3-xemul@scylladb.com>
On start the transport controller keeps the storage service
on server config's lambda just to let the server grab a
database config option.
The same can be achieved by passing the sharded database
reference to sharded<server>::start, so that each server
instance get local database with config.
As an nice side effect transport::server's config looks
more like a config with simple values and without methods
and/or lambdas on board.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210205175611.13464-1-xemul@scylladb.com>
Few method in column_familiy API were doing the aggregation wrong,
specifically, bloom filter disk size.
The issue is not always visible, it happens when there are multiple
filter files per shard.
Fixes#4513
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Closes#8007
we're unconditionally using make_combined_mutation_source(), which causes extra
allocations, even if memtable was flushed into a single sstable, which is the
most common case. memtable will only be flushed into more than one sstable if
TWCS is used and memtable had old data written into it due to out-of-order
writes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210205182028.439948-1-raphaelsc@scylladb.com>
std_list has an iterator object which provides the python3 `__next__()`
method only. Python2 wants a method called `next()`. As it is trivial to
provide both, do that to allow debugging on centos7.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210205073549.734362-1-bdenes@scylladb.com>
Fix raft_fsm_test failure in debug mode. ASAN complained
that follower_progress is used in append_entries_reply()
after it was destroyed. This could happen if in maybe_commit()
we switched to a new configuration and destroyed old progress
objects.
The fix is to lookup the object one more time after maybe_commit().
Throwing a C++ exception creates unnecessary overhead, so when
an unsupported operation is encountered, the api error is directly
returned instead of being thrown.