Split the update_schema_version_and_announce() into
update_schema_version() and announce_schema_version(). This is going to
be used in storage_service::prepare_to_join() where we want to first
update the schema version, start gossip, announce the schema version.
The included testcase used to crash because during database::stop() we
would try to update system.large_partition.
There doesn't seem to be an order we can stop the existing services in
cql_test_env that makes this possible.
This patch then adds another step when shutting down a database: first
stop updating system.large_partition.
This means that during shutdown any memtable flush, compaction or
sstable deletion will not be reflected in system.large_partition. This
is hopefully not too bad since the data in the table is TTLed.
This seems to impact only tests, since main.cc calls _exit directly.
Tests: unit (release,debug)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190213194851.117692-1-espindola@scylladb.com>
Previously it was the responsibility of the layer above (multishard
combining reader) to pause readers, which happened via an explicit
`pause()` call. This proved to be a very bad design as we kept finding
spots where the multishard reader should have paused the reader to avoid
potential deadlocks (due to starved reader concurrency semaphores), but
didn't.
This commit moves the responsibility of pausing the reader into the
shard reader. The reader is now kept in a paused state, except when it
is actually used (a `fill_buffer()` or `fast_forward_to()` call is
executing). This is fully transparent to the layer above.
As a side note, the shard reader now also hides when the reader is
created. This also used to be the responsibility of the multishard
reader, and although it caused no problems so far, it can be considered
a leak of internal details. The shard reader now automatically creates
the remote reader on the first time it is attempted to be used.
The code has been reorganized, such that there is now a clear separation
of responsibilities. The multishard combining reader handles the
combining of the output of the shard readers, as well as issuing
read-aheads. The shard reader handles read-ahead and creating the
remote reader when needed, as well as transferring the results of remote
reads to the "home" shard. The remote reader
(`shard_reader::remote_reader`, new in this patch) handles
pausing-resuming as well as recreating the reader after it was evicted.
Layers don't access each other's internals (like they used to).
After this commit, the reader passed to `destroy_reader()` will always
be in paused state.
Reader creation happens through the `reader_lifecycle_policy` interface,
which offers a `create_reader()` method. This method accepts a shard
parameter (among others) and returns a future. Its implementation is
expected to go to the specified shard and then return with the created
reader. The method is expected to be called from the shard where the
shard reader (and consequently the multishard reader) lives. This API,
while reasonable enough, has a serious flaw. It doesn't make batching
possible. For example, if the shard reader issues a call to the remote
shard to fill the remote reader's buffer, but finds that it was evicted
while paused, it has to come back to the local shard just to issue the
recreate call. This makes the code both convoluted and slow.
Change the reader creation API to be synchronous, that is, callable from
the shard where the reader has to be created, allowing for simple call
sites and batching.
This change requires that implementations of the lifecycle policy update
any per-reader data-structure they have from the remote shard. This is
not a problem however, as these data-structures are usually partitioned,
such that they can be accessed safely from a remote shard.
Another, very pleasant, consequence of this change is that now all
methods of the lifecycle interface are sync and thus calls to them
cannot overlap anymore.
This patch also removes the
`test_multishard_combining_reader_destroyed_with_pending_create_reader`
unit test, which is not useful anymore.
For now just emulate the old interface inside shard reader. We will
overhaul the shard reader after some further changes to minimize
noise.
The shard reader relies on the `reader_lifecycle_policy` for pausing and
resuming the remote reader. The lifecycle policy's API was designed to
be as general as possible, allowing for any implementation of
pause/resume. However, in practice, we have a single implementation of
pause/resume: registering/unregistering the reader with the relevant
`reader_concurrency_semaphore`, and we don't expect any new
implementations to appear in the future.
Thus, the generic API of the lifecycle policy, is needlessly abstract
making its implementations needlessly complex. We can instead make this
very concrete and have the lifecycle policy just return the relevant
semaphore, removing the need for every implementor of the lifecycle
policy interface to have a duplicate implementation of the very same
logic.
For now just emulate the old interface inside shard reader. We will
overhaul the shard reader after some further changes to minimize noise.
Right now Cassandra SSTables with counters cannot be imported into
Scylla. The reason for that is that Cassandra changed their counter
representation in their 2.1 version and kept transparently supporting
both representations. We do not support their old representation, nor
there is a sane way to figure out by looking at the data which one is in
use.
For safety, we had made the decision long ago to not import any
tables with counters: if a counter was generated in older Cassandra, we
would misrepresent them.
In this patch, I propose we offer a non-default way to import SSTables
with counters: we can gate it with a flag, and trust that the user knows
what they are doing when flipping it (at their own peril). Cassandra 2.1
is by now pretty old. many users can safely say they've never used
anything older.
While there are tools like sstableloader that can be used to import
those counters, there are often situations in which directly importing
SSTables is either better, faster, or worse: the only option left. I
argue that having a flag that allow us to import them when we are sure
it is safe is better than having no option at all.
With this patch I was able to successfully import Cassandra tables with
counters that were generated in Cassandra 2.1, reshard and compact their
SSTables, and read the data back to get the same values in Scylla as in
Cassandra.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190210154028.12472-1-glauber@scylladb.com>
"
This series prevents view building to fall back to storing hints.
Instead, it will try to send hints to an endpoint as if it has
consistency level ONE, and in case of failure retry the whole
building step. Then, view building will never be marked as finished
prematurely (because of pending hints), which will help avoid
creating inconsistencies when decommissioning a node from the cluster.
Tests:
unit (release)
dtest (materialized_views_test.py.*)
Fixes#3857Fixes#4039
"
* 'do_not_mark_view_as_built_with_hints_7' of https://github.com/psarna/scylla:
db,view: add updating view_building_paused statistics
database: add view_building_paused metrics
table: make populate_views not allow hints
db,view: add allow_hints parameter to mutate_MV
storage_proxy: add allow_hints parameter to send_to_endpoint
This commit declares shared_ptr<user_types_metadata> in
database.hh were user_types_metadata is an incomplete type so
it requires
"Allow to use shared_ptr with incomplete type other than sstable"
to compile correctly.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Currently nop_large_partition_handler is only used in tests, but it
can also be used avoid self-reporting.
Tests: unit(Release)
I also tested starting scylla with
--compaction-large-partition-warning-threshold-mb=0.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190123205059.39573-1-espindola@scylladb.com>
Recently we had a bug (#4096) due to a component
(`multishard_mutation_query()`) assuming that all reads used the
semaphore obtainable via `database::user_read_concurrency_sem()`.
This problem revealed that it is plain wrong to allow access to the
shard-global semaphores residing in the database object. Instead all
code wishing to access the relevant semaphore for some read, should do
so via the relevant `table` object, thus guaranteeing that it will get
the correct semaphore, configured for that table.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4f3a6780eb3240822db34aba7c1ba0a675a96592.1547734212.git.bdenes@scylladb.com>
This long slow-path function is called four times, so de-templating it is an
easy win. We use std::function instead of noncopyable_function because the
function is copied within the parallel_for_each callback. The original code
uses a move, which is incorrect, but did not fail because moving the lambdas
that were used as the actual arguments is equivalent to a copy.
"
This mini-series adds counters for the inactive reads registered in the
reader concurrency semaphore.
"
* 'reader-concurrency-semaphore-counters/v6' of https://github.com/denesb/scylla:
tests/querier_cache: use stats to get the no. of inactive reads
reader_concurrency_semaphore: add counters for inactive reads
Replace stdx::optional and stdx::string_view with the C++ std
counterparts.
Some instances of boost::variant were also replaced with std::variant,
namely those that called seastar::visit.
Scylla now requires GCC 8 to compile.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190108111141.5369-1-duarte@scylladb.com>
distributed_loader is a sizeable fraction of database.cc, so moving it
out reduces compile time and improves readability.
Message-Id: <20181230200926.15074-1-avi@scylladb.com>
Implementation of nodetool toppartiotion query, which samples most frequest PKs in read/write
operation over a period of time.
Content:
- data_listener classes: mechanism that interfaces with mutation readers in database and table classes,
- toppartition_query and toppartition_data_listener classes to implement toppartition-specific query (this
interfaces with data_listeners and the REST api),
- REST api for toppartitions query.
Uses Top-k structure for handling stream summary statistics (based on implementation in C*, see #2811).
What's still missing:
- JMX interface to nodetool (interface customization may be required),
- Querying #rows and #bytes (currently, only #partitions is supported).
Fixes#2811
* https://github.com/avikivity/scylla rafie_toppartitions_v7.1:
top_k: whitespace and minor fixes
top_k: map template arguments
top_k: std::list -> chunked_vector
top_k: support for appending top_k results
nodetool toppartitions: refactor table::config constructor
nodetool toppartitions: data listeners
nodetool toppartitions: add data_listeners to database/table
nodetool toppartitions: fully_qualified_cf_name
nodetool toppartitions: Toppartitions query implementation
nodetool toppartitions: Toppartitions query REST API
nodetool toppartitions: nodetool-toppartitions script
Add data_listeners member to database.
Adds data_listeners* to table::config, to be used by table methods to invoke listeners.
Install on_read() listener in table::make_reader().
Install on_write() listener in database::apply_in_memory().
Tests: Unit (release)
Signed-off-by: Rafi Einstein <rafie@scylladb.com>
Rather than forcing callers to go through get_config(), provide a
direct accessor. This reduces dependencies on config.hh, and will
allow separation of extensions from configuration.
View building sends view updates synchronously, which has natural
backpressure. However, they
1) Contribute to the load on the view replicas, and;
2) Add memory pressure to the base replica.
They should thus count towards the current view update backlog, and
consume units from the view update concurrency semaphore.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We no longer wait on the semaphore and instead over-subscribe it, so
there's not reason to pass a timeout.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We arrive at an overloaded state when we fail to acquire semaphore
units in the base replica. This can mean clients are working in
interactive mode, we fail to throttle them and consequently should
start shedding load. We want to avoid impacting base table
availability by running out of memory, so we could offload the memory
queue to disk by writing the view updates as hints without attempting
to send them. However, the disk is also a limited resource and in
extreme cases we won’t be able to write hints. A tension exists
between forgetting the view updates, thereby opening up a window for
inconsistencies between base and view, or failing the base replica
write. The latter can fail the whole user write, or if the
coordinator was able to achieve CL, can instead cause inconsistencies
between base tables (we wouldn't want to store a hint, because if the
base replica is still overloaded, we would redo the whole dance).
Between the devil and the deep blue sea, we chose to forget view
updates. As a further simplification, we don't even write hints,
assuming that if clients can’t be throttled (as we'll attempt to do in
future patches), it will only be a matter of time before view updates
can’t be offloaded. We also start acquiring the semaphore units using
consume(), which is non-blocking, but allows for underflow of the
available semaphore units. This is okay, and we expect not to underflow
by much, as we stop generating new view updates.
Refs #2538
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Propagate acquired semaphore units to mutate_MV() to allow the
semaphore to be incrementally signalled as view updates are processed
by view replicas.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Stopping a table with in-flight reads and writes can be happening
concurrently, which rely on table state and we must therefore prevent
its destruction before those operations complete.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The semaphore currently limiting the amount of view updates a given
base replica emits aims to control the load that is imposed on the
cluster, to protect view replicas from being overloaded when there
are bursts of traffic (especially for degenerate cases like an index
with low selectivity).
100 is, however, an arbitrary number. It might allow too much load on
the view replicas, and it might also allow too much memory from the
base shard to be consumed. Conversely, it might allow for too few
updates to be queued in case of a burst, or to absorb updates while a
view replica becomes partitioned.
To deal with the load that is inflicted on the cluster, future patches
will ensure that the rate of base writes obeys the rate at which the
slowest view replica can consume the corresponding view updates.
To protect the current shard from using too much memory for this
queue, we will limit it to 10% of the shard's memory. The goal is to
both protect the shard from being overloaded, but also to allow it to
absorb bursts of writes resulting in large view mutations.
Refs #2538
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Working in terms of frozen_mutations allows us to account more
precisely the memory pending view updates consume at the storage_proxy
layer.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This allows an std::move() in its body to work as intended. Also, make
the lambda's argument type explicit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"
This series optimises the read path by replacing some usages of
std::vector by utils::small_vector. The motivation for this change was
an observation that memory allocation functions are pointed out by the
profiler as the ones where we spent most time and while they have a
large number of callers storage allocation for some vectors was close to
the top. The gains are not huge, since the problem is a lot of things
adding up and not a single slow thing, but we need to start with
something.
Unfortunately, the performance of boost::container::small_vector is
quite disappointing so a new implementation of a small_vector was
introduced.
perf_simple_query -c4 --duration 60, medians:
./perf_before ./perf_after diff
read 343086.80 360720.53 5.1%
Tests: unit(release, small_vector in debug)
"
* tag 'small_vector/v2.1' of https://github.com/pdziepak/scylla:
partition_slice: use small_vector for column_ids
mutation_fragment_merger: use small_vector
auth: use small_vector in resource
auth: avoid list-initialisation of vectors
idl: serialiser: add serialiser for utils::small_vector
idl: serialiser: deduplicate vector serialisers
utils: introduce small_vector
intrusive_set_external_comparator: make iterator nothrow move constructible
mutation_fragment_merger: value-initialise iterator
If the compaction manager is started, compactions may start (this is
regardless of whether or not we trigger them). The problem with that is
that they start at a time in which we are flushing the commitlog and the
initialization procedure waits for the commitlog to be fully flushed and
the resulting memtables flushed before we move on.
Because there are no incoming writes, the amount of shares in memtable
flushes decrease as memory used decreases and that can cause the startup
procedure to take a long time.
We have recently started to bump the shares manually for manual flushes.
While that guarantees that we will not drive the shares to zero, I will
make the argument that we can do better by making sure that those things
are, at this point, running alone: user experience is affected by
startup times and the bump we give to user-triggered operations will
only do so much. Even if we increase the shares a lot flushes will still
be fighting for resources with compactions and startup will take longer
than it could.
By making sure that flushes are this point running alone we improve the
user experience by making sure the startup is as fast as it can be.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"
This series attempts to solve the regressions recently discovered in
performance of multi-partition range-scans. Namely that they:
* Flood the reader concurrency semaphore's queues, trampling other
reads.
* Behave very badly when too many of them is running concurrently
(trashing).
* May deadlock if enough of them is running without a timeout.
The solution for these problems is to make inactive shard readers
evictable. This should address all three issues listed above, to varying
degrees:
* Shard readers will now not cling onto their permits for the entire
duration of the scan, which might be a lot of time.
* Will be less affected by infinite concurrency (more than the node can
handle) as each scan now can make progress by evicting inactive shard
readers belonging to other scans.
* Will not deadlock at all.
In addition to the above fix, this series also bundles two further
improvements:
* Add a mechanism to `reader_concurrecy_semaphore` to be notified of
newly inserted evictables.
* General cleanups and fixes for `multishard_combining_reader` and
`foreign_reader`.
I can unbundle these mini series and send them separately, if the
maintainers so prefer, altough considering that this series will have to
be backported to 3.0, I think this present form is better.
Fixes: #3835
"
* 'evictable-inactive-shard-readers/v7' of https://github.com/denesb/scylla: (27 commits)
tests/multishard_mutation_query_test: test stateless query too
tests/querier_cache: fail resource-based eviction test gracefully
tests/querier_cache: simplify resource-based eviction test
tests/mutation_reader_test: add test_multishard_combining_reader_next_partition
tests/mutation_reader_test: restore indentation
tests/mutation_reader_test: enrich pause-related multishard reader test
multishard_combining_reader: use pause-resume API
query::partition_slice: add clear_ranges() method
position_in_partition: add region() accessor
foreign_reader: add pause-resume API
tests/mutation_reader_test: implement the pause-resume API
query_mutations_on_all_shards(): implement pause-resume API
make_multishard_streaming_reader(): implement the pause-resume API
database: add accessors for user and streaming concurrency semaphores
reader_lifecycle_policy: extend with a pause-resume API
query_mutations_on_all_shards(): restore indentation
query_mutations_on_all_shards(): simplify the state-machine
multishard_combining_reader: use the reader lifecycle policy
multishard_combining_reader: add reader lifecycle policy
multishard_combining_reader: drop unnecessary `reader_promise` member
...
This API provides a way for the mulishard reader to pause inactive shard
readers and later resume them when they are needed again. This allows
for these paused shard readers to be evicted when the node is under
pressure.
How the readers are made evictable while paused is up to the clients.
Using this API in the `multishard_combining_reader` and implementing it
in the clients will be done in the next patches.
Provide default implementation for the new virtual methods to facilitate
gradual adoption.
It doesn't make sense for the multishard reader anyway, as it's only
used by the row-cache. We are about to introduce the pausing of inactive
shard readers, and it would require complex data structures and code
to maintain support for this feature that is not even used. So drop it.
As we are about to add multiple sources of evictable readers, we need a
more scalable solution than a single functor being passed that opaquely
evicts a reader when called.
Add a generic way to register and unregister evictable (inactive)
readers to the semaphore. The readers are expected to be registered when
they become evictable and are expected to be unregistered when they
cease to become evictable. The semaphore might evict any reader that is
registered to it, when it sees fit.
This also solves the problem of notifying the semaphore when new readers
become evictable. Previously there was no such mechanism, and the
semaphore would only evict any such new readers when a new permit was
requested from it.
We would like to get rid of boost::filesystem and gradually replace it with
std::experimental::filesystem.
TODO: using namespace fs = std::experimental::filesystem,
use fs::path directly, rather than lister::path
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>