"
With this patches a monitor is destroyed before the writer, which
simplifies the writer destructor.
"
* 'espindola/simplify-write-monitor-v2' of https://github.com/espindola/scylla:
sstables: Delete write_failed
sstables: Move monitor after writer in compaction_writer
We need only one of the shards owning each ssatble to call create_links.
This will allow us to simplify it and only handle crash/replay scenarios rather than rename/link/remove races.
Fixes#1622
Test: unit(dev), database_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200803065505.42100-3-bhalevy@scylladb.com>
"
While working on another patch I was getting odd compiler errors
saying that a call to ::make_shared was ambiguous. The reason was that
seastar has both:
template <typename T, typename... A>
shared_ptr<T> make_shared(A&&... a);
template <typename T>
shared_ptr<T> make_shared(T&& a);
The second variant doesn't exist in std::make_shared.
This series drops the dependency in scylla, so that a future change
can make seastar::make_shared a bit more like std::make_shared.
"
* 'espindola/make_shared' of https://github.com/espindola/scylla:
Everywhere: Explicitly instantiate make_lw_shared
Everywhere: Add a make_shared_schema helper
Everywhere: Explicitly instantiate make_shared
cql3: Add a create_multi_column_relation helper
main: Return a shared_ptr from defer_verbose_shutdown
If the read is not paged (short read is not allowed) abort the query if
the hard memory limit is reached. On reaching the soft memory limit a
warning is logged. This should allow users to adjust their application
code while at the same time protecting the database from the really bad
queries.
The enforcement happens inside the memory accounter and doesn't require
cooperation from the result builders. This ensures memory limit set for
the query is respected for all kind of reads. Previously non-paged reads
simply ignored the memory accounter requesting the read to stop and
consumed all the memory they wanted.
If somebody wants to bypass proper memory accounting they should at
the very least be forced to consider if that is indeed wise and think a
second about the limit they want to apply.
Use the recently added `max_result_size` field of `query::read_command`
to pass the max result size around, including passing it to remote
nodes. This means that the max result size will be sent along each read,
instead of once per connection.
As we want to select the appropriate `max_result_size` based on the type
of the query as well as based on the query class (user or internal) the
previous method won't do anymore. If the remote doesn't fill this
field, the old per-connection value is used.
seastar::make_lw_shared has a constructor taking a T&&. There is no
such constructor in std::make_shared:
https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared
This means that we have to move from
make_lw_shared(T(...)
to
make_lw_shared<T>(...)
If we don't want to depend on the idiosyncrasies of
seastar::make_lw_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
A variant of `make_range_sstable_reader()` that wraps the reader in a
restricting reader, hence making it wait for admission on the read
concurrency semaphore, before starting to actually read.
We'd like to use the same uuid both for printing compaction log
messages and to update compaction_history.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch changes the per table latencies histograms: read, write,
cas_prepare, cas_accept, and cas_learn.
Beside changing the definition type and the insertion method, the API
was changed to support the new metrics.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch changes the row locking latencies to use
time_estimated_histogram.
The change consist of changing the histogram definition and changing how
values are inserted to the histogram.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
get_shards_for_this_sstable() can be called inside table::add_sstable()
because the shards for a sstable is precomputed and so completely
exception safe. We want a central point for checking that table will
no longer added shared SSTables to its sstable set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
no longer need to conditionally track the SSTable metadata,
as table will no longer accept shared SSTables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Now that table will no longer accept shared SSTables, it no longer
needs to keep track of them.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Now that table no longer accept shared SSTables, those two functions can
be simplified by removing the shared condition.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
With off-strategy work on reshard on boot and refresh, table no
longer needs to work with Shared SSTables. That will unlock
a host of cleanups.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Now when the snapshot stopping is correctly handled, we may pull the database
reference all the way down to the schema::describe().
One tricky place is in table::napshot() -- the local db reference is pulled
through an smp::submit_to call, but thanks to the shard checks in the place
where it is needed the db is still "local"
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Before Scylla 3.0, we used to send streaming mutations using
individual RPC requests and flush them together using dedicated
streaming memtables. This mechanism is no longer in use and all
versions that use it have long reached end-of-life.
Remove this code.
Streaming is handled by just once group for CPU scheduling, so
separating it into read and write classes for I/O is artificial, and
inflates the resources we allow for streaming if both reads and writes
happen at the same time.
Merge both classes into one class ("streaming") and adjust callers. The
merged class has 200 shares, so it reduces streaming bandwidth if both
directions are active at the same time (which is rare; I think it only
happens in view building).
Uploading of SSTables is problematic: for historical reasons it takes a
lock that may have to wait for ongoing compactions to finish, then it
disables writes in the table, and then it goes loading SSTables as if it
knew nothing about them.
With the sstable_directory infrastructure we can do much better:
* we can reshard and reshape the SSTables in place, keeping the number
of SSTables in check. Because this is an background process we can be
fairly aggressive and set the reshape mode to strict.
* we can then move the SSTables directly into the main directory.
Because we know they are few in number we can call the more elegant
add_sstable_and_invalidate_cache instead of the open coding currently
done by load_new_sstables
* we know they are not shared (if they were, we resharded them),
simplifying the load process even further.
The major changes after this patch is applied is that all compactions
(resharding and reshape) needed to make the SSTables in-strategy are
done in the streaming class, which reduces the impact of this operation
on the node. When the SSTables are loaded, subsequent reads will not
suffer as we will not be adding shared SSTables in potential high
numbers, nor will we reshard in the compaction class.
There is also no more need for a lock in the upload process so in the
fast path where users are uploading a set of SSTables from a backup this
should essentially be instantaneous. The lock, as well as the code to
disable and enable table writes is removed.
A future improvement is to bypass the staging directory too, in which
case the reshaping compaction would already generate the view updates.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
When we are scanning an sstable directory, we want to filter out the
manifest file in most situations. The table class has a filter for that,
but it is a static filter that doesn't depend on table for anything. We
are better off removing it and putting in another independent location.
While it seems wasteful to use a new header just for that, this header
will soon be populated with the sstable_directory class.
Tests: unit (dev)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"
The new seastar api changes make_file_output_stream and
make_file_data_sink to return futures. This series includes a few
refactoring patches and the actual transition.
"
* 'espindola/api-v3-v3' of https://github.com/espindola/scylla:
table: Fix indentation
everywhere: Move to seastar api level 3
sstables: Pass an output_stream to make_compressed_file_.*_format_output_stream
sstables: Pass a data_sink to checksummed_file_writer's constructor
sstables: Convert a file_writer constructor to a static make
sstables: Move file_writer constructor out of line
After 7f1a215, a sstable is only added to backlog tracker if
sstable::shared() returns true.
sstable::shared() can return true for a sstable that is actually owned
by more than one shard, but it can also incorrectly return true for
a sstable which wasn't made explicitly unshared through set_unshared().
A recent work of mine is getting rid of set_unshared() because a
sstable has the knowledge to determine whether or not it's shared.
The problem starts with streaming sstable which hasn't set_unshared()
called for it, so it won't be added to backlog tracker, but it can
be eventually removed from the tracker when that sstable is compacted.
Also, it could happen that a shared sstable, which was resharded, will
be removed from the tracker even though it wasn't previously added.
When those problems happen, backlog tracker will have an incorrect
account of total bytes, which leads it to producing incorrect
backlogs that can potentially go negative.
These problems are fixed by making every add / removal go through
functions which take into account sstable::shared().
Fixes#6227.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200512220226.134481-2-raphaelsc@scylladb.com>
New SStables are only added to backlog tracker if set_unshared() was
called on their behalf. SStables created for streaming are not being
added to the tracker because make_streaming_sstable_for_write()
doesn't call set_unshared() nor does it caller. Which results in backlog
not accounting for their existence, which means backlog will be much
lower than expected.
This problem could be fixed by adding a set_unshared() call but it
turns out we don't even need set_unshared() anymore. It was introduced
when Scylla metadata didn't exist, now a SSTable has built-in knowledge
of whether or not it's shared. Relying on every SSTable creator calling
set_unshared() is bug prone. Let's get rid of it and let the SStable
itself say whether or not it's shared. If an imported SSTable has not
Scylla metadata, Scylla will still be able to compute shards using
token range metadata.
Refs #6021.
Refs #6227.
Fixes#6441.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200512220226.134481-1-raphaelsc@scylladb.com>
"
Currently we classify queries as "system" or "user" based on the table
they target. The class of a query determines how the query is treated,
currently: timeout, limits for reverse queries and the concurrency
semaphore. The catch is that users are also allowed to query system
tables and when doing so they will bypass the limits intended for user
queries. This has caused performance problems in the past, yet the
reason we decided to finally address this is that we want to introduce a
memory limit for unpaged queries. Internal (system) queries are all
unpaged and we don't want to impose the same limit on them.
This series uses scheduling groups to distinguish user and system
workloads, based on the assumption that user workloads will run in the
statement scheduling group, while system workloads will run in the main
(or default) scheduling group, or perhaps something else, but in any
case not in the statement one. Currently the scheduling group of reads
and writes is lost when going through the messaging service, so to be
able to use scheduling groups to distinguish user and system reads this
series refactors the messaging service to retain this distinction across
verb calls. Furthermore, we execute some system reads/writes as part of
user reads/writes, such as auth and schema sync. These processes are
tagged to run in the main group.
This series also centralises query classification on the replica and
moves it to a higher level. More specifically, queries are now
classified -- the scheduling group they run in is translated to the
appropriate query class specific configuration -- on the database level
and the configuration is propagated down to the lower layers.
Currently this query class specific configuration consists of the reader
concurrency semaphore and the max memory limit for otherwise unlimited
queries. A corollary of the semaphore begin selected on the database
level is that the read permit is now created before the read starts. A
valid permit is now available during all stages of the read, enabling
tracking the memory consumption of e.g. the memtable and cache readers.
This change aligns nicely with the needs of more accurate reader memory
tracking, which also wants a valid permit that is available in every layer.
The series can be divided roughly into the following distinct patch
groups:
* 01-02: Give system read concurrency a boost during startup.
* 03-06: Introduce user/system statement isolation to messaging service.
* 07-13: Various infrastructure changes to prepare for using read
permits in all stages of reads.
* 14-19: Propagate the semaphore and the permit from database to the
various table methods that currently create the permit.
* 20-23: Migrate away from using the reader concurrency semaphore for
waiting for admission, use the permit instead.
* 24: Introduce `database::make_query_config()` and switch the database
methods needing such a config to use it.
* 25-31: Get rid of all uses of `no_reader_permit()`.
* 32-33: Ban empty permits for good.
* 34: querier_cache: use the queriers' permits to obtain the semaphore.
Fixes: #5919
Tests: unit(dev, release, debug),
dtest(bootstrap_test.py:TestBootstrap.start_stop_test_node), manual
testing with a 2 node mixed cluster with extra logging.
"
* 'query-class/v6' of https://github.com/denesb/scylla: (34 commits)
querier_cache: get semaphore from querier
reader_permit: forbid empty permits
reader_permit: fix reader_resources::operator bool
treewide: remove all uses of no_reader_permit()
database: make_multishard_streaming_reader: pass valid permit to multi range reader
sstables: pass valid permits to all internal reads
compaction: pass a valid permit to sstable reads
database: add compaction read concurrency semaphore
view: use valid permits for reads from the base table
database: use valid permit for counter read-before-write
database: introduce make_query_class_config()
reader_concurrency_semaphore: remove wait_admission and consume_resources()
test: move away from reader_concurrency_semaphore::wait_admission()
reader_permit: resource_units: introduce add()
mutation_reader: restricted_reader: work in terms of reader_permit
row_cache: pass a valid permit to underlying read
memtable: pass a valid permit to the delegate reader
table: require a valid permit to be passed to most read methods
multishard_mutation_query: pass a valid permit to shard mutation sources
querier: add reader_permit parameter and forward it to the mutation_source
...
Until now, view updates were generated with a bunch of random
time points, because the interface was not adjusted for passing
a single time point. The time points were used to determine
whether cells were alive (e.g. because of TTL), so it's better
to unify the process:
1. when generating view updates from user writes, a single time point
is used for the whole operation
2. when generating view updates via the view building process,
a single time point is used for each build step
NOTE: I don't see any reliable and deterministic way of writing
test scenarios which trigger problems with the old code.
After #6488 is resolved and error injection is integrated
into view.cc, tests can be added.
Fixes#6429
Tests: unit(dev)
Message-Id: <f864e965eb2e27ffc13d50359ad1e228894f7121.1590070130.git.sarna@scylladb.com>
Remove `no_reader_permit()` and all ways to create empty (invalid)
permits. All permits are guaranteed to be valid now and are only
obtainable from a semaphore.
`reader_permit::semaphore()` now returns a reference, as it is
guaranteed to always have a valid semaphore reference.
View update generation involves reading existing values from the base
table, which will soon require a valid permit to be passed to it, so
make sure we create and pass a valid permit to these reads.
We use `database::make_query_class_config()` to obtain the semaphore for
the read which selects the appropriate user/system semaphore based on
the scheduling group the base table write is running in.
All reader are soon going to require a valid permit, so make sure we
have a valid permit which we can pass to the underlying reader when
creating it. This means `row_cache::make_reader()` now also requires
a permit to be passed to it.
All reader are soon going to require a valid permit, so make sure we
have a valid permit which we can pass to the delegate reader when
creating it. This means `memtable::make_flat_reader()` now also requires
a permit to be passed to it.
Internally the permit is stored in `scanning_reader`, which is used both
for flushes and normal reads. In the former case a permit is not
required.
Now that the most prevalent users (range scan and single partition
reads) all pass valid permits we require all users to do so and
propagate the permit down towards `make_sstable_reader()`. The plan is
to use this permit for restricting the sstable readers, instead of the
semaphore the table is configured with. The various
`make_streaming_*reader()` overloads keep using the internal semaphores
as but they also create the permit before the read starts and pass it to
`make_sstable_reader()`.
We want to move away from the current practice of selecting the relevant
read concurrency semaphore inside `table` and instead want to pass it
down from `database` so that we can pass down a semaphore that is
appropriate for the class of the query. Use the recently created
`query_class_config` struct for this. This is added as a parameter to
`data_query`, `mutation_query` and propagated down to the point where we
create the `querier` to execute the read. We are already propagating
down a parameter down the same route -- max_memory_reverse_query --
which also happens to be part of `query_class_config`, so simply replace
this parameter with a `query_class_config` one. As the lower layers are
not prepared for a semaphore passed from above, make sure this semaphore
is the same that is selected inside `table`. After the lower layers are
prepared for a semaphore arriving from above, we will switch it to be
the appropriate one for the class of the query.
Mutation sources will soon require a valid permit so make sure we have
one and pass it to the mutation sources when creating the underlying
readers.
For now, pass no_reader_permit() on call sites, deferring the obtaining
of a valid permit to later patches.
In order to add tracing to places where it can be useful,
e.g. materialized view updates and hinted handoff, tracing state
is propagated to all applicable call sites.