Split the update_schema_version_and_announce() into
update_schema_version() and announce_schema_version(). This is going to
be used in storage_service::prepare_to_join() where we want to first
update the schema version, start gossip, announce the schema version.
The included testcase used to crash because during database::stop() we
would try to update system.large_partition.
There doesn't seem to be an order we can stop the existing services in
cql_test_env that makes this possible.
This patch then adds another step when shutting down a database: first
stop updating system.large_partition.
This means that during shutdown any memtable flush, compaction or
sstable deletion will not be reflected in system.large_partition. This
is hopefully not too bad since the data in the table is TTLed.
This seems to impact only tests, since main.cc calls _exit directly.
Tests: unit (release,debug)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190213194851.117692-1-espindola@scylladb.com>
Right now Cassandra SSTables with counters cannot be imported into
Scylla. The reason for that is that Cassandra changed their counter
representation in their 2.1 version and kept transparently supporting
both representations. We do not support their old representation, nor
there is a sane way to figure out by looking at the data which one is in
use.
For safety, we had made the decision long ago to not import any
tables with counters: if a counter was generated in older Cassandra, we
would misrepresent them.
In this patch, I propose we offer a non-default way to import SSTables
with counters: we can gate it with a flag, and trust that the user knows
what they are doing when flipping it (at their own peril). Cassandra 2.1
is by now pretty old. many users can safely say they've never used
anything older.
While there are tools like sstableloader that can be used to import
those counters, there are often situations in which directly importing
SSTables is either better, faster, or worse: the only option left. I
argue that having a flag that allow us to import them when we are sure
it is safe is better than having no option at all.
With this patch I was able to successfully import Cassandra tables with
counters that were generated in Cassandra 2.1, reshard and compact their
SSTables, and read the data back to get the same values in Scylla as in
Cassandra.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190210154028.12472-1-glauber@scylladb.com>
"
This series prevents view building to fall back to storing hints.
Instead, it will try to send hints to an endpoint as if it has
consistency level ONE, and in case of failure retry the whole
building step. Then, view building will never be marked as finished
prematurely (because of pending hints), which will help avoid
creating inconsistencies when decommissioning a node from the cluster.
Tests:
unit (release)
dtest (materialized_views_test.py.*)
Fixes#3857Fixes#4039
"
* 'do_not_mark_view_as_built_with_hints_7' of https://github.com/psarna/scylla:
db,view: add updating view_building_paused statistics
database: add view_building_paused metrics
table: make populate_views not allow hints
db,view: add allow_hints parameter to mutate_MV
storage_proxy: add allow_hints parameter to send_to_endpoint
This commit declares shared_ptr<user_types_metadata> in
database.hh were user_types_metadata is an incomplete type so
it requires
"Allow to use shared_ptr with incomplete type other than sstable"
to compile correctly.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Currently nop_large_partition_handler is only used in tests, but it
can also be used avoid self-reporting.
Tests: unit(Release)
I also tested starting scylla with
--compaction-large-partition-warning-threshold-mb=0.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190123205059.39573-1-espindola@scylladb.com>
Recently we had a bug (#4096) due to a component
(`multishard_mutation_query()`) assuming that all reads used the
semaphore obtainable via `database::user_read_concurrency_sem()`.
This problem revealed that it is plain wrong to allow access to the
shard-global semaphores residing in the database object. Instead all
code wishing to access the relevant semaphore for some read, should do
so via the relevant `table` object, thus guaranteeing that it will get
the correct semaphore, configured for that table.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4f3a6780eb3240822db34aba7c1ba0a675a96592.1547734212.git.bdenes@scylladb.com>
Many area of the code are splattered with unneeded templates. This patchset replaces
some of them, where the template parameter is a function object, with an std::function
or noncopyable_function (with a preference towards the latter; but it is not always
possible). As the template is compiled for each instantiation (if the function
object is a lambda) while a function is compiled only once, there are significant
savings in compile time and bloat.
text data bss dec hex filename
85160690 42120 284910 85487720 5187068 scylla.before
84824762 42120 284910 85151792 5135030 scylla.after
* https://github.com/avikivity/scylla detemplate/v1:
api/commitlog: de-template acquire_cl_metric()
database: de-template do_parse_schema_tables
database: merge for_all_partitions and for_all_partitions_slow
hints: de-template scan_for_hints_dirs()
schema_tables: partially de-template make_map_mutation()
distributed_loader: de-template
tests: commitlog_test: de-template
tests: cql_auth_query_test: de-template
test: de-template eventually() and eventually_true()
tests: flush_queue_test: de-template
hint_test: de-template
tests: mutation_fragment_test: de-template
test: mutation_test: de-template
The multishard mutation query used the semaphore obtained from
`database::user_read_concurrency_sem()` to pause-resume shard readers.
This presented a problem when `multishard_mutation_query()` was reading
from system tables. In this case the readers themselves would obtain
their permits from the system read concurrency semaphore. Since the
pausing of shard readers used the user read semaphore, pausing failed to
fulfill its objective of alleviating pressure on the semaphore the reads
obtained their permits from. In some cases this lead to a deadlock
during system reads.
To ensure the correct semaphore is used for pausing-resuming readers,
obtain the semaphore from the `table` object. To avoid looking up the
table on every pause or resume call, cache the semaphores when readers
are created.
Fixes: #4096
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <c784a3cd525ce29642d7216fbe92638fa7884e88.1547729119.git.bdenes@scylladb.com>
Do not allow write access to the sstable list via this accessor. Luckily
there are no violations, and now we enforce it.
Message-Id: <20190111151049.16953-1-avi@scylladb.com>
Race condition takes place when one of the sstables selected by snapshot
is deleted by compaction. Snapshot fails because it tries to link a
sstable that was previously unlinked by compaction's sstable deletion.
Fixes#4051.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190110194048.26051-1-raphaelsc@scylladb.com>
Replace stdx::optional and stdx::string_view with the C++ std
counterparts.
Some instances of boost::variant were also replaced with std::variant,
namely those that called seastar::visit.
Scylla now requires GCC 8 to compile.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190108111141.5369-1-duarte@scylladb.com>
distributed_loader is a sizeable fraction of database.cc, so moving it
out reduces compile time and improves readability.
Message-Id: <20181230200926.15074-1-avi@scylladb.com>
Implementation of nodetool toppartiotion query, which samples most frequest PKs in read/write
operation over a period of time.
Content:
- data_listener classes: mechanism that interfaces with mutation readers in database and table classes,
- toppartition_query and toppartition_data_listener classes to implement toppartition-specific query (this
interfaces with data_listeners and the REST api),
- REST api for toppartitions query.
Uses Top-k structure for handling stream summary statistics (based on implementation in C*, see #2811).
What's still missing:
- JMX interface to nodetool (interface customization may be required),
- Querying #rows and #bytes (currently, only #partitions is supported).
Fixes#2811
* https://github.com/avikivity/scylla rafie_toppartitions_v7.1:
top_k: whitespace and minor fixes
top_k: map template arguments
top_k: std::list -> chunked_vector
top_k: support for appending top_k results
nodetool toppartitions: refactor table::config constructor
nodetool toppartitions: data listeners
nodetool toppartitions: add data_listeners to database/table
nodetool toppartitions: fully_qualified_cf_name
nodetool toppartitions: Toppartitions query implementation
nodetool toppartitions: Toppartitions query REST API
nodetool toppartitions: nodetool-toppartitions script
Add data_listeners member to database.
Adds data_listeners* to table::config, to be used by table methods to invoke listeners.
Install on_read() listener in table::make_reader().
Install on_write() listener in database::apply_in_memory().
Tests: Unit (release)
Signed-off-by: Rafi Einstein <rafie@scylladb.com>
Rather than forcing callers to go through get_config(), provide a
direct accessor. This reduces dependencies on config.hh, and will
allow separation of extensions from configuration.
Expose the base replica's current memory view update backlog, which is
defined in terms of units consumed from the semaphore.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
View building sends view updates synchronously, which has natural
backpressure. However, they
1) Contribute to the load on the view replicas, and;
2) Add memory pressure to the base replica.
They should thus count towards the current view update backlog, and
consume units from the view update concurrency semaphore.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We no longer wait on the semaphore and instead over-subscribe it, so
there's not reason to pass a timeout.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We arrive at an overloaded state when we fail to acquire semaphore
units in the base replica. This can mean clients are working in
interactive mode, we fail to throttle them and consequently should
start shedding load. We want to avoid impacting base table
availability by running out of memory, so we could offload the memory
queue to disk by writing the view updates as hints without attempting
to send them. However, the disk is also a limited resource and in
extreme cases we won’t be able to write hints. A tension exists
between forgetting the view updates, thereby opening up a window for
inconsistencies between base and view, or failing the base replica
write. The latter can fail the whole user write, or if the
coordinator was able to achieve CL, can instead cause inconsistencies
between base tables (we wouldn't want to store a hint, because if the
base replica is still overloaded, we would redo the whole dance).
Between the devil and the deep blue sea, we chose to forget view
updates. As a further simplification, we don't even write hints,
assuming that if clients can’t be throttled (as we'll attempt to do in
future patches), it will only be a matter of time before view updates
can’t be offloaded. We also start acquiring the semaphore units using
consume(), which is non-blocking, but allows for underflow of the
available semaphore units. This is okay, and we expect not to underflow
by much, as we stop generating new view updates.
Refs #2538
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The semaphore currently limiting the amount of view updates a given
base replica emits aims to control the load that is imposed on the
cluster, to protect view replicas from being overloaded when there
are bursts of traffic (especially for degenerate cases like an index
with low selectivity).
100 is, however, an arbitrary number. It might allow too much load on
the view replicas, and it might also allow too much memory from the
base shard to be consumed. Conversely, it might allow for too few
updates to be queued in case of a burst, or to absorb updates while a
view replica becomes partitioned.
To deal with the load that is inflicted on the cluster, future patches
will ensure that the rate of base writes obeys the rate at which the
slowest view replica can consume the corresponding view updates.
To protect the current shard from using too much memory for this
queue, we will limit it to 10% of the shard's memory. The goal is to
both protect the shard from being overloaded, but also to allow it to
absorb bursts of writes resulting in large view mutations.
Refs #2538
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"
This series attempts to solve the regressions recently discovered in
performance of multi-partition range-scans. Namely that they:
* Flood the reader concurrency semaphore's queues, trampling other
reads.
* Behave very badly when too many of them is running concurrently
(trashing).
* May deadlock if enough of them is running without a timeout.
The solution for these problems is to make inactive shard readers
evictable. This should address all three issues listed above, to varying
degrees:
* Shard readers will now not cling onto their permits for the entire
duration of the scan, which might be a lot of time.
* Will be less affected by infinite concurrency (more than the node can
handle) as each scan now can make progress by evicting inactive shard
readers belonging to other scans.
* Will not deadlock at all.
In addition to the above fix, this series also bundles two further
improvements:
* Add a mechanism to `reader_concurrecy_semaphore` to be notified of
newly inserted evictables.
* General cleanups and fixes for `multishard_combining_reader` and
`foreign_reader`.
I can unbundle these mini series and send them separately, if the
maintainers so prefer, altough considering that this series will have to
be backported to 3.0, I think this present form is better.
Fixes: #3835
"
* 'evictable-inactive-shard-readers/v7' of https://github.com/denesb/scylla: (27 commits)
tests/multishard_mutation_query_test: test stateless query too
tests/querier_cache: fail resource-based eviction test gracefully
tests/querier_cache: simplify resource-based eviction test
tests/mutation_reader_test: add test_multishard_combining_reader_next_partition
tests/mutation_reader_test: restore indentation
tests/mutation_reader_test: enrich pause-related multishard reader test
multishard_combining_reader: use pause-resume API
query::partition_slice: add clear_ranges() method
position_in_partition: add region() accessor
foreign_reader: add pause-resume API
tests/mutation_reader_test: implement the pause-resume API
query_mutations_on_all_shards(): implement pause-resume API
make_multishard_streaming_reader(): implement the pause-resume API
database: add accessors for user and streaming concurrency semaphores
reader_lifecycle_policy: extend with a pause-resume API
query_mutations_on_all_shards(): restore indentation
query_mutations_on_all_shards(): simplify the state-machine
multishard_combining_reader: use the reader lifecycle policy
multishard_combining_reader: add reader lifecycle policy
multishard_combining_reader: drop unnecessary `reader_promise` member
...
* seastar d59fcef...b924495 (2):
> build: Fix protobuf generation rules
> Merge "Restructure files" from Jesse
Includes fixup patch from Jesse:
"
Update Seastar `#include`s to reflect restructure
All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
In (almost) all SSTable write paths, we need to inform the monitor that
the write has failed as well. The monitor will remove the SSTable from
controller's tracking at that point.
Except there is one place where we are not doing that: streaming of big
mutations. Streaming of big mutations is an interesting use case, in
which it is done in 2 parts: if the writing of the SSTable fails right
away, then we do the correct thing.
But the SSTables are not commited at that point and the monitors are
still kept around with the SSTables until a later time, when they are
finally committed. Between those two points in time, it is possible that
the streaming code will detect a failure and manually call
fail_streaming_mutations(), which marks the SSTable for deletions. At
that point we should propagate that information to the monitor as well,
but we don't.
Fixes#3732 (hopefully)
Tests: unit (release)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181114213618.16789-1-glauber@scylladb.com>
This method can be used to check if sstable is staging,
i.e. it shouldn't be compacted and it will not be used
for generating view updates from other staging tables,
and return proper shared_sstable pointer if it is.
When generating view updates from a staging sstable, this sstable
should not be used in the process. Hence, a reader that skips a single
sstable is added.
After materialized view updates are generated, the sstable
should be moved from staging/ to a regular directory.
It's expected to be called from seastar::async thread context.
The single-range overload, when used by
make_multishard_streaming_reader(), has to create a reader that is
forwardable. Otherwise the multishard streaming reader will not produce
any output as it cannot fast-forward its shard readers to the ranges
produced by the generator.
Also add a unit test, that is based on the real-life purpose the
multishard streaming reader was designed for - serving partition
from a shard, according to a sharding configuration that is different
than the local one. This is also the scenario that found the buf in the
first place.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <bf799961bfd535882ede6a54cd6c4b6f92e4e1c1.1539235034.git.bdenes@scylladb.com>
This will be used by the `make_multishard_streaming_reader()` in the
next patch. This method will create a multishard combining reader which
needs its shard readers to take a single range, not a vector of ranges
like the existing overload.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <cc6f2c9a8cf2c42696ff756ed6cb7949b95fe986.1538470782.git.bdenes@scylladb.com>
SStable format mc doesn't write ancestors to metadata, so resharding
will not work with this new format because it relies on ancestors to
replace new unshared sstables with old shared ones.
Fix is about not relying on ancestors metadata for this operation.
Fixes#3777.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180922211933.1987-1-raphaelsc@scylladb.com>