This draft extends and obsoletes #8123 by introducing a way of determining the workload type from service level parameters, and then using this context to qualify requests for shedding.
The rough idea is that when the admission queue in the CQL server is hit, it might make more sense to start shedding surplus requests instead of accumulating them on the semaphore. The assumption that interactive workloads are more interested in the success rate of as many requests as possible, and hanging on a semaphore reduces the chances for a request to succeed. Thus, it may make sense to shed some requests to reduce the load on this coordinator and let the existing requests to finish.
It's a draft, because I only performed local guided tests. #8123 was followed by some experiments on a multinode cluster which I want to rerun first.
Closes#8680
* github.com:scylladb/scylla:
test: add a case for conflicting workload types
cql-pytest: add basic tests for service level workload types
docs: describe workload types for service levels
sys_dist_ks: fix redundant parsing in get_service_level
sys_dist_ks: make get_service_level exception-safe
transport: start shedding requests during potential overload
client_state: hook workload type from service levels
cql3: add listing service level workload type
cql3: add persisting service level workload type
qos: add workload_type service level parameter
It is forbidden to create a secondary index of a column which includes in
any way the "duration" type. This includes a UDT which including duration.
Our code attempted to print in this case the message "Secondary indexes
are not supported on UDTs containing durations" - but because we tested
for tuples first, and UDTs are also tuples - we got the message about
tuples.
By changing the order of the tests, we get the most specific (and
useful) error message.
Fixes#8724.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210526201042.642550-1-nyh@scylladb.com>
The routine used for getting service level information already
operates on the service level name, but the same information is
also parsed once more from a row from an internal table.
This parsing is redundant, so it's hereby removed.
The purpose of the class in question is to start sharded storage
service to make its global instance alive. I don't know when exactly
it happened but no code that instantiates this wrapper really needs
the global storage service.
Ref: #2795
tests: unit(dev), perf_sstable(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210526170454.15795-1-xemul@scylladb.com>
This commit implements the following overload prevention heuristics:
if the admission queue becomes full, a timer is armed for 50ms.
If any of the ongoing requests finishes, the timer is disarmed,
but if that doesn't happen, the server goes into shedding mode,
which means that it reads new requests from the socket and immediately
drops them until one of the ongoing requests finishes.
This heuristics is not recommended for OLAP workloads,
so it is applied only if the session declared itself as
interactive (via service level's workload_type parameter).
The workload type is currently one of three values:
- unspecified
- interactive
- batch
By defining the workload type, the service level makes it easier
for other components to decide what to do in overload scenarios.
E.g. if the workload is interactive, requests can be shed earlier,
while if it's batched (or unspecified), shedding does not take place.
Conversely, batch workloads could accept long full scan operations.
Some subclasses want to maintain state, which constness needlessly precludes.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8721
The recent commit 0ef0a4c78d added helpful
error messages in case an index cannot be created because the intended
name of its materialized view is already taken - but accidentally broke
the "CREATE INDEX IF NOT EXISTS" feature.
The checking code was correct, but in the wrong place: we need to first
check maybe the index already exists and "IF NOT EXISTS" was chosen -
and only do this new error checking if this is not the case.
This patch also includes a cql-pytest test for reproducing this bug.
The bug is also reproduced by the translated Cassandra unit tests
cassandra_tests/validation/entities/secondary_index_test.py::
testCreateAndDropIndex
and this is how I found this bug. After these patch, all these tests
pass.
Fixes#8717.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210526143635.624398-1-nyh@scylladb.com>
We have enabled off-strategy compaction for bootstrap, replace,
decommission and removenode operations when repair based node operation
is enabled. Unlike node operations like replace or decommission, it is
harder to know when the repair of a table is finished because users can
send multiple repair requests one after another, each request repairing
a few token ranges.
This patch wires off-strategy compaction for regular repair by adding
a timeout based automatic off-strategy compaction trigger mechanism.
If there is no repair activity for sometime, off-strategy compaction
will be triggered for that table automatically.
Fixes#8677Closes#8678
This warning triggers when a range for ("for (auto x : range)") causes
non-trivial copies, prompting the developer to replace with a capture
by reference. A few minor violations in the test suite are corrected.
Closes#8699
In commit 829b4c1 (repair: Make removenode safe by default), removenode
was changed to use repair based node operations unconditionally. Since
repair based node operations is not enabled by default, we should
respect the flag to use stream to sync data if the flag is false.
Fixes#8700Closes#8701
* github.com:scylladb/scylla:
storage_service: Add removenode_add_ranges helper
storage_service: Respect --enable-repair-based-node-ops flag during removenode
Fixes#8270
If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk _footprint_ that is over limit causing new segment allocation to stall.
We need to take a few things into account:
1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here.
2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task.
3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit.
Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything).
Closes#8695
* github.com:scylladb/scylla:
commitlog_test: Add test case for usage/disk size threshold mismatch
commitlog: Flush all segments if we only have one.
commitlog: Always force flush if segment allocation is waiting
commitlog: Include segment wasted (slack) size in footprint check
commitlog: Adjust (lower) usage threshold
Refs #8270
Since segment allocation looks at actual disk footprint, not active,
the threshold check in timer task should include slack space so we
don't mistake sparse usage for space left.
Refs #8270
Try to ensure we issue a flush as soon as we are allocating in the
last allowable segment, instead of "half through". This will make
flushing a little more eager, but should reduce latencies created
by waiting for segment delete/recycle on heavy usage.
Currently the pending (memtables) flushes stats are adjusted back
only on success, therefore they will "leak" on error, so move
use a .then_wrapped clause to always update the stats.
Note that _commitlog->discard_completed_segments is still called
only on success, and so is returning the previous_flush future.
Test: unit(dev)
DTest: alternator_tests.py:AlternatorTest.test_batch_with_auto_snapshot_false(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210525055336.1190029-2-bhalevy@scylladb.com>
Currently, advance_and_wait() allocates a new gate
which might fail. Rather than returning this failure
as an exceptional future - which will require its callers
to handle that failure, keep the function as noexcept and
let an exception from make_lw_shared<gate>() terminate the program.
This makes the function "fail-free" to its callers,
in particular, when called from the table::stop() path where
we can't do much about these failures and we require close/stop
functions to always succeed.
The alternative of make the allocation of a new gate optional
and covering from it in start() is possible but was deemed not
worth it as it will add complexity and cost to start() that's
called on the common, hot, path.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210525055336.1190029-1-bhalevy@scylladb.com>
test_phased_barrier_reassignment has a timeout to prevent the test from
hanging on failure, but it occastionally triggers in debug mode since
the timeout is quite low (1ms). Increase the timeout to prevent false
positives. Since the timeout only expires if the test fails, it will
have no impact on execution time.
Ref #8613Closes#8692
In https://github.com/scylladb/scylla/issues/8609,
table::stop() that is called from database::drop_column_family
is expected to wait on outstanding flushes by calling
_memtable->request_flush(), but the memtable_list is considered
empty() at this point as it has a single empty memtable,
so request_flush() returns a ready future, without waiting
on outstanding flushes. This change replaces the call to
request_flush with flush().
Fix that by either returning _flush_coalescing future
that resolves when the memtable is sealed, if available,
or go through the get_flush_permit and
_dirty_memory_manager->flush_one song and dance, even though
the memtable is empty(), as the latter waits on pending flushes.
Fixes#8609
Test: unit(dev)
DTest: alternator_tests.py:AlternatorTest.test_batch_with_auto_snapshot_false(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210524143438.1056014-1-bhalevy@scylladb.com>
The algorithm used in `get_address_ranges` and `get_range_addresses`
calls `calculate_natural_endpoints` in a loop; the loop iterates over
all tokens in the token ring. If the complexity of a particular
implementation of `calculate_natural_endpoints` is large - say `θ(n)`,
where `n` is the number of tokens - this results in an `θ(n^2)`
algorithm (or worse). This case happens for `Everywhere` replication strategy.
For small clusters this doesn't matter that much, but if `n` is, say, `20*255`,
this may result in huge reactor stalls, as observed in practice.
We avoid these stalls by inserting tactical yields. We hope that
some day someone actually implements a subquadratic algortihm here.
The commit also adds a comment on
`abstract_replication_strategy::calculate_natural_endpoints` explaining
that the interface does not give a complexity guarantee (at this point);
the different implementations have different complexities.
For example, `Everywhere` implementation always iterates over all tokens
in the token ring, so it has `θ(n)` worst and best case complexity.
On the other hand, `NetworkTopologyStrategy` implementation usually
finishes after visiting a small part of the token ring (specifically,
as soon as it finds a token for each node in the ring) and performs
a constant number of operations for each visited token on average,
but theoretically its worst case complexity is actually `O(n + k^2)`,
where `n` is the number of all tokens and `k` is the number of endpoints
(the `k^2` appears since for each endpoint we must perform finds and
inserts on `unordered_set` of size `O(k)`; `unordered_set` operations
have `O(1)` average complexity but `O(size of the set)` worst case
complexity).
Therefore it's not easy to put any complexity guarantee in the interface
at this point. Instead, we say that:
- some implementations may yield - if their complexities force us to do so
- but in general, there is no guarantee that the implementation may
yield - e.g. the `Everywhere` implementation does not yield.
Fixes#8555.
Closes#8647
Off-strategy compaction on a table using STCS is slow because of
the needless write amplification of 2. That's because STCS reshape
isn't taking advantage of the fact that sstables produced by
a repair-based operation are disjoint. So the ~256 input sstables
were compacted (in batches of 32) into larger sstables, which in
turn were compacted into even larger ones. That write amp is very
significant on large data sets, making the whole operation 2x
slower.
Fixes#8449.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210524213426.196407-1-raphaelsc@scylladb.com>
In commit 829b4c1 (repair: Make removenode safe by default), removenode
was changed to use repair based node operations unconditionally. Since
repair based node operations is not enabled by default, we should
respect the flag to use stream to sync data if the flag is false.
Fixes#8700
config_file.cc instantiates std::istream& std::operator>>(std::istream&,
std::unordered_map<seastar::sstring, seastar::sstring>&), but that
instantiation is ignored since config_file_impl.hh specializes
that signature. -Winstantiation-after-specialization warns about it,
so re-enable it now that the code base is clean.
Also remove the matching "extern template" declaration, which has no
definition any more.
Closes#8696
* seastar 28dddd2683...f0f28d07e1 (4):
> httpd: allow handler to not read an empty content
Fixes#8691.
> compat: source_location: implement if no std or experimental are available
> compat: source_location: declare using in seastar::compat namespace
> perftune.py: fix a bug in mlx4 IRQs names matching pattern
To keep our cql-pytest tests "correct", we should strive for them to pass on
Cassandra - unless they are testing a Scylla-only feature or a deliberate
difference between Scylla and Cassandra - in which case they should be marked
"scylla-only" and cause such tests to be skipped when running on Cassandra.
The following few small patches fix a few cases where our tests we failing on
Cassandra. In one case this even found a bug in the test (a trivial Python
mistake, but still).
Closes#8694
* github.com:scylladb/scylla:
test/cql-pytest: fix python mistake in an xfailing test
test/cql-pytest: mark some tests with scylla-only
test/cql-pytest: clean up test_create_large_static_cells_and_rows
The cql3 layer manipulates lists as `std::vector`s (of `managed_bytes_opt`). Since lists can be arbitrarily large, let's use chunked vectors there to prevent potentially large contiguous allocations.
Closes#8668
* github.com:scylladb/scylla:
cql3: change the internal type of tuples::in_value from std::vector to chunked_vector
cql3: change the internal type of lists::value from std::vector to chunked_vector
cql3: in multi_item_terminal, return the vector of items by value
This series fixes a minor validation issue with service level timeouts - negative values were not checked. This bug is benign because negative timeouts act just like a 0s timeout, but the original series claimed to validate against negative values, so it's hereby fixed.
More importantly however, this series follows by enabling cql-pytest to run service level tests and provides a first batch of them, including a missing test case for negative timeouts.
The idea is similar to what we already have in alternator test suite - authentication is unconditionally enabled, which doesn't affect any existing tests, but at the same time allows writing test cases which rely on authentication - e.g. service levels.
Closes#8645
* github.com:scylladb/scylla:
cql-pytest: introduce service level test suite
cql-pytest: add enabling authentication by default
qos: fix validating service level timeouts for negative values
Now RPC module has some basic testing coverage to
make sure RPC configuration is updated appropriately
on configuration changes (i.e. `add_server` and
`remove_server` are called when appropriate).
The test suite currenty consists of the following
test-cases:
* Loading server instance with configuration from a snapshot.
* Loading server instance with configuration from a log.
* Configuration changes (remove + add node).
* Leader elections don't lead to RPC configuration changes.
* Voter <-> learner node transitions also don't change RPC
configuration.
* Reverting uncommitted configuration changes updates
RPC configuration accordingly (two cases: revert to
snapshot config or committed state from the log).
A few more refactorings are made along the way to be
able to reuse some existing functions from
`replication_test` in `rpc_test` implementation.
Please note, though, that there are still some functions
that are borrowed from `replication_test` but not yet
extracted to common helpers.
This is mostly because RPC tests doesn't need all
the complexity that `replication_test` has, thus,
some helpers are copied in a reduced form.
It would take some effort to refactor these bits to
fit both `replication_test` and `rpc_test` without
sacrificing convenience.
This will probably be addressed in another series later.
* manmanson/raft-rpc-tests-v9-alt3:
raft: add tests for RPC module
test: add CHECK_EVENTUALLY_EQUAL utility macro
raft: replication_test: reset test rpc network between test runs
raft: replication_test: extract tickers initialization into a separate func
raft: replication_test: support passing custom `apply_fn` to `change_configuration()`
raft: replication_test: introduce `test_server` aggregate struct
raft: replication_test: support voter<->learner configuration changes
raft: remove duplicate `create_command` function from `replication_test`
raft: avoid 'using' statements in raft testing helpers header
This patchset adds the missing features noted by the patchset
introducing it, namely:
* The ability to run a test through `coverage.py`, automating the entire
process of setting up the environment, running the test and generating
the report. This is possible with the new `--run` command line
argument. It supports either generating a report immediately after
running the provided test or just doing the running part, allowing the
user to generate the report after having run all the tests they wanted
to.
* A tweakable verbosity level.
It is also possible to specify a subset of the profiling data as input
for the report.
The documentation was also completed, with examples for all the
intended uses-cases.
With these changes, `coverage.py` is considered mature, the remaining
rough edges being located in other scripts (`tests.py` and
`configure.py`).
It is now possible to generate a coverage report for any test desired.
Also on: https://github.com/denesb/scylla.git
coverage-py-missing-features/v1
Botond Dénes (5):
scripts/coverage.py: allow specifying the input files to generate the
report from
scripts/coverage.py: add capability of running a test directly
scripts/coverage.py: add --verbose parameter
scripts/coverage.py: document intended uses-cases
HACKING.md: redirect to ./coverage.py for more details
scripts/coverage.py | 143 +++++++++++++++++++++++++++++++++++++++-----
HACKING.md | 19 +-----
2 files changed, 129 insertions(+), 33 deletions(-)
"
The patch set is an assorted collection of header cleanups, e.g:
* Reduce number of boost includes in header files
* Switch to forward declarations in some places
A quick measurement was performed to see if these changes
provide any improvement in build times (ccache cleaned and
existing build products wiped out).
The results are posted below (`/usr/bin/time -v ninja dev-build`)
for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX).
Before:
Command being timed: "ninja dev-build"
User time (seconds): 28262.47
System time (seconds): 824.85
Percent of CPU this job got: 3979%
Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2129888
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1402838
Minor (reclaiming a frame) page faults: 124265412
Voluntary context switches: 1879279
Involuntary context switches: 1159999
Swaps: 0
File system inputs: 0
File system outputs: 11806272
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
After:
Command being timed: "ninja dev-build"
User time (seconds): 26270.81
System time (seconds): 767.01
Percent of CPU this job got: 3905%
Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2117608
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1400189
Minor (reclaiming a frame) page faults: 117570335
Voluntary context switches: 1870631
Involuntary context switches: 1154535
Swaps: 0
File system inputs: 0
File system outputs: 11777280
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
The observed improvement is about 5% of total wall clock time
for `dev-build` target.
Also, all commits make sure that headers stay self-sufficient,
which would help to further improve the situation in the future.
"
* 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla:
transport: remove extraneous `qos/service_level_controller` includes from headers
treewide: remove evidently unneded storage_proxy includes from some places
service_level_controller: remove extraneous `service/storage_service.hh` include
sstables/writer: remove extraneous `service/storage_service.hh` include
treewide: remove extraneous database.hh includes from headers
treewide: reduce boost headers usage in scylla header files
cql3: remove extraneous includes from some headers
cql3: various forward declaration cleanups
utils: add missing <limits> header in `extremum_tracking.hh`
This is a follow up change to #8512.
Let's add aio conf file during scylla installation process and make sure
we also remove this file when uninstall Scylla
As per Avi Kivity's suggestion, let's set aio value as static
configuration, and make it large enough to work with 500 cpus.
Closes#8650
Currently, var-lib-scylla.mount may fails because it can start before
MDRAID volume initialized.
We may able to add "After=dev-disk-by\x2duuid-<uuid>.device" to wait for
device become available, but systemd manual says it automatically
configure dependency for mount unit when we specify filesystem path by
"absolute path of a device node".
So we need to replace What=UUID=<uuid> to What=/dev/disk/by-uuid/<uuid>.
Fixes#8279Closes#8681
The xfailing test cassandra_tests/validation/entities/collections_test.py::
testSelectionOfEmptyCollections had a Python mistake (using {} instead
of set() for an empty set), which resulted in its failure when run
against Cassandra. After this patch it passes on Cassandra and fails on
Scylla - as expected (this is why it is marked xfail).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Tests which are known to test a Scylla-only feature (such as CDC)
or to rely on a known and difference between Scylla and Cassandra
should be marked "scylla-only", so they are skipped when running
the tests against Cassandra (test/cql-pytest/run-cassandra) instead
of reporting errors.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The test test_create_large_static_cells_and_rows had its own
implementation of "nodetool flush" using Scylla's REST API.
Now that we have a nodetool.flush() function for general use in
cql-pytest, let's use it and save a bit of duplication.
Another benefit is that now this test can be run (and pass) against
Cassandra.
To allow this test to run on Cassandra, I had to remove a
"USING TIMEOUT" which wasn't necessary for this test, and is
not a feature supported by Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>