We don't explicitly cleanup the memtable, while
it might hold tokens disowned by the current node.
Flush the memtable before performing cleanup compaction
to make sure all tokens in the memtable are cleaned up.
Note that non-owned ranges are invalidate in the cache
in compaction_group::update_main_sstable_list_on_compaction_completion
using desc.ranges_for_cache_invalidation.
\Fixes #1239
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit eb3a94e2bc)
Move the integration with compaction_manager
from the api layer to the tabel class so
it can also make sure the memtable is cleaned up in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit fc278be6c4)
The line modified in this patch was supposed to increase the
optimization levels of parsers in debug mode to 1, because they
were too slow otherwise. But as a side effect, it also reduced the
optimization level in release mode to 1. This is not a problem
for the CQL frontend, because statement preparation is not
performance-sensitive, but it is a serious performance problem
for Alternator, where it lies in the hot path.
Fix this by only applying the -O1 to debug modes.
Fixes#12463Closes#12460
(cherry picked from commit 08b3a9c786)
Sometimes a single modification to a base partition requires updates to
a large number of view rows. A common example is deletion of a base
partition containing many rows. A large BATCH is also possible.
To avoid large allocations, we split the large amount of work into
batch of 100 (max_rows_for_view_updates) rows each. The existing code
assumed an empty result from one of these batches meant that we are
done. But this assumption was incorrect: There are several cases when
a base-table update may not need a view update to be generated (see
can_skip_view_updates()) so if all 100 rows in a batch were skipped,
the view update stopped prematurely. This patch includes two tests
showing when this bug can happen - one test using a partition deletion
with a USING TIMESTAMP causing the deletion to not affect the first
100 rows, and a second test using a specially-crafed large BATCH.
These use cases are fairly esoteric, but in fact hit a user in the
wild, which led to the discovery of this bug.
The fix is fairly simple: To detect when build_some() is done it is no
longer enough to check if it returned zero view-update rows; Rather,
it explicitly returns whether or not it is done as an std::optional.
The patch includes several tests for this bug, which pass on Cassandra,
failed on Scylla before this patch, and pass with this patch.
Fixes#12297.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12305
(cherry picked from commit 92d03be37b)
We recently (in 7fbad8de87) made sure all admission paths can trigger the eviction of inactive reads. As reader eviction happens in the background, a mechanism was added to make sure only a single eviction fiber was running at any given time. This mechanism however had a preemption point between stopping the fiber and releasing the evict lock. This gave an opportunity for either new waiters or inactive readers to be added, without the fiber acting on it. Since it still held onto the lock, it also prevented from other eviction fibers to start. This could create a situation where the semaphore could admit new reads by evicting inactive ones, but it still has waiters. Since an empty waitlist is also an admission criteria, once one waiter is wrongly added, many more can accumulate.
This series fixes this by ensuring the lock is released in the instant the fiber decides there is no more work to do.
It also fixes the assert failure on recursive eviction and adds a detection to the inactive/waiter contradiction.
Fixes: #11923
Refs: #11770Closes#12026
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: do_wait_admission(): detect admission-waiter anomaly
reader_concurrency_semaphore: evict_readers_in_the_background(): eliminate blind spot
reader_concurrency_semaphore: do_detach_inactive_read(): do a complete detach
(cherry picked from commit 15ee8cfc05)
The semaphore currently has two admission paths: the
obtain_permit()/with_permit() methods which admits permits on user
request (the front door) and the maybe_admit_waiters() which admits
permits based on internal events like memory resource being returned
(the back door). The two paths used their own admission conditions
and naturally this means that they diverged in time. Notably,
maybe_admit_waiters() did not look at inactive readers assuming that if
there are waiters there cannot be inactive readers. This is not true
however since we merged the execution-stage into the semaphore. Waiters
can queue up even when there are inactive reads and thus
maybe_admit_waiters() has to consider evicting some of them to see if
this would allow for admitting new reads.
To avoid such divergence in the future, the admission logic was moved
into a new method can_admit_read() which is now shared between the two
method families. This method now checks for the possibility of evicting
inactive readers as well.
The admission logic was tuned slightly to only consider evicting
inactive readers if there is a real possibility that this will result
in admissions: notably, before this patch, resource availability was
checked before stalls were (used permits == blocked permits), so we
could evict readers even if this couldn't help.
Because now eviction can be started from maybe_admit_waiters(), which is
also downstream from eviction, we added a flag to avoid recursive
evict -> maybe admit -> evict ... loops.
Fixes: #11770Closes#11784
(cherry picked from commit 7fbad8de87)
--online-discard option defined as string parameter since it doesn't
specify "action=", but has default value in boolean (default=True).
It breaks "provisioning in a similar environment" since the code
supposed boolean value should be "action='store_true'" but it's not.
We should change the type of the option to int, and also specify
"choices=[0, 1]" just like --io-setup does.
Fixes#11700Closes#11831
(cherry picked from commit acc408c976)
Regular INSERT statements with null values for primary key
components are rejected by Scylla since #9286 and #9314.
Batch statements missed a similar check, this patch
fixes it.
Fixes: #12060
(cherry picked from commit 7730c4718e)
When the mutation compactor has all the rows it needs for a page, it
saves the decision to stop in a member flag: _stop.
For single partition queries, the mutation compactor is kept alive
across pages and so it has a method, start_new_page() to reset its state
for the next page. This method didn't clear the _stop flag. This meant
that the value set at the end of the previous could cause the new page
and subsequently the entire query to be stopped prematurely.
This can happen if the new page starts with a row that is covered by a
higher level tombstone and is completely empty after compaction.
Reset the _stop flag in start_new_page() to prevent this.
This commit also adds a unit test which reproduces the bug.
Fixes: #12361Closes#12384
(cherry picked from commit b0d95948e1)
This series backports several patches which add or enable tests for Alternator TTL. The series does not touch the code - just tests.
The goal of backporting more tests is to get the code - which is already in branch 5.1 - tested. It wasn't a good idea to backport code without backporting the tests for it.
Closes#12200Fixes#11374
* github.com:scylladb/scylladb:
test/alternator: increase timeout on TTL tests
test/alternator: fix timeout in flaky test test_ttl_stats
test/alternator: test Alternator TTL metrics
test/alternator: skip fewer Alternator TTL tests
Due to an oversight, the local index cache isn't evicted gently
when _upper_bound existed. This is a source of reactor stalls.
Fix that.
Fixes#12271Closes#12364
(cherry picked from commit d9269abf5b)
Fix https://github.com/scylladb/scylla-doc-issues/issues/816
Fix https://github.com/scylladb/scylla-docs/issues/1613
This PR fixes the CQL version in the Interfaces page, so that it is the same as in other places across the docs and in sync with the version reported by the ScyllaDB (see https://github.com/scylladb/scylla-doc-issues/issues/816#issuecomment-1173878487).
To make sure the same CQL version is used across the docs, we should use the `|cql-version| `variable rather than hardcode the version number on several pages.
The variable is specified in the conf.py file:
```
rst_prolog = """
.. |cql-version| replace:: 3.3.1
"""
```
Closes#11320
* github.com:scylladb/scylladb:
doc: add the Cassandra version on which the tools are based
doc: fix the version number
doc: update the Enterprise version where the ME format was introduced
doc: add the ME format to the Cassandar Compatibility page
doc: replace Scylla with ScyllaDB
doc: rewrite the Interfaces table to the new format to include more information about CQL support
doc: remove the CQL version from pages other than Cassandra compatibility
doc: fix the CQL version in the Interfaces table
(cherry picked from commit ee606a5d52)
The problematic scenario this patch fixes might happen due to
unfortunate serialization of locks/unlocks between lock_pk and lock_ck,
as follows:
1. lock_pk acquires an exclusive lock on the partition.
2.a lock_ck attempts to acquire shared lock on the partition
and any lock on the row. both cases currently use a fiber
returning a future<rwlock::holder>.
2.b since the partition is locked, the lock_partition times out
returning an exceptional future. lock_row has no such problem
and succeeds, returning a future holding a rwlock::holder,
pointing to the row lock.
3.a the lock_holder previously returned by lock_pk is destroyed,
calling `row_locker::unlock`
3.b row_locker::unlock sees that the partition is not locked
and erases it, including the row locks it contains.
4.a when_all_succeeds continuation in lock_ck runs. Since
the lock_partition future failed, it destroyes both futures.
4.b the lock_row future is destroyed with the rwlock::holder value.
4.c ~holder attempts to return the semaphore units to the row rwlock,
but the latter was already destroyed in 3.b above.
Acquiring the partition lock and row lock in parallel
doesn't help anything, but it complicates error handling
as seen above,
This patch serializes acquiring the row lock in lock_ck
after locking the partition to prevent the above race.
This way, erasing the unlocked partition is never expected
to happen while any of its rows locks is held.
Fixes#12168
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12208
(cherry picked from commit 5007ded2c1)
This PR adds the link to the KB article about updating the mode after the upgrade to the 5.1 upgrade guide.
In addition, I have:
- updated the KB article to include the versions affected by that change.
- fixed the broken link to the page about metric updates (it is not related to the KB article, but I fixed it in the same PR to limit the number of PRs that need to be backported).
Related: https://github.com/scylladb/scylladb/pull/11122Closes#12148
* github.com:scylladb/scylladb:
doc: update the releases in the KB about updating the mode after upgrade
doc: fix the broken link in the 5.1 upgrade guide
doc: add the link to the 5.1-related KB article to the 5.1 upgrade guide
(cherry picked from commit 897b501ba3)
This is a backport of https://github.com/scylladb/scylladb/pull/11783.
Closes#12229
* github.com:scylladb/scylladb:
doc: replace Scylla with ScyllaDB
doc: add a comment to remove in future versions any information that refers to previous releases
doc: rewrite the notes to improve clarity
doc: remove the reperitions from the notes
Changing configuration involves two entries in the log: a 'joint
configuration entry' and a 'non-joint configuration entry'. We use
`wait_for_entry` to wait on the joint one. To wait on the non-joint one,
we use a separate promise field in `server`. This promise wasn't
connected to the `abort_source` passed into `set_configuration`.
The call could get stuck if the server got removed from the
configuration and lost leadership after committing the joint entry but
before committing the non-joint one, waiting on the promise. Aborting
wouldn't help. Fix this by subscribing to the `abort_source` in
resolving the promise exceptionally.
Furthermore, make sure that two `set_configuration` calls don't step on
each other's toes by one setting the other's promise. To do that, reset
the promise field at the end of `set_configuration` and check that it's
not engaged at the beginning.
Fixes#11288.
Closes#11325
* github.com:scylladb/scylladb:
test: raft: randomized_nemesis_test: additional logging
raft: server: handle aborts when waiting for config entry to commit
(cherry picked from commit 83850e247a)
When `io_fiber` fetched a batch with a configuration that does not
contain this node, it would send the entries committed in this batch to
`applier_fiber` and proceed by any remaining entry dropping waiters (if
the node was no longer a leader).
If there were waiters for entries committed in this batch, it could
either happen that `applier_fiber` received and processed those entries
first, notifying the waiters that the entries were committed and/or
applied, or it could happen that `io_fiber` reaches the dropping waiters
code first, causing the waiters to be resolved with
`commit_status_unknown`.
The second scenario is undesirable. For example, when a follower tries
to remove the current leader from the configuration using
`modify_config`, if the second scenario happens, the follower will get
`commit_status_unknown` - this can happen even though there are no node
or network failures. In particular, this caused
`randomized_nemesis_test.remove_leader_with_forwarding_finishes` to fail
from time to time.
Fix it by serializing the notifying and dropping of waiters in a single
fiber - `applier_fiber`. We decided to move all management of waiters
into `applier_fiber`, because most of that management was already there
(there was already one `drop_waiters` call, and two `notify_waiters`
calls). Now, when `io_fiber` observes that we've been removed from the
config and no longer a leader, instead of dropping waiters, it sends a
message to `applier_fiber`. `applier_fiber` will drop waiters when
receiving that message.
Improve an existing test to reproduce this scenario more frequently.
Fixes#11235.
Closes#11308
* github.com:scylladb/scylladb:
test: raft: randomized_nemesis_test: more chaos in `remove_leader_with_forwarding_finishes`
raft: server: drop waiters in `applier_fiber` instead of `io_fiber`
raft: server: use `visit` instead of `holds_alternative`+`get`
(cherry picked from commit 9c4e32d2e2)
Contains fixes requested in the issue (and some tiny extras), together with analysis why they don't affect the users (see commit messages).
Fixes [ #11800](https://github.com/scylladb/scylladb/issues/11800)
Closes#11926
* github.com:scylladb/scylladb:
alternator: add maybe_quote to secondary indexes 'where' condition
test/alternator: correct xfail reason for test_gsi_backfill_empty_string
test/alternator: correct indentation in test_lsi_describe
alternator: fix wrong 'where' condition for GSI range key
(cherry picked from commit ce7c1a6c52)
The SELECT JSON statement, just like SELECT, allows the user to rename
selected columns using an "AS" specification. E.g., "SELECT JSON v AS foo".
This specification was not honored: We simply forgot to look at the
alias in SELECT JSON's implementation (we did it correctly in regular
SELECT). So this patch fixes this bug.
We had two tests in cassandra_tests/validation/entities/json_test.py
that reproduced this bug. The checks in those tests now pass, but these
two tests still continue to fail after this patch because of two other
unrelated bugs that were discovered by the same tests. So in this patch
I also add a new test just for this specific issue - to serve as a
regression test.
Fixes#8078
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12123
(cherry picked from commit c5121cf273)
When we write to a materialized view, we need to know some information
defined in the base table such as the columns in its schema. We have
a "view_info" object that tracks each view and its base.
This view_info object has a couple of mutable attributes which are
used to lazily-calculate and cache the SELECT statement needed to
read from the base table. If the base-table schema ever changes -
and the code calls set_base_info() at that point - we need to forget
this cached statement. If we don't (as before this patch), the SELECT
will use the wrong schema and writes will no longer work.
This patch also includes a reproducing test that failed before this
patch, and passes afterwords. The test creates a base table with a
view that has a non-trivial SELECT (it has a filter on one of the
base-regular columns), makes a benign modification to the base table
(just a silly addition of a comment), and then tries to write to the
view - and before this patch it fails.
Fixes#10026Fixes#11542
(cherry picked from commit 2f2f01b045)
Some of the tests in test/alternator/test_ttl.py need an expiration scan
pass to complete and expire items. In development builds on developer
machines, this usually takes less than a second (our scanning period is
set to half a second). However, in debug builds on Jenkins each scan
often takes up to 100 (!) seconds (this is the record we've seen so far).
This is why we set the tests' timeout to 120.
But recently we saw another test run failing. I think the problem is
that in some case, we need not one, but *two* scanning passes to
complete before the timeout: It is possible that the test writes an
item right after the current scan passed it, so it doesn't get expired,
and then we a second scan at a random position, possibly making that
item we mention one of the last items to be considered - so in total
we need to wait for two scanning periods, not one, for the item to
expire.
So this patch increases the timeout from 120 seconds to 240 seconds -
more than twice the highest scanning time we ever saw (100 seconds).
Note that this timeout is just a timeout, it's not the typical test
run time: The test can finish much more quickly, as little as one
second, if items expire quickly on a fast build and machine.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12106
(cherry picked from commit 6bc3075bbd)
The test `test_metrics.py::test_ttl_stats` tests the metrics associated
with Alternator TTL expiration events. It normally finishes in less than a
second (the TTL scanning is configured to run every 0.5 seconds), so we
arbitrarily set a 60 second timeout for this test to allow for extremely
slow test machines. But in some extreme cases even this was not enough -
in one case we measured the TTL scan to take 63 seconds.
So in this patch we increase the timeout in this test from 60 seconds
to 120 seconds. We already did the same change in other Alternator TTL
tests in the past - in commit 746c4bd.
Fixes#11695
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11696
(cherry picked from commit 3a30fbd56c)
This patch adds a test for the metrics generated by the background
expiration thread run for Alternator's TTL feature.
We test three of the four metrics: scylla_expiration_scan_passes,
scylla_expiration_scan_table and scylla_expiration_items_deleted.
The fourth metric, scylla_expiration_secondary_ranges_scanned, counts the
number of times that this node took over another node's expiration duty.
so requires a multi-node cluster to test, and we can't test it in the
single-node cluster test framework.
To see TTL expiration in action this test may need to wait up to the
setting of alternator_ttl_period_in_seconds. For a setting of 1
second (the default set by test/alternator/run), this means this
test can take up to 1 second to run. If alternator_ttl_period_in_seconds
is set higher, the test is skipped unless --runveryslow is requested.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 297109f6ee)
Most of the Alternator TTL tests are extremely slow on DynamoDB because
item expiration may be delayed up to 24 hours (!), and in practice for
10 to 30 minutes. Because of this, we marked most of these tests
with the "veryslow" mark, causing them to be skipped by default - unless
pytest is given the "--runveryslow" option.
The result was that the TTL tests were not run in the normal test runs,
which can allow regressions to be introduced (luckily, this hasn't happened).
However, this "veryslow" mark was excessive. Many of the tests are very
slow only on DynamoDB, but aren't very slow on Scylla. In particular,
many of the tests involve waiting for an item to expire, something that
happens after the configurable alternator_ttl_period_in_seconds, which
is just one second in our tests.
So in this patch, we remove the "veryslow" mark from 6 tests of Alternator TTL
tests, and instead use two new fixtures - waits_for_expiration and
veryslow_on_aws - to only skip the test when running on DynamoDB or
when alternator_ttl_period_in_seconds is high - but in our usual test
environment they will not get skipped.
Because 5 of these 6 tests wait for an item to expire, they take one
second each and this patch adds 5 seconds to the Alternator test
runtime. This is unfortunate (it's more than 25% of the total Alternator
test runtime!) but not a disaster, and we plan to reduce this 5 second
time futher in the following patch, but decreasing the TTL scanning
period even further.
This patch also increases the timeout of several of these tests, to 120
seconds from the previous 10 seconds. As mentioned above, normally,
these tests should always finish in alternator_ttl_period_in_seconds
(1 second) with a single scan taking less than 0.2 seconds, but in
extreme cases of debug builds on overloaded test machines, we saw even
60 seconds being passed, so let's increase the maximum. I also needed
to make the sleep time between retries smaller, not a function of the
new (unrealistic) timeout.
4 more tests remain "veryslow" (and won't run by default) because they
are take 5-10 seconds each (e.g., a test which waits to see that an item
does *not* get expired, and a test involving writing a lot of data).
We should reconsider this in the future - to perhaps run these tests in
our normal test runs - but even for now, the 6 extra tests that we
start running are a much better protection against regressions than what
we had until now.
Fixes#11374
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
x
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 746c4bd9eb)
PR #9314 fixed a similar issue with regular insert statements
but missed the LWT code path.
It's expected behaviour of
modification_statement::create_clustering_ranges to return an
empty range in this case, since possible_lhs_values it
uses explicitly returns empty_value_set if it evaluates rhs
to null, and it has a comment about it (All NULL
comparisons fail; no column values match.) On the other hand,
all components of the primary key are required to be set,
this is checked at the prepare phase, in
modification_statement::process_where_clause. So the only
problem was modification_statement::execute_with_condition
was not expecting an empty clustering_range in case of
a null clustering key.
Fixes: #11954
(cherry picked from commit 0d443dfd16)
According to seastar/doc/lambda-coroutine-fiasco.md lambda that
co_awaits once loses its capture frame. In distrobuted_loader
code there's at least one of that kind.
fixes: #12175
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12170
(cherry picked from commit 71179ff5ab)
Fix https://github.com/scylladb/scylla-docs/issues/4126Closes#11122
* github.com:scylladb/scylladb:
doc: add info about the time-consuming step due to resharding
doc: add the new KB to the toctree
doc: doc: add a KB about updating the mode in perftune.yaml after upgrade
(cherry picked from commit e9fec761a2)
Release 5.1. introduced a new CQL extension that applies to the CREATE TABLE and ALTER TABLE statements. The ScyllaDB-specific extensions are described on a separate page, so the CREATE TABLE and ALTER TABLE should include links to that page and section.
Note: CQL extensions are described with Markdown, while the Data Definition page is RST. Currently, there's no way to link from an RST page to an MD subsection (using a section heading or anchor), so a URL is used as a temporary solution.
Related: https://github.com/scylladb/scylladb/pull/9810Closes#12070
* github.com:scylladb/scylladb:
doc: move the info about per-partition rate limit for the ALTER TABLE statemet from the paragraph to the list
doc: add the links to the per-partition rate limit extention to the CREATE TABLE and ALTER TABLE sections
(cherry picked from commit 6e9f739f19)
This is a backport of https://github.com/scylladb/scylladb/pull/11460.
Closes#12079
* github.com:scylladb/scylladb:
doc: update the commands to upgrade the ScyllaDB image
doc: fix the filename in the index to resolve the warnings and fix the link
doc: apply feedback by adding she step fo load the new repo and fixing the links
doc: fix the version name in file upgrade-guide-from-2021.1-to-2022.1-image.rst
doc: rename the upgrade-image file to upgrade-image-opensource and update all the links to that file
doc: update the Enterprise guide to include the Enterprise-onlyimage file
doc: update the image files
doc: split the upgrade-image file to separate files for Open Source and Enterprise
doc: clarify the alternative upgrade procedures for the ScyllaDB image
doc: add the upgrade guide for ScyllaDB Image from 2022.x.y. to 2022.x.z
doc: add the upgrade guide for ScyllaDB Image from 5.x.y. to 5.x.z