There was a bug in `expr::search_and_replace`.
It doesn't preserve the `order` field of binary_operator.
`order` field is used to mark relations created
using the SCYLLA_CLUSTERING_BOUND.
It is a CQL feature used for internal queries inside Scylla.
It means that we should handle the restriction as a raw
clustering bound, not as an expression in the CQL language.
Losing the SCYLLA_CLUSTERING_BOUND marker could cause issues,
the database could end up selecting the wrong clustering ranges.
Fixes: #13055
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#13056
(cherry picked from commit aa604bd935)
EOF is only guarateed to be set if one tried to read past the end of the
file. So when checking for EOF, also try to read some more. This
should force the EOF flag into a correct value. We can then check that
the read yielded 0 bytes.
This should ensure that `validate_checksums()` will not falsely declare
the validation to have failed.
Fixes: #11190Closes#12696
(cherry picked from commit 693c22595a)
This commit makes the following changes to the docs landing page:
- Adds the ScyllaDB enterprise docs as one of three tiles.
- Modifies the three tiles to reflect the three flavors of ScyllaDB.
- Moves the "New to ScyllaDB? Start here!" under the page title.
- Renames "Our Products" to "Other Products" to list the products other
than ScyllaDB itself. In addtition, the boxes are enlarged from to
large-4 to look better.
The major purpose of this commit is to expose the ScyllaDB
documentation.
docs: fix the link
(cherry picked from commit 27bb8c2302)
Closes#13086
This PR adds a note to the Alternator TTL section to specify in which Open Source and Enterprise versions the feature was promoted from experimental to non-experimental.
The challenge here is that OSS and Enterprise are (still) **documented together**, but they're **not in sync** in promoting the TTL feature: it's still experimental in 5.1 (released) but no longer experimental in 2022.2 (to be released soon).
We can take one of the following approaches:
a) Merge this PR with master and ask the 2022.2 users to refer to master.
b) Merge this PR with master and then backport to branch-5.1. If we choose this approach, it is necessary to backport https://github.com/scylladb/scylladb/pull/11997 beforehand to avoid conflicts.
I'd opt for a) because it makes more sense from the OSS perspective and helps us avoid mess and backporting.
Closes#12295
* github.com:scylladb/scylladb:
doc: fix the version in the comment on removing the note
doc: specify the versions where Alternator TTL is no longer experimental
(cherry picked from commit d5dee43be7)
It's known that reading large cells in reverse cause large allocations.
Source: https://github.com/scylladb/scylladb/issues/11642
The loading is preliminary work for splitting large partitions into
fragments composing a run and then be able to later read such a run
in an efficiency way using the position metadata.
The splitting is not turned on yet, anywhere. Therefore, we can
temporarily disable the loading, as a way to avoid regressions in
stable versions. Large allocations can cause stalls due to foreground
memory eviction kicking in.
The default values for position metadata say that first and last
position include all clustering rows, but they aren't used anywhere
other than by sstable_run to determine if a run is disjoint at
clustering level, but given that no splitting is done yet, it
does not really matter.
Unit tests relying on position metadata were adjusted to enable
the loading, such that they can still pass.
Fixes#11642.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12979
(cherry picked from commit d73ffe7220)
Currently all consumed range tombstone changes are unconditionally
forwarded to the validator. Even if they are shadowed by a higher level
tombstone and/or purgable. This can result in a situation where a range
tombstone change was seen by the validator but not passed to the
consumer. The validator expects the range tombstone change to be closed
by end-of-partition but the end fragment won't come as the tombstone was
dropped, resulting in a false-positive validation failure.
Fix by only passing tombstones to the validator, that are actually
passed to the consumer too.
Fixes: #12575Closes#12578
(cherry picked from commit e2c9cdb576)
Check the first fragment before dereferencing it, the fragment might be
empty, in which case move to the next one.
Found by running range scan tests with random schema and random data.
Fixes: #12821Fixes: #12823Fixes: #12708Closes#12824
(cherry picked from commit ef548e654d)
After 5badf20c7a applier fiber does not
stop after it gets abort error from a state machine which may trigger an
assertion because previous batch is not applied. Fix it.
Fixes#12863
(cherry picked from commit 9bdef9158e)
we should never return a reference to local variable.
so in this change, a reference to a static variable is returned
instead. this should address following warning from Clang 17:
```
/home/kefu/dev/scylladb/tools/schema_loader.cc:146:16: error: returning reference to local temporary object [-Werror,-Wreturn-stack-address]
return {};
^~
```
Fixes#12875
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12876
(cherry picked from commit 6eab8720c4)
We currently configure only TimeoutStartSec, but probably it's not
enough to prevent coredump timeout, since TimeoutStartSec is maximum
waiting time for service startup, and there is another directive to
specify maximum service running time (RuntimeMaxSec).
To fix the problem, we should specify RunTimeMaxSec and TimeoutSec (it
configures both TimeoutStartSec and TimeoutStopSec).
Fixes#5430Closes#12757
(cherry picked from commit bf27fdeaa2)
Related https://github.com/scylladb/scylladb/issues/12658.
This issue fixes the bug in the upgrade guides for the released versions.
Closes#12679
* github.com:scylladb/scylladb:
doc: fix the service name in the upgrade guide for patch releases versions 2022
doc: fix the service name in the upgrade guide from 2021.1 to 2022.1
(cherry picked from commit 325246ab2a)
Both patches are important to fix inefficiencies when updating the backlog tracker, which can manifest as a reactor stall, on a special event like schema change.
No conflicts when backporting.
Regression since 1d9f53c881, which is present in branch 5.1 onwards.
Closes#12851
* github.com:scylladb/scylladb:
compaction: Fix inefficiency when updating LCS backlog tracker
table: Fix quadratic behavior when inserting sstables into tracker on schema change
LCS backlog tracker uses STCS tracker for L0. Turns out LCS tracker
is calling STCS tracker's replace_sstables() with empty arguments
even when higher levels (> 0) *only* had sstables replaced.
This unnecessary call to STCS tracker will cause it to recompute
the L0 backlog, yielding the same value as before.
As LCS has a fragment size of 0.16G on higher levels, we may be
updating the tracker multiple times during incremental compaction,
which operates on SSTables on higher levels.
Inefficiency is fixed by only updating the STCS tracker if any
L0 sstable is being added or removed from the table.
This may be fixing a quadratic behavior during boot or refresh,
as new sstables are loaded one by one.
Higher levels have a substantial higher number of sstables,
therefore updating STCS tracker only when level 0 changes, reduces
significantly the number of times L0 backlog is recomputed.
Refs #12499.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12676
(cherry picked from commit 1b2140e416)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Each time backlog tracker is informed about a new or old sstable, it
will recompute the static part of backlog which complexity is
proportional to the total number of sstables.
On schema change, we're calling backlog_tracker::replace_sstables()
for each existing sstable, therefore it produces O(N ^ 2) complexity.
Fixes#12499.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12593
(cherry picked from commit 87ee547120)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The "cluster manager" used by the topology test suite uses a UNIX-domain
socket to communicate between the cluster manager and the individual tests.
The socket is currently located in the test directory but there is a
problem: In Linux the length of the path used as a UNIX-domain socket
address is limited to just a little over 100 bytes. In Jenkins run, the
test directory names are very long, and we sometimes go over this length
limit and the result is that test.py fails creating this socket.
In this patch we simply put the socket in /tmp instead of the test
directory. We only need to do this change in one place - the cluster
manager, as it already passes the socket path to the individual tests
(using the "--manager-api" option).
Tested by cloning Scylla in a very long directory name.
A test like ./test.py --mode=dev test_concurrent_schema fails before
this patch, and passes with it.
Fixes#12622Closes#12678
(cherry picked from commit 681a066923)
`ScyllaClusterManager` is used to run a sequence of test cases from
a single test file. Between two consecutive tests, if the previous test
left the cluster 'dirty', meaning the cluster cannot be reused, it would
free up space in the pool (using `steal`), stop the cluster, then get a
new cluster from the pool.
Between the `steal` and the `get`, a concurrent test run (with its own
instance of `ScyllaClusterManager` would start, because there was free
space in the pool.
This resulted in undesirable behavior when we ran tests with
`--repeat X` for a large `X`: we would start with e.g. 4 concurrent
runs of a test file, because the pool size was 4. As soon as one of the
runs freed up space in the pool, we would start another concurrent run.
Soon we'd end up with 8 concurrent runs. Then 16 concurrent runs. And so
on. We would have a large number of concurrent runs, even though the
original 4 runs didn't finish yet. All of these concurrent runs would
compete waiting on the pool, and waiting for space in the pool would
take longer and longer (the duration is linear w.r.t number of
concurrent competing runs). Tests would then time out because they would
have to wait too long.
Fix that by using the new `replace_dirty` function introduced to the
pool. This function frees up space by returning a dirty cluster and then
immediately takes it away to be used for a new cluster. Thanks to this,
we will only have at most as many concurrent runs as the pool size. For
example with --repeat 8 and pool size 4, we would run 4 concurrent runs
and start the 5th run only when one of the original 4 runs finishes,
then the 6th run when a second run finishes and so on.
The fix is preceded by a refactor that replaces `steal` with `put(is_dirty=True)`
and a `destroy` function passed to the pool (now the pool is responsible
for stopping the cluster and releasing its IPs).
Fixes#11757Closes#12549
* github.com:scylladb/scylladb:
test/pylib: scylla_cluster: ensure there's space in the cluster pool when running a sequence of tests
test/pylib: pool: introduce `replace_dirty`
test/pylib: pool: replace `steal` with `put(is_dirty=True)`
(cherry picked from commit 132af20057)
From reviews of https://github.com/scylladb/scylladb/pull/12569, avoid
using `async with` and access the `Pool` of clusters with
`get()`/`put()`.
Closes#12612
* github.com:scylladb/scylladb:
test.py: manual cluster handling for PythonSuite
test.py: stop cluster if PythonSuite fails to start
test.py: minor fix for failed PythonSuite test
(cherry picked from commit 5bc7f0732e)
If the after test check fails (is_after_test_ok is False), discard the cluster and raise exception so context manager (pool) does not recycle it.
Ignore exception re-raised by the context manager.
Fixes#12360Closes#12569
* github.com:scylladb/scylladb:
test.py: handle broken clusters for Python suite
test.py: Pool discard method
(cherry picked from commit 54f174a1f4)
`ScyllaCluster.server_stop` had this piece of code:
```
server = self.running.pop(server_id)
if gracefully:
await server.stop_gracefully()
else:
await server.stop()
self.stopped[server_id] = server
```
We observed `stop_gracefully()` failing due to a server hanging during
shutdown. We then ended up in a state where neither `self.running` nor
`self.stopped` had this server. Later, when releasing the cluster and
its IPs, we would release that server's IP - but the server might have
still been running (all servers in `self.running` are killed before
releasing IPs, but this one wasn't in `self.running`).
Fix this by popping the server from `self.running` only after
`stop_gracefully`/`stop` finishes.
Make an analogous fix in `server_start`: put `server` into
`self.running` *before* we actually start it. If the start fails, the
server will be considered "running" even though it isn't necessarily,
but that is OK - if it isn't running, then trying to stop it later will
simply do nothing; if it is actually running, we will kill it (which we
should do) when clearing after the cluster; and we don't leak it.
Closes#12613
(cherry picked from commit a0ff33e777)
Don't use a range scan, which is very inefficient, to perform a query for checking CQL availability.
Improve logging when waiting for server startup times out. Provide details about the failure: whether we managed to obtain the Host ID of the server and whether we managed to establish a CQL connection.
Closes#12588
* github.com:scylladb/scylladb:
test/pylib: scylla_cluster: better logging for timeout on server startup
test/pylib: scylla_cluster: use less expensive query to check for CQL availability
(cherry picked from commit ccc2c6b5dd)
If an endpoint handler throws an exception, the details of the exception
are not returned to the client. Normally this is desirable so that
information is not leaked, but in this test framework we do want to
return the details to the client so it can log a useful error message.
Do it by wrapping every handler into a catch clause that returns
the exception message.
Also modify a bit how HTTPErrors are rendered so it's easier to discern
the actual body of the error from other details (such as the params used
to make the request etc.)
Before:
```
E test.pylib.rest_client.HTTPError: HTTP error 500: 500 Internal Server Error
E
E Server got itself in trouble, params None, json None, uri http+unix://api/cluster/before-test/test_stuff
```
After:
```
E test.pylib.rest_client.HTTPError: HTTP error 500, uri: http+unix://api/cluster/before-test/test_stuff, params: None, json: None, body:
E Failed to start server at host 127.155.129.1.
E Check the log files:
E /home/kbraun/dev/scylladb/testlog/test.py.dev.log
E /home/kbraun/dev/scylladb/testlog/dev/scylla-1.log
```
Closes#12563
(cherry picked from commit 2f84e820fd)
When we obtained a new cluster for a test case after the previous test
case left a dirty cluster, we would release the old cluster's used IP
addresses (`_before_test` function). However, we would not release the
last cluster's IP after the last test case. We would run out of IPs with
sufficiently many test files or `--repeat` runs. Fix this.
Also reorder the operations a bit: stop the cluster (and release its
IPs) before freeing up space in the cluster pool (i.e. call
`self.cluster.stop()` before `self.clusters.steal()`). This reduces
concurrency a bit - fewer Scyllas running at the same time, which is
good (the pool size gives a limit on the desired max number of
concurrently running clusters). Killing a cluster is quick so it won't
make a significant difference for the next guy waiting on the pool.
Closes#12564
(cherry picked from commit 3ed3966f13)
If a cluster fails to boot, it saves the exception in
`self.start_exception` variable; the exception will be rethrown when
a test tries to start using this cluster. As explained in `before_test`:
```
def before_test(self, name) -> None:
"""Check that the cluster is ready for a test. If
there was a start error, throw it here - the server is
running when it's added to the pool, which can't be attributed
to any specific test, throwing it here would stop a specific
test."""
```
It's arguable whether we should blame some random test for a failure
that it didn't cause, but nevertheless, there's a problem here: the
`start_exception` will be rethrown and the test will fail, but then the
cluster will be simply returned to the pool and the next test will
attempt to use it... and so on.
Prevent this by marking the cluster as dirty the first time we rethrow
the exception.
Closes#12560
(cherry picked from commit 147dd73996)
Commitlog O_DSYNC is intended to make Raft and schema writes durable
in the face of power loss. To make O_DSYNC performant, we preallocate
the commitlog segments, so that the commitlog writes only change file
data and not file metadata (which would require the filesystem to commit
its own log).
However, in tests, this causes each ScyllaDB instance to write 384MB
of commitlog segments. This overloads the disks and slows everything
down.
Fix this by disabling O_DSYNC (and therefore preallocation) during
the tests. They can't survive power loss, and run with
--unsafe-bypass-fsync anyway.
Closes#12542
(cherry picked from commit 9029b8dead)
There was a small chance that we called `timeout_src.request_abort()`
twice in the `with_timeout` function, first by timeout and then by
shutdown. `abort_source` fails on an assertion in this case. Fix this.
Fixes: #12512Closes#12514
(cherry picked from commit 54170749b8)
before this change, we returns the total memory managed by Seastar
in the "total" field in system.memory. but this value only reflect
the total memory managed by Seastar's allocator. if
`reserve_additional_memory` is set when starting app_template,
Seastar's memory subsystem just reserves a chunk of memory of this
specified size for system, and takes the remaining memory. since
f05d612da8, we set this value to 50MB for wasmtime runtime. hence
the test of `TestRuntimeInfoTable.test_default_content` in dtest
fails. the test expects the size passed via the option of
`--memory` to be identical to the value reported by system.memory's
"total" field.
after this change, the "total" field takes the reserved memory
for wasm udf into account. the "total" field should reflect the total
size of memory used by Scylla, no matter how we use a certain portion
of the allocated memory.
Fixes#12522
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12573
(cherry picked from commit 4a0134a097)
Currently reverse types match the default case (false), even though they
might be wrapping a tuple type. One user-visible effect of this is that
a schema, which has a reversed<frozen<UDT>> clustering key component,
will have this component incorrectly represented in the schema cql dump:
the UDT will loose the frozen attribute. When attempting to recreate
this schema based on the dump, it will fail as the only frozen UDTs are
allowed in primary key components.
Fixes: #12576Closes#12579
(cherry picked from commit ebc100f74f)
Fixes#12601 (maybe?)
Sort the set of tables on ID. This should ensure we never
generate duplicates in a paged listing here. Can obviously miss things if they
are added between paged calls and end up with a "smaller" UUID/ARN, but that
is to be expected.
(cherry picked from commit da8adb4d26)
Since we're potentially searching the row_lock in parallel to acquiring
the read_lock on the partition, we're racing with row_locker::unlock
that may erase the _row_locks entry for the same clustering key, since
there is no lock to protect it up until the partition lock has been
acquired and the lock_partition future is resolved.
This change moves the code to search for or allocate the row lock
_after_ the partition lock has been acquired to make sure we're
synchronously starting the read/write lock function on it, without
yielding, to prevent this use-after-free.
This adds an allocation for copying the clustering key in advance
even if a row_lock entry already exists, that wasn't needed before.
It only us slows down (a bit) when there is contention and the lock
already existed when we want to go locking. In the fast path there
is no contention and then the code already had to create the lock
and copy the key. In any case, the penalty of copying the key once
is tiny compared to the rest of the work that view updates are doing.
This is required on top of 5007ded2c1 as
seen in https://github.com/scylladb/scylladb/issues/12632
which is closely related to #12168 but demonstrates a different race
causing use-after-free.
Fixes#12632
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 4b5e324ecb)
before this change, we construct a sstring from a comma statement,
which evaluates to the return value of `name.size()`, but what we
expect is `sstring(const char*, size_t)`.
in this change
* instead of passing the size of the string_view,
both its address and size are used
* `std::string_view` is constructed instead of sstring, for better
performance, as we don't need to perform a deep copy
the issue is reported by GCC-13:
```
In file included from cql3/selection/selectable.cc:11:
cql3/selection/field_selector.hh:83:60: error: ignoring return value of function declared with 'nodiscard' attribute [-Werror,-Wunused-result]
auto sname = sstring(reinterpret_cast<const char*>(name.begin(), name.size()));
^~~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12666
(cherry picked from commit 186ceea009)
Fixes#12739.
Currently, segment file removal first calls `f.remove_file()` and
does `total_size_on_disk -= f.known_size()` later.
However, `remove_file()` resets `known_size` to 0, so in effect
the freed space in not accounted for.
`total_size_on_disk` is not just a metric. It is also responsible
for deciding whether a segment should be recycled -- it is recycled
only if `total_size_on_disk - known_size < max_disk_size`.
Therefore this bug has dire performance consequences:
if `total_size_on_disk - known_size` ever exceeds `max_disk_size`,
the recycling of commitlog segments will stop permanently, because
`total_size_on_disk - known_size` will never go back below
`max_disk_size` due to the accounting bug. All new segments from this
point will be allocated from scratch.
The bug was uncovered by a QA performance test. It isn't easy to trigger --
it took the test 7 hours of constant high load to step into it.
However, the fact that the effect is permanent, and degrades the
performance of the cluster silently, makes the bug potentially quite severe.
The bug can be easily spotted with Prometheus as infinitely rising
`commitlog_total_size_on_disk` on the affected shards.
Fixes#12645Closes#12646
(cherry picked from commit fa7e904cd6)
New clusters that use a fresh conf/scylla.yaml will have `consistent_cluster_management: true`, which will enable Raft, unless the user explicitly turns it off before booting the cluster.
People using existing yaml files will continue without Raft, unless consistent_cluster_management is explicitly requested during/after upgrade.
Also update the docs: cluster creation and node addition procedures.
Fixes#12572.
Closes#12585
* github.com:scylladb/scylladb:
docs: mention `consistent_cluster_management` for creating cluster and adding node procedures
conf: enable `consistent_cluster_management` by default
(cherry picked from commit 5c886e59de)
Make it so that failures in `removenode`/`decommission` don't lead to reduced availability, and any leftovers in group 0 can be removed by `removenode`:
- In `removenode`, make the node a non-voter before removing it from the token ring. This removes the possibility of having a group 0 voting member which doesn't correspond to a token ring member. We can still be left with a non-voter, but that's doesn't reduce the availability of group 0.
- As above but for `decommission`.
- Make it possible to remove group 0 members that don't correspond to token ring members from group 0 using `removenode`.
- Add an API to query the current group 0 configuration.
Fixes#11723.
Closes#12502
* github.com:scylladb/scylladb:
test: test_topology: test for removing garbage group 0 members
test/pylib: move some utility functions to util.py
db: system_keyspace: add a virtual table with raft configuration
db: system_keyspace: improve system.raft_snapshot_config schema
service: storage_service: better error handling in `decommission`
service: storage_service: fix indentation in removenode
service: storage_service: make `removenode` work for group 0 members which are not token ring members
service/raft: raft_group0: perform read_barrier in wait_for_raft
service: storage_service: make leaving node a non-voter before removing it from group 0 in decommission/removenode
test: test_raft_upgrade: remove test_raft_upgrade_with_node_remove
service/raft: raft_group0: link to Raft docs where appropriate
service/raft: raft_group0: more logging
service/raft: raft_group0: separate function for checking and waiting for Raft
This test is frequently failing due to a timeout when we try to restart
one of the nodes. The shutdown procedure apparently hangs when we try to
stop the `hints_manager` service, e.g.:
```
INFO 2023-01-13 03:18:02,946 [shard 0] hints_manager - Asked to stop
INFO 2023-01-13 03:18:02,946 [shard 0] hints_manager - Stopped
INFO 2023-01-13 03:18:02,946 [shard 0] hints_manager - Asked to stop
INFO 2023-01-13 03:18:02,946 [shard 1] hints_manager - Asked to stop
INFO 2023-01-13 03:18:02,946 [shard 1] hints_manager - Stopped
INFO 2023-01-13 03:18:02,946 [shard 1] hints_manager - Asked to stop
INFO 2023-01-13 03:18:02,946 [shard 1] hints_manager - Stopped
INFO 2023-01-13 03:22:56,997 [shard 0] hints_manager - Stopped
```
observe the 5 minute delay at the end.
There is a known issue about `hints_manager` stop hanging: #8079.
Now, for some reason, this is the only test case that is hitting this
issue. We don't completely understand why. There is one significant
difference between this test case and others: this is the only test case
which kills 2 (out of 3) servers in the cluster and then tries to
gracefully shutdown the last server. There's a hypothesis that the last
server gets stuck trying to send hints to the killed servers. We weren't
able to prove/falsify it yet. But if it's true, then this patch will:
- unblock next promotions,
- give us some important information when we see that the issue stops
appearing.
In the patch we shutdown all servers gracefully instead of killing them,
like we do in the other test cases.
Closes#12548
The docs mention that method, but it doesn't exist. Instead, the
state_machine interface defines plain .apply() one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12541
Add a new virtual table `system.raft_state` that shows the currently
operating Raft configuration for each present group. The schema is the
same as `system.raft_snapshot_config` (the latter shows the config from
the last snapshot). In the future we plan to add more columns to this
table, showing more information (like the current leader and term),
hence the generic name.
Adding the table requires some plumbing of
`sharded<raft_group_registry>&` through function parameters to make it
accessible from `register_virtual_tables`, but it's mostly
straightforward.
Also added some APIs to `raft_group_registry` to list all groups and
find a given group (returning `nullptr` if one isn't found, not throwing
an exception).