utils/chunked_vector::reserve_partial: fix usage in callers
The method reserve_partial(), when used as documented, quits before the
intended capacity can be reserved fully. This can lead to overallocation
of memory in the last chunk when data is inserted to the chunked vector.
The method itself doesn't have any bug but the way it is being used by
the callers needs to be updated to get the desired behaviour.
Instead of calling it repeatedly with the value returned from the
previous call until it returns zero, it should be repeatedly called with
the intended size until the vector's capacity reaches that size.
This PR updates the method comment and all the callers to use the
right way.
Fixes#19254Closesscylladb/scylladb#19279
* github.com:scylladb/scylladb:
utils/large_bitset: remove unused includes identified by clangd
utils/large_bitset: use thread::maybe_yield()
test/boost/chunked_managed_vector_test: fix testcase tests_reserve_partial
utils/lsa/chunked_managed_vector: fix reserve_partial()
utils/chunked_vector: return void from reserve_partial and make_room
test/boost/chunked_vector_test: fix testcase tests_reserve_partial
utils/chunked_vector::reserve_partial: fix usage in callers
(cherry picked from commit b2ebc172d0)
Backported from #19308 to 5.4
Closesscylladb/scylladb#19355
This is a fairly elaborate backport of commit b2a500a9a1 to branch 5.4
The code patch itself is trivial, and backported cleanly. The big problem was the test, which was written using the "topology" test framework - because it needs to test a cluster, not a single node, because the scheduling group problem only happened when sending requests between different Scylla nodes.
I had to fix in the backport the following problems:
1. The test used a library function add_servers() which didn't exist in branch 5.4, so needed to switch to making three individual add_server() calls.
2. The test was randomly placed in the topology_experimental_raft directory, which runs with the tablets experimental flag enabled. In 5.4, the tablets code was broken with Alternator, and CreateTable fails (it fails in the callback to create tablets, and doesn't even get to check that tablets weren't requested). So I needed to move it the test file to a different directory.
3. Even after moving the file, it still ran with the tablets experimental feature! Turns out that test.py enabled tablets experimental feature unconditionally. This is a mistake, and I'm sure was never intended (tablets were never meant to be supported in 5.4), so I removed enabling this feature. It's still enabled in the topology_experimental_raft directory, where it is explicitly enabled.
After all that, the test passes with the patch, showing that the code fix is correct also for 5.4.
Closesscylladb/scylladb#19321
* github.com:scylladb/scylladb:
alternator, scheduler: test reproducing RPC scheduling group bug
test.py: don't enable "tablets" experimental feature
main: add maintenance tenant to messaging_service's scheduling config
Store that maintenance scheduling group inside the sstables_manager. The
next patch will use this to run the components reloader fiber.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 79f6746298)
This patch adds a test for issue #18719: Although the Alternator TTL
work is supposedly done in the "streaming" scheduling group, it turned
out we had a bug where work sent on behalf of that code to other nodes
failed to inherit the correct scheduling group, and was done in the
normal ("statement") group.
Because this problem only happens when more than one node is involved,
the test is in the multi-node test framework test/topology_experimental_raft.
The test uses the Alternator API. We already had in that framework a
test using the Alternator API (a test for alternator+tablets), so in
this patch we move the common Alternator utility functions to a common
file, test_alternator.py, where I also put the new test.
The test is based on metrics: We write expiring data, wait for it to expire,
and then check the metrics on how much CPU work was done in the wrong
scheduling group ("statement"). Before #18719 was fixed, a lot of work
was done there (more than half of the work done in the right group).
After the issue was fixed in the previous patch, the work on the wrong
scheduling group went down to zero.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 1fe8f22d89)
Modifications in the cherry-pick:
* Moved test to topology_custom directory, so it runs without tablets
* use the server_add() function instead of the newer add_servers() which
didn't yet exist in this branch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This branch (5.4) does NOT support tablets, and we don't want to run
any tests with the "tablets" experimental feature. When we made test.py
enable that feature by default, it was probably considered harmless -
the partial implementation we had in this branch won't do anything if
tablets aren't actually enabled for a specific keyspace.
But unfortunately, Alternator doesn't work with tablets enabled (there
was a bug in the callback during table creation), so we can't run any
Alternator tests from test.py (like the one we we wan to backport for
Alternator TTL scheduling groups) unless we drop that experimental
feature.
Note that one specific test subdirectory,
test/topology_experimental_raft, does enable this experimental
flag. The others shouldn't.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Currently only the user tenant (statement scheduling group) and system
(default scheduling group) tenants exist, as we used to have only
user-initiated operations and sytem (internal) ones. Now there is need
to distinguish between two kinds of system operation: foreground and
background ones. The former should use the system tenant while the
latter will use the new maintenance tenant (streaming scheduling group).
(cherry picked from commit 5d3f7c13f9)
Fetching only the first page is not the intuitive behavior expected by users.
This causes flakiness in some tests which generate variable amount of
keys depending on execution speed and verify later that all keys were
written using a single SELECT statement. When the amount of keys
becomes larger than page size, the test fails.
Fixes#18774
(cherry picked from commit 43b907b499)
Closesscylladb/scylladb#19129
before this change, when performing memtable_test, we expect that
the memtables of ks.cf is the only memtables being flushed. and
we inject 4 failures in the code path of flush, and wait until 4
of them are triggered. but in the background, `dirty_memory_manager`
performs flush on all tables when necessary. so, the total number of
failures is not necessary the total number of failures triggered
when flushing ks.cf, some of them could be triggered when flushing
system tables. that's why we have sporadict test failures from
this test. as we might check `t.min_memtable_timestamp()` too soon.
after this change, we increase `unspooled_dirty_soft_limit` setting,
in order to disable `dirty_memory_manager`, so that the only flush
is performed by the test.
Fixes https://github.com/scylladb/scylladb/issues/19034
---
the issue applies to both 5.4 and 6.0, and this issue hurts the CI stability, hence we should backport it.
(cherry picked from commit 2df4e9cfc2)
(cherry picked from commit 223fba3243)
Refs #19252Closesscylladb/scylladb#19256
* github.com:scylladb/scylladb:
test: memtable_test: increase unspooled_dirty_soft_limit
test: memtable_test: replace BOOST_ASSERT with BOOST_REQURE
storage_proxy has a throttling mechanism which attempts to limit the number
of background writes by forcefully raising CL to ALL
(it's not implemented exactly like that, but that's the effect) when
the amount of background and queued writes is above some fixed threshold.
If this is applied to a write, it becomes "throttled",
and its ID is appended to into _throttled_writes.
Whenever the amount of background and queued writes falls below the threshold,
writes are "unthrottled" — some IDs are popped from _throttled_writes
and the writes represented by these IDs — if their handlers still exist —
have their CL lowered back.
The problem here is that IDs are only ever removed from _throttled_writes
if the number of queued and background writes falls below the threshold.
But this doesn't have to happen in any finite time, if there's constant write
pressure. And in fact, in one load test, it hasn't happened in 3 hours,
eventually causing the buffer to grow into gigabytes and trigger OOM.
This patch is intended to be a good-enough-in-practice fix for the problem.
Fixes#17476Fixes#1834
(cherry picked from commit 97e1518eb9)
Closesscylladb/scylladb#19179
If we receive a message in the same term but from a different leader
than we expect, we print:
```
Got append request/install snapshot/read_quorum from an unexpected leader
```
For some reason the message did not include the details (who the leader
was and who the sender was) which requires almost zero effort and might
be useful for debugging. So let's include them.
Ref: scylladb/scylla-enterprise#4276
(cherry picked from commit 99a0599e1e)
Closesscylladb/scylladb#19264
before this change, when performing memtable_test, we expect that
the memtables of ks.cf is the only memtables being flushed. and
we inject 4 failures in the code path of flush, and wait until 4
of them are triggered. but in the background, `dirty_memory_manager`
performs flush on all tables when necessary. so, the total number of
failures is not necessary the total number of failures triggered
when flushing ks.cf, some of them could be triggered when flushing
system tables. that's why we have sporadict test failures from
this test. as we might check `t.min_memtable_timestamp()` too soon.
after this change, we increase `unspooled_dirty_soft_limit` setting,
in order to disable `dirty_memory_manager`, so that the only flush
is performed by the test.
Fixes#19034
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 223fba3243)
before this change, we verify the behavior of design under test using
`BOOST_ASSERT()`, which is a wrapper around `assert()`, so if a test
fails, the test just aborts. this is not very helpful for postmortem
debugging.
after this change, we use `BOOST_REQUIRE` macro for verifying the
behavior, so that Boost.Test prints out the condition if it does not
hold when we test it.
Refs #19034
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 2df4e9cfc2)
Currently, when a view update backlog is changed and sent
using gossip, we check whether the strtoll/strtoull
function used for reading the backlog returned
LLONG_MAX/ULLONG_MAX, signaling an error of a value
exceeding the type's limit, and if so, we do not store
it as the new value for the node.
However, the ULLONG_MAX value can also be used as the max
backlog size when sending empty backlogs that were never
updated. In theory, we could avoid sending the default
backlog because each node has its real backlog (based on
the node's memory, different than the ULLONG_MAX used in
the default backlog). In practice, if the node's
backlog changed to 0, the backlog sent by it will be
likely the default backlog, because when selecting
the biggest backlog across node's shards, we use the
operator<=>(), which treats the default backlog as
equal to an empty backlog and we may get the default
backlog during comparison if the backlog of some shard
was never changed (also it's the initial max value
we compare shard's backlogs against).
This patch removes the (U)LLONG_MAX check and replaces
it with the errno check, which is also set to ERANGE during
the strtoll error, and which won't prevent empty backlogs
from being read
Fixes: #18462
(cherry picked from commit 5154429713)
Closesscylladb/scylladb#18697
Incremented the components_memory_reclaim_threshold config's default
value to 0.2 as the previous value was too strict and caused unnecessary
eviction in otherwise healthy clusters.
Fixes#18607
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 3d7d1fa72a)
Closesscylladb/scylladb#19013
PR #17771 introduced a threshold for the total memory used by all bloom filters across SSTables. When the total usage surpasses the threshold, the largest bloom filter will be removed from memory, bringing the total usage back under the threshold. This PR adds support for reloading such reclaimed bloom filters back into memory when memory becomes available (i.e., within the 10% of available memory earmarked for the reclaimable components).
The SSTables manager now maintains a list of all SSTables whose bloom filter was removed from memory and attempts to reload them when an SSTable, whose bloom filter is still in memory, gets deleted. The manager reloads from the smallest to the largest bloom filter to maximize the number of filters being reloaded into memory.
Backported from https://github.com/scylladb/scylladb/pull/18186 to 5.4.
Closesscylladb/scylladb#18660
* github.com:scylladb/scylladb:
sstable_datafile_test: add testcase to test reclaim during reload
sstable_datafile_test: add test to verify auto reload of reclaimed components
sstables_manager: reload previously reclaimed components when memory is available
sstables_manager: start a fiber to reload components
sstable_directory_test: fix generation in sstable_directory_test_table_scan_incomplete_sstables
sstable_datafile_test: add test to verify reclaimed components reload
sstables: support reloading reclaimed components
sstables_manager: add new intrusive set to track the reclaimed sstables
sstable: add link and comparator class to support new instrusive set
sstable: renamed intrusive list link type
sstable: track memory reclaimed from components per sstable
sstable: rename local variable in sstable::total_reclaimable_memory_size
When a table has secondary indexes on *multiple* columns, and several
such columns are used for filtering in a query, Scylla chooses one
of these indexes as the main driver of the query, and the second
column's restriction is implemented as filtering.
Before this patch, the index to use was chosen fairly randomly, based on
the order of the indexes in the schema. This order may be different in
different coordinators, and may even change across restarts on the same
coordinators. This is not only inconsistent, it can cause outright wrong
results when using *paging* and switching (or restarting) coordinates
in the middle of a paged scan... One coordinator saves one index's key
in the paging state, and then the other coordinator gets this paging
state and wrongly believes it is supposed to be a key of a *different*
index.
The fix in this patch is to pick the index suitable for the first
indexed column mentioned in the query. This has two benefits over
the situation before the patch:
1. The decision of which index to use no longer changes between
coordinators or across restarts - it just depends on the schema
and the specific query.
2. Different indexes can have different "specificity" so using one
or the other can change the query's performance. After this patch,
the user is in control over which index is used by changing the
order of terms in the query. A curious user can use tracing to
check which index was used to implement a particular query.
An xfailing test we had for this issue no longer fails, so the "xfail"
marker is removed.
Fixes#7969
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 77c61f907e)
Closesscylladb/scylladb#18963
This PR backport the "repair_partition_count_estimation_ratio config option" support to 5.4 branch.
In addition to the main patch "repair: Introduce repair_partition_count_estimation_ratio config option",
the patch "repair: Add missing db/config.hh" is added too.
Closesscylladb/scylladb#18881
* github.com:scylladb/scylladb:
repair: Introduce repair_partition_count_estimation_ratio config option
repair: Add missing db/config.hh
In commit 642f9a1966 (repair: Improve
estimated_partitions to reduce memory usage), a 10% hard coded
estimation ratio is used.
This patch introduces a new config option to specify the estimation
ratio of partitions written by repair out of the total partitions.
It is set to 0.1 by default.
Fixes#18615
(cherry picked from commit 340eae007a)
Since commit 952dfc6157 "repair: Introduce
repair_partition_count_estimation_ratio config option", get_config() is
used. We need to include db/config.hh for that.
Spotted when backporting to 5.4 branch.
Refs #18615Closesscylladb/scylladb#18780
(cherry picked from commit 1a03e3d5ae)
On 7ce6962141 we dropped openssh-server,
it also dropped systemd package and caused an error on Scylla Operator
(#17787).
This reverts dropping systemd package and fix the issue.
Fix#17787
(cherry picked from commit 0c7aa9342d)
Closesscylladb/scylladb#18834
The function intersection(r1,r2) in statement_restrictions.cc is used
when several WHERE restrictions were applied to the same column.
For example, for "WHERE b<1 AND b<2" the intersection of the two ranges
is calculated to be b<1.
As noted in issue #18690, Scylla is inconsistent in where it allows or
doesn't allow these intersecting restrictions. But where they are
allowed they must be implemented correctly. And it turns out the
function intersection() had a bug that caused it to sometimes enter
an infinite loop - when the intent was only to call itself once with
swapped parameters.
This patch includes a test reproducing this bug, and a fix for the
bug. The test hangs before the fix, and passes after the fix.
While at it, I carefully reviewed the entire code used to implement
the intersection() function to try to make sure that the bug we found
was the only one. I also added a few more comments where I thought they
were needed to understand complicated logic of the code.
The bug, the fix and the test were originally discovered by
Michał Chojnowski.
Fixes#18688
Refs #18690
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 27ab560abd)
Closesscylladb/scylladb#18717
Currently, if the fill ctor throws an exception,
the destructor won't be called, as it object is not fully constructed yet.
Call the default ctor first (which doesn't throw)
to make sure the destructor will be called on exception.
Fixes scylladb/scylladb#18635
- [x] Although the fixes is for a rare bug, it has very low risk and so it's worth backporting to all live versions
(cherry picked from commit 64c51cf32c)
(cherry picked from commit 88b3173d03)
(cherry picked from commit 4bbb66f805)
Refs #18636Closesscylladb/scylladb#18679
* github.com:scylladb/scylladb:
chunked_vector_test: add more exception safety tests
chunked_vector_test: exception_safe_class: count also moved objects
utils: chunked_vector: fill ctor: make exception safe
We have to account for moved objects as well
as copied objects so they will be balanced with
the respective `del_live_object` calls called
by the destructor.
However, since chunked_vector requires the
value_type to be nothrow_move_constructible,
just count the additional live object, but
do not modify _countdown or, respectively, throw
an exception, as this should be considered only
for the default and copy constructors.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, if the fill ctor throws an exception,
the destructor won't be called, as it object is not
fully constructed yet.
Call the default ctor first (which doesn't throw)
to make sure the destructor will be called on exception.
Fixesscylladb/scylladb#18635
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
this change is inspired by following warning from clang-tidy
```
Warning: /home/runner/work/scylladb/scylladb/service/storage_proxy.cc:884:13: warning: 'tr_state' used after it was moved [bugprone-use-after-move]
884 | if (tr_state) {
| ^
/home/runner/work/scylladb/scylladb/service/storage_proxy.cc:872:139: note: move occurred here
872 | auto f = get_schema_for_read(proposal.update.schema_version(), src_addr, *timeout).then([&sp = _sp, &sys_ks = _sys_ks, tr_state = std::move(tr_state),
| ^
```
this is not a false positive. as `tr_state` is a captured by move for
constructing a variable in the captured list of a lambda which is in
turn passed to the expression evaluated to `f`.
even the expression itself is not evaluated yet when we reference
`tr_state` to check if it is empty after preparing the expression,
`tr_state` is already moved away into the captured variable. so
at that moment, the statement of `f = f.finally(...)` is never
evaluated, because `tr_state` is always empty by then.
so before this change, the trace message is never recorded.
in this change, we address this issue by capturing `tr_state` by
copying it. as `tr_state` is backed by a `lw_shared_ptr`, the overhead is
neglectable.
after this change, the tracing message is recorded.
the change introduced this issue was 548767f91e.
please note, we could coroutinize this function to improve its
readability, but since this is a fix and should be backported,
let's start with a minimal fix, and worry about the readability
in a follow-up change.
Refs 548767f91eFixes#18725
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit a429e7b1fe)
Closesscylladb/scylladb#18763
Even when configured to not do any validation at all, the validator still did some. This small series fixes this, and adds a test to check that validation levels in general are respected, and the validator doesn't validate more than it is asked to.
Fixes: #18662
(cherry picked from commit f6511ca1b0)
(cherry picked from commit e7b07692b6)
(cherry picked from commit 78afb3644c)
Refs #18667Closesscylladb/scylladb#18724
* github.com:scylladb/scylladb:
test/boost/mutation_fragment_test.cc: add test for validator validation levels
mutation: mutation_fragment_stream_validating_filter: fix validation_level::none
mutation: mutation_fragment_stream_validating_filter: add raises_error ctor parameter
Despite its name, this validation level still did some validation. Fix
this, by short-circuiting the catch-all operator(), preventing any
validation when the user asked for none.
(cherry picked from commit e7b07692b6)
When set to false, no exceptions will be raised from the validator on
validation error. Instead, it will just return false from the respective
validator methods. This makes testing simpler, asserting exceptions is
clunky.
When true (default), the previous behaviour will remain: any validation
error will invoke on_internal_error(), resulting in either std::abort()
or an exception.
(cherry picked from commit f6511ca1b0)
when migrating to the uuid-based identifiers, the mapping from the
integer-based generation to the shard-id is preserved. we used to have
"gen % smp_count" for calculating the shard which is responsible to host
a given sstable. despite that this is not a documented behavior, this is
handy when we try to correlate an sstable to a shard, typically when
looking at a performance issue.
in this change, a new subcommand is added to expose the connection
between the sstable and its "owner" shards.
Fixes#16343
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes https://github.com/scylladb/scylladb/pull/16345
(cherry picked from commit 273ee36bee)
Fixes#18381
- [x] need to backport, because we have needs in production to figure out the mapping from an sstable identifier to the shard which "owns" it.
Closesscylladb/scylladb#18681
* github.com:scylladb/scylladb:
tools: Make sstable shard-of efficient by loading minimum to compute owners
test/cql-pytest/test_tools.py: test shard-of with a single partition
tools/scylla-sstable: add `scylla sstable shard-of` command
Getting token() function first tries to find a schema for underlying
table and continues with nullptr if there's no one. Later, when creating
token_fct, the schema is passed as is and referenced. If it's null crash
happens.
It used to throw before 5983e9e7b2 (cql3: test_assignment: pass optional
schema everywhere) on missing schema, but this commit changed the way
schema is looked up, so nullptr is now possible.
fixes: #18637
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit df8a446437)
Closesscylladb/scylladb#18698
when migrating to the uuid-based identifiers, the mapping from the
integer-based generation to the shard-id is preserved. we used to have
"gen % smp_count" for calculating the shard which is responsible to host
a given sstable. despite that this is not a documented behavior, this is
handy when we try to correlate an sstable to a shard, typically when
looking at a performance issue.
in this change, a new subcommand is added to expose the connection
between the sstable and its "owner" shards.
Fixes#16343
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16345
(cherry picked from commit 273ee36bee)
When a compaction strategy uses garbage collected sstables to track
expired tombstones, do not use complete partition estimates for them,
instead, use a fraction of it based on the droppable tombstone ratio
estimate.
Fixes#18283
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#18465
(cherry picked from commit d39adf6438)
Closesscylladb/scylladb#18656
The default limit of open file descriptors
per process may be too small for iotune on
certain machines with large number of cores.
In such case iotune reports failure due to
unability to create files or to set up seastar
framework.
This change configures the limit of open file
descriptors before running iotune to ensure
that the failure does not occur.
The limit is set via 'resource.setrlimit()' in
the parent process. The limit is then inherited
by the child process.
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
(cherry picked from commit ec820e214c)
Closesscylladb/scylladb#18655
Mostly a set of fixes in the area of ssl handling
* tools/cqlsh 99b2b777...9d49b385 (21):
> cqlshlib/sslhandling: fix logic of `ssl_check_hostname`
> cqlshlib/sslhandling.py: don't use empty userkey/usercert
> Dockerfile: noninteractive isn't enough for answering yet on apt-get
> fix cqlsh version print
> cqlshlib/sslhandling: change `check_hostname` deafult to False
> Introduce new ssl configuration for disableing check_hostname
> set the hostname in ssl_options.server_hostname when SSL is used
> issue-73 Fixed a bug where username and password from the credentials file were ignored.
> issue-73 Fixed a bug where username and password from the credentials file were ignored.
> issue-73
> github actions: update `cibuildwheel==v2.16.5`
> dist/debian: fix the trailer line format
> `COPY TO STDOUT` shouldn't put None where a function is expected
> Make cqlsh work with unix domain sockets
> Bump python-driver version
> dist/debian: add trailer line
> dist/debian: wrap long line
> Draft: explicit build-time packge dependencies
> stop retruning status_code=2 on schema disagreement
> Fix minor typos in the code
> Dockerfile: apt-get update and apt-get upgrade to get latest OS packages
Ref: #18590Closesscylladb/scylladb#18652
When an SSTable is dropped, the associated bloom filter gets discarded
from memory, bringing down the total memory consumption of bloom
filters. Any bloom filter that was previously reclaimed from memory due
to the total usage crossing the threshold, can now be reloaded back into
memory if the total usage can still stay below the threshold. Added
support to reload such reclaimed filters back into memory when memory
becomes available.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 0b061194a7)
Start a fiber that gets notified whenever an sstable gets deleted. The
fiber doesn't do anything yet but the following patch will add support
to reload reclaimed components if there is sufficient memory.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit f758d7b114)
The testcase uses an sstable whose mutation key and the generation are
owned by different shards. Due to this, when process_sstable_dir is
called, the sstable gets loaded into a different shard than the one that
was intended. This also means that the sstable and the sstable manager
end up in different shards.
The following patch will introduce a condition variable in sstables
manager which will be signalled from the sstables. If the sstable and
the sstable manager are in different shards, the signalling will cause
the testcase to fail in debug mode with this error : "Promise task was
set on shard x but made ready on shard y". So, fix it by supplying
appropriate generation number owned by the same shard which owns the
mutation key as well.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 24064064e9)
Added support to reload components from which memory was previously
reclaimed as the total memory of reclaimable components crossed a
threshold. The implementation is kept simple as only the bloom filters
are considered reclaimable for now.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 54bb03cff8)
The new set holds the sstables from where the memory has been reclaimed
and is sorted in ascending order of the total memory reclaimed.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 2340ab63c6)
Renamed the intrusive list link type to differentiate it from the set
link type that will be added in an upcoming patch.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 3ef2f79d14)
Added a member variable _total_memory_reclaimed to the sstable class
that tracks the total memory reclaimed from a sstable.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 02d272fdb3)
Renamed local variable in sstable::total_reclaimable_memory_size in
preparation for the next patch which adds a new member variable
_total_memory_reclaimed to the sstable class.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit a53af1f878)
This PR removes the incorrect information that the ScyllaDB Rust Driver is not GA.
In addition, it replaces "Scylla" with "ScyllaDB".
Fixes https://github.com/scylladb/scylladb/issues/16178Closesscylladb/scylladb#16199
* github.com:scylladb/scylladb:
doc: remove the "preview" label from Rust driver
doc: fix Rust Driver release information
(cherry picked from commit 56c3515751)
Currently default task_ttl_in_seconds is 0, but scylla.yaml changes
the value to 10.
Change task_ttl_in_seconds in scylla.yaml to 0, so that there are
consistent defaults. Comment it out.
Fixes: #16714.
(cherry picked from commit 67bbaad62e)
Closesscylladb/scylladb#18584
The direct failure detector design is simplistic. It sends pings
sequentially and times out listeners that reached the threshold (i.e.
didn't hear from a given endpoint for too long) in-between pings.
Given the sequential nature, the previous ping must finish so the next
ping can start. We timeout pings that take too long. The timeout was
hardcoded and set to 300ms. This is too low for wide-area setups --
latencies across the Earth can indeed go up to 300ms. 3 subsequent timed
out pings to a given node were sufficient for the Raft listener to "mark
server as down" (the listener used a threshold of 1s).
Increase the ping timeout to 600ms which should be enough even for
pinging the opposite side of Earth, and make it tunable.
Increase the Raft listener threshold from 1s to 2s. Without the
increased threshold, one timed out ping would be enough to mark the
server as down. Increasing it to 2s requires 3 timed out pings which
makes it more robust in presence of transient network hiccups.
In the future we'll most likely want to decrease the Raft listener
threshold again, if we use Raft for data path -- so leader elections
start quickly after leader failures. (Faster than 2s). To do that we'll
have to improve the design of the direct failure detector.
Ref: scylladb/scylladb#16410Fixes: scylladb/scylladb#16607
---
I tested the change manually using `tc qdisc ... netem delay`, setting
network delay on local setup to ~300ms with jitter. Without the change,
the result is as observed in scylladb/scylladb#16410: interleaving
```
raft_group_registry - marking Raft server ... as dead for Raft groups
raft_group_registry - marking Raft server ... as alive for Raft groups
```
happening once every few seconds. The "marking as dead" happens whenever
we get 3 subsequent failed pings, which is happens with certain (high)
probability depending on the latency jitter. Then as soon as we get a
successful ping, we mark server back as alive.
With the change, the phenomenon no longer appears.
(cherry picked from commit 8df6d10e88)
Closesscylladb/scylladb#18559
More than three years ago, in issue #7949, we noticed that trying to
set a `map<ascii, int>` from JSON input (i.e., using INSERT JSON or the
fromJson() function) fails - the ascii key is incorrectly parsed.
We fixed that issue in commit 75109e9519
but unfortunately, did not do our due diligence: We did not write enough
tests inspired by this bug, and failed to discover that actually we have
the same bug for many other key types, not just for "ascii". Specifically,
the following key types have exactly the same bug:
* blob
* date
* inet
* time
* timestamp
* timeuuid
* uuid
Other types, like numbers or boolean worked "by accident" - instead of
parsing them as a normal string, we asked the JSON parser to parse them
again after removing the quotes, and because unquoted numbers and
unquoted true/false happwn to work in JSON, this didn't fail.
The fix here is very simple - for all *native* types (i.e., not
collections or tuples), the encoding of the key in JSON is simply a
quoted string - and removing the quotes is all we need to do and there's
no need to run the JSON parser a second time. Only for more elaborate
types - collections and tuples - we need to run the JSON parser a
second time on the key string to build the more elaborate object.
This patch also includes tests for fromJson() reading a map with all
native key types, confirming that all the aforementioned key types
were broken before this patch, and all key types (including the numbers
and booleans which worked even befoe this patch) work with this patch.
Fixes#18477.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 21557cfaa6)
Closesscylladb/scylladb#18522
The current text seems to suggest that `USING TIMEOUT` doesn't work with `DELETE` and `BATCH`. But that's wrong.
Closesscylladb/scylladb#18424
(cherry picked from commit c1146314a1)
The event is used in a loop.
Found by clang-tidy:
```
streaming/stream_result_future.cc:80:49: warning: 'event' used after it was moved [bugprone-use-after-move]
listener->handle_stream_event(std::move(event));
^
streaming/stream_result_future.cc:80:39: note: move occurred here
listener->handle_stream_event(std::move(event));
^
streaming/stream_result_future.cc:80:49: note: the use happens in a later loop iteration than the move
listener->handle_stream_event(std::move(event));
^
```
Fixes#18332Closesscylladb/scylladb#18333
(cherry picked from commit 1ca779d287)
When reclaiming memory from bloom filters, do not remove them from
_recognised_components, as that leads to the on-disk filter component
being left back on disk when the SSTable is deleted.
Fixes#18398
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#18400
(cherry picked from commit 6af2659b57)
When a view update has both a local and remote target endpoint,
it extends the lifetime of its memory tracking semaphore units
only until the end of the local update, while the resources are
actually used until the remote update finishes.
This patch changes the semaphore transferring so that in case
of both local and remote endpoints, both view updates share the
units, causing them to be released only after the update that
takes longer finishes.
Fixes#17890
(cherry picked from commit 9789a3dc7c)
Refs #17891Closesscylladb/scylladb#18108
```
sstables/storage.cc:152:21: warning: 'file_path' used after it was moved [bugprone-use-after-move]
remove_file(file_path).get();
^
sstables/storage.cc:145:64: note: move occurred here
auto w = file_writer(output_stream<char>(std::move(sink)), std::move(file_path));
```
It's a regression when TOC is found for a new sstable, and we try to delete temporary TOC.
courtesy of clang-tidy.
Fixes#18323.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 2fba1f936d)
Closesscylladb/scylladb#18382
in handler.cc, `make_non_overlapping_ranges()` references a moved
instance of `ColumnSlice` when something unexpected happens to
format the error message in an exception, the move constructor of
`ColumnSlice` is default-generated, so the members' move constructors
are used to construct the new instance in the move constructor. this
could lead to undefined behavior when dereferencing the move instance.
in this change, in order to avoid use-after free, let's keep
a copy of the referenced member variables and reference them when
formatting error message in the exception.
this use-after-move issue was introduced in 822a315dfa, which implemented
`get_multi_slice` verb and this piece in the first place. since both 5.2
and 5.4 include this commit, we should backport this change to them.
Refs 822a315dfaFixes#18356
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 1ad3744edc)
Closesscylladb/scylladb#18374
The new MX-native validator, which validates the index in tandem with the data file, was discovered to print false-positive errors, related to range-tombstones and promoted-index positions.
This series fixes that. But first, it refactors the scrub-related tests. These are currently dominated by boiler-plate code. They are hard to read and hard to write. In the first half of the series, a new scrub_test is introduced, which moves all the boiler-plate to a central place, allowing the tests to focus on just the aspect of scrub that is tested.
Then, all the found bugs in validate are fixed and finally a new test, checking validate with valid sstable is introduced.
This PR backports https://github.com/scylladb/scylladb/pull/16327.
Fixes: https://github.com/scylladb/scylladb/issues/16326Closesscylladb/scylladb#18404
* github.com:scylladb/scylladb:
test/boost/sstable_compaction_test: add validation test with valid sstable
sstablex/mx/reader: validate(): print trace message when finishing the PI block
sstablex/mx/reader: validate(): make index-data PI position check message consistent
sstablex/mx/reader: validate(): only load the next PI block if current is exhausted
sstablex/mx/reader: validate(): reset the current PI block on partition-start
sstablex/mx/reader: validate(): consume_range_tombstone(): check for finished clustering blocked
sstablex/mx/reader: validate(): fix validator for range tombstone end bounds
test/boost/sstable_compaction_test: drop write_corrupt_sstable() helper
test/boost/sstable_compaction_test: fix indentation
test/boost/sstable_compaction_test: use test_scrub_framework in test_scrub_quarantine_mode_test
test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_segregate_mode_test
test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_skip_mode_test
test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_validate_mode_test
test/boost/sstable_compaction_test: introduce scrub_test_framework
test/lib/random_schema: add uncompatible_timestamp_generator()
Add a positive test, as it turns out we had some false-positive
validation bugs in the validator and we need a regression test for this.
(cherry picked from commit 2335f42b2b)
The message says "index-data" but when printing the position, the data
position is printed first, causing confusion. Fix this and while at it,
also print the position of the partition start.
(cherry picked from commit 677be168c4)
The validate() consumes the content of partitions in a consume-loop.
Every time the consumer asks for a "break", the next PI block is loaded
and set on the validator, so it can validate that further clustering
elements are indeed from this block.
This loop assumed the consumer would only request interruption when the
current clustering block is finished. This is wrong, the consumer can
also request interruption when yielding is needed. When this is the
case, the next PI block doesn't have to be loaded yet, the current one
is not exhausted yet. Check this condition, before loading the next PI
block, to prevent false positive errors, due to mismatched PI block
and clustering elements from the sstable.
(cherry picked from commit 5bff7c40d3)
It is possible that the next partition has no PI and thus there won't be
a new PI block to overwrite the old one. This will result in
false-positive messages about rows being outside of the finished PI
block.
(cherry picked from commit e073df1dbb)
Promoted index entries can be written on any clustering elements,
icluding range tombstones. So the validating consumer also has the check
whether the current expected clustering block is finished, when
consuming a range tombstone. If it is, consumption has to be
interrupted, so that the outer-loop can load up the next promoted index
block, before moving on to the next clustering element.
(cherry picked from commit 2737899c21)
For range tombstone end-bounds, the validate_fragment_order() should be
passed a null tombstone, not a disengaged optional. The latter means no
change in the current tombstone. This caused the end bound of range
tombstones to not make it to the validator and the latter complained
later on partition-end that the partition has unclosed range tombstone.
(cherry picked from commit f46b458f0d)
The test becomes a lot shorter and it now uses random schema and random
data.
Indentation is left broken, to be fixed in a future patch.
(cherry picked from commit c35092aff6)
The test becomes a lot shorter and it now uses random schema and random
data.
Indentation is left broken, to be fixed in a future patch.
(cherry picked from commit 3f76aad609)
The test becomes a lot shorter and it now uses random schema and random
data. The test is also split in two: one test for abort mode and one for
skip mode.
Indentation is left broken, to be fixed in a future patch.
(cherry picked from commit 5237e8133b)
The test becomes a lot shorter and it now uses random schema and random
data.
Indentation is left broken, to be fixed in a future patch.
(cherry picked from commit 76785baf43)
Scrub tests require a lot of boilerplate code to work. This has a lot of
disadvantages:
* Tests are long
* The "meat" of the test is lost between all the boiler-plate, it is
hard to glean what a test actually does
* Tests are hard to write, so we have only a few of them and they test
multiple things.
* The boiler-plate differs sligthly from test-to-test.
To solve this, this patch introduces a new class, `scrub_test_frawmework`,
which is a central place for all the boiler-plate code needed to write
scrub-related tests. In the next patches, we will migrate scrub related
tests to this class.
(cherry picked from commit b6f0c4efa0)
Currently, when dividing memory tracked for a batch of updates
we do not take into account the overhead that we have for processing
every update. This patch adds the overhead for single updates
and joins the memory calculation path for batches and their parts
so that both use the same overhead.
Fixes#17854
(cherry picked from commit efcb718e0a)
Closesscylladb/scylladb#18107
Currently, we use the sum of the estimated_partitions from each
participant node as the estimated_partitions for sstable produced by
repair. This way, the estimated_partitions is the biggest possible
number of partitions repair would write.
Since repair will write only the difference between repair participant
nodes, using the biggest possible estimation will overestimate the
partitions written by repair, most of the time.
The problem is that overestimated partitions makes the bloom filter
consume more memory. It is observed that it causes OOM in the field.
This patch changes the estimation to use a fraction of the average
partitions per node instead of sum. It is still not a perfect estimation
but it already improves memory usage significantly.
Fixes#18140Closesscylladb/scylladb#18141
(cherry picked from commit 642f9a1966)
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.
For example:
- in scylladb/scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb/scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.
I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.
After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.
With additional logging and additional head-banging, I determined
the root cause.
The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
updates the copy:
```
auto local_state = *ep_state_before;
for (auto& p : states) {
auto& state = p.first;
auto& value = p.second;
value = versioned_value::clone_with_higher_version(value);
local_state.add_application_state(state, value);
}
```
`clone_with_higher_version` bumps `version` inside
gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
phase it copies the updated `local_state` to all shards into a
separate map. In second phase the values from separate map are used to
overwrite the endpoint_state map used for gossiping.
Due to the cross-shard calls of the 1 phase, there is a yield before
the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
This uses the monotonic version_generator, so it uses a higher version
then the ones we used for states added above. Let's call this new version
X. Note that X is larger than the versions used by application_states
added above.
- now node B handles a SYN or ACK message from node A, creating
an ACK or ACK2 message in response. This message contains:
- old application states (NOT including the update described above,
because `replicate` is still sleeping before phase 2),
- but bumped heart_beat == X from `gossiper::run()` loop,
and sends the message.
- node A receives the message and remembers that the max
version across all states (including heart_beat) of node B is X.
This means that it will no longer request or apply states from node B
with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
endpoint_state with the ones it saved in phase 1. In particular it
reverts heart_beat back to smaller value, but the larger problem is that it
saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
message to node A, node A will ignore them, because their versions are
smaller than X. Or node B will never send them, because whenever node
A requests states from node B, it only requests states with versions >
X. Either way, node A will fail to observe new states of node B.
If I understand correctly, this is a regression introduced in
38c2347a3c, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.
With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.
The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.
The PR also adds a regression test.
Fixes: scylladb/scylladb#15393Fixes: scylladb/scylladb#15602Fixes: scylladb/scylladb#16668Fixes: scylladb/scylladb#16902Fixes: scylladb/scylladb#17493Fixes: scylladb/scylladb#18118
Ref: scylladb/scylla-enterprise#3720
(cherry picked from commit a0b331b310)
(cherry picked from commit 72955093eb)
Refs scylladb/scylladb#18184Closesscylladb/scylladb#18245
* github.com:scylladb/scylladb:
test: reproducer for missing gossiper updates
gossiper: lock local endpoint when updating heart_beat
Added support to track and limit the memory usage by sstable components. A reclaimable component of an SSTable is one from which memory can be reclaimed. SSTables and their managers now track such reclaimable memory and limit the component memory usage accordingly. A new configuration variable defines the memory reclaim threshold. If the total memory of the reclaimable components exceeds this limit, memory will be reclaimed to keep the usage under the limit. This PR considers only the bloom filters as reclaimable and adds support to track and limit them as required.
The feature can be manually verified by doing the following :
1. run a single-node single-shard 1GB cluster
2. create a table with bloom-filter-false-positive-chance of 0.001 (to intentionally cause large bloom filter)
3. populate with tiny partitions
4. watch the bloom filter metrics get capped at 100MB
The default value of the `components_memory_reclaim_threshold` config variable which controls the reclamation process is `.1`. This can also be reduced further during manual tests to easily hit the threshold and verify the feature.
Fixes https://github.com/scylladb/scylladb/issues/17747
Backported from #17771 to 5.4.
Closesscylladb/scylladb#18248
* github.com:scylladb/scylladb:
test_bloom_filter.py: disable reclaiming memory from components
sstable_datafile_test: add tests to verify auto reclamation of components
test/lib: allow overriding available memory via test_env_config
sstables_manager: support reclaiming memory from components
sstables_manager: store available memory size
sstables_manager: add variable to track component memory usage
db/config: add a new variable to limit memory used by table components
sstable_datafile_test: add testcase to verify reclamation from sstables
sstables: support reclaiming memory from components
Regression test for scylladb/scylladb#17493.
(cherry picked from commit 72955093eb)
Backport note: removed `timeout` parameter passed to `server_add`,
missing on this branch. (If server adding hangs, it will timeout after
`TOPOLOGY_TIMEOUT` from scylla_cluster.py)
Removed `force_gossip_join_boot` error injection from test, not present
in this branch. Starting nodes with `experimental_features` disabled.
Added missing `handle_state_normal.*finished` message.
By default the suitename in the junit files generated by pytest
is named `pytest` for all suites instead of the suite, ex. `topology_experimental_raft`
With this change, the junit files will use the real suitename
This change doesn't affect the Test Report in Jenkins, but it
raised part of the other task of publishing the test results to
elasticsearch https://github.com/scylladb/scylla-pkg/pull/3950
where we parse the XMLs and we need the correct suitename
Closesscylladb/scylladb#18172
(cherry picked from commit 223275b4d1)
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.
For example:
- in scylladb/scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb/scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.
I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.
After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.
With additional logging and additional head-banging, I determined
the root cause.
The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
updates the copy:
```
auto local_state = *ep_state_before;
for (auto& p : states) {
auto& state = p.first;
auto& value = p.second;
value = versioned_value::clone_with_higher_version(value);
local_state.add_application_state(state, value);
}
```
`clone_with_higher_version` bumps `version` inside
gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
phase it copies the updated `local_state` to all shards into a
separate map. In second phase the values from separate map are used to
overwrite the endpoint_state map used for gossiping.
Due to the cross-shard calls of the 1 phase, there is a yield before
the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
This uses the monotonic version_generator, so it uses a higher version
then the ones we used for states added above. Let's call this new version
X. Note that X is larger than the versions used by application_states
added above.
- now node B handles a SYN or ACK message from node A, creating
an ACK or ACK2 message in response. This message contains:
- old application states (NOT including the update described above,
because `replicate` is still sleeping before phase 2),
- but bumped heart_beat == X from `gossiper::run()` loop,
and sends the message.
- node A receives the message and remembers that the max
version across all states (including heart_beat) of node B is X.
This means that it will no longer request or apply states from node B
with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
endpoint_state with the ones it saved in phase 1. In particular it
reverts heart_beat back to smaller value, but the larger problem is that it
saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
message to node A, node A will ignore them, because their versions are
smaller than X. Or node B will never send them, because whenever node
A requests states from node B, it only requests states with versions >
X. Either way, node A will fail to observe new states of node B.
If I understand correctly, this is a regression introduced in
38c2347a3c, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.
With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.
The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.
Fixes: scylladb/scylladb#15393Fixes: scylladb/scylladb#15602Fixes: scylladb/scylladb#16668Fixes: scylladb/scylladb#16902Fixes: scylladb/scylladb#17493Fixes: scylladb/scylladb#18118
Ref: scylladb/scylla-enterprise#3720
(cherry picked from commit a0b331b310)
Disabled reclaiming memory from sstable components in the testcase as it
interferes with the false positive calculation.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit d86505e399)
Reclaim memory from the SSTable that has the most reclaimable memory if
the total reclaimable memory has crossed the threshold. Only the bloom
filter memory is considered reclaimable for now.
Fixes#17747
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit a36965c474)
The available memory size is required to calculate the reclaim memory
threshold, so store that within the sstables manager.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 2ca4b0a7a2)
sstables_manager::_total_reclaimable_memory variable tracks the total
memory that is reclaimable from all the SSTables managed by it.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit f05bb4ba36)
A new configuration variable, components_memory_reclaim_threshold, has
been added to configure the maximum allowed percentage of available
memory for all SSTable components in a shard. If the total memory usage
exceeds this threshold, it will be reclaimed from the components to
bring it back under the limit. Currently, only the memory used by the
bloom filters will be restricted.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit e8026197d2)
Added support to track total memory from components that are reclaimable
and to reclaim memory from them if and when required. Right now only the
bloom filters are considered as reclaimable components but this can be
extended to any component in the future.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 4f0aee62d1)
Patch 967ebacaa4 (view_update_generator: Move abort kicking to
do_abort()) moved unplugging v.u.g from database from .stop() to
.do_abort(). The latter call happens very early on stop -- once scylla
receives SIGINT. However, database may still need v.u.g. plugged to
flush views.
This patch moves unplug to later, namely to .stop() method of v.u.g.
which happens after database is drained and should no longer continue
view updates.
fixes: #16001
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18132
(cherry picked from commit 3471f30b58)
Repair memory limit includes only the size of frozen mutation
fragments in repair row. The size of other members of repair
row may grow uncontrollably and cause out of memory.
Modify what's counted to repair memory limit.
Fixes: #16710.
(cherry picked from commit a4dc6553ab)
(cherry picked from commit 51c09a84cc)
Refs #17785Closesscylladb/scylladb#18205
* github.com:scylladb/scylladb:
test: add test for repair_row::size()
repair: fix memory accounting in repair_row
In repair, only the size of frozen mutation fragments of repair row
is counted to the memory limit. So, huge keys of repair rows may
lead to OOM.
Include other repair_row's members' memory size in repair memory
limit.
(cherry picked from commit a4dc6553ab)
Currently, Scylla logs a warning when it writes a cell, row or partition which are larger than certain configured sizes. These warnings contain the partition key and in case of rows and cells also the cluster key which allow the large row or partition to be identified. However, these keys can contain user-private, sensitive information. The information which identifies the partition/row/cell is also inserted into tables system.large_partitions, system.large_rows and system.large_cells respectivelly.
This change removes the partition and cluster keys from the log messages, but still inserts them into the system tables.
The logged data will look like this:
Large cells:
WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large cell ks_name/tbl_name: cell_name (SIZE bytes) to sstable.db
Large rows:
WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large row ks_name/tbl_name: (SIZE bytes) to sstable.db
Large partitions:
WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large partition ks_name/tbl_name: (SIZE bytes) to sstable.db
Fixes#18041Closesscylladb/scylladb#18166
(cherry picked from commit f1cc6252fd)
before this change, `reclaim_timer::report()` calls
```c++
fmt::format(", at {}", current_backtrace())
```
which allocates a `std::string` on heap, so it can fail and throw. in
that case, `std::terminate()` is called. but at that moment, the reason
why `reclaim_timer::report()` gets called is that we fail to reclaim
memory for the caller. so we are more likely to run into this issue. anyway,
we should not allocate memory in this path.
in this change, a dedicated printer is created so that we don't format
to a temporary `std::string`, and instead write directly to the buffer
of logger. this avoids the memory allocation.
Fixes#18099
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18100
(cherry picked from commit fcf7ca5675)
In 0c86abab4d `merge_schema` obtained a new flag, `reload`.
Unfortunately, the flag was assigned a default value, which I think is
almost always a bad idea, and indeed it was in this case. When
`merge_schema` is called on shard different than 0, it recursively calls
itself on shard 0. That recursive call forgot to pass the `reload` flag.
Fix this.
(cherry picked from commit 5223d32fab)
In order to avoid running out of memory, we can't
underestimate the memory used when processing a view
update. Particularly, we need to handle the remote
view updates well, because we may create many of them
at the same time in contrast to local updates which
are processed synchronously.
After investigating a coredump generated in a crash
caused by running out of memory due to these remote
view updates, we found that the current estimation
is much lower than what we observed in practice; we
identified overhead of up to 2288 bytes for each
remote view update. The overhead consists of:
- 512 bytes - a write_response_handler
- less than 512 bytes - excessive memory allocation
for the mutation in bytes_ostream
- 448 bytes - the apply_to_remote_endpoints coroutine
started in mutate_MV()
- 192 bytes - a continuation to the coroutine above
- 320 bytes - the coroutine in result_parallel_for_each
started in mutate_begin()
- 112 bytes - a continuation to the coroutine above
- 192 bytes - 5 unspecified allocations of 32, 32, 32,
48 and 48 bytes
This patch changes the previous overhead estimate
of 256 bytes to 2288 bytes, which should take into
account all allocations in the current version of the
code. It's worth noting that changes in the related
pieces of code may result in a different overhead.
The allocations seem to be mostly captures for the
background tasks. Coroutines seem to allocate extra,
however testing shows that replacing a coroutine with
continuations may result in generating a few smaller
futures/continuations with a larger total size.
Besides that, considering that we're waiting for
a response for each remote view update, we need the
relatively large write_response_handler, which also
includes the mutation in case we needed to reuse it.
The change should not majorly affect workloads with many
local updates because we don't keep many of them at
the same time anyway, and an added benefit of correct
memory utilization estimation is avoiding evictions
of other memory that would be otherwise necessary
to handle the excessive memory used by view updates.
Fixes#17364
(cherry picked from commit 4c767c379c)
Closesscylladb/scylladb#18105
according to {fmt}'s document at
https://fmt.dev/latest/api.html#formatting-user-defined-types,
```
// the range will contain "f} continued". The formatter should parse
// specifiers until '}' or the end of the range. In this example the
// formatter should parse the 'f' specifier and return an iterator
// pointing to '}'.
```
so we should check for _both_ '}' and end of the range. when building
scylla with {fmt} 10.2.1, it fails to build code like
```c++
fmt::format_to(out, "{}", fmt_hex(frag))
```
as {fmt}'s compile-time checker fails to parse this format string
along with given argument, as at compile time,
```c++
throw format_error("invalid group_size")
```
is executed.
so, in this change, we check both '}' and the end of range.
the change which introduced this formatter was
2f9dfba800
Refs 2f9dfba800
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit ef2afa75d1)
Closesscylladb/scylladb#18086
Calling `_next_row.get_iterator_in_latest()` is illegal when `_next_row` is not
pointing at a row. In particular, the iterator returned by such call might be
dangling.
We have observed this to cause a use-after-free in the field, when a reverse
read called `maybe_add_to_cache` after `_latest_it` was left dangling after
a dead row removal in `copy_from_cache_to_buffer`.
To fix this, we should ensure that we only call `_next_row.get_iterator_in_latest`
is pointing at a row.
Only the occurrences of this problem in `maybe_add_to_cache` are truly dangerous.
As far as I can see, other occurrences can't break anything as of now.
But we apply fixes to them anyway.
(cherry picked from commit 04db6d4bb1)
Closesscylladb/scylladb#18075
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.
in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision. the tests are updated to print the time duration
instead of count to provide information with a higher resolution.
Fixes#15902
(cherry picked from commit 8a5689e7a7)
(cherry picked from commit 1d33a68dd7)
Closesscylladb/scylladb#17909
* github.com:scylladb/scylladb:
tests: utils: error injection: print time duration instead of count
error_injection: do not cast to milliseconds when injecting timeout
Repairs have to obtain a permit to the reader concurrency semaphore on
each shard they have a presence on. This is prone to deadlocks:
node1 node2
repair1_master (takes permit) repair1_follower (waits on permit)
repair2_master (waits for permit) repair2_follower (takes permit)
In lieu of strong central coordination, we solved this by making permits
evictable: if repair2 can evict repair1's permit so it can obtain one
and make progress. This is not efficient as evicting a permit usually
means discarding already done work, but it prevents the deadlocks.
We recently discovered that there is a window when deadlocks can still
happen. The permit is made evictable when the disk reader is created.
This reader is an evictable one, which effectively makes the permit
evictable. But the permit is obtained when the repair constrol
structrure -- repair meta -- is create. Between creating the repair meta
and reading the first row from disk, the deadlock is still possible. And
we know that what is possible, will happen (and did happen). Fix by
making the permit evictable as soon as the repair meta is created. This
is very clunky and we should have a better API for this (refs #17644),
but for now we go with this simple patch, to make it easy to backport.
Refs: #17644Fixes: #17591Closesscylladb/scylladb#17646
(cherry picked from commit 65b9e10543)
Closesscylladb/scylladb#17950
instead of casting / comparing the count of duration unit, let's just
compare the durations, so that boost.test is able to print the duration
in a more informative and user friendly way (line wrapped)
test/boost/error_injection_test.cc(167): fatal error:
in "test_inject_future_disabled":
critical check wait_time > sleep_msec has failed [23839ns <= 10ms]
Refs #15902
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 1d33a68dd7)
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.
in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision.
Fixes#15902
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 8a5689e7a7)
This commit updates the Upgrade ScyllaDB Image page.
- It removes the incorrect information that updating underlying OS packages is mandatory.
- It adds information about the extended procedure for non-official images.
(cherry picked from commit fc90112b97)
Closesscylladb/scylladb#17886
Since 6b87778 regular compaction tasks are removed from task manager
immediately after they are finished.
test_regular_compaction_task lists compaction tasks and then requests
their statuses. Only one regular compaction task is guaranteed to still
be running at that time, the rest of them may finish before their status
is requested and so it will no longer be in task manager, causing the test
to fail.
Fix statuses check to consider the possibility of a regular compaction
task being removed from task manager.
Fixes: #17776.
(cherry picked from commit 80c5eb4ecb)
Closesscylladb/scylladb#17810
This is a backport of 0c376043eb and follow-up fix 57b14580f0 to 5.4.
We haven't identified any specific issues in test or field in 5.4/2024.1 releases, but the bug should be fixed either way, it might bite us in unexpected ways.
For 5.4 it's even more important than 5.2 because in 5.4 in Raft mode schema pulls are disabled.
Closesscylladb/scylladb#17641
* github.com:scylladb/scylladb:
raft_group0_client: assert that hold_read_apply_mutex is called on shard 0
migration_manager: fix indentation after the previous patch.
messaging_service: process migration_request rpc on shard 0
migration_manager: take group0 lock during raft snapshot taking
A materialized view in CQL allows AT MOST ONE view key column that
wasn't a key column in the base table. This is because if there were
two or more of those, the "liveness" (timestamp, ttl) of these different
columns can change at every update, and it's not possible to pick what
liveness to use for the view row we create.
We made an exception for this rule for Alternator: DynamoDB's API allows
creating a GSI whose partition key and range key are both regular columns
in the base table, and we must support this. We claim that the fact that
Alternator allows neither TTL (Alternator's "TTL" is a different feature)
nor user-defined timestamps, does allow picking the liveness for the view
row we create. But we did it wrong!
We claimed in a comment - and implemented in the code before this patch -
that in Alternator we can assume that both GSI key columns will have the
*same* liveness, and in particular timestamp. But this is only true if
one modifies both columns together! In fact, in general it is not true:
We can have two non-key attributes 'a' and 'b' which are the GSI's key
columns, and we can modify *only* b, without modifying a, in which case
the timestamp of the view modification should be b's newer timestamp,
not a's older one. The existing code took a's timestamp, assuming it
will be the same as b's, which is incorrect. The result was that if
we repeatedly modify only b, all view updates will receive the same
timestamp (a's old timestamp), and a deletion will always win over
all the modifications. This patch includes a reproducing test written by
a user (@Zak-Kent) that demonstrates how after a view row is deleted
it doesn't get recreated - because all the modifications use the same
timestamp.
The fix is, as suggested above, to use the *higher* of the two
timestamps of both base-regular-column GSI key columns as the timestamp
for the new view rows or view row deletions. The reproducer that
failed before this patch passes with it. As usual, the reproducer
passes on AWS DynamoDB as well, proving that the test is correct and
should really work.
Fixes#17119
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#17172
(cherry picked from commit 21e7deafeb)
Since commit f1bbf70, many compaction types can do cleanup work, but turns out
we forgot to invalidate cache on their completion.
So if a node regains ownership of token that had partition deleted in its previous
owner (and tombstone is already gone), data can be resurrected.
Tablet is not affected, as it explicitly invalidates cache during migration
cleanup stage.
Scylla 5.4 is affected.
Fixes#17501.
Fixes#17452.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#17502
(cherry picked from commit f07c233ad5)
The test is booting nodes, and then immediately starts shutting down
nodes and removing them from the cluster. The shutting down and
removing may happen before driver manages to connect to all nodes in the
cluster. In particular, the driver didn't yet connect to the last
bootstrapped node. Or it can even happen that the driver has connected,
but the control connection is established to the first node, and the
driver fetched topology from the first node when the first node didn't
yet consider the last node to be normal. So the driver decides to close
connection to the last node like this:
```
22:34:03.159 DEBUG> [control connection] Removing host not found in
peers metadata: <Host: 127.42.90.14:9042 datacenter1>
```
Eventually, at the end of the test, only the last node remains, all
other nodes have been removed or stopped. But the driver does not have a
connection to that last node.
Fix this problem by ensuring that:
- all nodes see each other as NORMAL,
- the driver has connected to all nodes
at the beginning of the test, before we start shutting down and removing
nodes.
Fixesscylladb/scylladb#16373
(cherry picked from commit a68701ed4f)
Closesscylladb/scylladb#17702
key_view::explode() contains a blatant use-after-free:
unless the input is already linearized, it returns a view to a local temporary buffer.
This is rare, because partition keys are usually not large enough to be fragmented.
But for a sufficiently large key, this bug causes a corrupted partition_key down
the line.
Fixes#17625
(cherry picked from commit 7a7b8972e5)
Closesscylladb/scylladb#17717
Store schema_ptr in reader permit instead of storing a const pointer to
schema to ensure that the schema doesn't get changed elsewhere when the
permit is holding on to it. Also update the constructors and all the
relevant callers to pass down schema_ptr instead of a raw pointer.
Fixes#16180
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#16658
(cherry picked from commit 76f0d5e35b)
Closesscylladb/scylladb#17677
Commit 0c376043eb added access to group0
semaphore which can be done on shard0 only. Unlike all other group0 rpcs
(that already always forwarded to shard0) migration_request does not
since it is an rpc that what reused from non raft days. The patch adds
the missing jump to shard0 before executing the rpc.
(cherry picked from commit 4a3c79625f)
Group0 state machine access atomicity is guaranteed by a mutex in group0
client. A code that reads or writes the state needs to hold the log. To
transfer schema part of the snapshot we used existing "migration
request" verb which did not follow the rule. Fix the code to take group0
lock before accessing schema in case the verb is called as part of
group0 snapshot transfer.
Fixesscylladb/scylladb#16821
(cherry picked from commit 0c376043eb)
The function `gms::version_generator::get_next_version()` can only be called from shard 0 as it uses a global, unsynchronized counter to issue versions. Notably, the function is used as a default argument for the constructor of `gms::versioned_value` which is used from shorthand constructors such as `versioned_value::cache_hitrates`, `versioned_value::schema` etc.
The `cache_hitrate_calculator` service runs a periodic job which updates the `CACHE_HITRATES` application state in the local gossiper state. Each time the job is scheduled, it runs on the next shard (it goes through shards in a round-robin fashion). The job uses the `versioned_value::cache_hitrates` shorthand to create a `versioned_value`, therefore risking a data race if it is not currently executing on shard 0.
The PR fixes the race by moving the call to `versioned_value::cache_hitrates` to shard 0. Additionally, in order to help detect similar issues in the future, a check is introduced to `get_next_version` which aborts the process if the function was called on other shard than 0.
There is a possibility that it is a fix for #17493. Because `get_next_version` uses a simple incrementation to advance the global counter, a data race can occur if two shards call it concurrently and it may result in shard 0 returning the same or smaller value when called two times in a row. The following sequence of events is suspected to occur on node A:
1. Shard 1 calls `get_next_version()`, loads version `v - 1` from the global counter and stores in a register; the thread then is preempted,
2. Shard 0 executes `add_local_application_state()` which internally calls `get_next_version()`, loads `v - 1` then stores `v` and uses version `v` to update the application state,
3. Shard 0 executes `add_local_application_state()` again, increments version to `v + 1` and uses it to update the application state,
4. Gossip message handler runs, exchanging application states with node B. It sends its application state to B. Note that the max version of any of the local application states is `v + 1`,
5. Shard 1 resumes and stores version `v` in the global counter,
6. Shard 0 executes `add_local_application_state()` and updates the application state - again - with version `v + 1`.
7. After that, node B will never learn about the application state introduced in point 6. as gossip exchange only sends endpoint states with version larger than the previous observed max version, which was `v + 1` in point 4.
Note that the above scenario was _not_ reproduced. However, I managed to observe a race condition by:
1. modifying Scylla to run update of `CACHE_HITRATES` much more frequently than usual,
2. putting an assertion in `add_local_application_state` which fails if the version returned by `get_next_version` was not larger than the previous returned value,
3. running a test which performs schema changes in a loop.
The assertion from the second point was triggered. While it's hard to tell how likely it is to occur without making updates of cache hitrates more frequent - not to mention the full theorized scenario - for now this is the best lead that we have, and the data race being fixed here is a real bug anyway.
Refs: #17493Closesscylladb/scylladb#17499
* github.com:scylladb/scylladb:
version_generator: check that get_next_version is called on shard 0
misc_services: fix data race from bad usage of get_next_version
(cherry picked from commit fd32e2ee10)
This test needed a lot of data to ensure multiple pages when doing the read repair. This change two key configuration items, allowing for a drastic reduction of the data size and consequently a large reduction in run-time.
* Changes query-tombstone-page-limit 1000 -> 10. Before f068d1a6fa, reducing this to a too small value would start killing internal queries. Now, after said commit, this is no longer a concern, as this limit no longer affects unpaged queries.
* Sets (the new) query-page-size-in-bytes 1MB (default) -> 1KB.
The latter configuration is a new one, added by the first patches of this series. It allows configuring the page-size in bytes, after which pages are cut. Previously this was a hard-coded constant: 1MB. This forced any tests which wanted to check paging, with pages cut on size, to work with large datasets. This was especially pronounced in the tests fixed in this PR, because this test works with tombstones which are tiny and a lot of them were needed to trigger paging based on the size.
With this two changes, we can reduce the data size:
* total_rows: 20000 -> 100
* max_live_rows: 32 -> 8
The runtime of the test consequently drops from 62 seconds to 13.5 seconds (dev mode, on my build machine).
Fixes: https://github.com/scylladb/scylladb/issues/15425
Fixes: https://github.com/scylladb/scylladb/issues/16899Closesscylladb/scylladb#17529
* github.com:scylladb/scylladb:
test/topology_custom: test_read_repair.py: reduce run-time
replica/database: get_query_max_result_size(): use query_page_size_in_bytes
replica/database: use include page-size in max-result-size
query-request: max_result_size: add without_page_limit()
db/config: introduce query_page_size_in_bytes
(cherry picked from commit 616eec2214)
RPC calls lose information about the type of returned exception.
Thus, if a table is dropped on receiver node, but it still exists
on a sender node and sender node streams the table's data, then
the whole operation fails.
To prevent that, add a method which synchronizes schema and then
checks, if the exception was caused by table drop. If so,
the exception is swallowed.
Use the method in streaming and repair to continue them when
the table is dropped in the meantime.
Fixes: https://github.com/scylladb/scylladb/issues/17028.
Fixes: https://github.com/scylladb/scylladb/issues/15370.
Fixes: https://github.com/scylladb/scylladb/issues/15598.
Closesscylladb/scylladb#17525
* github.com:scylladb/scylladb:
repair: handle no_such_column_family from remote node gracefully
test: test drop table on receiver side during streaming
streaming: fix indentation
streaming: handle no_such_column_family from remote node gracefully
repair: add methods to skip dropped table
If index_reader isn't closed before it is destroyed, then ongoing
sstables reads won't be awaited and assertion will be triggered.
Close index_reader in has_partition_key before destroying it.
Fixes: https://github.com/scylladb/scylladb/issues/17232.
Closesscylladb/scylladb#17531
* github.com:scylladb/scylladb:
test: add test to check if reader is closed
sstables: close index_reader in has_partition_key
If index_reader isn't closed before it is destroyed, then ongoing
sstables reads won't be awaited and assertion will be triggered.
Close index_reader in has_partition_key before destroying it.
(cherry picked from commit 5227336a32)
If no_such_column_family is thrown on remote node, then repair
operation fails as the type of exception cannot be determined.
Use repair::with_table_drop_silenced in repair to continue operation
if a table was dropped.
(cherry picked from commit cf36015591)
If no_such_column_family is thrown on remote node, then streaming
operation fails as the type of exception cannot be determined.
Use repair::with_table_drop_silenced in streaming to continue
operation if a table was dropped.
(cherry picked from commit 219e1eda09)
Schema propagation is async so one node can see the table while on
the other node it is already dropped. So, if the nodes stream
the table data, the latter node throws no_such_column_family.
The exception is propagated to the other node, but its type is lost,
so the operation fails on the other node.
Add method which waits until all raft changes are applied and then
checks whether given table exists.
Add the function which uses the above to determine, whether the function
failed because of dropped table (eg. on the remote node so the exact
exception type is unknown). If so, the exception isn't rethrown.
(cherry picked from commit 5202bb9d3c)
For efficiency, if a base-table update generates many view updates that
go the same partition, they are collected as one mutation. If this
mutation grows too big it can lead to memory exhaustion, so since
commit 7d214800d0 we split the output
mutation to mutations no longer than 100 rows (max_rows_for_view_updates)
each.
This patch fixes a bug where this split was done incorrectly when
the update involved range tombstones, a bug which was discovered by
a user in a real use case (#17117).
Range tombstones are read in two parts, a beginning and an end, and the
code could split the processing between these two parts and the result
that some of the range tombstones in update could be missed - and the
view could miss some deletions that happened in the base table.
This patch fixes the code in two places to avoid breaking up the
processing between range tombstones:
1. The counter "_op_count" that decides where to break the output mutation
should only be incremented when adding rows to this output mutation.
The existing code strangely incrmented it on every read (!?) which
resulted in the counter being incremented on every *input* fragment,
and in particular could reach the limit 100 between two range
tombstone pieces.
2. Moreover, the length of output was checked in the wrong place...
The existing code could get to 100 rows, not check at that point,
read the next input - half a range tombstone - and only *then*
check that we reached 100 rows and stop. The fix is to calculate
the number of rows in the right place - exactly when it's needed,
not before the step.
The first change needs more justification: The old code, that incremented
_op_count on every input fragment and not just output fragments did not
fit the stated goal of its introduction - to avoid large allocations.
In one test it resulted in breaking up the output mutation to chunks of
25 rows instead of the intended 100 rows. But, maybe there was another
goal, to stop the iteration after 100 *input* rows and avoid the possibility
of stalls if there are no output rows? It turns out the answer is no -
we don't need this _op_count increment to avoid stalls: The function
build_some() uses `co_await on_results()` to run one step of processing
one input fragment - and `co_await` always checks for preemption.
I verfied that indeed no stalls happen by using the existing test
test_long_skipped_view_update_delete_with_timestamp. It generates a
very long base update where all the view updates go to the same partition,
but all but the last few updates don't generate any view updates.
I confirmed that the fixed code loops over all these input rows without
increasing _op_count and without generating any view update yet, but it
does NOT stall.
This patch also includes two tests reproducing this bug and confirming
its fixed, and also two additional tests for breaking up long deletions
that I wanted to make sure doesn't fail after this patch (it doesn't).
By the way, this fix would have also fixed issue #12297 - which we
fixed a year ago in a different way. That issue happend when the code
went through 100 input rows without generating *any* output rows,
and incorrectly concluding that there's no view update to send.
With this fix, the code no longer stops generating the view
update just because it saw 100 input rows - it would have waited
until it generated 100 output rows in the view update (or the
input is really done).
Fixes#17117
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#17164
(cherry picked from commit 14315fcbc3)
Before this PR, writes to the previous CDC generations would
always be rejected. After this PR, they will be accepted if the
write's timestamp is greater than `now - generation_leeway`.
This change was proposed around 3 years ago. The motivation was
to improve user experience. If a client generates timestamps by
itself and its clock is desynchronized with the clock of the node
the client is connected to, there could be a period during
generation switching when writes fail. We didn't consider this
problem critical because the client could simply retry a failed
write with a higher timestamp. Eventually, it would succeed. This
approach is safe because these failed writes cannot have any side
effects. However, it can be inconvenient. Writing to previous
generations was proposed to improve it.
The idea was rejected 3 years ago. Recently, it turned out that
there is a case when the client cannot retry a write with the
increased timestamp. It happens when a table uses CDC and LWT,
which makes timestamps permanent. Once Paxos commits an entry
with a given timestamp, Scylla will keep trying to apply that entry
until it succeeds, with the same timestamp. Applying the entry
involves writing to the CDC log table. If it fails, we get stuck.
It's a major bug with an unknown perfect solution.
Allowing writes to previous generations for `generation_leeway` is
a probabilistic fix that should solve the problem in practice.
Apart from this change, this PR adds tests for it and updates
the documentation.
This PR is sufficient to enable writes to the previous generations
only in the gossiper-based topology. The Raft-based topology
needs some adjustments in loading and cleaning CDC generations.
These changes won't interfere with the changes introduced in this
PR, so they are left for a follow-up.
Fixesscylladb/scylladb#7251Fixesscylladb/scylladb#15260Closesscylladb/scylladb#17134
* github.com:scylladb/scylladb:
docs: using-scylla: cdc: remove info about failing writes to old generations
docs: dev: cdc: document writing to previous CDC generations
test: add test_writes_to_previous_cdc_generations
cdc: generation: allow increasing generation_leeway through error injection
cdc: metadata: allow sending writes to the previous generations
(cherry picked from commit 9bb4482ad0)
Backport note: in tests, replaced `servers_add` with loop of `server_add`
The currently used versions of "time" and "rustix" depencies
had minor security vulnerabilities.
In this patch:
- the "rustix" crate is updated
- the "chrono" crate that we depend on was not compatible
with the version of the "time" crate that had fixes, so
we updated the "chrono" crate, which actually removed the
dependency on "time" completely.
Both updated were performed using "cargo update" on the
relevant package and the corresponding version.
Refs #15772Closesscylladb/scylladb#17407
With this commit:
- The information about ScyllaDB Enterprise OS support
is removed from the Open Source documentation.
- The information about ScyllaDB Open Source OS support
is moved to the os-support-info file in the _common folder.
- The os-support-info file is included in the os-support page
using the scylladb_include_flag directive.
This update employs the solution we added with
https://github.com/scylladb/scylladb/pull/16753.
It allows to dynamically add content to a page
depending on the opensource/enterprise flag.
Refs https://github.com/scylladb/scylladb/issues/15484Closesscylladb/scylladb#17310
(cherry picked from commit ef1468d5ec)
During shutdown, as all system tables are closed in parallel, there is a
possibility of a race condition between compaction stoppage and the
closure of the compaction_history table. So, quiesce all the compaction
tasks before attempting to close the tables.
Fixes#15721
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#17218
(cherry picked from commit 3b7b315f6a)
This commit adds the missing redirections
to the pages whose source files were
previously stored in the install-scylla folder
and were moved to another location.
Closesscylladb/scylladb#17367
(cherry picked from commit e132ffdb60)
The reason we introduced the tombstone-limit
(query_tombstone_page_limit), was to allow paged queries to return
incomplete/empty pages in the face of large tombstone spans. This works
by cutting the page after the tombstone-limit amount of tombstones were
processed. If the read is unpaged, it is killed instead. This was a
mistake. First, it doesn't really make sense, the reason we introduced
the tombstone limit, was to allow paged queries to process large
tombstone-spans without timing out. It does not help unpaged queries.
Furthermore, the tombstone-limit can kill internal queries done on
behalf of user queries, because all our internal queries are unpaged.
This can cause denial of service.
So in this patch we disable the tombstone-limit for unpaged queries
altogether, they are allowed to continue even after having processed the
configured limit of tombstones.
Fixes: #17241Closesscylladb/scylladb#17242
(cherry picked from commit f068d1a6fa)
* seastar 95a38bb0...9d44e5eb (1):
> Merge "Slowdown IO scheduler based on dispatched/completed ratio" int branch-5.4
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This PR fixes the bug of certain calls to the `mintimeuuid()` CQL function which large negative timestamps could crash Scylla. It turns out we already had protections in place against very positive timestamps, but very negative timestamps could still cause bugs.
The actual fix in this series is just a few lines, but the bigger effort was improving the test coverage in this area. I added tests for the "date" type (the original reproducer for this bug used totimestamp() which takes a date parameter), and also reproducers for this bug directly, without totimestamp() function, and one with that function.
Finally this PR also replaces the assert() which made this molehill-of-a-bug into a mountain, by a throw.
Fixes#17035Closesscylladb/scylladb#17073
* github.com:scylladb/scylladb:
utils: replace assert() by on_internal_error()
utils: add on_internal_error with common logger
utils: add a timeuuid minimum, like we had maximum
test/cql-pytest: tests for "date" type
(cherry picked from commit 2a4b991772)
Backports required to fix https://github.com/scylladb/scylladb/issues/16683 in 5.4:
- add an API to trigger Raft snapshot
- use the API when we restart and see that the existing snapshot is at index 0, to trigger a new one --- in order to fix broken deployments that already bootstrapped with index-0 snapshot (we may get such deployments by upgrading from 5.2)
Closesscylladb/scylladb#17123
* github.com:scylladb/scylladb:
test_raft_snapshot_request: fix flakiness (again)
test_raft_snapshot_request: fix flakiness
Merge 'raft_group0: trigger snapshot if existing snapshot index is 0' from Kamil Braun
Merge 'Add an API to trigger snapshot in Raft servers' from Kamil Braun
This PR:
- Adds the upgrade guide from ScyllaDB Open Source 5.4 to ScyllaDB Enterprise 2024.1. Note: The need to include the "Restore system tables" step in rollback has been confirmed; see https://github.com/scylladb/scylladb/issues/11907#issuecomment-1842657959.
- Removes the 5.1-to-2022.2 upgrade guide (unsupported versions).
Fixes https://github.com/scylladb/scylladb/issues/16445Closesscylladb/scylladb#16887
* github.com:scylladb/scylladb:
doc: fix the OSS version number
doc: metric updates between 2024.1. and 5.4
doc: remove the 5.1-to-2022.2 upgrade guide
doc: add the 5.4-to-2024.1 upgrade guide
(cherry picked from commit edb983d165)
This commit adds the information that
ScyllaDB Enterprise 2024.1 is based
on ScyllaDB Open Source 5.4
to the OSS vs. Enterprise matrix.
Closesscylladb/scylladb#16880
(cherry picked from commit a462b914cb)
Commit e81fc1f095 accidentally broke the control
flow of row_cache::do_update().
Before that commit, the body of the loop was wrapped in a lambda.
Thus, to break out of the loop, `return` was used.
The bad commit removed the lambda, but didn't update the `return` accordingly.
Thus, since the commit, the statement doesn't just break out of the loop as
intended, but also skips the code after the loop, which updates `_prev_snapshot_pos`
to reflect the work done by the loop.
As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted,
`do_update()` fails to update `_prev_snapshot_pos`. It remains in a
stale state, until `do_update()` runs again and either finishes or
is preempted outside of `updater`.
If we read a partition processed by `do_update()` but not covered by
`_prev_snapshot_pos`, we will read stale data (from the previous snapshot),
which will be remembered in the cache as the current data.
This results in outdated data being returned by the replica.
(And perhaps in something worse if range tombstones are involved.
I didn't investigate this possibility in depth).
Note: for queries with CL>1, occurences of this bug are likely to be hidden
by reconciliation, because the reconciled query will only see stale data if
the queried partition is affected by the bug on on *all* queried replicas
at the time of the query.
Fixes#16759Closesscylladb/scylladb#17138
(cherry picked from commit ed98102c45)
At the end of the test, we wait until a restarted node receives a
snapshot from the leader, and then verify that the log has been
truncated.
To check the snapshot, the test used the `system.raft_snapshots` table,
while the log is stored in `system.raft`.
Unfortunately, the two tables are not updated atomically when Raft
persists a snapshot (scylladb/scylladb#9603). We first update
`system.raft_snapshots`, then `system.raft` (see
`raft_sys_table_storage::store_snapshot_descriptor`). So after the wait
finishes, there's no guarantee the log has been truncated yet -- there's
a race between the test's last check and Scylla doing that last delete.
But we can check the snapshot using `system.raft` instead of
`system.raft_snapshots`, as `system.raft` has the latest ID. And since
1640f83fdc, storing that ID and truncating
the log in `system.raft` happens atomically.
Closesscylladb/scylladb#17106
(cherry picked from commit c911bf1a33)
Add workaround for scylladb/python-driver#295.
Also an assert made at the end of the test was false, it is fixed with
appropriate comment added.
(cherry picked from commit 74bf60a8ca)
The persisted snapshot index may be 0 if the snapshot was created in
older version of Scylla, which means snapshot transfer won't be
triggered to a bootstrapping node. Commands present in the log may not
cover all schema changes --- group 0 might have been created through the
upgrade upgrade procedure, on a cluster with existing schema. So a
deployment with index=0 snapshot is broken and we need to fix it. We can
use the new `raft::server::trigger_snapshot` API for that.
Also add a test.
Fixesscylladb/scylladb#16683Closesscylladb/scylladb#17072
* github.com:scylladb/scylladb:
test: add test for fixing a broken group 0 snapshot
raft_group0: trigger snapshot if existing snapshot index is 0
(cherry picked from commit 181f68f248)
This allows the user of `raft::server` to cause it to create a snapshot
and truncate the Raft log (leaving no trailing entries; in the future we
may extend the API to specify number of trailing entries left if
needed). In a later commit we'll add a REST endpoint to Scylla to
trigger group 0 snapshots.
One use case for this API is to create group 0 snapshots in Scylla
deployments which upgraded to Raft in version 5.2 and started with an
empty Raft log with no snapshot at the beginning. This causes problems,
e.g. when a new node bootstraps to the cluster, it will not receive a
snapshot that would contain both schema and group 0 history, which would
then lead to inconsistent schema state and trigger assertion failures as
observed in scylladb/scylladb#16683.
In 5.4 the logic of initial group 0 setup was changed to start the Raft
log with a snapshot at index 1 (ff386e7a44)
but a problem remains with these existing deployments coming from 5.2,
we need a way to trigger a snapshot in them (other than performing 1000
arbitrary schema changes).
Another potential use case in the future would be to trigger snapshots
based on external memory pressure in tablet Raft groups (for strongly
consistent tables).
The PR adds the API to `raft::server` and a HTTP endpoint that uses it.
In a follow-up PR, we plan to modify group 0 server startup logic to automatically
call this API if it sees that no snapshot is present yet (to automatically
fix the aforementioned 5.2 deployments once they upgrade.)
Closesscylladb/scylladb#16816
* github.com:scylladb/scylladb:
raft: remove `empty()` from `fsm_output`
test: add test for manual triggering of Raft snapshots
api: add HTTP endpoint to trigger Raft snapshots
raft: server: add `trigger_snapshot` API
raft: server: track last persisted snapshot descriptor index
raft: server: framework for handling server requests
raft: server: inline `poll_fsm_output`
raft: server: fix indentation
raft: server: move `io_fiber`'s processing of `batch` to a separate function
raft: move `poll_output()` from `fsm` to `server`
raft: move `_sm_events` from `fsm` to `server`
raft: fsm: remove constructor used only in tests
raft: fsm: move trace message from `poll_output` to `has_output`
raft: fsm: extract `has_output()`
raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor`
raft: server: pass `*_aborted` to `set_exception` call
(cherry picked from commit d202d32f81)
Backport note: the HTTP API is only started if raft_group_registry is
started.
When a base table changes and altered, so does the views that might
refer to the added column (which includes "SELECT *" views and also
views that might need to use this column for rows lifetime (virtual
columns).
However the query processor implementation for views change notification
was an empty function.
Since views are tables, the query processor needs to at least treat them
as such (and maybe in the future, do also some MV specific stuff).
This commit adds a call to `on_update_column_family` from within
`on_update_view`.
The side effect true to this date is that prepared statements for views
which changed due to a base table change will be invalidated.
Fixes https://github.com/scylladb/scylladb/issues/16392
This series also adds a test which fails without this fix and passes when the fix is applied.
Closesscylladb/scylladb#16897
* github.com:scylladb/scylladb:
Add test for mv prepared statements invalidation on base alter
query processor: treat view changes at least as table changes
(cherry picked from commit 5810396ba1)
This commit removes the upgrade guides
from ScyllaDB Open Source to Enterprise
for versions we no longer support.
In addition, it removes a link to
one of the removed pages from
the Troubleshooting section (the link is
redundant).
(cherry picked from commit 0ad3ef4c55)
Closesscylladb/scylladb#16913
This commit removes the information about ScyllaDB Cloud Serverless,
which is no longer valid.
(cherry picked from commit 758284318a)
Closesscylladb/scylladb#16805
When the reader is currently paused, it is resumed, fast-forwarded, then
paused again. The fast forwarding part can throw and this will lead to
destroying the reader without it being closed first.
Add a try-catch surrounding this part in the code. Also mark
`maybe_pause()` and `do_pause()` as noexcept, to make it clear why
that part doesn't need to be in the try-catch.
Fixes: #16606Closesscylladb/scylladb#16630
(cherry picked from commit 204d3284fa)
Regular compaction tasks are internal.
Adjust test_compaction_task accordingly: modify test_regular_compaction_task,
delete test_running_compaction_task_abort (relying on regular compaction)
which checks are already achived by test_not_created_compaction_task_abort.
Rename the latter.
(cherry picked from commit 6b87778ef2)
Major compaction already flushes each table to make
sure it considers any mutations that are present in the
memtable for the purpose of tombstone purging.
See 64ec1c6ec6
However, tombstone purging may be inhibited by data
in commitlog segments based on `gc_time_min` in the
`tombstone_gc_state` (See f42eb4d1ce).
Flushing all sstables in the database release
all references to commitlog segments and there
it maximizes the potential for tombstone purging,
which is typically the reason for running major compaction.
However, flushing all tables too frequently might
result in tiny sstables. Since when flushing all
keyspaces using `nodetool flush` the `force_keyspace_compaction`
api is invoked for keyspace successively, we need a mechanism
to prevent too frequent flushes by major compaction.
Hence a `compaction_flush_all_tables_before_major_seconds` interval
configuration option is added (defaults to 24 hours).
In the case that not all tables are flushed prior
to major compaction, we revert to the old behavior of
flushing each table in the keyspace before major-compacting it.
Fixesscylladb/scylladb#15777Closesscylladb/scylladb#15820
to address the confliction, following change is also included in this changeset:
tools/scylla-nodetool: implement the cleanup command
The --jobs command-line argument is accepted but ignored, just like the
current nodetool does.
Refs: scylladb/scylladb#15588Closesscylladb/scylladb#16756
* github.com:scylladb/scylladb:
docs: nodetool: flush: enrich examples
docs: nodetool: compact: fix example
api: add /storage_service/compact
api: add /storage_service/flush
tools/scylla-nodetool: implement the flush command
compaction_manager: flush_all_tables before major compaction
database: add flush_all_tables
api: compaction: add flush_memtables option
test/nodetool: jmx: fix path to scripts/scylla-jmx
scylla-nodetool, docs: improve optional params documentation
tools/scylla-nodetool: extract keyspace/table parsing
tools/scylla-nodetool: implement the cleanup command
test/nodetool: rest_api_mock: add more options for multiple requests
This commit removes support for CentOS 7
from the docs.
The change applies to version 5.4,so it
must be backported to branch-5.4.
Refs https://github.com/scylladb/scylla-enterprise/issues/3502
In addition, this commit removes the information
about Amazon Linux and Oracle Linux, unnecessarily added
without request, and there's no clarity over which versions
should be documented.
Closesscylladb/scylladb#16279
(cherry picked from commit af1405e517)
This is a loose collection of fixes to rare row cache bugs flushed out by running test_concurrent_reads_and_eviction several million times. See individual commits for details.
Fixes#15483Closesscylladb/scylladb#15945
* github.com:scylladb/scylladb:
partition_version: fix violation of "older versions are evicted first" during schema upgrades
cache_flat_mutation_reader: fix a broken iterator validity guarantee in ensure_population_lower_bound()
cache_flat_mutation_reader: fix a continuity loss in maybe_update_continuity()
cache_flat_mutation_reader: fix continuity losses during cache population races with reverse reads
partition_snapshot_row_cursor: fix a continuity loss in ensure_entry_in_latest() with reverse reads
cache_flat_mutation_reader: fix some cache mispopulations with reverse reads
cache_flat_mutation_reader: fix a logic bug in ensure_population_lower_bound() with reverse reads
cache_flat_mutation_reader: never make an unlinked last dummy continuous
(cherry picked from commit 6bcf3ac86c)
If std::vector is resized its iterators and references may
get invalidated. While task_manager::task::impl::_children's
iterators are avoided throughout the code, references to its
elements are being used.
Since children vector does not need random access to its
elements, change its type to std::list<foreign_task_ptr>, which
iterators and references aren't invalidated on element insertion.
Fixes: #16380.
Closesscylladb/scylladb#16381
(cherry picked from commit 9b9ea1193c)
Provide 3 examples, like in the nodetool/compact page:
global, per-keyspace, per-table.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 310ff20e1e)
It looks like `nodetool compact standard1` is meant
to show how to compact a specified table, not a keyspace.
Note that the previous example like is for a keyspace.
So fix the table compaction example to:
`nodetool compact keyspace1 standard1`
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit d32b90155a)
For major compacting all tables in the database.
The advantage of this api is that `commitlog->force_new_active_segment`
happens only once in `database::flush_all_tables` rather than
once per keyspace (when `nodetool compact` translates to
a sequence of `/storage_service/keyspace_compaction` calls).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b12b142232)
For flushing all tables in the database.
The advantage of this api is that `commitlog->force_new_active_segment`
happens only once in `database::flush_all_tables` rather than
once per keyspace (when `nodetool flush` translates to
a sequence of `/storage_service/keyspace_flush` calls).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 1b576f358b)
Major compaction already flushes each table to make
sure it considers any mutations that are present in the
memtable for the purpose of tombstone purging.
See 64ec1c6ec6
However, tombstone purging may be inhibited by data
in commitlog segments based on `gc_time_min` in the
`tombstone_gc_state` (See f42eb4d1ce).
Flushing all sstables in the database release
all references to commitlog segments and there
it maximizes the potential for tombstone purging,
which is typically the reason for running major compaction.
However, flushing all tables too frequently might
result in tiny sstables. Since when flushing all
keyspaces using `nodetool flush` the `force_keyspace_compaction`
api is invoked for keyspace successively, we need a mechanism
to prevent too frequent flushes by major compaction.
Hence a `compaction_flush_all_tables_before_major_seconds` interval
configuration option is added (defaults to 24 hours).
In the case that not all tables are flushed prior
to major compaction, we revert to the old behavior of
flushing each table in the keyspace before major-compacting it.
Fixesscylladb/scylladb#15777
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 66ba983fe0)
Flushes all tables after forcing force_new_active_segment
of the commitlog to make sure all commitlog segments can
get recycled.
Otherwise, due to "false sharing", rarely-written tables
might inhibit recycling of the commitlog segments they reference.
After f42eb4d1ce,
that won't allow compaction to purge some tombstones based on
the min_gc_time.
To be used in the next patch by major compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit be763bea34)
When flushing is done externally, e.g. by running
`nodetool flush` prior to `nodetool compact`,
flush_memtables=false can be passed to skip flushing
of tables right before they are major-compacted.
This is useful to prevent creation of small sstables
due to excessive memtable flushing.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 1fd85bd37b)
The current implementation makes no sense.
Like `nodetool_path`, base the default `jmx_path`
on the assumption that the test is run using, e.g.
```
(cd test/nodetool; pytest --nodetool=cassandra test_compact.py)
```
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 7f860d612a)
Document the behavior if no keyspace is specified
or no table(s) are specified for a given keyspace.
Fixesscylladb/scylladb#16032
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 9324363e55)
Having to extract 1 keyspace and N tables from the command-line is
proving to be a common pattern among commands. Extract this into a
method, so the boiler-plate can be shared. Add a forward-looking
overload as well, which will be used in the next patch.
(cherry picked from commit f082cc8273)
Change the current bool multiple param to a weak enum, allowing for a
third value: ANY, which allows for 0 matches too.
(cherry picked from commit 7e3a78d73d)
TWCS tables require partition estimation adjustment as incoming streaming data can be segregated into the time windows.
Turns out we had two problems in this area that leads to suboptimal bloom filters.
1) With off-strategy enabled, data segregation is postponed, but partition estimation was adjusted as if segregation wasn't postponed. Solved by not adjusting estimation if segregation is postponed.
2) With off-strategy disabled, data segregation is not postponed, but streaming didn't feed any metadata into partition estimation procedure, meaning it had to assume the max windows input data can be segregated into (100). Solved by using schema's default TTL for a precise estimation of window count.
For the future, we want to dynamically size filters (see https://github.com/scylladb/scylladb/issues/2024), especially for TWCS that might have SSTables that are left uncompacted until they're fully expired, meaning that the system won't heal itself in a timely manner through compaction on a SSTable that had partition estimation really wrong.
Fixes https://github.com/scylladb/scylladb/issues/15704.
Closesscylladb/scylladb#15938
* github.com:scylladb/scylladb:
streaming: Improve partition estimation with TWCS
streaming: Don't adjust partition estimate if segregation is postponed
(cherry picked from commit 64d1d5cf62)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#16671
Fixes#15269
If segment being replayed is corrupted/truncated we can attempt skipping
completely bogues byte amounts, which can cause assert (i.e. crash) in
file_data_source_impl. This is not a crash-level error, so ensure we
range check the distance in the reader.
v2: Add to corrupt_size if trying to skip more than available. The amount added is "wrong", but at least will
ensure we log the fact that things are broken
Closesscylladb/scylladb#15270
(cherry picked from commit 6ffb482bf3)
The polling loop was intended to ignore
`condition_variable_timed_out` and check for progress
using a longer `max_idle_duration` timeout in the loop.
Fixes#15669
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#15671
(cherry picked from commit 68a7bbe582)
Currently, `tool_app_template::run_async()` crashes when invoked with empty argv (with just `argv[0]` populated). This can happen if the tool app is invoked without any further args, e.g. just invoking `scylla nodetool`. The crash happens because unconditional dereferencing of `argv[1]` to get the current operation.
To fix, add an early-exit for this case, just printing a usage message and exiting with exit code 2.
Fixes: #16451Closesscylladb/scylladb#16456
* github.com:scylladb/scylladb:
test: add regression tests for invoking tools with no args
tools/utils: tool_app_template: handle the case of no args
tools/utils: tool_app_template: remove "scylla-" prefix from app name
(cherry picked from commit 5866d265c3)
The reader used to read the sstables was not closed. This could
sometimes trigger an abort(), because the reader was destroyed, without
it being closed first.
Why only sometimes? This is due to two factors:
* read_mutation_from_flat_mutation_reader() - the method used to extract
a mutation from the reader, uses consume(), which does not trigger
`set_close_is_required()` (#16520). Due to this, the top-level
combined reader did not complain when destroyed without close.
* The combined reader closes underlying readers who have no more data
for the current range. If the circumstances are just right, all
underlying readers are closed, before the combined reader is
destoyed. Looks like this is what happens for the most time.
This bug was discovered in SCT testing. After fixing #16520, all
invokations of `scylla-sstable`, which use this code would trigger the
abort, without this patch. So no further testing is required.
Fixes: #16519Closesscylladb/scylladb#16521
(cherry picked from commit da033343b7)
Commit 62458b8e4f introduced the enforcement of EXECUTE permissions of functions in cql select. However, according to the reference in #12869, the permissions should be enforced only on UDFs and UDAs.
The code does not distinguish between the two so the permissions are also unintenionally enforced also on native function. This commit introduce the distinction and only enforces the permissions on non native functions.
Fixes#16526
Manually verified (before and after change) with the reproducer supplied in #16526 and also with some the `min` and `max` native functions.
Also added test that checks for regression on native functions execution and verified that it fails on authorization before
the fix and passes after the fix.
Closesscylladb/scylladb#16556
* github.com:scylladb/scylladb:
test.py: Add test for native functions permissions
select statement: verify EXECUTE permissions only for non native functions
(cherry picked from commit fc71c34597)
This short series fixes a regression from Scylla 5.2 to Scylla 5.4 in "SELECT * GROUP BY" - this query was supposed to return just a single row from each partition (the first one in clustering order), but after the expression rewrite started to wrongly return all rows.
The series also includes a regression test that verifies that this query works doesn't work correctly before this series, but works with this patch - and also works as expected in Scylla 5.2 and in Cassadra.
Fixes#16531.
Closesscylladb/scylladb#16559
* github.com:scylladb/scylladb:
test/cql-pytest: check that most aggregators don't take "*"
cql-pytest: add reproducer for GROUP BY regression
cql: fix regression in SELECT * GROUP BY
(cherry picked from commit 3968fc11bf)
systemd man page says:
systemd-fstab-generator(3) automatically adds dependencies of type Before= to
all mount units that refer to local mount points for this target unit.
So "Before=local-fs.taget" is the correct dependency for local mount
points, but we currently specify "After=local-fs.target", it should be
fixed.
Also replaced "WantedBy=multi-user.target" with "WantedBy=local-fs.target",
since .mount are not related with multi-user but depends local
filesystems.
Fixes#8761Closesscylladb/scylladb#15647
(cherry picked from commit a23278308f)
Having values of the duration type is not allowed for clustering
columns, because duration can't be ordered. This is correctly validated
when creating a table but do not validated when we alter the type.
Fixes#12913Closesscylladb/scylladb#16022
(cherry picked from commit bd73536b33)
After "repair: Get rid of the gc_grace_seconds", the sstable's schema (mode,
gc period if applicable, etc) is used to estimate the amount of droppable
data (or determine full expiration = max_deletion_time < gc_before).
It could happen that the user switched from timeout to repair mode, but
sstables will still use the old mode, despite the user asked for a new one.
Another example is when you play with value of grace period, to prevent
data resurrection if repair won't be able to run in a timely manner.
The problem persists until all sstables using old GC settings are recompacted
or node is restarted.
To fix this, we have to feed latest schema into sstable procedures used
for expiration purposes.
Fixes#15643.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15746
(cherry picked from commit fded314e46)
The use statement execution code can throw if the keyspace is
doesn't exist, this can be a problem for code that will use
execute in a fiber since the exception will break the fiber even
if `then_wrapped` is used.
Fixes#14449
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Closesscylladb/scylladb#14394
(cherry picked from commit c5956957f3)
The implementation of "SELECT TOJSON(t)" or "SELECT JSON t" for a column
of type "time" forgot to put the time string in quotes. The result was
invalid JSON. This is patch is a one-liner fixing this bug.
This patch also removes the "xfail" marker from one xfailing test
for this issue which now starts to pass. We also add a second test for
this issue - the existing test was for "SELECT TOJSON(t)", and the second
test shows that "SELECT JSON t" had exactly the same bug - and both are
fixed by the same patch.
We also had a test translated from Cassandra which exposed this bug,
but that test continues to fail because of other bugs, so we just
need to update the xfail string.
The patch also fixes one C++ test, test/boost/json_cql_query_test.cc,
which enshrined the *wrong* behavior - JSON output that isn't even
valid JSON - and had to be fixed. Unlike the Python tests, the C++ test
can't be run against Cassandra, and doesn't even run a JSON parser
on the output, which explains how it came to enshrine wrong output
instead of helping to discover the bug.
Fixes#7988
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#16121
(cherry picked from commit 8d040325ab)
since always we were putting cqlsh configuration into `~/.cqlshrc`
acording to commit from 8 years ago [1], this path is deprecated.
until this commit [2], actully remove this path from cqlsh code
as part of moving to scylla-cqlsh, we got [2], and didn't
notice until the first release with it.
this change write the configuration into `~/.casssndra/cqlshrc`
as this is the default place cqlsh is looking.
[1]: 13ea8a6669/bin/cqlsh.py (L264)
[2]: 2024ea4796Fixes: scylladb/scylladb#16329Closesscylladb/scylladb#16340
(cherry picked from commit 514ef48d75)
reconcilable_result_builder passes range tombstone changes to _rt_assembler
using table schema, not query schema.
This means that a tombstone with bounds (a; b), where a < b in query schema
but a > b in table schema, will not be emitted from mutation_query.
This is a very serious bug, because it means that such tombstones in reverse
queries are not reconciled with data from other replicas.
If *any* queried replica has a row, but not the range tombstone which deleted
the row, the reconciled result will contain the deleted row.
In particular, range deletes performed while a replica is down will not
later be visible to reverse queries which select this replica, regardless of the
consistency level.
As far as I can see, this doesn't result in any persistent data loss.
Only in that some data might appear resurrected to reverse queries,
until the relevant range tombstone is fully repaired.
This series fixes the bug and adds a minimal reproducer test.
Fixes#10598Closesscylladb/scylladb#16003
* github.com:scylladb/scylladb:
mutation_query_test: test that range tombstones are sent in reverse queries
mutation_query: properly send range tombstones in reverse queries
(cherry picked from commit 65e42e4166)
When scanning our latest docker image using `trivy` (command: `trivy
image docker.io/scylladb/scylla-nightly:latest`), it shows we have OS
packages which are out of date.
Also removing `openssh-server` and `openssh-client` since we don't use
it for our docker images
Fixes: https://github.com/scylladb/scylladb/issues/16222Closesscylladb/scylladb#16224
(cherry picked from commit 7ce6962141)
Closesscylladb/scylladb#16359
This commit replaces the link to the OSS-only page
(the 5.2-to-5.4 upgrade guide not present in
the Enterprise docs) on the Raft page.
While providing the link to the specific upgrade
guide is more user-friendly, it causes build failures
of the Enterprise documentation. I've replaced
it with the link to the general Upgrade section.
The ".. only:: opensource" directive used to wrap
the OSS-only content correctly excludes the content
form the Enterprise docs - but it doesn't prevent
build warnings.
This commit must be backported to branch-5.4 to
prevent errors in all versions.
Closesscylladb/scylladb#16176
(cherry picked from commit 24d5dbd66f)
Fixes#16153
* jmx 166599f...f45067f (3):
> ColumnFamilyStore: only quote table names if necessary
> APIBuilder: allow quoted scope names
> ColumnFamilyStore: don't fail if there is a table with ":" in its name
* java dfbf3726ee...3764ae94db (1):
> NodeProbe: allow addressing table name with colon in it
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#16294
This commit adds a short paragraph to the Raft
page to explain how to enable consistent
topology updates with Raft - an experimental
feature in version 5.4.
The paragraph should satisfy the requirements
for version 5.4. The Raft page will be
rewritten in the next release when consistent
topology changes with Raft will be GA.
Fixes https://github.com/scylladb/scylladb/issues/15080
Requires backport to branch-5.4.
Closesscylladb/scylladb#16273
(cherry picked from commit 409e20e5ab)
This commit fixes the rollback procedure in
the 4.6-to-5.0 upgrade guide:
- The "Restore system tables" step is removed.
- The "Restore the configuration file" command
is fixed.
- The "Gracefully shutdown ScyllaDB" command
is fixed.
In addition, there are the following updates
to be in sync with the tests:
- The "Backup the configuration file" step is
extended to include a command to backup
the packages.
- The Rollback procedure is extended to restore
the backup packages.
- The Reinstallation section is fixed for RHEL.
Refs https://github.com/scylladb/scylladb/issues/11907
This commit must be backported to branch-5.4, branch-5.2, and branch-5.1
Closesscylladb/scylladb#16155
(cherry picked from commit 1e80bdb440)
This commit fixes the rollback procedure in
the 5.0-to-5.1 upgrade guide:
- The "Restore system tables" step is removed.
- The "Restore the configuration file" command
is fixed.
- The "Gracefully shutdown ScyllaDB" command
is fixed.
In addition, there are the following updates
to be in sync with the tests:
- The "Backup the configuration file" step is
extended to include a command to backup
the packages.
- The Rollback procedure is extended to restore
the backup packages.
- The Reinstallation section is fixed for RHEL.
Also, I've the section removed the rollback
section for images, as it's not correct or
relevant.
Refs https://github.com/scylladb/scylladb/issues/11907
This commit must be backported to branch-5.4, branch-5.2, and branch-5.1
Closesscylladb/scylladb#16154
(cherry picked from commit 7ad0b92559)
This commit fixes the rollback procedure in
the 5.1-to-5.2 upgrade guide:
- The "Restore system tables" step is removed.
- The "Restore the configuration file" command
is fixed.
- The "Gracefully shutdown ScyllaDB" command
is fixed.
In addition, there are the following updates
to be in sync with the tests:
- The "Backup the configuration file" step is
extended to include a command to backup
the packages.
- The Rollback procedure is extended to restore
the backup packages.
- The Reinstallation section is fixed for RHEL.
Also, I've the section removed the rollback
section for images, as it's not correct or
relevant.
Refs https://github.com/scylladb/scylladb/issues/11907
This commit must be backported to branch-5.4 and branch-5.2.
Closesscylladb/scylladb#16152
(cherry picked from commit 91cddb606f)
Currently the cache updaters aren't exception safe
yet they are intended to be.
Instead of allowing exceptions from
`external_updater::execute` escape `row_cache::update`,
abort using `on_fatal_internal_error`.
Future changes should harden all `execute` implementations
to effectively make them `noexcept`, then the pure virtual
definition can be made `noexcept` to cement that.
\Fixes scylladb/scylladb#15576
\Closes scylladb/scylladb#15577
* github.com:scylladb/scylladb:
row_cache: abort on exteral_updater::execute errors
row_cache: do_update: simplify _prev_snapshot_pos setup
(cherry picked from commit 4a0f16474f)
Closesscylladb/scylladb#16256
The copy assignment operator of _ck can throw
after _type and _bound_weight have already been changed.
This leaves position_in_partition in an inconsistent state,
potentially leading to various weird symptoms.
The problem was witnessed by test_exception_safety_of_reads.
Specifically: in cache_flat_mutation_reader::add_to_buffer,
which requires the assignment to _lower_bound to be exception-safe.
The easy fix is to perform the only potentially-throwing step first.
Fixes#15822Closesscylladb/scylladb#15864
(cherry picked from commit 93ea3d41d8)
This commit adds information on how to enable
object storage for a keyspace.
The "Keyspace storage options" section already
existed in the doc, but it was not valid as
the support was only added in version 5.4
The scope of this commit:
- Update the "Keyspace storage options" section.
- Add the information about object storage support
to the Data Definition> CREATE KEYSPACE section
* Marked as "Experimental".
* Excluded from the Enterprise docs with the
.. only:: opensource directive.
This commit must be backported to branch-5.4,
as support for object storage was added
in version 5.4.
Closesscylladb/scylladb#16081
(cherry picked from commit bfe19c0ed2)
Update node_exporter to 1.7.0.
The previous version (1.6.1) was flagged by security scanners (such as
Trivy) with HIGH-severity CVE-2023-39325. 1.7.0 release fixed that
problem.
[Botond: regenerate frozen toolchain]
Fixes#16085Closesscylladb/scylladb#16086Closesscylladb/scylladb#16090
(cherry picked from commit 321459ec51)
[avi: regenerate frozen toolchain]
* ./tools/jmx 8d15342e...9a03d4fa (1):
> Merge "scylla-apiclient: update several Java dependencies" from Piotr Grabowski
* ./tools/java 3c09ab97...dfbf3726 (1):
> Merge 'build: update several dependencies' from Piotr Grabowski
Update build dependencies which were flagged by security scanners.
Refs: scylladb/scylla-jmx#220
Refs: scylladb/scylla-tools-java#351Closesscylladb/scylladb#16149
This commit fixes the rollback procedure in
the 5.2-to-5.4 upgrade guide:
- The "Restore system tables" step is removed.
- The "Restore the configuration file" command
is fixed.
- The "Gracefully shutdown ScyllaDB" command
is fixed.
In addition, there are the following updates
to be in sync with the tests:
- The "Backup the configuration file" step is
extended to include a command to backup
the packages.
- The Rollback procedure is extended to restore
the backup packages.
- The Reinstallation section is fixed for RHEL.
Also, I've removed the optional step to enable
consistent schema management from the list of
steps - the appropriate section has already
been removed, but it remained in the procedure
description, which was misleading.
Refs https://github.com/scylladb/scylladb/issues/11907
This commit must be backported to branch-5.4
Closesscylladb/scylladb#16114
(cherry picked from commit 3751acce42)
This miniset addresses two potential conversions to `global_schema_ptr` of incomplete materialized views schemas.
One of them was completely unnecessary and also is a "chicken and an egg" problem where on the sync schema procedure itself a view schema was converted to `global_schema_ptr` solely for the purposes of logging. This can create a
"hickup" in the materialized views updates if they are comming from a node with a different mv schema.
The reason why sometimes a synced schema can have no base info is because of deactivision and reactivision of the schema inside the `schema_registry` which doesn't restore the base information due to lack of context.
When a schema is synced the problem becomes easy since we can just use the latest base information from the database.
Fixes#14011Closesscylladb/scylladb#14861
* github.com:scylladb/scylladb:
migration manager: fix incomplete mv schemas returned from get_schema_for_write
migration_manager: do not globalize potentially incomplete schema
(cherry picked from commit 5752dc875b)
allow_mutation_read_page_without_live_row is a new option in the
partition_slice::option option set. In a mixed clusters, old nodes
possibly don't know this new option, so its usage must be protected by a
cluster feature. This patch does just that.
Fixes: #15795Closesscylladb/scylladb#15890
(cherry picked from commit f53961248d)
Currently, it is started/stopped in the streaming/maintenance sg, which
is what the API itself runs in.
Starting the native transport in the streaming sg, will lead to severely
degraded performance, as the streaming sg has significantly less
CPU/disk shares and reader concurrency semaphore resources.
Furthermore, it will lead to multi-paged reads possibly switching
between scheduling groups mid-way, triggering an internal error.
To fix, use `with_scheduling_group()` for both starting and stopping
native transport. Technically, it is only strictly necessary for
starting, but I added it for stop as well for consistency.
Also apply the same treatment to RPC (Thrift). Although no one uses it,
best to fix it, just to be on the safe side.
I think we need a more systematic approach for solving this once and for
all, like passing the scheduling group to the protocol server and have
it switch to it internally. This allows the server to always run on the
correct scheduling group, not depending on the caller to remember using
it. However, I think this is best done in a follow-up, to keep this
critical patch small and easily backportable.
Fixes: #15485Closesscylladb/scylladb#16019
(cherry picked from commit dfd7981fa7)
$ID_LIKE = "rhel" works only on RHEL compatible OSes, not for RHEL
itself.
To detect RHEL correctly, we also need to check $ID = "rhel".
Fixes#16040Closesscylladb/scylladb#16041
(cherry picked from commit 338a9492c9)
There are two tests, test_read_all and test_read_with_partition_row_limits, which asserts on every page as well
as at the end that there are no misses whatsoever. This is incorrect, because it is possible that on a given page, not all shards participate and thus there won't be a saved reader on every shard. On the subsequent page, a shard without a reader may produce a miss. This is fine. Refine the asserts, to check that we have only as much misses, as many
shards we have without readers on them.
Fixes: https://github.com/scylladb/scylladb/issues/14087Closesscylladb/scylladb#15806
* github.com:scylladb/scylladb:
test/boost/multishard_mutation_query_test: fix querier cache misses expectations
test/lib/test_utils: add require_* variants for all comparators
(cherry picked from commit 457d170078)
When base write triggers mv write and it needs to be send to another
shard it used the same service group and we could end up with a
deadlock.
This fix affects also alternator's secondary indexes.
Testing was done using (yet) not committed framework for easy alternator
performance testing: https://github.com/scylladb/scylladb/pull/13121.
I've changed hardcoded max_nonlocal_requests config in scylla from 5000 to 500 and
then ran:
./build/release/scylla perf-alternator-workloads --workdir /tmp/scylla-workdir/ --smp 2 \
--developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write_gsi \
--duration 60 --ring-delay-ms 0 --skip-wait-for-gossip-to-settle 0 --continue-after-error true --concurrency 2000
Without the patch when scylla is overloaded (i.e. number of scheduled futures being close to max_nonlocal_requests) after couple seconds
scylla hangs, cpu usage drops to zero, no progress is made. We can confirm we're hitting this issue by seeing under gdb:
p seastar::get_smp_service_groups_semaphore(2,0)._count
$1 = 0
With the patch I wasn't able to observe the problem, even with 2x
concurrency. I was able to make the process hang with 10x concurrency
but I think it's hitting different limit as there wasn't any depleted
smp service group semaphore and it was happening also on non mv loads.
Fixes https://github.com/scylladb/scylladb/issues/15844Closesscylladb/scylladb#15845
(cherry picked from commit 020a9c931b)
This commit adds the .. only:: opensource directive
to the Raft page to exclude the link to the 5.2-to-5.4
upgrade guide from the Enterprise documentation.
The Raft page belongs to both OSS and Enterprise
documentation sets, while the upgrade guide
is OSS-only. This causes documentation build
issues in the Enterprise repository, for example,
https://github.com/scylladb/scylla-enterprise/pull/3242.
As a rule, all OSS-only links should be provided
by using the .. only:: opensource directive.
This commit must be backported to branch-5.4
to prevent errors in the documentation for
ScyllaDB Enterprise 2024.1
(backport)
Closesscylladb/scylladb#16064
(cherry picked from commit ca22de4843)
Currently, when said feature is enabled, we recalcuate the schema
digest. But this feature also influences how table versions are
calculated, so it has to trigger a recalculation of all table versions,
so that we can guarantee correct versions.
Before, this used to happen by happy accident. Another feature --
table_digest_insensitive_to_expiry -- used to take care of this, by
triggering a table version recalulation. However this feature only takes
effect if digest_insensitive_to_expiry is also enabled. This used to be
the case incidently, by the time the reload triggered by
table_digest_insensitive_to_expiry ran, digest_insensitive_to_expiry was
already enabled. But this was not guaranteed whatsoever and as we've
recently seen, any change to the feature list, which changes the order
in which features are enabled, can cause this intricate balance to
break.
This patch makes digest_insensitive_to_expiry also kick off a schema
reload, to eliminate our dependence on (unguaranteed) feature order, and
to guarantee that table schemas have a correct version after all features
are enabled. In fact, all schema feature notification handlers now kick
off a full schema reload, to ensure bugs like this don't creep in, in
the future.
Fixes: #16004Closesscylladb/scylladb#16013
(cherry picked from commit 22381441b0)
`system.raft` was using the "user memory pool", i.e. the
`dirty_memory_manager` for this table was set to
`database::_dirty_memory_manager` (instead of
`database::_system_dirty_memory_manager`).
This meant that if a write workload caused memory pressure on the user
memory pool, internal `system.raft` writes would have to wait for
memtables of user tables to get flushed before the write would proceed.
This was observed in SCT longevity tests which ran a heavy workload on
the cluster and concurrently, schema changes (which underneath use the
`system.raft` table). Raft would often get stuck waiting many seconds
for user memtables to get flushed. More details in issue #15622.
Experiments showed that moving Raft to system memory fixed this
particular issue, bringing the waits to reasonable levels.
Currently `system.raft` stores only one group, group 0, which is
internally used for cluster metadata operations (schema and topology
changes) -- so it makes sense to keep use system memory.
In the future we'd like to have other groups, for strongly consistent
tables. These groups should use the user memory pool. It means we won't
be able to use `system.raft` for them -- we'll just have to use a
separate table.
Fixes: scylladb/scylladb#15622Closesscylladb/scylladb#15972
(cherry picked from commit f094e23d84)
When topology coordinator tries to fence the previous coordinator it
performs a group0 operation. The current topology coordinator might be
aborted in the meantime, which will result in a `raft::request_aborted`
exception being thrown. After the fix to scylladb/scylladb#15728 was
merged, the exception is caught, but then `sleep_abortable` is called
which immediately throws `abort_requested_exception` as it uses the same
abort source as the group0 operation. The `fence_previous_coordinator`
function which does all those things is not supposed to throw
exceptions, if it does - it causes `raft_state_monitor_fiber` to exit,
completely disabling the topology coordinator functionality on that
node.
Modify the code in the following way:
- Catch `abort_requested_exception` thrown from `sleep_abortable` and
exit the function if it happens. In addition to the described issue,
it will also handle the case when abort is requested while
`sleep_abortable` happens,
- Catch `raft::request_aborted` thrown from group0 operation, log the
exception with lower verbosity and exit the function explicitly.
Finally, wrap both `fence_previous_coordinator` and `run` functions in a
`try` block with `on_fatal_internal_error` in the catch handler in order
to implement the behavior that adding `noexcept` was originally supposed
to introduce.
Fixes: scylladb/scylladb#15747Closesscylladb/scylladb#15948
* github.com:scylladb/scylladb:
raft topology: catch and abort on exceptions from topology_coordinator::run
Revert "storage_service: raft topology: mark topology_coordinator::run function as noexcept"
raft topology: don't print an error when fencing previous coordinator is aborted
raft topology: handle abort exceptions from sleeping in fence_previous_coordinator
(cherry picked from commit 07e9522d6c)
This commit updates the Repair-Based Node
Operations page. In particular:
- Information about RBNO enabled for all
node operations is added (before 5.4, RBNO
was enabled for the replace operation, while
it was experimental for others).
- The content is rewritten to remove redundant
information about previous versions.
The improvement is part of the 5.4 release.
This commit must be backported to branch-5.4
Closesscylladb/scylladb#16015
(cherry picked from commit 8a4a8f077a)
since we use `sphinx_multiversion` for building multiple versions of document. and in #15860, we changed the way how options are rendered, so the same change should be applied to the branch which includes the option list.
to address the conflicts, in addition to #15860, the depended PRs are also backported. so, in this pull request, following PRs are backported:
- #15827
- #15782
- #15860Closesscylladb/scylladb#16030
* github.com:scylladb/scylladb:
docs: add divider using CSS
docs: extract _clean_description as a filter
docs: render option with role
docs: update cofig params design
docs: parse source files right into rst
instead of hardwiring the formatting in the html code, do this using
CSS, more flexible this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit ff12f1f678)
would be better to split the parser from the formatter. in future,
we can apply more filter on top of the exiting one.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 1694a7addc)
so we can cross-reference them with the syntax like
:confval:`alternator_timeout_in_ms`.
or even render an option like:
.. confval:: alternator_timeout_in_ms
in order to make the headerlink of the option visible,
a new CSS rule is added.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 9ddc639237)
This commit fixes the information about
Raft-based consistent cluster management
in the 5.2-to-5.4 upgrade guide.
This a follow-up to https://github.com/scylladb/scylladb/pull/15880 and must be backported to branch-5.4.
In addition, it adds information about removing
DateTieredCompactionStrategy to the 5.2-to-5.4
upgrade guide, including the guideline to
migrate to TimeWindowCompactionStrategy.
Closesscylladb/scylladb#15988
(cherry picked from commit ca0f5f39b5)
This commit updates the cqlsh compatibility
with Python to Python 3.
In addition it:
- Replaces "Cassandra" with "ScyllaDB" in
the description of cqlsh.
The previous description was outdated, as
we no longer can talk about using cqlsh
released with Cassandra.
- Replaces occurrences of "Scylla" with "ScyllaDB".
- Adds additional locations of cqlsh (Docker Hub
and PyPI), as well as the link to the scylla-cqlsh
repository.
Closesscylladb/scylladb#16016
(cherry picked from commit 8d618bbfc6)
We have observed do_repair_ranges() receiving tens of thousands of
ranges to repairs on occasion. do_repair_ranges() repairs all ranges in
parallel, with parallel_for_each(). This is normally fine, as the lambda
inside parallel_for_each() takes a semaphore and this will result in
limited concurrency.
However, in some instances, it is possible that most of these ranges are
skipped. In this case the lambda will become synchronous, only logging a
message. This can cause stalls beacuse there are no opportunities to
yield. Solve this by adding an explicit yield to prevent this.
Fixes: #14330Closesscylladb/scylladb#15879
(cherry picked from commit 90a8489809)
This PR adds the 5.2-5.4 upgrade guide.
In addition, it removes the redundant upgrade guide from 5.2 to 5.3 (as 5.3 was skipped), as well as some mentions of version 5.3.
This PR must be backported to branch-5.4.
Closesscylladb/scylladb#15880
* github.com:scylladb/scylladb:
doc: add the upgrade guide from 5.2 to 5.4
doc: remove version "5.3" from the docs
doc: remove the 5.2-to-5.3 upgrade guide
(cherry picked from commit 74f68a472f)
These APIs may return stale or simply incorrect data on shards
other than 0. Newer versions of Scylla are better at maintaining
cross-shard consistency, but we need a simple fix that can be easily and
without risk be backported to older versions; this is the fix.
Add a simple test to check that the `failure_detector/endpoints`
API returns nonzero generation.
Fixes: scylladb/scylladb#15816Closesscylladb/scylladb#15970
* github.com:scylladb/scylladb:
test: rest_api: test that generation is nonzero in `failure_detector/endpoints`
api: failure_detector: fix indentation
api: failure_detector: invoke on shard 0
(cherry picked from commit 9443253f3d)
This commit updates the package installation
instructions in version 5.4.
- It updates the variables to include "5.4"
as the version name.
- It adds the information for the newly supported
Rocky/RHEL 9 - a new EPEL download link is required.
Closesscylladb/scylladb#15963
(cherry picked from commit 1e0cbfe522)
This commit adds OS support information
in version 5.4 (removing the non-released
version 5.3).
In particular, it adds support for Oracle Linux
and Amazon Linux.
Also, it removes support for outdated versions.
Closesscylladb/scylladb#15923
(cherry picked from commit 3756705520)
Topology on raft is still an experimental feature. The RPC verbs
introduced in that mode shouldn't be used when it's disabled, otherwise
we lose the right to make breaking changes to those verbs.
First, make sure that the aforementioned verbs are not sent outside the
mode. It turns out that `raft_pull_topology_snapshot` could be sent
outside topology-on-raft mode - after the PR, it no longer can.
Second, topology-on-raft mode verbs are now not registered at all on the
receiving side when the mode is disabled.
Additionally tested by running `topology/` tests with
`consistent_cluster_management: True` but with experimental features
disabled.
Fixes: scylladb/scylladb#15862Closesscylladb/scylladb#15917
* github.com:scylladb/scylladb:
storage_service: fix indentation
raft: topology: only register verbs in topology-on-raft mode
raft: topology: only pull topology snapshot in topology-on-raft mode
(cherry picked from commit 5cf18b18b2)
Major compaction semantics is that all data of a table will be compacted
together, so user can expect e.g. a recently introduced tombstone to be
compacted with the data it shadows.
Today, it can happen that all data in maintenance set won't be included
for major, until they're promoted into main set by off-strategy.
So user might be left wondering why major is not having the expected
effect.
To fix this, let's perform off-strategy first, so data in maintenance
set will be made available by major. A similar approach is done for
data in memtable, so flush is performed before major starts.
The only exception will be data in staging, which cannot be compacted
until view building is done with it, to avoid inconsistency in view
replicas.
The serialization in comapaction manager of reshape jobs guarantee
correctness if there's an ongoing off-strategy on behalf of the
table.
Fixes#11915.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15792
(cherry picked from commit ea6c281b9f)
Said statement keeps a reference to erm indirectly, via a topology node pointer, but doesn't keep erm alive. This can result in use-after-free. Furthermore, it allows for vnodes being pulled from under the query's feet, as it is running.
To prevent this, keep the erm alive for the duration of the query.
Also, use `host_id` instead of `node`, the node pointer is not needed really, as the statement only uses the host id from it.
Fixes: #15802Closesscylladb/scylladb#15808
* github.com:scylladb/scylladb:
cql3: mutation_fragments_select_statement: use host_id instead of node
cql3: mutation_fragments_select_statement: pin erm reference
(cherry picked from commit 782c6a208a)
Throwing error kills the topology coordinator monitor fiber. Instead we
retry the operation until it succeeds or the node looses its leadership.
This is fine before for the operation to succeed quorum is needed and if
the quorum is not available the node should relinquish its leadership.
Fixes#15728
(cherry picked from commit 65bf5877e7)
This commit fixes the layout of the Reference
page. Previously, the toctree level was "2",
which made the page hard to navigate.
This PR changes the level to "1".
In addition, the capitalization of page
titles is fixed.
This is a follow-up PR to the ones that
created and updated the Reference section.
It must be backported to branch-5.4.
Closesscylladb/scylladb#15830
(cherry picked from commit e223624e2e)
When populating system keyspace the sstable_directory forgets to create upload/ subdir in the tables' datadir because of the way it's invoked from distributed loader. For non-system keyspaces directories are created in table::init_storage() which is self-contained and just creates the whole layout regardless of what.
This PR makes system keyspace's tables use table::init_storage() as well so that the datadir layout is the same for all on-disk tables.
Test included.
fixes: #15708closes: scylladb/scylla-manager#3603Closesscylladb/scylladb#15723
* github.com:scylladb/scylladb:
test: Add test for datadir/ layout
sstable_directory: Indentation fix after previous patch
db,sstables: Move storage init for system keyspace to table creation
(cherry picked from commit 7f81957437)
This commit:
- Removes upgrade guides for versions older than 5.0.
The oldest one is from version 4.6 to 5.0.
- Adds the redirections for the removed pages.
Closesscylladb/scylladb#15709
"description":"Controls flushing of memtables before compaction (true by default). Set to \"false\" to skip automatic flushing of memtables before compaction, e.g. when the table is flushed explicitly before invoking the compaction api.",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
},
{
"name":"split_output",
"description":"true if the output of the major compaction should be split in several sstables",
"summary":"Forces major compaction in all keyspaces",
"type":"void",
"nickname":"force_compaction",
"produces":[
"application/json"
],
"parameters":[
{
"name":"flush_memtables",
"description":"Controls flushing of memtables before compaction (true by default). Set to \"false\" to skip automatic flushing of memtables before compaction, e.g. when tables were flushed explicitly before invoking the compaction api.",
"description":"Controls flushing of memtables before compaction (true by default). Set to \"false\" to skip automatic flushing of memtables before compaction, e.g. when tables were flushed explicitly before invoking the compaction api.",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
@@ -912,6 +944,21 @@
}
]
},
{
"path":"/storage_service/flush",
"operations":[
{
"method":"POST",
"summary":"Flush all memtables in all keyspaces.",
throwexceptions::invalid_request_exception(format("Cannot add new field to type {} because it is used in the clustering key column {} of table {}.{} where durations are not allowed",
"If set to higher than 0, ignore the controller's output and set the compaction shares statically. Do not set this unless you know what you are doing and suspect a problem in the controller. This option will be retired when the controller reaches more maturity")
"If set to true, enforce the min_threshold option for compactions strictly. If false (default), Scylla may decide to compact even if below min_threshold")
"Adjusts the sensitivity of the failure detector on an exponential scale. Generally this setting never needs adjusting.\n"
"Related information: Failure detection and recovery")
,failure_detector_timeout_in_ms(this,"failure_detector_timeout_in_ms",liveness::LiveUpdate,value_status::Used,20*1000,"Maximum time between two successful echo message before gossip mark a node down in milliseconds.\n")
,direct_failure_detector_ping_timeout_in_ms(this,"direct_failure_detector_ping_timeout_in_ms",value_status::Used,600,"Duration after which the direct failure detector aborts a ping message, so the next ping can start.\n"
"Note: this failure detector is used by Raft, and is different from gossiper's failure detector (configured by `failure_detector_timeout_in_ms`).\n")
,enable_repair_based_node_ops(this,"enable_repair_based_node_ops",liveness::LiveUpdate,value_status::Used,true,"Set true to use enable repair based node operations instead of streaming based")
,allowed_repair_based_node_ops(this,"allowed_repair_based_node_ops",liveness::LiveUpdate,value_status::Used,"replace,removenode,rebuild,bootstrap,decommission","A comma separated list of node operations which are allowed to enable repair based node operations. The operations can be bootstrap, replace, removenode, decommission and rebuild")
,enable_compacting_data_for_streaming_and_repair(this,"enable_compacting_data_for_streaming_and_repair",liveness::LiveUpdate,value_status::Used,true,"Enable the compacting reader, which compacts the data for streaming and repair (load'n'stream included) before sending it to, or synchronizing it with peers. Can reduce the amount of data to be processed by removing dead data, but adds CPU overhead.")
"Specify the fraction of partitions written by repair out of the total partitions. The value is currently only used for bloom filter estimation. Value is between 0 and 1.")
,ring_delay_ms(this,"ring_delay_ms",value_status::Used,30*1000,"Time a node waits to hear from other nodes before joining the ring in milliseconds. Same as -Dcassandra.ring_delay_ms in cassandra.")
,shadow_round_ms(this,"shadow_round_ms",value_status::Used,300*1000,"The maximum gossip shadow round time. Can be used to reduce the gossip feature check time during node boot up.")
,fd_max_interval_ms(this,"fd_max_interval_ms",value_status::Used,2*1000,"The maximum failure_detector interval time in milliseconds. Interval larger than the maximum will be ignored. Larger cluster may need to increase the default.")
,unspooled_dirty_soft_limit(this,"unspooled_dirty_soft_limit",value_status::Used,0.6,"Soft limit of unspooled dirty memory expressed as a portion of the hard limit")
,sstable_summary_ratio(this,"sstable_summary_ratio",value_status::Used,0.0005,"Enforces that 1 byte of summary is written for every N (2000 by default) "
"bytes written to data file. Value must be between 0 and 1.")
,components_memory_reclaim_threshold(this,"components_memory_reclaim_threshold",liveness::LiveUpdate,value_status::Used,.2,"Ratio of available memory for all in-memory components of SSTables in a shard beyond which the memory will be reclaimed from components until it falls back under the threshold. Currently, this limit is only enforced for bloom filters.")
,large_memory_allocation_warning_threshold(this,"large_memory_allocation_warning_threshold",value_status::Used,size_t(1)<<20,"Warn about memory allocations above this size; set to zero to disable")
,enable_deprecated_partitioners(this,"enable_deprecated_partitioners",value_status::Used,false,"Enable the byteordered and random partitioners. These partitioners are deprecated and will be removed in a future version.")
,enable_keyspace_column_family_metrics(this,"enable_keyspace_column_family_metrics",value_status::Used,false,"Enable per keyspace and per column family metrics reporting")
@@ -8,8 +8,7 @@ Scylla implements the following compaction strategies in order to reduce :term:`
*`Size-tiered compaction strategy (STCS)`_ - triggered when the system has enough (four by default) similarly sized SSTables.
*`Leveled compaction strategy (LCS)`_ - the system uses small, fixed-size (by default 160 MB) SSTables distributed across different levels.
*`Incremental Compaction Strategy (ICS)`_ - shares the same read and write amplification factors as STCS, but it fixes its 2x temporary space amplification issue by breaking huge sstables into SSTable runs, which are comprised of a sorted set of smaller (1 GB by default), non-overlapping SSTables.
*`Time-window compaction strategy (TWCS)`_ - designed for time series data; replaced Date-tiered compaction.
*`Date-tiered compaction strategy (DTCS)`_ - designed for time series data.
*`Time-window compaction strategy (TWCS)`_ - designed for time series data.
This document covers how to choose a compaction strategy and presents the benefits and disadvantages of each one. If you want more information on compaction in general or on any of these strategies, refer to the :doc:`Compaction Overview </kb/compaction>`. If you want an explanation of the CQL commands used to create a compaction strategy, refer to :doc:`Compaction CQL Reference </cql/compaction>` .
@@ -78,7 +77,6 @@ ICS is only available in ScyllaDB Enterprise. See the `ScyllaDB Enetrpise docume
Time-window Compaction Strategy (TWCS)
======================================
Time-window compaction strategy was introduced in Cassandra 3.0.8 for time-series data as a replacement for `Date-tiered Compaction Strategy (DTCS)`_.
Time-Window Compaction Strategy compacts SSTables within each time window using `Size-tiered Compaction Strategy (STCS)`_.
SSTables from different time windows are never compacted together. You set the :ref:`TimeWindowCompactionStrategy <time-window-compactionstrategy-twcs>` parameters when you create a table using a CQL command.
@@ -87,9 +85,8 @@ SSTables from different time windows are never compacted together. You set the :
Time-window Compaction benefits
-------------------------------
* Keeps entries according to a time range, making searches for data within a given range easy to do, resulting in better read performance
*Improves over DTCS in that it reduces the number to huge compactions
* Allows you to expire an entire SSTable at once (using a TTL) as the data is already organized within a time frame
* Keeps entries according to a time range, making searches for data within a given range easy to do, resulting in better read performance.
*Allows you to expire an entire SSTable at once (using a TTL) as the data is already organized within a time frame.
Time-window Compaction deficits
-------------------------------
@@ -102,14 +99,6 @@ Set the parameters for :ref:`Time-window Compaction <time-window-compactionstrat
Use the table in `Which strategy is best`_ to determine if this is the right strategy for your needs.
.._DTCS1:
Date-tiered Compaction Strategy (DTCS)
======================================
Date-Tiered Compaction is designed for time series data. This strategy was introduced with Cassandra 2.1.
It is only suitable for time-series data. This strategy is not recommended and has been replaced by :ref:`Time-window Compaction Strategy <TWCS1>`.
In ScyllaDB 5.2 and ScyllaDB Enterprise 2023.1 Raft is Generally Available and can be safely used for consistent schema management.
In further versions, it will be mandatory.
It will get enabled by default when you upgrade your cluster to ScyllaDB 5.4 or 2024.1.
If needed, you can explicitly prevent it from getting enabled upon upgrade.
..only:: opensource
See :doc:`the upgrade guide from 5.2 to 5.4 </upgrade/index>` for details.
ScyllaDB Open Source 5.2 and later, and ScyllaDB Enterprise 2023.1 and later come equipped with a procedure that can setup Raft-based consistent cluster management in an existing cluster. We refer to this as the **Raft upgrade procedure** (do not confuse with the :doc:`ScyllaDB version upgrade procedure </upgrade/index/>`).
@@ -214,6 +219,36 @@ of nodes in the cluster is available. The following examples illustrate how Raft
In summary, Raft makes schema changes safe, but it requires that a quorum of nodes in the cluster is available.
.._raft-topology-changes:
..only:: opensource
Consistent Topology with Raft :label-caution:`Experimental`
@@ -19,8 +19,6 @@ The following compaction strategies are supported by Scylla:
* Time-window Compaction Strategy (`TWCS`_)
* Date-tiered Compaction Strategy (DTCS) - use `TWCS`_ instead
This page concentrates on the parameters to use when creating a table with a compaction strategy. If you are unsure which strategy to use or want general information on the compaction strategies which are available to Scylla, refer to :doc:`Compaction Strategies </architecture/compaction/compaction-strategies>`.
Custom strategy can be provided by specifying the full class name as a :ref:`string constant
<constants>`.
All default strategies support a number of common options, as well as options specific to
the strategy chosen (see the section corresponding to your strategy for details: :ref:`STCS <stcs-options>`, :ref:`LCS <lcs-options>`, and :ref:`TWCS <twcs-options>`). DTCS is not recommended, and TWCS should be used instead.
.. ``'Date Tiered Compaction Strategy is not recommended and has been replaced by Time Window Compaction Stragegy.'`` (:ref:`TWCS <TWCS>`) (the
.. is also supported but is deprecated and ``'TimeWindowCompactionStrategy'`` should be
.. preferred instead).
the strategy chosen (see the section corresponding to your strategy for details: :ref:`STCS <stcs-options>`, :ref:`LCS <lcs-options>`, and :ref:`TWCS <twcs-options>`).
@@ -5,7 +5,7 @@ Wasm support for user-defined functions
This document describes the details of Wasm language support in user-defined functions (UDF). The language ``wasm`` is one of the possible languages to use, besides Lua, to implement these functions. To learn more about User-defined functions in ScyllaDB, click :ref:`here <udfs>`.
..note:: Until ScyllaDB 5.2, the Wasm language was called ``xwasm``. This name is replaced with ``wasm`` in ScyllaDB 5.3.
..note:: Until ScyllaDB 5.2, the Wasm language was called ``xwasm``. This name is replaced with ``wasm`` in ScyllaDB 5.4.
@@ -198,11 +198,27 @@ We're not able to prevent a node learning about a new generation too late due to
After committing the generation ID, the topology coordinator publishes the generation data to user-facing description tables (`system_distributed.cdc_streams_descriptions_v2` and `system_distributed.cdc_generation_timestamps`).
#### Generation switching: other notes
#### Generation switching: accepting writes
Due to the need of maintaining colocation we don't allow the client to send writes with arbitrary timestamps.
Suppose that a write is requested and the write coordinator's local clock has time `C` and the generation operating at time `C` has timestamp `T` (`T <= C`). Then we only allow the write if its timestamp is in the interval [`T`, `C + generation_leeway`), where `generation_leeway` is a small time-inteval constant (e.g. 5 seconds).
Reason: we cannot allow writes before `T`, because they belong to the old generation whose token ranges might no longer refine the current vnodes, so the corresponding log write would not necessarily be colocated with the base write. We also cannot allow writes too far "into the future" because we don't know what generation will be operating at that time (the node which will introduce this generation might not have joined yet). But, as mentioned before, we assume that we'll learn about the next generation in time. Again --- the need for this assumption will be gone in a future patch.
Due to the need of maintaining colocation we don't allow the client to send writes with arbitrary timestamps. We allow:
- writes to the current and next generations unless they are too far into the future,
- writes to the previous generations unless they are too far into the past.
##### Writes to the current and next generations
Suppose that a write with timestamp `W` is requested and the write coordinator's local clock has time `C` and the generation operating at time `C` has timestamp `T` (`T <= C`) such that `T <= W`. Then we only allow the write if `W < C + generation_leeway`, where `generation_leeway` is a small time-interval constant (e.g. 5 seconds).
We cannot allow writes too far "into the future" because we don't know what generation will be operating at that time (the node which will introduce this generation might not have joined yet). But, as mentioned before, we assume that we'll learn about the next generation in time. Again --- the need for this assumption will be gone in a future patch.
##### Writes to the previous generations
This time suppose that `T > W`. Then we only allow the write if `W > C - generation_leeway` and there was a generation operating at `W`.
We allow writes to previous generations to improve user experience. If a client generates timestamps by itself and clocks are not perfectly synchronized, there may be short periods of time around the moment of switching generations when client's writes are rejected because they fall into one of the previous generations. Usually, this problem is easy to overcome by the client. It can simply repeat a write a few times, but using a higher timestamp. Unfortunately, if a table additionally uses LWT, the client cannot increase the timestamp because LWT makes timestamps permanent. Once Paxos commits an entry with a given timestamp, Scylla will keep trying to apply that entry until it succeeds, with the same timestamp. Applying the entry involves doing a CDC log table write. If it fails, we are stuck. Allowing writes to the previous generations is also a probabilistic fix for this bug.
Note that writing only to the previous generation might not be enough. With the Raft-based topology and tablets, we can add multiple nodes almost instantly. Then, we can have multiple generations with almost identical timestamps.
We allow writes only to the recent past to reduce the number of generations that must be stored in memory.
You can `build ScyllaDB from source <https://github.com/scylladb/scylladb#build-prerequisites>`_ on other x86_64 or aarch64 platforms, without any guarantees.
@@ -8,8 +8,14 @@ as-a-service, see `ScyllaDB Cloud documentation <https://cloud.docs.scylladb.com
Launching Instances from ScyllaDB AMI
---------------------------------------
#.Go to `Amazon EC2 AMIs – ScyllaDB <https://www.scylladb.com/download/?platform=aws#open-source>`_ in ScyllaDB's download center,
choose your region, and click the **Node** link to open the EC2 instance creation wizard.
#.Choose your region, and click the **Node** link to open the EC2 instance creation wizard.
The following table shows the latest patch release. See :doc:`AWS Images </reference/aws-images/>` for earlier releases.
..scylladb_aws_images_template::
:exclude:rc,dev
:only_latest:
#. Choose the instance type. See :ref:`Cloud Instance Recommendations for AWS <system-requirements-aws>` for the list of recommended instances.
Other instance types will work, but with lesser performance. If you choose an instance type other than the recommended ones, make sure to run the :ref:`scylla_setup <system-configuration-scripts>` script.
@@ -5,92 +5,9 @@ The following matrix shows which Linux distributions, containers, and images are
Where *supported* in this scope means:
.. REMOVE IN FUTURE VERSIONS - Remove information about versions from the notes below in version 5.2.
- A binary installation package is available to `download <https://www.scylladb.com/download/>`_.
- The download and install procedures are tested as part of ScyllaDB release process for each version.
- An automated install is included from :doc:`ScyllaDB Web Installer for Linux tool </getting-started/installation-common/scylla-web-installer>` (for latest versions)
You can `build ScyllaDB from source <https://github.com/scylladb/scylladb#build-prerequisites>`_ on other x86_64 or aarch64 platforms, without any guarantees.
..scylladb_include_flag:: os-support-info.rst
..note::
**Supported Architecture**
ScyllaDB Open Source supports x86_64 for all versions and AArch64 starting from ScyllaDB 4.6 and nightly build. In particular, aarch64 support includes AWS EC2 Graviton.
ScyllaDB Open Source
----------------------
..note::
The recommended OS for ScyllaDB Open Source is Ubuntu 22.04.
@@ -23,7 +23,7 @@ It’s recommended to have a balanced setup. If there are only 4-8 :term:`Logica
This works in the opposite direction as well.
ScyllaDB can be used in many types of installation environments.
To see which system would best suit your workload requirements, use the `ScyllaDB Sizing Calculator <https://price-calc.gh.scylladb.com/>`_ to customize ScyllaDB for your usage.
To see which system would best suit your workload requirements, use the `ScyllaDB Sizing Calculator <https://www.scylladb.com/product/scylla-cloud/get-pricing/>`_ to customize ScyllaDB for your usage.
@@ -44,8 +44,7 @@ A compaction strategy is what determines which of the SSTables will be compacted
*`Size-tiered compaction strategy (STCS)`_ - (default setting) triggered when the system has enough similarly sized SSTables.
*`Leveled compaction strategy (LCS)`_ - the system uses small, fixed-size (by default 160 MB) SSTables divided into different levels and lowers both Read and Space Amplification.
*:ref:`Incremental compaction strategy (ICS) <incremental-compaction-strategy-ics>` - :label-tip:`ScyllaDB Enterprise` Uses runs of sorted, fixed size (by default 1 GB) SSTables in a similar way that LCS does, organized into size-tiers, similar to STCS size-tiers. If you are an Enterprise customer ICS is an updated strategy meant to replace STCS. It has the same read and write amplification, but has lower space amplification due to the reduction of temporary space overhead is reduced to a constant manageable level.
*`Time-window compaction strategy (TWCS)`_ - designed for time series data and puts data in time order. This strategy replaced Date-tiered compaction. TWCS uses STCS to prevent accumulating SSTables in a window not yet closed. When the window closes, TWCS works towards reducing the SSTables in a time window to one.
*`Date-tiered compaction strategy (DTCS)`_ - designed for time series data, but TWCS should be used instead.
*`Time-window compaction strategy (TWCS)`_ - designed for time series data and puts data in time order. TWCS uses STCS to prevent accumulating SSTables in a window not yet closed. When the window closes, TWCS works towards reducing the SSTables in a time window to one.
How to Set a Compaction Strategy
................................
@@ -125,7 +124,7 @@ ICS is only available in ScyllaDB Enterprise. See the `ScyllaDB Enetrpise docume
Time-window Compaction Strategy (TWCS)
--------------------------------------
Time-window compaction strategy was introduced as a replacement for `Date-tiered Compaction Strategy (DTCS)`_ for handling time series workloads. Time-Window Compaction Strategy compacts SSTables within each time window using `Size-tiered Compaction Strategy (STCS)`_. SSTables from different time windows are never compacted together.
Time-Window Compaction Strategy is designed for handling time series workloads. It compacts SSTables within each time window using `Size-tiered Compaction Strategy (STCS)`_. SSTables from different time windows are never compacted together.
..include:: /rst_include/warning-ttl-twcs.rst
@@ -148,22 +147,6 @@ The primary motivation for TWCS is to separate data on disk by timestamp and to
While TWCS tries to minimize the impact of commingled data, users should attempt to avoid this behavior. Specifically, users should avoid queries that explicitly set the timestamp. It is recommended to run frequent repairs (which streams data in such a way that it does not become commingled), and disable background read repair by setting the table’s :ref:`read_repair_chance <create-table-general-options>` and :ref:`dclocal_read_repair_chance <create-table-general-options>` to ``0``.
Date-tiered compaction strategy (DTCS)
--------------------------------------
Date-Tiered Compaction is designed for time series data. It is only suitable for time-series data. This strategy has been replaced by `Time-window compaction strategy (TWCS)`_.
Date-tiered compaction strategy works as follows:
* First it sorts the SSTables by time and then compacts adjacent (time-wise) SSTables.
* This results in SSTables whose sizes increase exponentially as they grow older.
For example, at some point we can have the last minute of data in one SSTable (by default, base_time_seconds = 60), another minute before that in another SSTable, then the 4 minutes before that in one SSTable, then the 4 minutes before that, then an SSTable of the 16 minutes before that, and so on. This structure can easily be maintained by compaction, very similar to size-tiered compaction. When there are 4 (the default value for min_threshold) small (one-minute) consecutive SSTables, they are compacted into one 4-minute SSTable. When there are 4 of the bigger SSTables one after another (time-wise), they are merged into a 16-minute SSTable, and so on.
Antique SSTables older than ``max_SSTable_age_days`` (by default 365 days) are not compacted as doing these compactions would not be useful for most queries, the process would be very slow, and the compaction would require huge amounts of temporary disk space.
Scylla has two use cases for transferring data between nodes:
In ScyllaDB, data is transferred between nodes during:
- Topology changes, like adding and removing nodes.
- Repair, a background process to compare and sync data between nodes.
* Topology changes via node operations, such as adding or removing nodes.
* Repair - a row-level background process to compare and sync data between nodes.
Up to Scylla 4.6, the two used different underline logic. In later releases, the same data transferring logic used for repair is also used for topology changes, making it more robust, reliable, and safer for data consistency. In particular, node operations can restart from the same point it stopped without sending data that has been synced, a significant time-saver when adding or removing large nodes.
In 4.6, Repair Based Node Operations (RBNO) is enabled by default only for replace node operation.
Example from scylla.yaml:
By default, the row-level repair mechanism used for the repair process is also
used during node operations (instead of streaming). We refer to it as
Repair-Based Node Operations (RBNO).
..code::yaml
enable_repair_based_node_ops:true
allowed_repair_based_node_ops:replace
RBNO is more robust, reliable, and safer for data consistency than streaming.
In particular, a failed node operation can resume from the point it stopped -
without sending data that has already been synced, which is a significant
time-saver when adding or removing large nodes. In addition, with RBNO enabled,
you don't need to run repair befor or after node operations, such as replace
or removenode.
To enable other operations (experimental), add them as a comma-separated list to allowed_repair_based_node_ops. Available operations are:
RBNO is enabled for the following node operations:
* bootstrap
* replace
* removenode
* decommission
* rebuild
* removenode
* replace
The following configuration options can be used to enable or disable RBNO:
*``enable_repair_based_node_ops= true|false`` - Enables or disables RBNO.
Quorum is a *global* consistency level setting across the entire cluster including all data centers. See :doc:`Consistency Levels </cql/consistency>`.
Date-tiered compaction strategy (DTCS)
:abbr:`DTCS (Date-tiered compaction strategy)` is designed for time series data, but should not be used. Use :term:`Time-Window Compaction Strategy`. See :doc:`Compaction Strategies</architecture/compaction/compaction-strategies/>`.
Entropy
A state where data is not consistent. This is the result when replicas are not synced and data is random. Scylla has measures in place to be antientropic. See :doc:`Scylla Anti-Entropy </architecture/anti-entropy/index>`.
@@ -151,7 +148,7 @@ Glossary
A collection of columns fetched by row. Columns are ordered by Clustering Key. See :doc:`Ring Architecture </architecture/ringarchitecture/index>`.
Time-window compaction strategy
TWCS is designed for time series data and replaced Date-tiered compaction. See :doc:`Compaction Strategies</architecture/compaction/compaction-strategies/>`.
TWCS is designed for time series data. See :doc:`Compaction Strategies</architecture/compaction/compaction-strategies/>`.
Token
A value in a range, used to identify both nodes and partitions. Each node in a Scylla cluster is given an (initial) token, which defines the end of the range a node handles. See :doc:`Ring Architecture </architecture/ringarchitecture/index>`.
@@ -12,7 +12,7 @@ the ``/etc/systemd/system/var-lib-scylla.mount`` and ``/etc/systemd/system/var-l
deleted by RPM.
To avoid losing the files, the upgrade procedure includes a step to backup the .mount files. The following
example shows the command to backup the files before the :doc:`upgrade from version 5.0 </upgrade/upgrade-to-enterprise/upgrade-guide-from-5.0-to-2022.1/upgrade-guide-from-5.0-to-2022.1-rpm/>`:
example shows the command to backup the files before the upgrade from version 5.0:
This document is a step by step procedure for upgrading from Scylla 1.6 to Scylla 1.7, and rollback to 1.6 if required.
Applicable versions
===================
This guide covers upgrading Scylla from the following versions: 1.6.x to Scylla version 1.7.y on the following platform:
* |OS|
Upgrade Procedure
=================
..include:: /upgrade/_common/warning.rst
A Scylla upgrade is a rolling procedure which does **not** require full cluster shutdown.
For each of the nodes in the cluster, you will:
* drain node and backup the data
* check your current release
* backup configuration file
* stop Scylla
* download and install new Scylla packages
* start Scylla
* validate that the upgrade was successful
Apply the following procedure **serially** on each node. Do not move to the next node before validating the node is up and running with the new version.
**During** the rolling upgrade it is highly recommended:
* Not to use new 1.7 features
* Not to run administration functions, like repairs, refresh, rebuild or add or remove nodes
* Not to apply schema changes
Upgrade steps
=============
Drain node and backup the data
------------------------------
Before any major procedure, like an upgrade, it is recommended to backup all the data to an external device. In Scylla, backup is done using the ``nodetool snapshot`` command. For **each** node in the cluster, run the following command:
..code::sh
nodetool drain
nodetool snapshot
Take note of the directory name that nodetool gives you, and copy all the directories having this name under ``/var/lib/scylla`` to a backup device.
When the upgrade is complete (all nodes), the snapshot should be removed by ``nodetool clearsnapshot -t <snapshot>``, or you risk running out of space.
Backup configuration file
-------------------------
..code::sh
sudo cp -a /etc/scylla/scylla.yaml /etc/scylla/scylla.yaml.backup-1.6
Gracefully stop the node
------------------------
..code::sh
sudo service scylla-server stop
Download and install the new release
------------------------------------
Before upgrading, check what version you are running now using ``dpkg -s scylla-server``. You should use the same version in case you want to |ROLLBACK|_ the upgrade. If you are not running a 1.6.x version, stop right here! This guide only covers 1.6.x to 1.7.y upgrades.
To upgrade:
1. Update the |APT|_ to **1.7**
2. Upgrade java to 1.8 on Ubuntu 14.04 and Debian 8, which is requested by Scylla 1.7
1. Check cluster status with ``nodetool status`` and make sure **all** nodes, including the one you just upgraded, are in UN status.
2. Use ``curl -X GET "http://localhost:10000/storage_service/scylla_release_version"`` to check scylla version.
3. Check scylla-server log (check ``/var/log/upstart/scylla-server.log`` for Ubuntu 14.04, execute ``journalctl _COMM=scylla`` for Ubuntu 16.04 and Debian 8) and ``/var/log/syslog`` to validate there are no errors.
4. Check again after 2 minutes, to validate no new issues are introduced.
Once you are sure the node upgrade is successful, move to the next node in the cluster.
Rollback Procedure
==================
..include:: /upgrade/_common/warning_rollback.rst
The following procedure describes a rollback from Scylla release 1.7.x to 1.6.y. Apply this procedure if an upgrade from 1.6 to 1.7 failed before completing on all nodes. Use this procedure only for nodes you upgraded to 1.7
Scylla rollback is a rolling procedure which does **not** require full cluster shutdown.
For each of the nodes rollback to 1.6, you will:
* drain the node and stop Scylla
* retrieve the old Scylla packages
* restore the configuration file
* restart Scylla
* validate the rollback success
Apply the following procedure **serially** on each node. Do not move to the next node before validating the node is up and running with the new version.
Rollback steps
==============
Gracefully shutdown Scylla
--------------------------
..code::sh
nodetool drain
sudo service scylla-server stop
download and install the old release
------------------------------------
1. Remove the old repo file.
..code::sh
sudo rm -rf /etc/apt/sources.list.d/scylla.list
2. Update the |APT|_ to **1.6**
3. install
..code::sh
sudo apt-get update
sudo apt-get remove scylla\* -y
sudo apt-get install scylla
Answer ‘y’ to the first two questions.
Restore the configuration file
------------------------------
..code::sh
sudo rm -rf /etc/scylla/scylla.yaml
sudo cp -a /etc/scylla/scylla.yaml.backup-1.6 /etc/scylla/scylla.yaml
Start the node
--------------
..code::sh
sudo service scylla-server start
Validate
--------
Check upgrade instruction above for validation. Once you are sure the node rollback is successful, move to the next node in the cluster.
This document is a step by step procedure for upgrading from Scylla 1.7 to Scylla 2.0, and rollback to 1.7 if required.
Applicable versions
===================
This guide covers upgrading Scylla from the following versions: 1.7.x (x >= 4) to Scylla version 2.0.y on the following platform:
* |OS|
Upgrade Procedure
=================
..include:: /upgrade/_common/warning.rst
A Scylla upgrade is a rolling procedure which does **not** require full cluster shutdown.
For each of the nodes in the cluster, you will:
* check cluster schema
* drain node and backup the data
* backup configuration file
* stop Scylla
* download and install new Scylla packages
* start Scylla
* validate that the upgrade was successful
Apply the following procedure **serially** on each node. Do not move to the next node before validating the node is up and running with the new version.
**During** the rolling upgrade it is highly recommended:
* Not to use new 2.0 features
* Not to run administration functions, like repairs, refresh, rebuild or add or remove nodes
* Not to apply schema changes
Upgrade steps
=============
Check cluster schema
--------------------
Make sure that all nodes have the schema synched prior to upgrade, we won't survive an upgrade that has schema disagreement between nodes.
..code::sh
nodetool describecluster
Drain node and backup the data
------------------------------
Before any major procedure, like an upgrade, it is recommended to backup all the data to an external device. In Scylla, backup is done using the ``nodetool snapshot`` command. For **each** node in the cluster, run the following command:
..code::sh
nodetool drain
nodetool snapshot
Take note of the directory name that nodetool gives you, and copy all the directories having this name under ``/var/lib/scylla`` to a backup device.
When the upgrade is complete (all nodes), the snapshot should be removed by ``nodetool clearsnapshot -t <snapshot>``, or you risk running out of space.
Backup configuration file
-------------------------
..code::sh
sudo cp -a /etc/scylla/scylla.yaml /etc/scylla/scylla.yaml.backup-1.7
Gracefully stop the node
------------------------
..code::sh
sudo service scylla-server stop
Download and install the new release
------------------------------------
Before upgrading, check what version you are running now using ``dpkg -s scylla-server``. You should use the same version in case you want to |ROLLBACK|_ the upgrade. If you are not running a 1.7.x version, stop right here! This guide only covers 1.7.x to 2.0.y upgrades.
To upgrade:
1. Update the |APT|_ to **2.0**
2. Upgrade java to 1.8 on Ubuntu 14.04 and Debian 8, which is requested by Scylla 2.0
1. Check cluster status with ``nodetool status`` and make sure **all** nodes, including the one you just upgraded, are in UN status.
2. Use ``curl -X GET "http://localhost:10000/storage_service/scylla_release_version"`` to check scylla version.
3. Check scylla-server log (check ``/var/log/upstart/scylla-server.log`` for Ubuntu 14.04, execute ``journalctl _COMM=scylla`` for Ubuntu 16.04 and Debian 8) and ``/var/log/syslog`` to validate there are no errors.
4. Check again after 2 minutes, to validate no new issues are introduced.
Once you are sure the node upgrade is successful, move to the next node in the cluster.
* More on :doc:`Scylla Metrics Update - Scylla 1.7 to 2.0<metric-update-1.7-to-2.0>`
Rollback Procedure
==================
..include:: /upgrade/_common/warning_rollback.rst
The following procedure describes a rollback from Scylla release 2.0.x to 1.7.y (y >= 4). Apply this procedure if an upgrade from 1.7 to 2.0 failed before completing on all nodes. Use this procedure only for nodes you upgraded to 2.0
Scylla rollback is a rolling procedure which does **not** require full cluster shutdown.
For each of the nodes rollback to 1.7, you will:
* drain the node and stop Scylla
* retrieve the old Scylla packages
* restore the configuration file
* restart Scylla
* validate the rollback success
Apply the following procedure **serially** on each node. Do not move to the next node before validating the node is up and running with the new version.
Rollback steps
==============
Gracefully shutdown Scylla
--------------------------
..code::sh
nodetool drain
sudo service scylla-server stop
download and install the old release
------------------------------------
1. Remove the old repo file.
..code::sh
sudo rm -rf /etc/apt/sources.list.d/scylla.list
2. Update the |APT|_ to **1.7**
3. install
..code::sh
sudo apt-get update
sudo apt-get remove scylla\* -y
sudo apt-get install scylla
Answer ‘y’ to the first two questions.
Restore the configuration file
------------------------------
..code::sh
sudo rm -rf /etc/scylla/scylla.yaml
sudo cp -a /etc/scylla/scylla.yaml.backup-1.7 /etc/scylla/scylla.yaml
Restore system tables
---------------------
Restore all tables of **system** and **system_schema** from previous snapshot, 2.0 uses a different set of system tables. Reference doc: :doc:`Restore from a Backup and Incremental Backup </operating-scylla/procedures/backup-restore/restore/>`
..code::sh
cd /var/lib/scylla/data/keyspace_name/table_name-UUID/snapshots/<snapshot_name>/
Check upgrade instruction above for validation. Once you are sure the node rollback is successful, move to the next node in the cluster.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.