Incremented the components_memory_reclaim_threshold config's default
value to 0.2 as the previous value was too strict and caused unnecessary
eviction in otherwise healthy clusters.
Fixes#18607
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 3d7d1fa72a)
Closes#19011
PR https://github.com/scylladb/scylladb/pull/17771 introduced a threshold for the total memory used by all bloom filters across SSTables. When the total usage surpasses the threshold, the largest bloom filter will be removed from memory, bringing the total usage back under the threshold. This PR adds support for reloading such reclaimed bloom filters back into memory when memory becomes available (i.e., within the 10% of available memory earmarked for the reclaimable components).
The SSTables manager now maintains a list of all SSTables whose bloom filter was removed from memory and attempts to reload them when an SSTable, whose bloom filter is still in memory, gets deleted. The manager reloads from the smallest to the largest bloom filter to maximize the number of filters being reloaded into memory.
Backported from https://github.com/scylladb/scylladb/pull/18186 to 5.2.
Closes#18666
* github.com:scylladb/scylladb:
sstable_datafile_test: add testcase to test reclaim during reload
sstable_datafile_test: add test to verify auto reload of reclaimed components
sstables_manager: reload previously reclaimed components when memory is available
sstables_manager: start a fiber to reload components
sstable_directory_test: fix generation in sstable_directory_test_table_scan_incomplete_sstables
sstable_datafile_test: add test to verify reclaimed components reload
sstables: support reloading reclaimed components
sstables_manager: add new intrusive set to track the reclaimed sstables
sstable: add link and comparator class to support new instrusive set
sstable: renamed intrusive list link type
sstable: track memory reclaimed from components per sstable
sstable: rename local variable in sstable::total_reclaimable_memory_size
Even when configured to not do any validation at all, the validator still did some. This small series fixes this, and adds a test to check that validation levels in general are respected, and the validator doesn't validate more than it is asked to.
Fixes: #18662
(cherry picked from commit f6511ca1b0)
(cherry picked from commit e7b07692b6)
(cherry picked from commit 78afb3644c)
Refs #18667Closes#18723
* github.com:scylladb/scylladb:
test/boost/mutation_fragment_test.cc: add test for validator validation levels
mutation: mutation_fragment_stream_validating_filter: fix validation_level::none
mutation: mutation_fragment_stream_validating_filter: add raises_error ctor parameter
when we convert timestamp into string it must look like: '2017-12-27T11:57:42.500Z'
it concerns any conversion except JSON timestamp format
JSON string has space as time separator and must look like: '2017-12-27 11:57:42.500Z'
both formats always contain milliseconds and timezone specification
Fixes#14518Fixes#7997Closes#14726Fixes#16575
(cherry picked from commit ff721ec3e3)
Closes#18852
Despite its name, this validation level still did some validation. Fix
this, by short-circuiting the catch-all operator(), preventing any
validation when the user asked for none.
(cherry picked from commit e7b07692b6)
When set to false, no exceptions will be raised from the validator on
validation error. Instead, it will just return false from the respective
validator methods. This makes testing simpler, asserting exceptions is
clunky.
When true (default), the previous behaviour will remain: any validation
error will invoke on_internal_error(), resulting in either std::abort()
or an exception.
Backporting notes:
* Added const const mutation_fragment_stream_validating_filter&
param to on_validation_error()
* Made full_name() public
(cherry picked from commit f6511ca1b0)
Currently, if the fill ctor throws an exception,
the destructor won't be called, as it object is not fully constructed yet.
Call the default ctor first (which doesn't throw)
to make sure the destructor will be called on exception.
Fixes scylladb/scylladb#18635
- [x] Although the fixes is for a rare bug, it has very low risk and so it's worth backporting to all live versions
(cherry picked from commit 64c51cf32c)
(cherry picked from commit 88b3173d03)
(cherry picked from commit 4bbb66f805)
Refs #18636Closes#18680
* github.com:scylladb/scylladb:
chunked_vector_test: add more exception safety tests
chunked_vector_test: exception_safe_class: count also moved objects
utils: chunked_vector: fill ctor: make exception safe
We have to account for moved objects as well
as copied objects so they will be balanced with
the respective `del_live_object` calls called
by the destructor.
However, since chunked_vector requires the
value_type to be nothrow_move_constructible,
just count the additional live object, but
do not modify _countdown or, respectively, throw
an exception, as this should be considered only
for the default and copy constructors.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, if the fill ctor throws an exception,
the destructor won't be called, as it object is not
fully constructed yet.
Call the default ctor first (which doesn't throw)
to make sure the destructor will be called on exception.
Fixesscylladb/scylladb#18635
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When an SSTable is dropped, the associated bloom filter gets discarded
from memory, bringing down the total memory consumption of bloom
filters. Any bloom filter that was previously reclaimed from memory due
to the total usage crossing the threshold, can now be reloaded back into
memory if the total usage can still stay below the threshold. Added
support to reload such reclaimed filters back into memory when memory
becomes available.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 0b061194a7)
Start a fiber that gets notified whenever an sstable gets deleted. The
fiber doesn't do anything yet but the following patch will add support
to reload reclaimed components if there is sufficient memory.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit f758d7b114)
The testcase uses an sstable whose mutation key and the generation are
owned by different shards. Due to this, when process_sstable_dir is
called, the sstable gets loaded into a different shard than the one that
was intended. This also means that the sstable and the sstable manager
end up in different shards.
The following patch will introduce a condition variable in sstables
manager which will be signalled from the sstables. If the sstable and
the sstable manager are in different shards, the signalling will cause
the testcase to fail in debug mode with this error : "Promise task was
set on shard x but made ready on shard y". So, fix it by supplying
appropriate generation number owned by the same shard which owns the
mutation key as well.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 24064064e9)
Added support to reload components from which memory was previously
reclaimed as the total memory of reclaimable components crossed a
threshold. The implementation is kept simple as only the bloom filters
are considered reclaimable for now.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 54bb03cff8)
When a compaction strategy uses garbage collected sstables to track
expired tombstones, do not use complete partition estimates for them,
instead, use a fraction of it based on the droppable tombstone ratio
estimate.
Fixes#18283
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#18465
(cherry picked from commit d39adf6438)
Closes#18659
The new set holds the sstables from where the memory has been reclaimed
and is sorted in ascending order of the total memory reclaimed.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 2340ab63c6)
Renamed the intrusive list link type to differentiate it from the set
link type that will be added in an upcoming patch.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 3ef2f79d14)
Added a member variable _total_memory_reclaimed to the sstable class
that tracks the total memory reclaimed from a sstable.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 02d272fdb3)
Renamed local variable in sstable::total_reclaimable_memory_size in
preparation for the next patch which adds a new member variable
_total_memory_reclaimed to the sstable class.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit a53af1f878)
The direct failure detector design is simplistic. It sends pings
sequentially and times out listeners that reached the threshold (i.e.
didn't hear from a given endpoint for too long) in-between pings.
Given the sequential nature, the previous ping must finish so the next
ping can start. We timeout pings that take too long. The timeout was
hardcoded and set to 300ms. This is too low for wide-area setups --
latencies across the Earth can indeed go up to 300ms. 3 subsequent timed
out pings to a given node were sufficient for the Raft listener to "mark
server as down" (the listener used a threshold of 1s).
Increase the ping timeout to 600ms which should be enough even for
pinging the opposite side of Earth, and make it tunable.
Increase the Raft listener threshold from 1s to 2s. Without the
increased threshold, one timed out ping would be enough to mark the
server as down. Increasing it to 2s requires 3 timed out pings which
makes it more robust in presence of transient network hiccups.
In the future we'll most likely want to decrease the Raft listener
threshold again, if we use Raft for data path -- so leader elections
start quickly after leader failures. (Faster than 2s). To do that we'll
have to improve the design of the direct failure detector.
Ref: scylladb/scylladb#16410
Fixes: scylladb/scylladb#16607
---
I tested the change manually using `tc qdisc ... netem delay`, setting
network delay on local setup to ~300ms with jitter. Without the change,
the result is as observed in scylladb/scylladb#16410: interleaving
```
raft_group_registry - marking Raft server ... as dead for Raft groups
raft_group_registry - marking Raft server ... as alive for Raft groups
```
happening once every few seconds. The "marking as dead" happens whenever
we get 3 subsequent failed pings, which is happens with certain (high)
probability depending on the latency jitter. Then as soon as we get a
successful ping, we mark server back as alive.
With the change, the phenomenon no longer appears.
(cherry picked from commit 8df6d10e88)
Closes#18558
The event is used in a loop.
Found by clang-tidy:
```
streaming/stream_result_future.cc:80:49: warning: 'event' used after it was moved [bugprone-use-after-move]
listener->handle_stream_event(std::move(event));
^
streaming/stream_result_future.cc:80:39: note: move occurred here
listener->handle_stream_event(std::move(event));
^
streaming/stream_result_future.cc:80:49: note: the use happens in a later loop iteration than the move
listener->handle_stream_event(std::move(event));
^
```
Fixes#18332
(cherry picked from commit 4fd4e6acf3)
Closes#18430
When reclaiming memory from bloom filters, do not remove them from
_recognised_components, as that leads to the on-disk filter component
being left back on disk when the SSTable is deleted.
Fixes#18398
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#18400
(cherry picked from commit 6af2659b57)
Closes#18437
in handler.cc, `make_non_overlapping_ranges()` references a moved
instance of `ColumnSlice` when something unexpected happens to
format the error message in an exception, the move constructor of
`ColumnSlice` is default-generated, so the members' move constructors
are used to construct the new instance in the move constructor. this
could lead to undefined behavior when dereferencing the move instance.
in this change, in order to avoid use-after free, let's keep
a copy of the referenced member variables and reference them when
formatting error message in the exception.
this use-after-move issue was introduced in 822a315dfa, which implemented
`get_multi_slice` verb and this piece in the first place. since both 5.2
and 5.4 include this commit, we should backport this change to them.
Refs 822a315dfaFixes#18356
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 1ad3744edc)
Closes#18373
Currently, we use the sum of the estimated_partitions from each
participant node as the estimated_partitions for sstable produced by
repair. This way, the estimated_partitions is the biggest possible
number of partitions repair would write.
Since repair will write only the difference between repair participant
nodes, using the biggest possible estimation will overestimate the
partitions written by repair, most of the time.
The problem is that overestimated partitions makes the bloom filter
consume more memory. It is observed that it causes OOM in the field.
This patch changes the estimation to use a fraction of the average
partitions per node instead of sum. It is still not a perfect estimation
but it already improves memory usage significantly.
Fixes#18140Closesscylladb/scylladb#18141
(cherry picked from commit 642f9a1966)
Added support to track and limit the memory usage by sstable components. A reclaimable component of an SSTable is one from which memory can be reclaimed. SSTables and their managers now track such reclaimable memory and limit the component memory usage accordingly. A new configuration variable defines the memory reclaim threshold. If the total memory of the reclaimable components exceeds this limit, memory will be reclaimed to keep the usage under the limit. This PR considers only the bloom filters as reclaimable and adds support to track and limit them as required.
The feature can be manually verified by doing the following :
1. run a single-node single-shard 1GB cluster
2. create a table with bloom-filter-false-positive-chance of 0.001 (to intentionally cause large bloom filter)
3. populate with tiny partitions
4. watch the bloom filter metrics get capped at 100MB
The default value of the `components_memory_reclaim_threshold` config variable which controls the reclamation process is `.1`. This can also be reduced further during manual tests to easily hit the threshold and verify the feature.
Fixes https://github.com/scylladb/scylladb/issues/17747
Backported from #17771 to 5.2.
Closes#18247
* github.com:scylladb/scylladb:
test_bloom_filter.py: disable reclaiming memory from components
sstable_datafile_test: add tests to verify auto reclamation of components
test/lib: allow overriding available memory via test_env_config
sstables_manager: support reclaiming memory from components
sstables_manager: store available memory size
sstables_manager: add variable to track component memory usage
db/config: add a new variable to limit memory used by table components
sstable_datafile_test: add testcase to verify reclamation from sstables
sstables: support reclaiming memory from components
Disabled reclaiming memory from sstable components in the testcase as it
interferes with the false positive calculation.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit d86505e399)
Reclaim memory from the SSTable that has the most reclaimable memory if
the total reclaimable memory has crossed the threshold. Only the bloom
filter memory is considered reclaimable for now.
Fixes#17747
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit a36965c474)
The available memory size is required to calculate the reclaim memory
threshold, so store that within the sstables manager.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 2ca4b0a7a2)
sstables_manager::_total_reclaimable_memory variable tracks the total
memory that is reclaimable from all the SSTables managed by it.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit f05bb4ba36)
A new configuration variable, components_memory_reclaim_threshold, has
been added to configure the maximum allowed percentage of available
memory for all SSTable components in a shard. If the total memory usage
exceeds this threshold, it will be reclaimed from the components to
bring it back under the limit. Currently, only the memory used by the
bloom filters will be restricted.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit e8026197d2)
Added support to track total memory from components that are reclaimable
and to reclaim memory from them if and when required. Right now only the
bloom filters are considered as reclaimable components but this can be
extended to any component in the future.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 4f0aee62d1)
Repair memory limit includes only the size of frozen mutation
fragments in repair row. The size of other members of repair
row may grow uncontrollably and cause out of memory.
Modify what's counted to repair memory limit.
Fixes: https://github.com/scylladb/scylladb/issues/16710.
(cherry picked from commit a4dc6553ab)
(cherry picked from commit 51c09a84cc)
Refs https://github.com/scylladb/scylladb/pull/17785Closes#18237
* github.com:scylladb/scylladb:
test: add test for repair_row::size()
repair: fix memory accounting in repair_row
In repair, only the size of frozen mutation fragments of repair row
is counted to the memory limit. So, huge keys of repair rows may
lead to OOM.
Include other repair_row's members' memory size in repair memory
limit.
(cherry picked from commit a4dc6553ab)
Currently, Scylla logs a warning when it writes a cell, row or partition which are larger than certain configured sizes. These warnings contain the partition key and in case of rows and cells also the cluster key which allow the large row or partition to be identified. However, these keys can contain user-private, sensitive information. The information which identifies the partition/row/cell is also inserted into tables system.large_partitions, system.large_rows and system.large_cells respectivelly.
This change removes the partition and cluster keys from the log messages, but still inserts them into the system tables.
The logged data will look like this:
Large cells:
WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large cell ks_name/tbl_name: cell_name (SIZE bytes) to sstable.db
Large rows:
WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large row ks_name/tbl_name: (SIZE bytes) to sstable.db
Large partitions:
WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large partition ks_name/tbl_name: (SIZE bytes) to sstable.db
Fixes#18041Closesscylladb/scylladb#18166
(cherry picked from commit f1cc6252fd)
before this change, `reclaim_timer::report()` calls
```c++
fmt::format(", at {}", current_backtrace())
```
which allocates a `std::string` on heap, so it can fail and throw. in
that case, `std::terminate()` is called. but at that moment, the reason
why `reclaim_timer::report()` gets called is that we fail to reclaim
memory for the caller. so we are more likely to run into this issue. anyway,
we should not allocate memory in this path.
in this change, a dedicated printer is created so that we don't format
to a temporary `std::string`, and instead write directly to the buffer
of logger. this avoids the memory allocation.
Fixes#18099
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18100
(cherry picked from commit fcf7ca5675)
When a view update has both a local and remote target endpoint,
it extends the lifetime of its memory tracking semaphore units
only until the end of the local update, while the resources are
actually used until the remote update finishes.
This patch changes the semaphore transferring so that in case
of both local and remote endpoints, both view updates share the
units, causing them to be released only after the update that
takes longer finishes.
Fixes#17890
(cherry picked from commit 9789a3dc7c)
Closes#18104
Currently, when dividing memory tracked for a batch of updates
we do not take into account the overhead that we have for processing
every update. This patch adds the overhead for single updates
and joins the memory calculation path for batches and their parts
so that both use the same overhead.
Fixes#17854
(cherry picked from commit efcb718)
Closes#17999
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.
in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision. the tests are updated to print the time duration
instead of count to provide information with a higher resolution.
Fixes#15902
(cherry picked from commit 8a5689e7a7)
(cherry picked from commit 1d33a68dd7)
Closes#17911
* github.com:scylladb/scylladb:
tests: utils: error injection: print time duration instead of count
error_injection: do not cast to milliseconds when injecting timeout
In order to avoid running out of memory, we can't
underestimate the memory used when processing a view
update. Particularly, we need to handle the remote
view updates well, because we may create many of them
at the same time in contrast to local updates which
are processed synchronously.
After investigating a coredump generated in a crash
caused by running out of memory due to these remote
view updates, we found that the current estimation
is much lower than what we observed in practice; we
identified overhead of up to 2288 bytes for each
remote view update. The overhead consists of:
- 512 bytes - a write_response_handler
- less than 512 bytes - excessive memory allocation
for the mutation in bytes_ostream
- 448 bytes - the apply_to_remote_endpoints coroutine
started in mutate_MV()
- 192 bytes - a continuation to the coroutine above
- 320 bytes - the coroutine in result_parallel_for_each
started in mutate_begin()
- 112 bytes - a continuation to the coroutine above
- 192 bytes - 5 unspecified allocations of 32, 32, 32,
48 and 48 bytes
This patch changes the previous overhead estimate
of 256 bytes to 2288 bytes, which should take into
account all allocations in the current version of the
code. It's worth noting that changes in the related
pieces of code may result in a different overhead.
The allocations seem to be mostly captures for the
background tasks. Coroutines seem to allocate extra,
however testing shows that replacing a coroutine with
continuations may result in generating a few smaller
futures/continuations with a larger total size.
Besides that, considering that we're waiting for
a response for each remote view update, we need the
relatively large write_response_handler, which also
includes the mutation in case we needed to reuse it.
The change should not majorly affect workloads with many
local updates because we don't keep many of them at
the same time anyway, and an added benefit of correct
memory utilization estimation is avoiding evictions
of other memory that would be otherwise necessary
to handle the excessive memory used by view updates.
Fixes#17364
(cherry picked from commit 5ab3586135)
Closes#17858
instead of casting / comparing the count of duration unit, let's just
compare the durations, so that boost.test is able to print the duration
in a more informative and user friendly way (line wrapped)
test/boost/error_injection_test.cc(167): fatal error:
in "test_inject_future_disabled":
critical check wait_time > sleep_msec has failed [23839ns <= 10ms]
Refs #15902
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 1d33a68dd7)
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.
in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision.
Fixes#15902
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 8a5689e7a7)
Major compaction semantics is that all data of a table will be compacted
together, so user can expect e.g. a recently introduced tombstone to be
compacted with the data it shadows.
Today, it can happen that all data in maintenance set won't be included
for major, until they're promoted into main set by off-strategy.
So user might be left wondering why major is not having the expected
effect.
To fix this, let's perform off-strategy first, so data in maintenance
set will be made available by major. A similar approach is done for
data in memtable, so flush is performed before major starts.
The only exception will be data in staging, which cannot be compacted
until view building is done with it, to avoid inconsistency in view
replicas.
The serialization in comapaction manager of reshape jobs guarantee
correctness if there's an ongoing off-strategy on behalf of the
table.
Fixes#11915.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15792
(cherry picked from commit ea6c281b9f)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#17901
This commit updates the Upgrade ScyllaDB Image page.
- It removes the incorrect information that updating underlying OS packages is mandatory.
- It adds information about the extended procedure for non-official images.
(cherry picked from commit fc90112b97)
Closes#17885
The test in its original form relies on the
`error_injections_at_startup` feature, which 5.2 doesn't have, so I
adapted the test to enable error injections after bootstrapping nodes in
the backport (9c44bbce67). That is however
incorrect, it's important for the injection to be enabled while the
nodes are booting, otherwise the test will be flaky, as we observed.
Details in scylladb/scylladb#17749.
Remove the test from 5.2 branch.
Fixesscylladb/scylladb#17749Closes#17750
This is a backport of 0c376043eb and follow-up fix 57b14580f0 to 5.2.
We haven't identified any specific issues in test or field in 5.2/2023.1 releases, but the bug should be fixed either way, it might bite us in unexpected ways.
Closes#17640
* github.com:scylladb/scylladb:
migration_manager: only jump to shard 0 in migration_request during group 0 snapshot transfer
raft_group0_client: assert that hold_read_apply_mutex is called on shard 0
migration_manager: fix indentation after the previous patch.
messaging_service: process migration_request rpc on shard 0
migration_manager: take group0 lock during raft snapshot taking
A materialized view in CQL allows AT MOST ONE view key column that
wasn't a key column in the base table. This is because if there were
two or more of those, the "liveness" (timestamp, ttl) of these different
columns can change at every update, and it's not possible to pick what
liveness to use for the view row we create.
We made an exception for this rule for Alternator: DynamoDB's API allows
creating a GSI whose partition key and range key are both regular columns
in the base table, and we must support this. We claim that the fact that
Alternator allows neither TTL (Alternator's "TTL" is a different feature)
nor user-defined timestamps, does allow picking the liveness for the view
row we create. But we did it wrong!
We claimed in a comment - and implemented in the code before this patch -
that in Alternator we can assume that both GSI key columns will have the
*same* liveness, and in particular timestamp. But this is only true if
one modifies both columns together! In fact, in general it is not true:
We can have two non-key attributes 'a' and 'b' which are the GSI's key
columns, and we can modify *only* b, without modifying a, in which case
the timestamp of the view modification should be b's newer timestamp,
not a's older one. The existing code took a's timestamp, assuming it
will be the same as b's, which is incorrect. The result was that if
we repeatedly modify only b, all view updates will receive the same
timestamp (a's old timestamp), and a deletion will always win over
all the modifications. This patch includes a reproducing test written by
a user (@Zak-Kent) that demonstrates how after a view row is deleted
it doesn't get recreated - because all the modifications use the same
timestamp.
The fix is, as suggested above, to use the *higher* of the two
timestamps of both base-regular-column GSI key columns as the timestamp
for the new view rows or view row deletions. The reproducer that
failed before this patch passes with it. As usual, the reproducer
passes on AWS DynamoDB as well, proving that the test is correct and
should really work.
Fixes#17119
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#17172
(cherry picked from commit 21e7deafeb)
The test is booting nodes, and then immediately starts shutting down
nodes and removing them from the cluster. The shutting down and
removing may happen before driver manages to connect to all nodes in the
cluster. In particular, the driver didn't yet connect to the last
bootstrapped node. Or it can even happen that the driver has connected,
but the control connection is established to the first node, and the
driver fetched topology from the first node when the first node didn't
yet consider the last node to be normal. So the driver decides to close
connection to the last node like this:
```
22:34:03.159 DEBUG> [control connection] Removing host not found in
peers metadata: <Host: 127.42.90.14:9042 datacenter1>
```
Eventually, at the end of the test, only the last node remains, all
other nodes have been removed or stopped. But the driver does not have a
connection to that last node.
Fix this problem by ensuring that:
- all nodes see each other as NORMAL,
- the driver has connected to all nodes
at the beginning of the test, before we start shutting down and removing
nodes.
Fixes scylladb/scylladb#16373
(cherry picked from commit a68701ed4f)
Closes#17703
Jumping to shard 0 during group 0 snapshot transfer is required because
we take group 0 lock, onyl available on shard 0. But outside of Raft
mode it only pessimizes performance unnecessarily, so don't do it.
Repairs have to obtain a permit to the reader concurrency semaphore on
each shard they have a presence on. This is prone to deadlocks:
node1 node2
repair1_master (takes permit) repair1_follower (waits on permit)
repair2_master (waits for permit) repair2_follower (takes permit)
In lieu of strong central coordination, we solved this by making permits
evictable: if repair2 can evict repair1's permit so it can obtain one
and make progress. This is not efficient as evicting a permit usually
means discarding already done work, but it prevents the deadlocks.
We recently discovered that there is a window when deadlocks can still
happen. The permit is made evictable when the disk reader is created.
This reader is an evictable one, which effectively makes the permit
evictable. But the permit is obtained when the repair constrol
structrure -- repair meta -- is create. Between creating the repair meta
and reading the first row from disk, the deadlock is still possible. And
we know that what is possible, will happen (and did happen). Fix by
making the permit evictable as soon as the repair meta is created. This
is very clunky and we should have a better API for this (refs #17644),
but for now we go with this simple patch, to make it easy to backport.
Refs: #17644Fixes: #17591Closes#17646
(cherry picked from commit c6e108a)
Backport notes:
The fix above does not apply to 5.2, because on 5.2 the reader is
created immediately when the repair-meta is created. So we don't need
the game with a fake inactive read, we can just pause the already
created reader in the repair-reader constructor.
Closes#17730
key_view::explode() contains a blatant use-after-free:
unless the input is already linearized, it returns a view to a local temporary buffer.
This is rare, because partition keys are usually not large enough to be fragmented.
But for a sufficiently large key, this bug causes a corrupted partition_key down
the line.
Fixes#17625
(cherry picked from commit 7a7b8972e5)
Closes#17725
Store schema_ptr in reader permit instead of storing a const pointer to
schema to ensure that the schema doesn't get changed elsewhere when the
permit is holding on to it. Also update the constructors and all the
relevant callers to pass down schema_ptr instead of a raw pointer.
Fixes#16180
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#16658
(cherry picked from commit 76f0d5e35b)
Closes#17694
Commit 0c376043eb added access to group0
semaphore which can be done on shard0 only. Unlike all other group0 rpcs
(that already always forwarded to shard0) migration_request does not
since it is an rpc that what reused from non raft days. The patch adds
the missing jump to shard0 before executing the rpc.
(cherry picked from commit 4a3c79625f)
Group0 state machine access atomicity is guaranteed by a mutex in group0
client. A code that reads or writes the state needs to hold the log. To
transfer schema part of the snapshot we used existing "migration
request" verb which did not follow the rule. Fix the code to take group0
lock before accessing schema in case the verb is called as part of
group0 snapshot transfer.
Fixesscylladb/scylladb#16821
(cherry picked from commit 0c376043eb)
Backport note: introduced missing
`raft_group0_client::hold_read_apply_mutex`
RPC calls lose information about the type of returned exception.
Thus, if a table is dropped on receiver node, but it still exists
on a sender node and sender node streams the table's data, then
the whole operation fails.
To prevent that, add a method which synchronizes schema and then
checks, if the exception was caused by table drop. If so,
the exception is swallowed.
Use the method in streaming and repair to continue them when
the table is dropped in the meantime.
Fixes: https://github.com/scylladb/scylladb/issues/17028.
Fixes: https://github.com/scylladb/scylladb/issues/15370.
Fixes: https://github.com/scylladb/scylladb/issues/15598.
Closes#17528
* github.com:scylladb/scylladb:
repair: handle no_such_column_family from remote node gracefully
test: test drop table on receiver side during streaming
streaming: fix indentation
streaming: handle no_such_column_family from remote node gracefully
repair: add methods to skip dropped table
The function `gms::version_generator::get_next_version()` can only be called from shard 0 as it uses a global, unsynchronized counter to issue versions. Notably, the function is used as a default argument for the constructor of `gms::versioned_value` which is used from shorthand constructors such as `versioned_value::cache_hitrates`, `versioned_value::schema` etc.
The `cache_hitrate_calculator` service runs a periodic job which updates the `CACHE_HITRATES` application state in the local gossiper state. Each time the job is scheduled, it runs on the next shard (it goes through shards in a round-robin fashion). The job uses the `versioned_value::cache_hitrates` shorthand to create a `versioned_value`, therefore risking a data race if it is not currently executing on shard 0.
The PR fixes the race by moving the call to `versioned_value::cache_hitrates` to shard 0. Additionally, in order to help detect similar issues in the future, a check is introduced to `get_next_version` which aborts the process if the function was called on other shard than 0.
There is a possibility that it is a fix for #17493. Because `get_next_version` uses a simple incrementation to advance the global counter, a data race can occur if two shards call it concurrently and it may result in shard 0 returning the same or smaller value when called two times in a row. The following sequence of events is suspected to occur on node A:
1. Shard 1 calls `get_next_version()`, loads version `v - 1` from the global counter and stores in a register; the thread then is preempted,
2. Shard 0 executes `add_local_application_state()` which internally calls `get_next_version()`, loads `v - 1` then stores `v` and uses version `v` to update the application state,
3. Shard 0 executes `add_local_application_state()` again, increments version to `v + 1` and uses it to update the application state,
4. Gossip message handler runs, exchanging application states with node B. It sends its application state to B. Note that the max version of any of the local application states is `v + 1`,
5. Shard 1 resumes and stores version `v` in the global counter,
6. Shard 0 executes `add_local_application_state()` and updates the application state - again - with version `v + 1`.
7. After that, node B will never learn about the application state introduced in point 6. as gossip exchange only sends endpoint states with version larger than the previous observed max version, which was `v + 1` in point 4.
Note that the above scenario was _not_ reproduced. However, I managed to observe a race condition by:
1. modifying Scylla to run update of `CACHE_HITRATES` much more frequently than usual,
2. putting an assertion in `add_local_application_state` which fails if the version returned by `get_next_version` was not larger than the previous returned value,
3. running a test which performs schema changes in a loop.
The assertion from the second point was triggered. While it's hard to tell how likely it is to occur without making updates of cache hitrates more frequent - not to mention the full theorized scenario - for now this is the best lead that we have, and the data race being fixed here is a real bug anyway.
Refs: #17493Closesscylladb/scylladb#17499
* github.com:scylladb/scylladb:
version_generator: check that get_next_version is called on shard 0
misc_services: fix data race from bad usage of get_next_version
(cherry picked from commit fd32e2ee10)
If no_such_column_family is thrown on remote node, then repair
operation fails as the type of exception cannot be determined.
Use repair::with_table_drop_silenced in repair to continue operation
if a table was dropped.
(cherry picked from commit cf36015591)
If no_such_column_family is thrown on remote node, then streaming
operation fails as the type of exception cannot be determined.
Use repair::with_table_drop_silenced in streaming to continue
operation if a table was dropped.
(cherry picked from commit 219e1eda09)
Schema propagation is async so one node can see the table while on
the other node it is already dropped. So, if the nodes stream
the table data, the latter node throws no_such_column_family.
The exception is propagated to the other node, but its type is lost,
so the operation fails on the other node.
Add method which waits until all raft changes are applied and then
checks whether given table exists.
Add the function which uses the above to determine, whether the function
failed because of dropped table (eg. on the remote node so the exact
exception type is unknown). If so, the exception isn't rethrown.
(cherry picked from commit 5202bb9d3c)
If index_reader isn't closed before it is destroyed, then ongoing
sstables reads won't be awaited and assertion will be triggered.
Close index_reader in has_partition_key before destroying it.
Fixes: https://github.com/scylladb/scylladb/issues/17232.
Closes#17532
* github.com:scylladb/scylladb:
test: add test to check if reader is closed
sstables: close index_reader in has_partition_key
If index_reader isn't closed before it is destroyed, then ongoing
sstables reads won't be awaited and assertion will be triggered.
Close index_reader in has_partition_key before destroying it.
(cherry picked from commit 5227336a32)
Before this PR, writes to the previous CDC generations would
always be rejected. After this PR, they will be accepted if the
write's timestamp is greater than `now - generation_leeway`.
This change was proposed around 3 years ago. The motivation was
to improve user experience. If a client generates timestamps by
itself and its clock is desynchronized with the clock of the node
the client is connected to, there could be a period during
generation switching when writes fail. We didn't consider this
problem critical because the client could simply retry a failed
write with a higher timestamp. Eventually, it would succeed. This
approach is safe because these failed writes cannot have any side
effects. However, it can be inconvenient. Writing to previous
generations was proposed to improve it.
The idea was rejected 3 years ago. Recently, it turned out that
there is a case when the client cannot retry a write with the
increased timestamp. It happens when a table uses CDC and LWT,
which makes timestamps permanent. Once Paxos commits an entry
with a given timestamp, Scylla will keep trying to apply that entry
until it succeeds, with the same timestamp. Applying the entry
involves writing to the CDC log table. If it fails, we get stuck.
It's a major bug with an unknown perfect solution.
Allowing writes to previous generations for `generation_leeway` is
a probabilistic fix that should solve the problem in practice.
Apart from this change, this PR adds tests for it and updates
the documentation.
This PR is sufficient to enable writes to the previous generations
only in the gossiper-based topology. The Raft-based topology
needs some adjustments in loading and cleaning CDC generations.
These changes won't interfere with the changes introduced in this
PR, so they are left for a follow-up.
Fixesscylladb/scylladb#7251Fixesscylladb/scylladb#15260Closesscylladb/scylladb#17134
* github.com:scylladb/scylladb:
docs: using-scylla: cdc: remove info about failing writes to old generations
docs: dev: cdc: document writing to previous CDC generations
test: add test_writes_to_previous_cdc_generations
cdc: generation: allow increasing generation_leeway through error injection
cdc: metadata: allow sending writes to the previous generations
(cherry picked from commit 9bb4482ad0)
Backport note: replaced `servers_add` with `server_add` loop in tests
replaced `error_injections_at_startup` (not implemented in 5.2) with
`enable_injection` post-boot
For efficiency, if a base-table update generates many view updates that
go the same partition, they are collected as one mutation. If this
mutation grows too big it can lead to memory exhaustion, so since
commit 7d214800d0 we split the output
mutation to mutations no longer than 100 rows (max_rows_for_view_updates)
each.
This patch fixes a bug where this split was done incorrectly when
the update involved range tombstones, a bug which was discovered by
a user in a real use case (#17117).
Range tombstones are read in two parts, a beginning and an end, and the
code could split the processing between these two parts and the result
that some of the range tombstones in update could be missed - and the
view could miss some deletions that happened in the base table.
This patch fixes the code in two places to avoid breaking up the
processing between range tombstones:
1. The counter "_op_count" that decides where to break the output mutation
should only be incremented when adding rows to this output mutation.
The existing code strangely incrmented it on every read (!?) which
resulted in the counter being incremented on every *input* fragment,
and in particular could reach the limit 100 between two range
tombstone pieces.
2. Moreover, the length of output was checked in the wrong place...
The existing code could get to 100 rows, not check at that point,
read the next input - half a range tombstone - and only *then*
check that we reached 100 rows and stop. The fix is to calculate
the number of rows in the right place - exactly when it's needed,
not before the step.
The first change needs more justification: The old code, that incremented
_op_count on every input fragment and not just output fragments did not
fit the stated goal of its introduction - to avoid large allocations.
In one test it resulted in breaking up the output mutation to chunks of
25 rows instead of the intended 100 rows. But, maybe there was another
goal, to stop the iteration after 100 *input* rows and avoid the possibility
of stalls if there are no output rows? It turns out the answer is no -
we don't need this _op_count increment to avoid stalls: The function
build_some() uses `co_await on_results()` to run one step of processing
one input fragment - and `co_await` always checks for preemption.
I verfied that indeed no stalls happen by using the existing test
test_long_skipped_view_update_delete_with_timestamp. It generates a
very long base update where all the view updates go to the same partition,
but all but the last few updates don't generate any view updates.
I confirmed that the fixed code loops over all these input rows without
increasing _op_count and without generating any view update yet, but it
does NOT stall.
This patch also includes two tests reproducing this bug and confirming
its fixed, and also two additional tests for breaking up long deletions
that I wanted to make sure doesn't fail after this patch (it doesn't).
By the way, this fix would have also fixed issue #12297 - which we
fixed a year ago in a different way. That issue happend when the code
went through 100 input rows without generating *any* output rows,
and incorrectly concluding that there's no view update to send.
With this fix, the code no longer stops generating the view
update just because it saw 100 input rows - it would have waited
until it generated 100 output rows in the view update (or the
input is really done).
Fixes#17117
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#17164
(cherry picked from commit 14315fcbc3)
For gnutls 3.8.3.
Since Fedora 37 is end-of-life, pick the package from Fedora 38. libunistring
needs to be updated to satisfy the dependency solver.
Fixes#17285.
Closesscylladb/scylladb#17287
Signed-off-by: Avi Kivity <avi@scylladb.com>
Closes#17411
The currently used version of "rustix" depency had a minor
security vulnerability. This patch updates the corresponding
crate.
The update was performed using "cargo update" on "rustix"
package and version "0.36.17"
relevant package and the corresponding version.
Refs #15772Closes#17408
While describing materialized view, print `synchronous_updates` option
only if the tag is present in schema's extensions map. Previously if the
key wasn't present, the default (false) value was printed.
Fixes: #14924Closes#14928
(cherry picked from commit b92d47362f)
The reason we introduced the tombstone-limit
(query_tombstone_page_limit), was to allow paged queries to return
incomplete/empty pages in the face of large tombstone spans. This works
by cutting the page after the tombstone-limit amount of tombstones were
processed. If the read is unpaged, it is killed instead. This was a
mistake. First, it doesn't really make sense, the reason we introduced
the tombstone limit, was to allow paged queries to process large
tombstone-spans without timing out. It does not help unpaged queries.
Furthermore, the tombstone-limit can kill internal queries done on
behalf of user queries, because all our internal queries are unpaged.
This can cause denial of service.
So in this patch we disable the tombstone-limit for unpaged queries
altogether, they are allowed to continue even after having processed the
configured limit of tombstones.
Fixes: #17241Closesscylladb/scylladb#17242
(cherry picked from commit f068d1a6fa)
* seastar 29badd99...ad0f2d5d (1):
> Merge "Slowdown IO scheduler based on dispatched/completed ratio" into branch-5.2
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This PR fixes the bug of certain calls to the `mintimeuuid()` CQL function which large negative timestamps could crash Scylla. It turns out we already had protections in place against very positive timestamps, but very negative timestamps could still cause bugs.
The actual fix in this series is just a few lines, but the bigger effort was improving the test coverage in this area. I added tests for the "date" type (the original reproducer for this bug used totimestamp() which takes a date parameter), and also reproducers for this bug directly, without totimestamp() function, and one with that function.
Finally this PR also replaces the assert() which made this molehill-of-a-bug into a mountain, by a throw.
Fixes#17035Closesscylladb/scylladb#17073
* github.com:scylladb/scylladb:
utils: replace assert() by on_internal_error()
utils: add on_internal_error with common logger
utils: add a timeuuid minimum, like we had maximum
test/cql-pytest: tests for "date" type
(cherry picked from commit 2a4b991772)
Backports required to fix scylladb/scylladb#16683 in 5.2:
- when creating first group 0 server, create a snapshot with non-empty ID, and start it at index 1 instead of 0 to force snapshot transfer to servers that join group 0
- add an API to trigger Raft snapshot
- use the API when we restart and see that the existing snapshot is at index 0, to trigger a new one --- in order to fix broken deployments that already bootstrapped with index-0 snapshot.
Closes#17087
* github.com:scylladb/scylladb:
test_raft_snapshot_request: fix flakiness (again)
test_raft_snapshot_request: fix flakiness
Merge 'raft_group0: trigger snapshot if existing snapshot index is 0' from Kamil Braun
Merge 'Add an API to trigger snapshot in Raft servers' from Kamil Braun
raft: server: add workaround for scylladb/scylladb#12972
raft: Store snapshot update and truncate log atomically
service: raft: force initial snapshot transfer in new cluster
raft_sys_table_storage: give initial snapshot a non zero value
Commit e81fc1f095 accidentally broke the control
flow of row_cache::do_update().
Before that commit, the body of the loop was wrapped in a lambda.
Thus, to break out of the loop, `return` was used.
The bad commit removed the lambda, but didn't update the `return` accordingly.
Thus, since the commit, the statement doesn't just break out of the loop as
intended, but also skips the code after the loop, which updates `_prev_snapshot_pos`
to reflect the work done by the loop.
As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted,
`do_update()` fails to update `_prev_snapshot_pos`. It remains in a
stale state, until `do_update()` runs again and either finishes or
is preempted outside of `updater`.
If we read a partition processed by `do_update()` but not covered by
`_prev_snapshot_pos`, we will read stale data (from the previous snapshot),
which will be remembered in the cache as the current data.
This results in outdated data being returned by the replica.
(And perhaps in something worse if range tombstones are involved.
I didn't investigate this possibility in depth).
Note: for queries with CL>1, occurences of this bug are likely to be hidden
by reconciliation, because the reconciled query will only see stale data if
the queried partition is affected by the bug on on *all* queried replicas
at the time of the query.
Fixes#16759Closesscylladb/scylladb#17138
(cherry picked from commit ed98102c45)
At the end of the test, we wait until a restarted node receives a
snapshot from the leader, and then verify that the log has been
truncated.
To check the snapshot, the test used the `system.raft_snapshots` table,
while the log is stored in `system.raft`.
Unfortunately, the two tables are not updated atomically when Raft
persists a snapshot (scylladb/scylladb#9603). We first update
`system.raft_snapshots`, then `system.raft` (see
`raft_sys_table_storage::store_snapshot_descriptor`). So after the wait
finishes, there's no guarantee the log has been truncated yet -- there's
a race between the test's last check and Scylla doing that last delete.
But we can check the snapshot using `system.raft` instead of
`system.raft_snapshots`, as `system.raft` has the latest ID. And since
1640f83fdc, storing that ID and truncating
the log in `system.raft` happens atomically.
Closesscylladb/scylladb#17106
(cherry picked from commit c911bf1a33)
Add workaround for scylladb/python-driver#295.
Also an assert made at the end of the test was false, it is fixed with
appropriate comment added.
(cherry picked from commit 74bf60a8ca)
The persisted snapshot index may be 0 if the snapshot was created in
older version of Scylla, which means snapshot transfer won't be
triggered to a bootstrapping node. Commands present in the log may not
cover all schema changes --- group 0 might have been created through the
upgrade upgrade procedure, on a cluster with existing schema. So a
deployment with index=0 snapshot is broken and we need to fix it. We can
use the new `raft::server::trigger_snapshot` API for that.
Also add a test.
Fixesscylladb/scylladb#16683Closesscylladb/scylladb#17072
* github.com:scylladb/scylladb:
test: add test for fixing a broken group 0 snapshot
raft_group0: trigger snapshot if existing snapshot index is 0
(cherry picked from commit 181f68f248)
Backport note: test_raft_fix_broken_snapshot had to be removed because
the "error injections enabled at startup" feature does not yet exist in
5.2.
This allows the user of `raft::server` to cause it to create a snapshot
and truncate the Raft log (leaving no trailing entries; in the future we
may extend the API to specify number of trailing entries left if
needed). In a later commit we'll add a REST endpoint to Scylla to
trigger group 0 snapshots.
One use case for this API is to create group 0 snapshots in Scylla
deployments which upgraded to Raft in version 5.2 and started with an
empty Raft log with no snapshot at the beginning. This causes problems,
e.g. when a new node bootstraps to the cluster, it will not receive a
snapshot that would contain both schema and group 0 history, which would
then lead to inconsistent schema state and trigger assertion failures as
observed in scylladb/scylladb#16683.
In 5.4 the logic of initial group 0 setup was changed to start the Raft
log with a snapshot at index 1 (ff386e7a44)
but a problem remains with these existing deployments coming from 5.2,
we need a way to trigger a snapshot in them (other than performing 1000
arbitrary schema changes).
Another potential use case in the future would be to trigger snapshots
based on external memory pressure in tablet Raft groups (for strongly
consistent tables).
The PR adds the API to `raft::server` and a HTTP endpoint that uses it.
In a follow-up PR, we plan to modify group 0 server startup logic to automatically
call this API if it sees that no snapshot is present yet (to automatically
fix the aforementioned 5.2 deployments once they upgrade.)
Closesscylladb/scylladb#16816
* github.com:scylladb/scylladb:
raft: remove `empty()` from `fsm_output`
test: add test for manual triggering of Raft snapshots
api: add HTTP endpoint to trigger Raft snapshots
raft: server: add `trigger_snapshot` API
raft: server: track last persisted snapshot descriptor index
raft: server: framework for handling server requests
raft: server: inline `poll_fsm_output`
raft: server: fix indentation
raft: server: move `io_fiber`'s processing of `batch` to a separate function
raft: move `poll_output()` from `fsm` to `server`
raft: move `_sm_events` from `fsm` to `server`
raft: fsm: remove constructor used only in tests
raft: fsm: move trace message from `poll_output` to `has_output`
raft: fsm: extract `has_output()`
raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor`
raft: server: pass `*_aborted` to `set_exception` call
(cherry picked from commit d202d32f81)
Backport notes:
- `has_output()` has a smaller condition in the backported version
(because the condition was smaller in `poll_output()`)
- `process_fsm_output` has a smaller body (because `io_fiber` had a
smaller body) in the backported version
- the HTTP API is only started if `raft_group_registry` is started
When a node joins the cluster, it closes connections after learning
topology information from other nodes, in order to reopen them with
correct encryption, compression etc.
In ScyllaDB 5.2, this mechanism may interrupt an ongoing Raft snapshot
transfer. This was fixed in later versions by putting some order into
the bootstrap process with 50e8ec77c6 but
the fix was not backported due to many prerequisites and complexity.
Raft automatically recovers from interrupted snapshot transfer by
retrying it eventually, and everything works. However an ERROR is
reported due to that one failed snapshot transfer, and dtests dont like
ERRORs -- they report the test case as failed if an ERROR happened in
any node's logs even if the test passed otherwise.
Here we apply a simple workaround to please dtests -- in this particular
scenario, turn the ERROR into a WARN.
When we upgrade a cluster to use Raft, or perform manual Raft recovery
procedure (which also creates a fresh group 0 cluster, using the same
algorithm as during upgrade), we start with a non-empty group 0 state
machine; in particular, the schema tables are non-empty.
In this case we need to ensure that nodes which join group 0 receive the
group 0 state. Right now this is not the case. In previous releases,
where group 0 consisted only of schema, and schema pulls were also done
outside Raft, those nodes received schema through this outside
mechanism. In 91f609d065 we disabled
schema pulls outside Raft; we're also extending group 0 with other
things, like topology-specific state.
To solve this, we force snapshot transfers by setting the initial
snapshot index on the first group 0 server to `1` instead of `0`. During
replication, Raft will see that the joining servers are behind,
triggering snapshot transfer and forcing them to pull group 0 state.
It's unnecessary to do this for cluster which bootstraps with Raft
enabled right away but it also doesn't hurt, so we keep the logic simple
and don't introduce branches based on that.
Extend Raft upgrade tests with a node bootstrap step at the end to
prevent regressions (without this patch, the step would hang - node
would never join, waiting for schema).
Fixes: #14066Closes#14336
(cherry picked from commit ff386e7a44)
Backport note: contrary to the claims above, it turns out that it is
actually necessary to create snapshots in clusters which bootstrap with
Raft, because of tombstones in current schema state expire hence
applying schema mutations from old Raft log entries is not really
idempotent. Snapshot transfer, which transfers group 0 history and
state_ids, prevents old entries from applying schema mutations over
latest schema state.
Ref: scylladb/scylladb#16683
We create a snapshot (config only, but still), but do not assign it any
id. Because of that it is not loaded on start. We do want it to be
loaded though since the state of group0 will not be re-created from the
log on restart because the entries will have outdated id and will be
skipped. As a result in memory state machine state will not be restored.
This is not a problem now since schema state it restored outside of raft
code.
Message-Id: <20230316112801.1004602-5-gleb@scylladb.com>
(cherry picked from commit a690070722)
Before returning task status, wait_task waits for it to finish with
done() method and calls get() on a resulting future.
If requested task fails, an exception will be thrown and user will
get internal server error instead of failed task status.
Result of done() method is ignored.
Fixes: #14914.
(cherry picked from commit ae67f5d47e)
Closes#16438
discard_result ignores only successful futures. Thus, if
perform_compaction<regular_compaction_task_executor> call fails,
a failure is considered abandoned, causing tests to fail.
Explicitly ignore failed future.
Fixes: #14971.
Closes#15000
(cherry picked from commit 7a28cc60ec)
Closes#16441
`system.raft` was using the "user memory pool", i.e. the
`dirty_memory_manager` for this table was set to
`database::_dirty_memory_manager` (instead of
`database::_system_dirty_memory_manager`).
This meant that if a write workload caused memory pressure on the user
memory pool, internal `system.raft` writes would have to wait for
memtables of user tables to get flushed before the write would proceed.
This was observed in SCT longevity tests which ran a heavy workload on
the cluster and concurrently, schema changes (which underneath use the
`system.raft` table). Raft would often get stuck waiting many seconds
for user memtables to get flushed. More details in issue #15622.
Experiments showed that moving Raft to system memory fixed this
particular issue, bringing the waits to reasonable levels.
Currently `system.raft` stores only one group, group 0, which is
internally used for cluster metadata operations (schema and topology
changes) -- so it makes sense to keep use system memory.
In the future we'd like to have other groups, for strongly consistent
tables. These groups should use the user memory pool. It means we won't
be able to use `system.raft` for them -- we'll just have to use a
separate table.
Fixes: scylladb/scylladb#15622Closesscylladb/scylladb#15972
(cherry picked from commit f094e23d84)
When a base table changes and altered, so does the views that might
refer to the added column (which includes "SELECT *" views and also
views that might need to use this column for rows lifetime (virtual
columns).
However the query processor implementation for views change notification
was an empty function.
Since views are tables, the query processor needs to at least treat them
as such (and maybe in the future, do also some MV specific stuff).
This commit adds a call to `on_update_column_family` from within
`on_update_view`.
The side effect true to this date is that prepared statements for views
which changed due to a base table change will be invalidated.
Fixes https://github.com/scylladb/scylladb/issues/16392
This series also adds a test which fails without this fix and passes when the fix is applied.
Closesscylladb/scylladb#16897
* github.com:scylladb/scylladb:
Add test for mv prepared statements invalidation on base alter
query processor: treat view changes at least as table changes
(cherry picked from commit 5810396ba1)
On some environment such as VMware instance, /dev/disk/by-uuid/<UUID> is
not available, scylla_raid_setup will fail while mounting volume.
To avoid failing to mount /dev/disk/by-uuid/<UUID>, fetch all available
paths to mount the disk and fallback to other paths like by-partuuid,
by-id, by-path or just using real device path like /dev/md0.
To get device path, and also to dumping device status when UUID is not
available, this will introduce UdevInfo class which communicate udev
using pyudev.
Related #11359Closesscylladb/scylladb#13803
(cherry picked from commit 58d94a54a3)
[syuu: renegerate tools/toolchain/image for new python3-pyudev package]
Closes#16938
When the reader is currently paused, it is resumed, fast-forwarded, then
paused again. The fast forwarding part can throw and this will lead to
destroying the reader without it being closed first.
Add a try-catch surrounding this part in the code. Also mark
`maybe_pause()` and `do_pause()` as noexcept, to make it clear why
that part doesn't need to be in the try-catch.
Fixes: #16606Closesscylladb/scylladb#16630
(cherry picked from commit 204d3284fa)
If std::vector is resized its iterators and references may
get invalidated. While task_manager::task::impl::_children's
iterators are avoided throughout the code, references to its
elements are being used.
Since children vector does not need random access to its
elements, change its type to std::list<foreign_task_ptr>, which
iterators and references aren't invalidated on element insertion.
Fixes: #16380.
Closesscylladb/scylladb#16381
(cherry picked from commit 9b9ea1193c)
Closes#16777
Table properties validation is performed on statement execution.
Thus, when one attempts to create a table with invalid options,
an incorrect command gets committed in Raft. But then its
application fails, leading to a raft machine being stopped.
Check table properties when create and alter statements are prepared.
Fixes: https://github.com/scylladb/scylladb/issues/14710.
Closes#16750
* github.com:scylladb/scylladb:
cql3: statements: delete execute override
cql3: statements: call check_restricted_table_properties in prepare
cql3: statements: pass data_dictionary::database to check_restricted_table_properties
Delete overriden create_table_statement::execute as it only calls its
direct parent's (schema_altering_statement) execute method anyway.
(cherry picked from commit 6c7eb7096e)
Table properties validation is performed on statement execution.
Thus, when one attempts to create a table with invalid options,
an incorrect command gets committed in Raft. But then its
application fails, leading to a raft machine being stopped.
Check table properties when create and alter statements are prepared.
The error is no longer returned as an exceptional future, but it
is thrown. Adjust the tests accordingly.
(cherry picked from commit 60fdc44bce)
Pass data_dictionary::database to check_restricted_table_properties
as an arguemnt instead of query_processor as the method will be called
from a context which does not have access to query processor.
(cherry picked from commit ec98b182c8)
TWCS tables require partition estimation adjustment as incoming streaming data can be segregated into the time windows.
Turns out we had two problems in this area that leads to suboptimal bloom filters.
1) With off-strategy enabled, data segregation is postponed, but partition estimation was adjusted as if segregation wasn't postponed. Solved by not adjusting estimation if segregation is postponed.
2) With off-strategy disabled, data segregation is not postponed, but streaming didn't feed any metadata into partition estimation procedure, meaning it had to assume the max windows input data can be segregated into (100). Solved by using schema's default TTL for a precise estimation of window count.
For the future, we want to dynamically size filters (see https://github.com/scylladb/scylladb/issues/2024), especially for TWCS that might have SSTables that are left uncompacted until they're fully expired, meaning that the system won't heal itself in a timely manner through compaction on a SSTable that had partition estimation really wrong.
Fixes https://github.com/scylladb/scylladb/issues/15704.
Closes scylladb/scylladb#15938
* github.com:scylladb/scylladb:
streaming: Improve partition estimation with TWCS
streaming: Don't adjust partition estimate if segregation is postponed
(cherry picked from commit 64d1d5cf62)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#16672
Fixes#15269
If segment being replayed is corrupted/truncated we can attempt skipping
completely bogues byte amounts, which can cause assert (i.e. crash) in
file_data_source_impl. This is not a crash-level error, so ensure we
range check the distance in the reader.
v2: Add to corrupt_size if trying to skip more than available. The amount added is "wrong", but at least will
ensure we log the fact that things are broken
Closesscylladb/scylladb#15270
(cherry picked from commit 6ffb482bf3)
The reader used to read the sstables was not closed. This could
sometimes trigger an abort(), because the reader was destroyed, without
it being closed first.
Why only sometimes? This is due to two factors:
* read_mutation_from_flat_mutation_reader() - the method used to extract
a mutation from the reader, uses consume(), which does not trigger
`set_close_is_required()` (#16520). Due to this, the top-level
combined reader did not complain when destroyed without close.
* The combined reader closes underlying readers who have no more data
for the current range. If the circumstances are just right, all
underlying readers are closed, before the combined reader is
destoyed. Looks like this is what happens for the most time.
This bug was discovered in SCT testing. After fixing #16520, all
invokations of `scylla-sstable`, which use this code would trigger the
abort, without this patch. So no further testing is required.
Fixes: #16519Closesscylladb/scylladb#16521
(cherry picked from commit da033343b7)
The schema version is updated by group0, so if group0 starts before
schema version observer is registered some updates may be missed. Since
the observer is used to update node's gossiper state the gossiper may
contain wrong schema version.
Fix by registering the observer before starting group0 and even before
starting gossiper to avoid a theoretical case that something may pull
schema after start of gossiping and before the observer is registered.
Fixes: #15078
Message-Id: <ZOYZWhEh6Zyb+FaN@scylladb.com>
(cherry picked from commit d1654ccdda)
Fixes#9405
`sync_point` API provided with incorrect sync point id might allocate
crazy amount of memory and fail with `std::bad_alloc`.
To fix this, we can check if the encoded sync point has been modified
before decoding. We can achieve this by calculating a checksum before
encoding, appending it to the encoded sync point, and compering it with
a checksum calculated in `db::hints::decode` before decoding.
Closes#14534
* github.com:scylladb/scylladb:
db: hints: add checksum to sync point encoding
db: hints: add the version_size constant
(cherry picked from commit eb6202ef9c)
The only difference from the original merge commit is the include
path of `xx_hasher.hh`. On branch 5.2, this file is in the root
directory, not `utils`.
Closes#16458
* tools/java e2aad6e3a0...d7ec9bf45f (1):
> Merge "build: take care of old libthrift" from Piotr Grabowski
Fixes: scylladb/scylla-tools-java#352
Closes#16464
Lists can grow very big. Let's use a chunked vector to prevent large contiguous
allocations.
Fixes: #15302.
Closesscylladb/scylladb#15428
(cherry picked from commit 62a8a31be7)
The default reconnection policy in Python Driver is an exponential
backoff (with jitter) policy, which starts at 1 second reconnection
interval and ramps up to 600 seconds.
This is a problem in tests (refs #15104), especially in tests that restart
or replace nodes. In such a scenario, a node can be unavailable for an
extended period of time and the driver will try to reconnect to it
multiple times, eventually reaching very long reconnection interval
values, exceeding the timeout of a test.
Fix the issue by using a exponential reconnection policy with a maximum
interval of 4 seconds. A smaller value was not chosen, as each retry
clutters the logs with reconnection exception stack trace.
Fixes#15104Closes#15112
(cherry picked from commit 17e3e367ca)
server_impl::apply_snapshot() assumes that it cannot receive a snapshots
from the same host until the previous one is handled and usually this is
true since a leader will not send another snapshot until it gets
response to a previous one. But it may happens that snapshot sending
RPC fails after the snapshot was sent, but before reply is received
because of connection disconnect. In this case the leader may send
another snapshot and there is no guaranty that the previous one was
already handled, so the assumption may break.
Drop the assert that verifies the assumption and return an error in this
case instead.
Fixes: #15222
Message-ID: <ZO9JoEiHg+nIdavS@scylladb.com>
(cherry picked from commit 55f047f33f)
Having values of the duration type is not allowed for clustering
columns, because duration can't be ordered. This is correctly validated
when creating a table but do not validated when we alter the type.
Fixes#12913Closesscylladb/scylladb#16022
(cherry picked from commit bd73536b33)
systemd man page says:
systemd-fstab-generator(3) automatically adds dependencies of type Before= to
all mount units that refer to local mount points for this target unit.
So "Before=local-fs.taget" is the correct dependency for local mount
points, but we currently specify "After=local-fs.target", it should be
fixed.
Also replaced "WantedBy=multi-user.target" with "WantedBy=local-fs.target",
since .mount are not related with multi-user but depends local
filesystems.
Fixes#8761Closesscylladb/scylladb#15647
(cherry picked from commit a23278308f)
Shard id is logged twice in repair (once explicitly, once added by logger).
Redundant occurrence is deleted.
shard_repair_task_impl::id (which contains global repair shard)
is renamed to avoid further confusion.
Fixes: https://github.com/scylladb/scylladb/issues/12955Closes#16439
* github.com:scylladb/scylladb:
repair: rename shard_repair_task_impl::id
repair: delete redundant shard id from logs
Alternator's implementation of TagResource, UntagResource and UpdateTimeToLive (the latter uses tags to store the TTL configuration) was unsafe for concurrent modifications - some of these modifications may be lost. This short series fixes the bug, and also adds (in the last patch) a test that reproduces the bug and verifies that it's fixed.
The cause of the incorrect isolation was that we separately read the old tags and wrote the modified tags. In this series we introduce a new function, `modify_tags()` which can do both under one lock, so concurrent tag operations are serialized and therefore isolated as expected.
Fixes#6389.
Closes#13150
* github.com:scylladb/scylladb:
test/alternator: test concurrent TagResource / UntagResource
db/tags: drop unsafe update_tags() utility function
alternator: isolate concurrent modification to tags
db/tags: add safe modify_tags() utility functions
migration_manager: expose access to storage_proxy
(cherry picked from commit dba1d36aa6)
Closes#16453
Backport the following patches to 5.2:
- gossiper: mark_alive: enter background_msg gate (#14791)
- gossiper: mark_alive: use deferred_action to unmark pending (#14839)
Closes#16452
* github.com:scylladb/scylladb:
gossiper: mark_alive: use deferred_action to unmark pending
gossiper: mark_alive: enter background_msg gate
The implementation of "SELECT TOJSON(t)" or "SELECT JSON t" for a column
of type "time" forgot to put the time string in quotes. The result was
invalid JSON. This is patch is a one-liner fixing this bug.
This patch also removes the "xfail" marker from one xfailing test
for this issue which now starts to pass. We also add a second test for
this issue - the existing test was for "SELECT TOJSON(t)", and the second
test shows that "SELECT JSON t" had exactly the same bug - and both are
fixed by the same patch.
We also had a test translated from Cassandra which exposed this bug,
but that test continues to fail because of other bugs, so we just
need to update the xfail string.
The patch also fixes one C++ test, test/boost/json_cql_query_test.cc,
which enshrined the *wrong* behavior - JSON output that isn't even
valid JSON - and had to be fixed. Unlike the Python tests, the C++ test
can't be run against Cassandra, and doesn't even run a JSON parser
on the output, which explains how it came to enshrine wrong output
instead of helping to discover the bug.
Fixes#7988
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#16121
(cherry picked from commit 8d040325ab)
Make sure _pending_mark_alive_endpoints is unmarked in
any case, including exceptions.
Fixes#14839
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14840
(cherry picked from commit 1e7e2eeaee)
we compare the symbols lists of stripped ELF file ($orig.stripped) and
that of the one including debugging symbols ($orig.debug) to get a
an ELF file which includes only the necessary bits as the debuginfo
($orig.minidebug).
but we generate the symbol list of stripped ELF file using the
sysv format, while generate the one from the unstripped one using
posix format. the former is always padded the symbol names with spaces
so that their the length at least the same as the section name after
we split the fields with "|".
that's why the diff includes the stuff we don't expect. and hence,
we have tons of warnings like:
```
objcopy: build/node_exporter/node_exporter.keep_symbols:4910: Ignoring rubbish found on this line
```
when using objcopy to filter the ELF file to keep only the
symbols we are interested in.
so, in this change
* use the same format when dumping the symbols from unstripped ELF
file
* include the symbols in the text area -- the code, by checking
"T" and "t" in the dumped symbols. this was achieved by matching
the lines with "FUNC" before this change.
* include the the symbols in .init data section -- the global
variables which are initialized at compile time. they could
be also interesting when debugging an application.
Fixes#15513
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15514
(cherry picked from commit 50c937439b)
If the constructor of row_cache throws, `_partitions` is cleared in the
wrong allocator, possibly causing allocator corruption.
Fix that.
Fixes#15632Closesscylladb/scylladb#15633
(cherry picked from commit 330d221deb)
For JSON objects represented as map<ascii, int>, don't treat ASCII keys
as a nested JSON string. We were doing that prior to the patch, which
led to parsing errors.
Included the error offset where JSON parsing failed for
rjson::parse related functions to help identify parsing errors
better.
Fixes: #7949
Signed-off-by: Michael Huang <michaelhly@gmail.com>
Closesscylladb/scylladb#15499
(cherry picked from commit 75109e9519)
before this change, when running into a zero chunk_len, scylla
crashes with `assert(chunk_size != 0)`. but we can do better than
printing a backtrace like:
```
scylla: sstables/compress.cc:158: void
sstables::compression::segmented_offsets::init(uint32_t): Assertion `chunk_size != 0' failed.
```
so, in this change, a `malformed_sstable_exception` is throw in place
of an `assert()`, which is supposed to verify the programming
invariants, not for identifying corrupted data file.
Fixes#15265
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15264
(cherry picked from commit 1ed894170c)
When a table is dropped, we delete its sstables, and finally try to delete
the table's top-level directory with the rmdir system call. When the
auto-snapshot feature is enabled (this is still Scylla's default),
the snapshot will remain in that directory so it won't be empty and will
cannot be removed. Today, this results in a long, ugly and scary warning
in the log:
```
WARN 2023-07-06 20:48:04,995 [shard 0] sstable - Could not remove table directory "/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots": std::filesystem::__cxx11::filesystem_error (error system:39, filesystem error: remove failed: Directory not empty [/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots]). Ignored.
```
It is bad to log as a warning something which is completely normal - it
happens every time a table is dropped with the perfectly valid (and even
default) auto-snapshot mode. We should only log a warning if the deletion
failed because of some unexpected reason.
And in fact, this is exactly what the code **tried** to do - it does
not log a warning if the rmdir failed with EEXIST. It even had a comment
saying why it was doing this. But the problem is that in Linux, deleting
a non-empty directory does not return EEXIST, it returns ENOTEMPTY...
Posix actually allows both. So we need to check both, and this is the
only change in this patch.
To confirm this that this patch works, edit test/cql-pytest/run.py and
change auto-snapshot from 0 to 1, run test/alternator/run (for example)
and see many "Directory not empty" warnings as above. With this patch,
none of these warnings appear.
Fixes#13538
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14557
(cherry picked from commit edfb89ef65)
before this change, we format a `long` using `{:f}`. fmtlib would
throw an exception when actually formatting it.
so, let's make the percentage a float before formatting it.
Fixes#14587
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14588
(cherry picked from commit 1eb76d93b7)
shard_repair_task_impl::id stores global repair id. To avoid confusion
with the task id, the field is renamed to global_repair_id.
(cherry picked from commit d889a599e8)
The problem was that the holder in with_gate
call was released too early. This happened
before the possible call to on_hint_send_failure
in then_wrapped. As a result, the effects of
on_hint_send_failure (segment_replay_failed flag)
were not visible in send_one_file after
ctx_ptr->file_send_gate.close(), so we could decide
that the segment was sent in full and delete
it even if sending of some hints led to errors.
Fixes#15110
(cherry picked from commit 9fd3df13a2)
before this change, we format a sstring with "{:d}", fmtlib would throw
`fmt::format_error` at runtime when formatting it. this is not expected.
so, in this change, we just print the int8_t using `seastar::format()`
in a single pass. and with the format specifier of `#02x` instead of
adding the "0x" prefix manually.
Fixes#14577
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14578
(cherry picked from commit 27d6ff36df)
before this change, `scylla sstable dump-statistics` prints the
"regular_columns" as a list of strings, like:
```
"regular_columns": [
"name",
"clustering_order",
"type_name",
"org.apache.cassandra.db.marshal.UTF8Type",
"name",
"column_name_bytes",
"type_name",
"org.apache.cassandra.db.marshal.BytesType",
"name",
"kind",
"type_name",
"org.apache.cassandra.db.marshal.UTF8Type",
"name",
"position",
"type_name",
"org.apache.cassandra.db.marshal.Int32Type",
"name",
"type",
"type_name",
"org.apache.cassandra.db.marshal.UTF8Type"
]
```
but according
https://opensource.docs.scylladb.com/stable/operating-scylla/admin-tools/scylla-sstable.html#dump-statistics,
> $SERIALIZATION_HEADER_METADATA := {
> "min_timestamp_base": Uint64,
> "min_local_deletion_time_base": Uint64,
> "min_ttl_base": Uint64",
> "pk_type_name": String,
> "clustering_key_types_names": [String, ...],
> "static_columns": [$COLUMN_DESC, ...],
> "regular_columns": [$COLUMN_DESC, ...],
> }
>
> $COLUMN_DESC := {
> "name": String,
> "type_name": String
> }
"regular_columns" is supposed to be a list of "$COLUMN_DESC".
the same applies to "static_columnes". this schema makes sense,
as each column should be considered as a single object which
is composed of two properties. but we dump them like a list.
so, in this change, we guard each visit() call of `json_dumper()`
with `StartObject()` and `EndObject()` pair, so that each column
is printed as an object.
after the change, "regular_columns" are printed like:
```
"regular_columns": [
{
"name": "clustering_order",
"type_name": "org.apache.cassandra.db.marshal.UTF8Type"
},
{
"name": "column_name_bytes",
"type_name": "org.apache.cassandra.db.marshal.BytesType"
},
{
"name": "kind",
"type_name": "org.apache.cassandra.db.marshal.UTF8Type"
},
{
"name": "position",
"type_name": "org.apache.cassandra.db.marshal.Int32Type"
},
{
"name": "type",
"type_name": "org.apache.cassandra.db.marshal.UTF8Type"
}
]
```
Fixes#15036
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15037
(cherry picked from commit c82f1d2f57)
This commit introduces a new boolean flag, `shutdown`, to the
forward_service, along with a corresponding shutdown method. It also
adds checks throughout the forward_service to verify the value of the
shutdown flag before retrying or invoking functions that might use the
messaging service under the hood.
The flag is set before messaging service shutdown, by invoking
forward_service::shutdown in main. By checking the flag before each call
that potentially involves the messaging service, we can ensure that the
messaging service is still operational. If the flag is false, indicating
that the messaging service is still active, we can proceed with the
call. In the event that the messaging service is shutdown during the
call, appropriate exceptions should be thrown somewhere down in called
functions, avoiding potential hangs.
This fix should resolve the issue where forward_service retries could
block the shutdown.
Fixes#12604Closes#13922
(cherry picked from commit e0855b1de2)
We had a redundant copy in receive_mutation_handler
forward_fn callback. This frozen_mutation is
dynamically allocated and can be arbitrary large.
Fixes: #12504
(cherry picked from commit 5adbb6cde2)
This is used for readiness API: /storage_service/rpc_server and the fix prevents from returning 'true' prematurely.
Some improvement for readiness was added in a51529dd15 but thrift implementation wasn't fully done.
Fixes https://github.com/scylladb/scylladb/issues/12376Closes#13319
* github.com:scylladb/scylladb:
thrift: return address in listen_addresses() only after server is ready
thrift: simplify do_start_server() with seastar:async
(cherry picked from commit 9a024f72c4)
Fix a few cases where instead of printing column names in error messages, we printed weird stuff like ASCII codes or the address of the name.
Fixes#13657Closes#13658
* github.com:scylladb/scylladb:
cql3: fix printing of column_specification::name in some error messages
cql3: fix printing of column_definition::name in some error messages
(cherry picked from commit a29b8cd02b)
This series introduces a new gossiper method: get_endpoints that returns a vector of endpoints (by value) based on the endpoint state map.
get_endpoints is used here by gossiper and storage_service for iterations that may preempt
instead of iterating direction over the endpoint state map (`_endpoint_state_map` in gossiper or via `get_endpoint_states()`) so to prevent use-after-free that may potentially happen if the map is rehashed while the function yields causing invalidation of the loop iterators.
\Fixes #13899
\Closes #13900
* github.com:scylladb/scylladb:
storage_service: do not preempt while traversing endpoint_state_map
gossiper: do not preempt while traversing endpoint_state_map
(cherry picked from commit d2d53fc1db)
Closes#16431
`clustering_key_columns()` returns a range view, and `front()` returns
the reference to its first element. so we cannot assume the availability
of this reference after the expression is evaluated. to address this
issue, let's capture the returned range by value, and keep the first
element by reference.
this also silences warning from GCC-13:
```
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:30: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
3654 | const column_definition& first_view_ck = v->clustering_key_columns().front();
| ^~~~~~~~~~~~~
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:79: note: the temporary was destroyed at the end of the full expression ‘(& v)->view_ptr::operator->()->schema::clustering_key_columns().boost::iterator_range<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> > >::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::random_access_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::bidirectional_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::incrementable_traversal_tag>::front()’
3654 | const column_definition& first_view_ck = v->clustering_key_columns().front();
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
```
Fixes#13720
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13721
(cherry picked from commit 135b4fd434)
Currently, the reader might dereference a null pointer
if the input stream reaches eof prematurely,
and read_exactly returns an empty temporary_buffer.
Detect this condition before dereferencing the buffer
and sstables::malformed_sstable_exception.
Fixes#13599
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13600
(cherry picked from commit 77b70dbdb7)
The previous version of wasmtime had a vulnerability that possibly
allowed causing undefined behavior when calling UDFs.
We're directly updating to wasmtime 8.0.1, because the update only
requires a slight code modification and the Wasm UDF feature is
still experimental. As a result, we'll benefit from a number of
new optimizations.
Fixes#13807Closes#13804
(cherry picked from commit 6bc16047ba)
The metric shows the opposite of what its name suggests.
It shows available memory rather than consumed memory.
Fix that.
Fixes#13810Closes#13811
(cherry picked from commit 0813fa1da0)
The use statement execution code can throw if the keyspace is
doesn't exist, this can be a problem for code that will use
execute in a fiber since the exception will break the fiber even
if `then_wrapped` is used.
Fixes#14449
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Closesscylladb/scylladb#14394
(cherry picked from commit c5956957f3)
reconcilable_result_builder passes range tombstone changes to _rt_assembler
using table schema, not query schema.
This means that a tombstone with bounds (a; b), where a < b in query schema
but a > b in table schema, will not be emitted from mutation_query.
This is a very serious bug, because it means that such tombstones in reverse
queries are not reconciled with data from other replicas.
If *any* queried replica has a row, but not the range tombstone which deleted
the row, the reconciled result will contain the deleted row.
In particular, range deletes performed while a replica is down will not
later be visible to reverse queries which select this replica, regardless of the
consistency level.
As far as I can see, this doesn't result in any persistent data loss.
Only in that some data might appear resurrected to reverse queries,
until the relevant range tombstone is fully repaired.
This series fixes the bug and adds a minimal reproducer test.
Fixes#10598Closesscylladb/scylladb#16003
* github.com:scylladb/scylladb:
mutation_query_test: test that range tombstones are sent in reverse queries
mutation_query: properly send range tombstones in reverse queries
(cherry picked from commit 65e42e4166)
When scanning our latest docker image using `trivy` (command: `trivy
image docker.io/scylladb/scylla-nightly:latest`), it shows we have OS
packages which are out of date.
Also removing `openssh-server` and `openssh-client` since we don't use
it for our docker images
Fixes: https://github.com/scylladb/scylladb/issues/16222
Closes scylladb/scylladb#16224
(cherry picked from commit 7ce6962141)
Closes#16360
The execution loop consumes permits from the _ready_list and executes
them. The _ready_list usually contains a single permit. When the
_ready_list is not empty, new permits are queued until it becomes empty.
The execution loops relies on admission checks triggered by the read
releasing resouces, to bring in any queued read into the _ready_list,
while it is executing the current read. But in some cases the current
read might not free any resorces and thus fail to trigger an admission
check and the currently queued permits will sit in the queue until
another source triggers an admission check.
I don't yet know how this situation can occur, if at all, but it is
reproducible with a simple unit test, so it is best to cover this
corner-case in the off-chance it happens in the wild.
Add an explicit admission check to the execution loop, after the
_ready_list is exhausted, to make sure any waiters that can be admitted
with an empty _ready_list are admitted immediately and execution
continues.
Fixes: #13540Closes#13541
(cherry picked from commit b790f14456)
Currently the cache updaters aren't exception safe
yet they are intended to be.
Instead of allowing exceptions from
`external_updater::execute` escape `row_cache::update`,
abort using `on_fatal_internal_error`.
Future changes should harden all `execute` implementations
to effectively make them `noexcept`, then the pure virtual
definition can be made `noexcept` to cement that.
\Fixes scylladb/scylladb#15576
\Closes scylladb/scylladb#15577
* github.com:scylladb/scylladb:
row_cache: abort on exteral_updater::execute errors
row_cache: do_update: simplify _prev_snapshot_pos setup
(cherry picked from commit 4a0f16474f)
Closesscylladb/scylladb#16256
Fixes#16153
* java e716e1bd1d...80701efa8d (1):
> NodeProbe: allow addressing table name with colon in it
/home/nyh/scylla/tools$ git submodule summary jmx | cat
* jmx bc4f8ea...f21550e (3):
> ColumnFamilyStore: only quote table names if necessary
> APIBuilder: allow quoted scope names
> ColumnFamilyStore: don't fail if there is a table with ":" in its name
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#16296
This commit fixes the rollback procedure in
the 4.6-to-5.0 upgrade guide:
- The "Restore system tables" step is removed.
- The "Restore the configuration file" command
is fixed.
- The "Gracefully shutdown ScyllaDB" command
is fixed.
In addition, there are the following updates
to be in sync with the tests:
- The "Backup the configuration file" step is
extended to include a command to backup
the packages.
- The Rollback procedure is extended to restore
the backup packages.
- The Reinstallation section is fixed for RHEL.
Refs https://github.com/scylladb/scylladb/issues/11907
This commit must be backported to branch-5.4, branch-5.2, and branch-5.1
Closesscylladb/scylladb#16155
(cherry picked from commit 1e80bdb440)
This commit fixes the rollback procedure in
the 5.0-to-5.1 upgrade guide:
- The "Restore system tables" step is removed.
- The "Restore the configuration file" command
is fixed.
- The "Gracefully shutdown ScyllaDB" command
is fixed.
In addition, there are the following updates
to be in sync with the tests:
- The "Backup the configuration file" step is
extended to include a command to backup
the packages.
- The Rollback procedure is extended to restore
the backup packages.
- The Reinstallation section is fixed for RHEL.
Also, I've the section removed the rollback
section for images, as it's not correct or
relevant.
Refs https://github.com/scylladb/scylladb/issues/11907
This commit must be backported to branch-5.4, branch-5.2, and branch-5.1
Closesscylladb/scylladb#16154
(cherry picked from commit 7ad0b92559)
This commit fixes the rollback procedure in
the 5.1-to-5.2 upgrade guide:
- The "Restore system tables" step is removed.
- The "Restore the configuration file" command
is fixed.
- The "Gracefully shutdown ScyllaDB" command
is fixed.
In addition, there are the following updates
to be in sync with the tests:
- The "Backup the configuration file" step is
extended to include a command to backup
the packages.
- The Rollback procedure is extended to restore
the backup packages.
- The Reinstallation section is fixed for RHEL.
Also, I've the section removed the rollback
section for images, as it's not correct or
relevant.
Refs https://github.com/scylladb/scylladb/issues/11907
This commit must be backported to branch-5.4 and branch-5.2.
Closesscylladb/scylladb#16152
(cherry picked from commit 91cddb606f)
database_test is failing sporadically and the cause was traced back
to commit e3e7c3c7e5.
The commit forces a subset of tests in database_test, to run once
for each of predefined x_log2_compaction_group settings.
That causes two problems:
1) test becomes 240% slower in dev mode.
2) queries on system.auth is timing out, and the reason is a small
table being spread across hundreds of compaction groups in each
shard. so to satisfy a range scan, there will be multiple hops,
making the overhead huge. additionally, the compaction group
aware sstable set is not merged yet. so even point queries will
unnecessarily scan through all the groups.
Fixes#13660.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13851
(cherry picked from commit a7ceb987f5)
Passing the gate_closed_exception to the task promise in start()
ends up with abandoned exception since no-one is waiting
for it.
Instead, enter the gate when the task is made
so it will fail make_task if the gate is already closed.
Fixesscylladb/scylladb#15211
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit f9a7635390)
The copy assignment operator of _ck can throw
after _type and _bound_weight have already been changed.
This leaves position_in_partition in an inconsistent state,
potentially leading to various weird symptoms.
The problem was witnessed by test_exception_safety_of_reads.
Specifically: in cache_flat_mutation_reader::add_to_buffer,
which requires the assignment to _lower_bound to be exception-safe.
The easy fix is to perform the only potentially-throwing step first.
Fixes#15822Closesscylladb/scylladb#15864
(cherry picked from commit 93ea3d41d8)
Update node_exporter to 1.7.0.
The previous version (1.6.1) was flagged by security scanners (such as
Trivy) with HIGH-severity CVE-2023-39325. 1.7.0 release fixed that
problem.
[Botond: regenerate frozen toolchain]
Fixes#16085Closesscylladb/scylladb#16086Closesscylladb/scylladb#16090
(cherry picked from commit 321459ec51)
[avi: regenerate frozen toolchain]
* tools/jmx 88d9bdc...bc4f8ea (1):
> Merge "scylla-apiclient: update several Java dependencies" from Piotr Grabowski
* tools/java f8f556d802...e716e1bd1d (1):
> Merge 'build: update several dependencies' from Piotr Grabowski
Update build dependencies which were flagged by security scanners.
Refs: scylladb/scylla-jmx#220
Refs: scylladb/scylla-tools-java#351
Closes#16150
Currently, the API call recalculates only per-node schema version. To
workaround issues like #4485 we want to recalculate per-table
digests. One way to do that is to restart the node, but that's slow
and has impact on availability.
Use like this:
curl -X POST http://127.0.0.1:10000/storage_service/relocal_schemaFixes#15380Closes#15381
(cherry picked from commit c27d212f4b)
Currently, when said feature is enabled, we recalcuate the schema
digest. But this feature also influences how table versions are
calculated, so it has to trigger a recalculation of all table versions,
so that we can guarantee correct versions.
Before, this used to happen by happy accident. Another feature --
table_digest_insensitive_to_expiry -- used to take care of this, by
triggering a table version recalulation. However this feature only takes
effect if digest_insensitive_to_expiry is also enabled. This used to be
the case incidently, by the time the reload triggered by
table_digest_insensitive_to_expiry ran, digest_insensitive_to_expiry was
already enabled. But this was not guaranteed whatsoever and as we've
recently seen, any change to the feature list, which changes the order
in which features are enabled, can cause this intricate balance to
break.
This patch makes digest_insensitive_to_expiry also kick off a schema
reload, to eliminate our dependence on (unguaranteed) feature order, and
to guarantee that table schemas have a correct version after all features
are enabled. In fact, all schema feature notification handlers now kick
off a full schema reload, to ensure bugs like this don't creep in, in
the future.
Fixes: #16004Closesscylladb/scylladb#16013
(cherry picked from commit 22381441b0)
In 0c86abab4d `merge_schema` obtained a new flag, `reload`.
Unfortunately, the flag was assigned a default value, which I think is
almost always a bad idea, and indeed it was in this case. When
`merge_scehma` is called on shard different than 0, it recursively calls
itself on shard 0. That recursive call forgot to pass the `reload` flag.
Fix this.
(cherry picked from commit 48164e1d09)
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.
Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.
After ae8d2a550d (5.2.0), it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).
This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.
A similar problem was fixed for per-node digest calculation in
c2ba94dc39e4add9db213751295fb17b95e6b962. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.
Fixes#4485.
Manually tested using ccm on cluster upgrade scenarios and node restarts.
Closes#14441
* github.com:scylladb/scylladb:
test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled
schema_mutations, migration_manager: Ignore empty partitions in per-table digest
migration_manager, schema_tables: Implement migration_manager::reload_schema()
schema_tables: Avoid crashing when table selector has only one kind of tables
(cherry picked from commit cf81eef370)
Currently the code will assert because cl pointer will be null and it
will be null because there is no mutations to initialize it from.
Message-Id: <20230212144837.2276080-3-gleb@scylladb.com>
(cherry picked from commit 941407b905)
Backport needed by #4485.
Currently, it is started/stopped in the streaming/maintenance sg, which
is what the API itself runs in.
Starting the native transport in the streaming sg, will lead to severely
degraded performance, as the streaming sg has significantly less
CPU/disk shares and reader concurrency semaphore resources.
Furthermore, it will lead to multi-paged reads possibly switching
between scheduling groups mid-way, triggering an internal error.
To fix, use `with_scheduling_group()` for both starting and stopping
native transport. Technically, it is only strictly necessary for
starting, but I added it for stop as well for consistency.
Also apply the same treatment to RPC (Thrift). Although no one uses it,
best to fix it, just to be on the safe side.
I think we need a more systematic approach for solving this once and for
all, like passing the scheduling group to the protocol server and have
it switch to it internally. This allows the server to always run on the
correct scheduling group, not depending on the caller to remember using
it. However, I think this is best done in a follow-up, to keep this
critical patch small and easily backportable.
Fixes: #15485Closesscylladb/scylladb#16019
(cherry picked from commit dfd7981fa7)
$ID_LIKE = "rhel" works only on RHEL compatible OSes, not for RHEL
itself.
To detect RHEL correctly, we also need to check $ID = "rhel".
Fixes#16040Closesscylladb/scylladb#16041
(cherry picked from commit 338a9492c9)
When base write triggers mv write and it needs to be send to another
shard it used the same service group and we could end up with a
deadlock.
This fix affects also alternator's secondary indexes.
Testing was done using (yet) not committed framework for easy alternator
performance testing: https://github.com/scylladb/scylladb/pull/13121.
I've changed hardcoded max_nonlocal_requests config in scylla from 5000 to 500 and
then ran:
./build/release/scylla perf-alternator-workloads --workdir /tmp/scylla-workdir/ --smp 2 \
--developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write_gsi \
--duration 60 --ring-delay-ms 0 --skip-wait-for-gossip-to-settle 0 --continue-after-error true --concurrency 2000
Without the patch when scylla is overloaded (i.e. number of scheduled futures being close to max_nonlocal_requests) after couple seconds
scylla hangs, cpu usage drops to zero, no progress is made. We can confirm we're hitting this issue by seeing under gdb:
p seastar::get_smp_service_groups_semaphore(2,0)._count
$1 = 0
With the patch I wasn't able to observe the problem, even with 2x
concurrency. I was able to make the process hang with 10x concurrency
but I think it's hitting different limit as there wasn't any depleted
smp service group semaphore and it was happening also on non mv loads.
Fixes https://github.com/scylladb/scylladb/issues/15844Closesscylladb/scylladb#15845
(cherry picked from commit 020a9c931b)
We have observed do_repair_ranges() receiving tens of thousands of
ranges to repairs on occasion. do_repair_ranges() repairs all ranges in
parallel, with parallel_for_each(). This is normally fine, as the lambda
inside parallel_for_each() takes a semaphore and this will result in
limited concurrency.
However, in some instances, it is possible that most of these ranges are
skipped. In this case the lambda will become synchronous, only logging a
message. This can cause stalls beacuse there are no opportunities to
yield. Solve this by adding an explicit yield to prevent this.
Fixes: #14330Closesscylladb/scylladb#15879
(cherry picked from commit 90a8489809)
While looking for specific UDF/UDA, result of
`functions::functions::find()` needs to be filtered out based on
function's type.
Fixes: #14360
(cherry picked from commit d498451cdf)
These APIs may return stale or simply incorrect data on shards
other than 0. Newer versions of Scylla are better at maintaining
cross-shard consistency, but we need a simple fix that can be easily and
without risk be backported to older versions; this is the fix.
Add a simple test to check that the `failure_detector/endpoints`
API returns nonzero generation.
Fixes: scylladb/scylladb#15816Closesscylladb/scylladb#15970
* github.com:scylladb/scylladb:
test: rest_api: test that generation is nonzero in `failure_detector/endpoints`
api: failure_detector: fix indentation
api: failure_detector: invoke on shard 0
(cherry picked from commit 9443253f3d)
This `with` context is supposed to disable, then re-enable
autocompaction for the given keyspaces, but it used the wrong API for
it, it used the column_family/autocompaction API, which operates on
column families, not keyspaces. This oversight led to a silent failure
because the code didn't check the result of the request.
Both are fixed in this patch:
* switch to use `storage_service/auto_compaction/{keyspace}` endpoint
* check the result of the API calls and report errors as exceptions
Fixes: #13553Closes#13568
(cherry picked from commit 66ee73641e)
Before integration with task manager the state of one shard repair
was kept in repair_info. repair_info object was destroyed immediately
after shard repair was finished.
In an integration process repair_info's fields were moved to
shard_repair_task_impl as the two served the similar purposes.
Though, shard_repair_task_impl isn't immediately destoyed, but is
kept in task manager for task_ttl seconds after it's complete.
Thus, some of repair_info's fields have their lifetime prolonged,
which makes the repair state change delayed.
Release shard_repair_task_impl resources immediately after shard
repair is finished.
Fixes: #15505.
(cherry picked from commit 0474e150a9)
Closes#15875
scylla-sstable currently has two ways to obtain the schema:
* via a `schema.cql` file.
* load schema definition from memory (only works for system tables).
This meant that for most cases it was necessary to export the schema into a CQL format and write it to a file. This is very flexible. The sstable can be inspected anywhere, it doesn't have to be on the same host where it originates form. Yet in many cases the sstable is inspected on the same host where it originates from. In this cases, the schema is readily available in the schema tables on disk and it is plain annoying to have to export it into a file, just to quickly inspect an sstable file.
This series solves this annoyance by providing a mechanism to load schemas from the on-disk schema tables. Furthermore, an auto-detect mechanism is provided to detect the location of these schema tables based on the path of the sstable, but if that fails, the tool check the usual locations of the scylla data dir, the scylla confguration file and even looks for environment variables that tell the location of these. The old methods are still supported. In fact, if a schema.cql is present in the working directory of the tool, it is preferred over any other method, allowing for an easy force-override.
If the auto-detection magic fails, an error is printed to the console, advising the user to turn on debug level logging to see what went wrong.
A comprehensive test is added which checks all the different schema loading mechanisms. The documentation is also updated to reflect the changes.
This change breaks the backward-compatibility of the command-line API of the tool, as `--system-schema` is now just a flag, the keyspace and table names are supplied separately via the new `--keyspace` and `--table` options. I don't think this will break anybody's workflow as this tools is still lightly used, exactly because of the annoying way the schema has to be provided. Hopefully after this series, this will change.
Example:
```
$ ./build/dev/scylla sstable dump-data /var/lib/scylla/data/ks/tbl2-d55ba230b9a811ed9ae8495671e9e4f8/quarantine/me-1-big-Data.db
{"sstables":{"/var/lib/scylla/data/ks/tbl2-d55ba230b9a811ed9ae8495671e9e4f8/quarantine//me-1-big-Data.db":[{"key":{"token":"-3485513579396041028","raw":"000400000000","value":"0"},"clustering_elements":[{"type":"clustering-row","key":{"raw":"","value":""},"marker":{"timestamp":1677837047297728},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1677837047297728,"value":"0"}}}]}]}}
```
As seen above, subdirectories like qurantine, staging etc are also supported.
Fixes: https://github.com/scylladb/scylladb/issues/10126Closes#13448
* github.com:scylladb/scylladb:
test/cql-pytest: test_tools.py: add tests for schema loading
test/cql-pytest: add no_autocompaction_context
docs: scylla-sstable.rst: remove accidentally added copy-pasta
docs: scylla-sstable.rst: remove paragraph with schema limitations
docs: scylla-sstable.rst: update schema section
test/cql-pytest: nodetool.py: add flush_keyspace()
tools/scylla-sstable: reform schema loading mechanism
tools/schema_loader: add load_schema_from_schema_tables()
db/schema_tables: expose types schema
(cherry picked from commit 952b455310)
Closes#15386
This is a backport of https://github.com/scylladb/scylladb/pull/14158 to branch 5.2
Closes#15872
* github.com:scylladb/scylladb:
migration_notifier: get schema_ptr by value
migration_manager: propagate listener notification exceptions
storage_service: keyspace_changed: execute only on shard 0
database: modify_keyspace_on_all_shards: execute func first on shard 0
database: modify_keyspace_on_all_shards: call notifiers only after applying func on all shards
database: add modify_keyspace_on_all_shards
schema_tables: merge_keyspaces: extract_scylla_specific_keyspace_info for update_keyspace
database: create_keyspace_on_all_shards
database: update_keyspace_on_all_shards
database: drop_keyspace_on_all_shards
Off-strategy suffers with a 100% space overhead, as it adopted
a sort of all or nothing approach. Meaning all input sstables,
living in maintenance set, are kept alive until they're all
reshaped according to the strategy criteria.
Input sstables in off-strategy are very likely to be mostly disjoint,
so it can greatly benefit from incremental compaction.
The incremental compaction approach is not only good for
decreasing disk usage, but also memory usage (as metadata of
input and output live in memory), and file desc count, which
takes memory away from OS.
Turns out that this approach also greatly simplifies the
off-strategy impl in compaction manager, as it no longer have
to maintain new unused sstables and mark them for
deletion on failure, and also unlink intermediary sstables
used between reshape rounds.
Fixes https://github.com/scylladb/scylladb/issues/14992.
Backport notes: relatively easy to backport, had to include
**replica: Make compaction_group responsible for deleting off-strategy compaction input**
and
**compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0**
Closes#15793
* github.com:scylladb/scylladb:
test: Verify that off-strategy can do incremental compaction
compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0
compaction: Clear pending_replacement list when tombstone GC is disabled
compaction: Enable incremental compaction on off-strategy
compaction: Extend reshape type to allow for incremental compaction
compaction: Move reshape_compaction in the source
compaction: Enable incremental compaction only if replacer callback is engaged
replica: Make compaction_group responsible for deleting off-strategy compaction input
To prevent use-after-free as seen in
https://github.com/scylladb/scylladb/issues/15097
where a temp schema_ptr retrieved from a global_schema_ptr
get destroyed when the notification function yielded.
Capturing the schema_ptr on the coroutine frame
is inexpensive since its a shared ptr and it makes sure
that the schema remains valid throughput the coroutine
life time.
\Fixes scylladb/scylladb#15097
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
\Closes #15098
(cherry picked from commit 0f54e24519)
1e29b07e40 claimed
to make event notification exception safe,
but swallawing the exceptions isn't safe at all,
as this might leave the node in an inconsistent state
if e.g. storage_service::keyspace_changed fails on any of the
shards. Propagating the exception here will cause abort,
but it is better than leaving the node up, but in an
inconsistent state.
We keep notifying other listeners even if any of them failed
Based on 1e29b07e40:
```
If one of the listeners throws an exception, we must ensure that other
listeners are still notified.
```
The decision about swallowing exceptions can't be
made in such a generic layer.
Specific notification listeners that may ignore exceptions,
like in transport/evenet_notifier, may decide to swallow their
local exceptions on their own (as done in this patch).
Refs #3389
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 825d617a53)
Previously all shards called `update_topology_change_info`
which in turn calls `mutate_token_metadata`, ending up
in quadratic complexity.
Now that the notifications are called after
all database shards are updated, we can apply
the changes on token metadata / effective replication map
only on shard 0 and count on replicate_to_all_cores to
propagate those changes to all other shards.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit a690f0e81f)
When creating or altering a keyspace, we create a new
effective_replication_map instance.
It is more efficient to do that first on shard 0
and then on all other shards, otherwise multiple
shards might need to calculate to new e_r_m (and reach
the same result). When the new e_r_m is "seeded" on
shard 0, other shards will find it there and clone
a local copy of it - which is more efficient.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 13dd92e618)
When creating, updating, or dropping keyspaces,
first execute the database internal function to
modify the database state, and only when all shards
are updated, run the listener notifications,
to make sure they would operate when the database
shards are consistent with each other.
\Fixes #13137
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ba15786059)
Run all keyspace create/update/drop ops
via `modify_keyspace_on_all_shards` that
will standardize the execution on all shards
in the coming patches.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 3b8c913e61)
Similar to create_keyspace_on_all_shards,
`extract_scylla_specific_keyspace_info` and
`create_keyspace_from_schema_partition` can be called
once in the upper layer, passing keyspace_metadata&
down to database::update_keyspace_on_all_shards
which now would only make the per-shard
keyspace_metadata from the reference it gets
from the schema_tables layer.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit dc9b0812e9)
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 3520c786bd)
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 53a6ea8616)
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 9d40305ef6)
before this change, `checksummed_file_data_sink_impl` just inherits the
`data_sink_impl::flush()` from its parent class. but as a wrapper around
the underlying `_out` data_sink, this is not only an unusual design
decision in a layered design of an I/O system, but also could be
problematic. to be more specific, the typical user of `data_sink_impl`
is a `data_sink`, whose `flush()` member function is called when
the user of `data_sink` want to ensure that the data sent to the sink
is pushed to the underlying storage / channel.
this in general works, as the typical user of `data_sink` is in turn
`output_stream`, which calls `data_sink.flush()` before closing the
`data_sink` with `data_sink.close()`. and the operating system will
eventually flush the data after application closes the corresponding
fd. to be more specific, almost none of the popular local filesystem
implements the file_operations.op, hence, it's safe even if the
`output_stream` does not flush the underlying data_sink after writing
to it. this is the use case when we write to sstables stored on local
filesystem. but as explained above, if the data_sink is backed by a
network filesystem, a layered filesystem or a storage connected via
a buffered network device, then it is crucial to flush in a timely
manner, otherwise we could risk data lost if the application / machine /
network breaks when the data is considerered persisted but they are
_not_!
but the `data_sink` returned by `client::make_upload_jumbo_sink` is
a little bit different. multipart upload is used under the hood, and
we have to finalize the upload once all the parts are uploaded by
calling `close()`. but if the caller fails / chooses to close the
sink before flushing it, the upload is aborted, and the partially
uploaded parts are deleted.
the default-implemented `checksummed_file_data_sink_impl::flush()`
breaks `upload_jumbo_sink` which is the `_out` data_sink being
wrapped by `checksummed_file_data_sink_impl`. as the `flush()`
calls are shortcircuited by the wrapper, the `close()` call
always aborts the upload. that's why the data and index components
just fail to upload with the S3 backend.
in this change, we just delegate the `flush()` call to the
wrapped class.
Fixes#15079
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15134
(cherry picked from commit d2d1141188)
The grammar mistakenly allows nothing to be parsed as an
intValue (itself accepted in LIMIT and similar clauses).
Easily fixed by removing the empty alternative. A unit test is
added.
Fixes#14705.
Closes#14707
(cherry picked from commit e00811caac)
Prevent div-by-zero byt returning const level 1
if max_sstable_size is zero, as configured by
cleanup_incremental_compaction_test, before it's
extended to cover also offstrategy compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b1e164a241)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
pending_replacement list is used by incremental compaction to
communicate to other ongoing compactions about exhausted sstables
that must be replaced in the sstable set they keep for tombstone
GC purposes.
Reshape doesn't enable tombstone GC, so that list will not
be cleared, which prevents incremental compaction from releasing
sstables referenced by that list. It's not a problem until now
where we want reshape to do incremental compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Off-strategy suffers with a 100% space overhead, as it adopted
a sort of all or nothing approach. Meaning all input sstables,
living in maintenance set, are kept alive until they're all
reshaped according to the strategy criteria.
Input sstables in off-strategy are very likely to mostly disjoint,
so it can greatly benefit from incremental compaction.
The incremental compaction approach is not only good for
decreasing disk usage, but also memory usage (as metadata of
input and output live in memory), and file desc count, which
takes memory away from OS.
Turns out that this approach also greatly simplifies the
off-strategy impl in compaction manager, as it no longer have
to maintain new unused sstables and mark them for
deletion on failure, and also unlink intermediary sstables
used between reshape rounds.
Fixes#14992.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 42050f13a0)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's done by inheriting regular_compaction, which implement
incremental compaction. But reshape still implements its own
methods for creating writer and reader. One reason is that
reshape is not driven by controller, as input sstables to it
live in maintenance set. Another reason is customization
of things like sstable origin, etc.
stop_sstable_writer() is extended because that's used by
regular_compaction to check for possibility of removing
exhausted sstables earlier whenever an output sstable
is sealed.
Also, incremental compaction will be unconditionally
enabled for ICS/LCS during off-strategy.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit db9ce9f35a)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's in preparation to next change that will make reshape
inherit from regular compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's needed for enabling incremental compaction to operate, and
needed for subsequent work that enables incremental compaction
for off-strategy, which in turn uses reshape compaction type.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Compaction group is responsible for deleting SSTables of "in-strategy"
compactions, i.e. regular, major, cleanup, etc.
Both in-strategy and off-strategy compaction have their completion
handled using the same compaction group interface, which is
compaction_group::table_state::on_compaction_completion(...,
sstables::offstrategy offstrategy)
So it's important to bring symmetry there, by moving the responsibility
of deleting off-strategy input, from manager to group.
Another important advantage is that off-strategy deletion is now throttled
and gated, allowing for better control, e.g. table waiting for deletion
on shutdown.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13432
(cherry picked from commit 457c772c9c)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Commit 8c4b5e4 introduced an optimization which only
calculates max purgeable timestamp when a tombstone satisfy the
grace period.
Commit 'repair: Get rid of the gc_grace_seconds' inverted the order,
probably under the assumption that getting grace period can be
more expensive than calculating max purgeable, as repair-mode GC
will look up into history data in order to calculate gc_before.
This caused a significant regression on tombstone heavy compactions,
where most of tombstones are still newer than grace period.
A compaction which used to take 5s, now takes 35s. 7x slower.
The reason is simple, now calculation of max purgeable happens
for every single tombstone (once for each key), even the ones that
cannot be GC'ed yet. And each calculation has to iterate through
(i.e. check the bloom filter of) every single sstable that doesn't
participate in compaction.
Flame graph makes it very clear that bloom filter is a heavy path
without the optimization:
45.64% 45.64% sstable_compact sstable_compaction_test_g
[.] utils::filter::bloom_filter::is_present
With its resurrection, the problem is gone.
This scenario can easily happen, e.g. after a deletion burst, and
tombstones becoming only GC'able after they reach upper tiers in
the LSM tree.
Before this patch, a compaction can be estimated to have this # of
filter checks:
(# of keys containing *any* tombstone) * (# of uncompacting sstable
runs[1])
[1] It's # of *runs*, as each key tend to overlap with only one
fragment of each run.
After this patch, the estimation becomes:
(# of keys containing a GC'able tombstone) * (# of uncompacting
runs).
With repair mode for tombstone GC, the assumption, that retrieval
of gc_before is more expensive than calculating max purgeable,
is kept. We can revisit it later. But the default mode, which
is the "timeout" (i.e. gc_grace_seconds) one, we still benefit
from the optimization of deferring the calculation until
needed.
Cherry picked from commit 38b226f997
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Fixes#14091.
Closes#13908
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#15744
This is a backport of PR https://github.com/scylladb/scylladb/pull/15740.
This commit removes the information about the recommended way of upgrading ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade procedure is not supported (it was implemented, but then reverted).
The scope of this commit:
- Remove the information from the 5.0-to.-5.1 upgrade guide and replace with general info.
- Remove the information from the 4.6-to.-5.1 upgrade guide and replace with general info.
- Remove the information from the 5.x.y-to.-5.x.z upgrade guide and replace with general info.
- Remove the following files as no longer necessary (they were only created to incorporate the (invalid) information about image upgrade into the upgrade guides.
/upgrade/_common/upgrade-image-opensource.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p1.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p2.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian.rst
Closes#15768
* github.com:scylladb/scylladb:
doc: remove wrong image upgrade info (5.x.y-to-5.x.y)
doc: remove wrong image upgrade info (4.6-to-5.0)
doc: remove wrong image upgrade info (5.0-to-5.1)
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 5.x.y-to-5.x.y upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).
Refs https://github.com/scylladb/scylladb/issues/15733
In addition, the following files are removed as no longer
necessary (they were only created to incorporate the (invalid)
information about image upgrade into the upgrade guides.
/upgrade/_common/upgrade-image-opensource.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p1.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p2.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian.rst
(cherry picked from commit dd1207cabb)
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 4.6-to-5.0 upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).
Refs https://github.com/scylladb/scylladb/issues/15733
(cherry picked from commit 526d543b95)
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 5.0-to-5.1 upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).
Refs https://github.com/scylladb/scylladb/issues/15733
(cherry picked from commit 9852130c5b)
The estimated_partitions is estimated after the repair_meta is created.
Currently, the default estimated_partitions was used to create the
write which is not correct.
To fix, use the updated estimated_partitions.
Reported by Petr Gusev
Closes#14179Fixes#15748
(cherry picked from commit 4592bbe182)
This commit removes the information about
the recommended way of upgrading ScyllaDB
images - by updating ScyllaDB and OS packages
in one step.
This upgrade procedure is not supported
(it was implemented, but then reverted).
The scope of this commit:
- Remove the information from the 5.1-to.-5.2
upgrade guide and replace with general info.
- Remove the information from the Image Upgrade
page.
- Remove outdated info (about previous releases)
from the Image Upgrade page.
- Rename "AMI Upgrade" as "Image Upgrade"
in the page tree.
Refs: https://github.com/scylladb/scylladb/issues/15733
(cherry picked from commit f6767f6d6e)
Closes#15754
Scylla can crash due to a complicated interaction of service level drop,
evictable readers, inactive read registration path.
1) service level drop invoke stop of reader concurrency semaphore, which will
wait for in flight requests
2) turns out it stops first the gate used for closing readers that will
become inactive.
3) proceeds to wait for in-flight reads by closing the reader permit gate.
4) one of evictable reads take the inactive read registration path, and
finds the gate for closing readers closed.
5) flat mutation reader is destroyed, but finds the underlying reader was
not closed gracefully and triggers the abort.
By closing permit gate first, evictable readers becoming inactive will
be able to properly close underlying reader, therefore avoiding the
crash.
Fixes#15534.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15535
(cherry picked from commit 914cbc11cf)
Modeled after get_live_members_synchronized,
get_unreachable_members_synchronized calls
replicate_live_endpoints_on_change to synchronize
the state of unreachable_members on all shards.
Fixes#12261Fixes#15088
Also, add rest_api unit test for those apis
Closes#15093
* github.com:scylladb/scylladb:
test: rest_api: add test_gossiper
gossiper: add get_unreachable_members_synchronized
(cherry picked from commit 57deeb5d39)
Backport note: `gossiper::lock_endpoint_update_semaphore` helper
function was missing, replaced with
`get_units(g._endpoint_update_semaphore, 1)`
It is possible that a gossip message from an old node is delivered
out of order during a slow boot and the raft address map overwrites
a new IP address with an obsolete one, from the previous incarnation
of this node. Take into account the node restart counter when updating
the address map.
A test case requires a parameterized error injection, which
we don't support yet. Will be added as a separate commit.
Fixes#14257
Refs #14357Closes#14329
(cherry picked from commit b9c2b326bc)
Backport note: replaced `gms::generation_type` with `int64_t` because
the branch is missing the refactor which introduced `generation_type`
(7f04d8231d)
Currently, when creating the table, permissions may be mistakenly
granted to the user even if the table is already existing. This
can happen in two cases:
The query has a IF NOT EXISTS clause - as a result no exception
is thrown after encountering the existing table, and the permission
granting is not prevented.
The query is handled by a non-zero shard - as a result we accept
the query with a bounce_to_shard result_message, again without
preventing the granting of permissions.
These two cases are now avoided by checking the result_message
generated when handling the query - now we only grant permissions
when the query resulted in a schema_change message.
Additionally, a test is added that reproduces both of the mentioned
cases.
CVE-2023-33972
Fixes#15467.
* 'no-grant-on-no-create' of github.com:scylladb/scylladb-ghsa-ww5v-p45p-3vhq:
auth: do not grant permissions to creator without actually creating
transport: add is_schema_change() method to result_message
(cherry picked from commit ab6988c52f)
Fixes https://github.com/scylladb/scylladb/issues/14490
This commit fixes mulitple links that were broken
after the documentation is published (but not in
the preview) due to incorrect syntax.
I've fixed the syntax to use the :docs: and :ref:
directive for pages and sections, respectively.
Closes#14664
(cherry picked from commit a93fd2b162)
This commit adds the information that ScyllaDB Enterprise
supports FIPS-compliant systems in versions
2023.1.1 and later.
The information is excluded from OSS docs with
the "only" directive, because the support was not
added in OSS.
This commit must be backported to branch-5.2 so that
it appears on version 2023.1 in the Enterprise docs.
Closes#15415
(cherry picked from commit fb635dccaa)
Today, we base compaction throughput on the amount of data written,
but it should be based on the amount of input data compacted
instead, to show the amount of data compaction had to process
during its execution.
A good example is a compaction which expire 99% of data, and
today throughput would be calculated on the 1% written, which
will mislead the reader to think that compaction was terribly
slow.
Fixes#14533.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14615
(cherry picked from commit 3b1829f0d8)
We allow inserting column values using a JSON value, eg:
```cql
INSERT INTO mytable JSON '{ "\"myKey\"": 0, "value": 0}';
```
When no JSON value is specified, the query should be rejected.
Scylla used to crash in such cases. A recent change fixed the crash
(https://github.com/scylladb/scylladb/pull/14706), it now fails
on unwrapping an uninitialized value, but really it should
be rejected at the parsing stage, so let's fix the grammar so that
it doesn't allow JSON queries without JSON values.
A unit test is added to prevent regressions.
Refs: https://github.com/scylladb/scylladb/pull/14707
Fixes: https://github.com/scylladb/scylladb/issues/14709
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
\Closes #14785
(cherry picked from commit cbc97b41d4)
The Alternator test test_ttl.py::test_ttl_expiration_gsi_lsi was flaky.
The test incorrectly assumes that when we write an already expired item,
it will be visible for a short time until being deleted by the TTL thread.
But this doesn't need to be true - if the test is slow enough, it may go
look or the item after it was already expired!
So we fix this test by splitting it into two parts - in the first part
we write a non-expiring item, and notice it eventually appears in the
GSI, LSI, and base-table. Then we write the same item again, with an
expiration time - and now it should eventually disappear from the GSI,
LSI and base-table.
This patch also fixes a small bug which prevented this test from running
on DynamoDB.
Fixes#14495
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14496
(cherry picked from commit 599636b307)
Permits added to `_ready_list` remain there until
executed by `execution_loop()`.
But `execution_loop()` exits when `_stopped == true`,
even though nothing prevents new permits from being added
to `_ready_list` after `stop()` sets `_stopped = true`.
Thus, if there are reads concurrent with `stop()`,
it's possible for a permit to be added to `_ready_list`
after `execution_loop()` has already quit. Such a permit will
never be destroyed, and `stop()` will forever block on
`_permit_gate.close()`.
A natural solution is to dismiss `execution_loop()` only after
it's certain that `_ready_list` won't receive any new permits.
This is guaranteed by `_permit_gate.close()`. After this call completes,
it is certain that no permits *exist*.
After this patch, `execution_loop()` no longer looks at `_stopped`.
It only exits when `_ready_list_cv` breaks, and this is triggered
by `stop()` right after `_permit_gate.close()`.
Fixes#15198Closes#15199
(cherry picked from commit 2000a09859)
Call replicate_live_endpoints on shard 0 to copy from 0 to the rest of
the shards. And get the list of live members from shard 0.
Move lock to the callers.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13240
(cherry picked from commit da00052ad8)
Add an API call to wait for all shards to reach the current shard 0
gossiper version. Throws when timeout is reached.
Closes#12540
* github.com:scylladb/scylladb:
api: gossiper: fix alive nodes
gms, service: lock live endpoint copy
gms, service: live endpoint copy method
(cherry picked from commit b919373cce)
when the local_deletion_time is too large and beyond the
epoch time of INT32_MAX, we cap it to INT32_MAX - 1.
this is a signal of bad configuration or a bug in scylla.
so let's add more information in the logging message to
help track back to the source of the problem.
Fixes#15015
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 9c24be05c3)
Closes#15150
Secondary index creation is asynchronous, meaning it
takes time for existing data to be reflected within
the index. However, new data added after the
index is created should appear in it immediately.
The test consisted of two parts. The first created
a series of indexes for one table, added
test data to the table, and then ran a series of checks.
In the second part, several new indexes were added to
the same table, and checks were made to make sure that
already existing data would appear in them. This
last part was flaky.
The patch just moves the index creation statements
from the second part to the first.
Fixes: #14076Closes#14090
(cherry picked from commit 0415ac3d5f)
Closes#15101
This mini-series backports the fix for #12010 along with low-risk patches it depends on.
Fixes: #12010Closes#15137
* github.com:scylladb/scylladb:
distributed_loader: process_sstable_dir: do not verify snapshots
utils/directories: verify_owner_and_mode: add recursive flag
utils: Restore indentation after previous patch
utils: Coroutinize verify_owner_and_mode()
Skip over verification of owner and mode of the snapshots
sub-directory as this might race with scylla-manager
trying to delete old snapshots concurrently.
Fixes#12010
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 845b6f901b)
Allow the caller to verify only the top level directories
so that sub-directories can be verified selectively
(in particular, skip validation of snapshots).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 60862c63dd)
There's a helper verification_error() that prints a warning and returns
excpetional future. The one is converted into void throwing one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 4ebb812df0)
Loop in shard_reshaping_compaction_task_impl::run relies on whether
sstables::compaction_stopped_exception is thrown from run_custom_job.
The exception is swallowed for each type of compaction
in compaction_manager::perform_task.
Rethrow an exception in perfrom task for reshape compaction.
Fixes: #15058.
(cherry picked from commit e0ce711e4f)
Closes#15122
This argument was dead since its introduction and 'discard' was
always configured regardless of its value.
This patch allows actually configuring things using this argument.
Fixes#14963Closes#14964
(cherry picked from commit e13a2b687d)
2023-08-22 10:40:37 +03:00
281 changed files with 6147 additions and 2563 deletions
throwexceptions::invalid_request_exception(format("Invalid user type literal for {}: field {} is not of type {}",receiver.name,field,field_spec->type->as_cql3_type()));
throwexceptions::invalid_request_exception(format("Invalid user type literal for {}: field {} is not of type {}",*receiver.name,field,field_spec->type->as_cql3_type()));
throwexceptions::invalid_request_exception(format("Invalid tuple literal for {}: component {:d} is not of type {}",receiver.name,i,spec->type->as_cql3_type()));
throwexceptions::invalid_request_exception(format("Invalid tuple literal for {}: component {:d} is not of type {}",*receiver.name,i,spec->type->as_cql3_type()));
throwexceptions::invalid_request_exception(format("Cannot add new field to type {} because it is used in the clustering key column {} of table {}.{} where durations are not allowed",
"Adjusts the sensitivity of the failure detector on an exponential scale. Generally this setting never needs adjusting.\n"
"Related information: Failure detection and recovery")
,failure_detector_timeout_in_ms(this,"failure_detector_timeout_in_ms",liveness::LiveUpdate,value_status::Used,20*1000,"Maximum time between two successful echo message before gossip mark a node down in milliseconds.\n")
,direct_failure_detector_ping_timeout_in_ms(this,"direct_failure_detector_ping_timeout_in_ms",value_status::Used,600,"Duration after which the direct failure detector aborts a ping message, so the next ping can start.\n"
"Note: this failure detector is used by Raft, and is different from gossiper's failure detector (configured by `failure_detector_timeout_in_ms`).\n")
/* Performance tuning properties */
/* Tuning performance and system reso urce utilization, including commit log, compaction, memory, disk I/O, CPU, reads, and writes. */
,unspooled_dirty_soft_limit(this,"unspooled_dirty_soft_limit",value_status::Used,0.6,"Soft limit of unspooled dirty memory expressed as a portion of the hard limit")
,sstable_summary_ratio(this,"sstable_summary_ratio",value_status::Used,0.0005,"Enforces that 1 byte of summary is written for every N (2000 by default) "
"bytes written to data file. Value must be between 0 and 1.")
,components_memory_reclaim_threshold(this,"components_memory_reclaim_threshold",liveness::LiveUpdate,value_status::Used,.2,"Ratio of available memory for all in-memory components of SSTables in a shard beyond which the memory will be reclaimed from components until it falls back under the threshold. Currently, this limit is only enforced for bloom filters.")
,large_memory_allocation_warning_threshold(this,"large_memory_allocation_warning_threshold",value_status::Used,size_t(1)<<20,"Warn about memory allocations above this size; set to zero to disable")
,enable_deprecated_partitioners(this,"enable_deprecated_partitioners",value_status::Used,false,"Enable the byteordered and random partitioners. These partitioners are deprecated and will be removed in a future version.")
,enable_keyspace_column_family_metrics(this,"enable_keyspace_column_family_metrics",value_status::Used,false,"Enable per keyspace and per column family metrics reporting")
,enable_sstables_mc_format(this,"enable_sstables_mc_format",value_status::Unused,true,"Enable SSTables 'mc' format to be used as the default file format. Deprecated, please use \"sstable_format\" instead.")
,enable_sstables_md_format(this,"enable_sstables_md_format",value_status::Unused,true,"Enable SSTables 'md' format to be used as the default file format. Deprecated, please use \"sstable_format\" instead.")
,enable_dangerous_direct_import_of_cassandra_counters(this,"enable_dangerous_direct_import_of_cassandra_counters",value_status::Used,false,"Only turn this option on if you want to import tables from Cassandra containing counters, and you are SURE that no counters in that table were created in a version earlier than Cassandra 2.1."
" It is not enough to have ever since upgraded to newer versions of Cassandra. If you EVER used a version earlier than 2.1 in the cluster where these SSTables come from, DO NOT TURN ON THIS OPTION! You will corrupt your data. You have been warned.")
,enable_shard_aware_drivers(this,"enable_shard_aware_drivers",value_status::Used,true,"Enable native transport drivers to use connection-per-shard for better performance")
Outputs all tokens of a node, and displays the :doc:`token ring information </operating-scylla/nodetool-commands/ring>`. It produces output as follows for a single datacenter:
@@ -158,9 +158,27 @@ We're not able to prevent a node learning about a new generation too late due to
However, it could happen that a node learns about the generation from gossip in time, but then won't be able to extract it from `cdc_generation_descriptions_v2`. In that case we can still maintain consistency: the node will remember that there is a new generation even though it doesn't yet know what it is (it knows only the ID, in particular it knows the timestamp) using the `cdc::metadata::prepare(db_clock::time_point)` method, and then _reject_ writes for CDC-enabled tables that are supposed to use this new generation. The node will keep trying to read the generation's data in background until it succeeds or sees that it's not necessary anymore (e.g. because the generation was already superseded by a new generation).
Thus we give up availability for safety. This likely won't happen if the administrator ensures that the cluster is not partitioned before bootstrapping a new node. This problem will also be mitigated with a future patch.
Due to the need of maintaining colocation we don't allow the client to send writes with arbitrary timestamps.
Suppose that a write is requested and the write coordinator's local clock has time `C` and the generation operating at time `C` has timestamp `T` (`T <= C`). Then we only allow the write if its timestamp is in the interval [`T`, `C + generation_leeway`), where `generation_leeway` is a small time-inteval constant (e.g. 5 seconds).
Reason: we cannot allow writes before `T`, because they belong to the old generation whose token ranges might no longer refine the current vnodes, so the corresponding log write would not necessarily be colocated with the base write. We also cannot allow writes too far "into the future" because we don't know what generation will be operating at that time (the node which will introduce this generation might not have joined yet). But, as mentioned before, we assume that we'll learn about the next generation in time. Again --- the need for this assumption will be gone in a future patch.
#### Generation switching: accepting writes
Due to the need of maintaining colocation we don't allow the client to send writes with arbitrary timestamps. We allow:
- writes to the current and next generations unless they are too far into the future,
- writes to the previous generations unless they are too far into the past.
##### Writes to the current and next generations
Suppose that a write with timestamp `W` is requested and the write coordinator's local clock has time `C` and the generation operating at time `C` has timestamp `T` (`T <= C`) such that `T <= W`. Then we only allow the write if `W < C + generation_leeway`, where `generation_leeway` is a small time-interval constant (e.g. 5 seconds).
We cannot allow writes too far "into the future" because we don't know what generation will be operating at that time (the node which will introduce this generation might not have joined yet). But, as mentioned before, we assume that we'll learn about the next generation in time. Again --- the need for this assumption will be gone in a future patch.
##### Writes to the previous generations
This time suppose that `T > W`. Then we only allow the write if `W > C - generation_leeway` and there was a generation operating at `W`.
We allow writes to previous generations to improve user experience. If a client generates timestamps by itself and clocks are not perfectly synchronized, there may be short periods of time around the moment of switching generations when client's writes are rejected because they fall into one of the previous generations. Usually, this problem is easy to overcome by the client. It can simply repeat a write a few times, but using a higher timestamp. Unfortunately, if a table additionally uses LWT, the client cannot increase the timestamp because LWT makes timestamps permanent. Once Paxos commits an entry with a given timestamp, Scylla will keep trying to apply that entry until it succeeds, with the same timestamp. Applying the entry involves doing a CDC log table write. If it fails, we are stuck. Allowing writes to the previous generations is also a probabilistic fix for this bug.
Note that writing only to the previous generation might not be enough. With the Raft-based topology and tablets, we can add multiple nodes almost instantly. Then, we can have multiple generations with almost identical timestamps.
We allow writes only to the recent past to reduce the number of generations that must be stored in memory.
@@ -17,7 +17,7 @@ This tool is similar to SStableDump_, with notable differences:
* Expanded scope: this tool supports much more than dumping SStable data components (see `Supported Operations`_).
* More flexible on how schema is obtained and where SStables are located: SStableDump_ only supports dumping SStables located in their native data directory. To dump an SStable, one has to clone the entire ScyllaDB data directory tree, including system table directories and even config files. ``scylla sstable`` can dump sstables from any path with multiple choices on how to obtain the schema, see Schema_.
Currently, SStableDump_ works better on production systems as it automatically loads the schema from the system tables, unlike ``scylla sstable``, which has to be provided with the schema explicitly. On the other hand ``scylla sstable`` works better for off-line investigations, as it can be used with as little as just a schema definition file and a single sstable. In the future we plan on closing this gap -- adding support for automatic schema-loading for ``scylla sstable`` too -- and completely supplant SStableDump_ with ``scylla sstable``.
``scylla sstable`` was developed to supplant SStableDump_ as ScyllaDB-native tool, better tailored for the needs of ScyllaDB.
@@ -35,14 +35,33 @@ You can specify more than one SStable.
Schema
------
All operations need a schema to interpret the SStables with.
Currently, there are two ways to obtain the schema:
This tool tries to auto-detect the location of the ScyllaDB data directories and the name of the table the SStable belongs to.
If the SStable is located in a ScyllaDB data directory, it works out-of-the-box, without any additional input from the user.
If the SStable is located at an external path, you need to specify the names of the keyspace and table to which the SStable belongs. In addition, some hints as to where the ScyllaDB data directory is located may also be required.
The schema can be obtained in the following ways:
* Auto-detected - If the SStable is located in the table's directory within the ScyllaDB data directory.
*``--keyspace=KEYSPACE --table=TABLE`` - If the SStable is located at an external location, but the ScyllaDB data directory or the config file are located at the standard location. The tool also reads the ``SCYLLA_CONF`` and ``SCYLLA_HOME`` environment variables to try to locate the configuration file.
*``--schema-file FILENAME`` - Read the schema definition from a file.
*``--system-schema KEYSPACE.TABLE`` - Use the known definition of built-in tables (only works for system tables).
*``--system-schema --keyspace=KEYSPACE --table=TABLE`` - Use the known definition of built-in tables (only works for system tables).
*``--scylla-data-dir SCYLLA_DATA_DIR_PATH --keyspace=KEYSPACE --table=TABLE`` - Read the schema tables from the data directory at the provided location, needs the keyspace and table name to be provided with ``--keyspace`` and ``--table``.
*``--scylla-yaml-file SCYLLA_YAML_FILE_PATH --keyspace=KEYSPACE --table=TABLE`` - Read the schema tables from the data directory path obtained from the configuration, needs the keyspace and table name to be provided with ``--keyspace`` and ``--table``.
By default, the tool uses the first method: ``--schema-file schema.cql``; i.e. it assumes there is a schema file named ``schema.cql`` in the working directory.
If this fails, it will exit with an error.
By default (no schema-related options are provided), the tool will try the following sequence:
* Try to load schema from ``schema.cql``.
* Try to deduce the ScyllaDB data directory path and table names from the SStable path.
* Try to load the schema from the ScyllaDB directory located at the standard location (``/var/lib/scylla``). For this to succeed, the table name has to be provided via ``--keyspace`` and ``--table``.
* Try to load the schema from the ScyllaDB directory path obtained from config at the standard location (``./conf/scylla.yaml``). ``SCYLLA_CONF`` and ``SCYLLA_HOME`` environment variables are also checked. For this to succeed, the table name has to be provided via ``--keyspace`` and ``--table``.
The tool stops after the first successful attempt. If none of the above succeed, an error message will be printed.
A user provided schema in ``schema.cql`` (if present) always takes precedence over other methods. This is deliberate, to allow to manually override the schema to be used.
schema.cql
^^^^^^^^^^
The schema file should contain all definitions needed to interpret data belonging to the table.
@@ -72,7 +91,7 @@ Note:
* The schema file doesn't have to be called ``schema.cql``, this is just the default name. Any file name is supported (with any extension).
Dropped columns
^^^^^^^^^^^^^^^
~~~~~~~~~~~~~~~
The examined sstable might have columns which were dropped from the schema definition. In this case providing the up-do-date schema will not be enough, the tool will fail when attempting to process a cell for the dropped column.
Dropped columns can be provided to the tool in the form of insert statements into the ``system_schema.dropped_columns`` system table, in the schema definition file. Example:
@@ -282,34 +301,6 @@ The content is dumped in JSON, using the following schema:
},
"pos": Uint64
}
)",
dump_index_operation},
/* dump-compression-info */
{"dump-compression-info",
"Dump content of sstable compression info(s)",
R"(
Dumps the content of the compression-info component. Contains compression
parameters and maps positions into the uncompressed data to that into compressed
data. Note that compression happens over chunks with configurable size, so to
get data at a position in the middle of a compressed chunk, the entire chunk has
to be decompressed.
For more information about the sstable components and the format itself, visit
https://docs.scylladb.com/architecture/sstable/.
The content is dumped in JSON, using the following schema:
This tool allows you to converts SSTable into a JSON format file.
SSTabledump supported when using Scylla 3.0, Scylla Enterprise 2019.1, and newer versions.
In older versions, the tool is named SSTable2json_.
If you need more flexibility or want to dump more than just the data-component, see scylla-sstable_.
In older versions, the tool is named :doc:`SSTable2json </operating-scylla/admin-tools/sstable2json>`.
If you need more flexibility or want to dump more than just the data-component, see :doc:`scylla-sstable </operating-scylla/admin-tools/scylla-sstable>`.
@@ -17,12 +17,8 @@ The backup includes two procedures. These are:
Full Backup - Snapshots
=======================
Snapshots are taken using `nodetool snapshot`_. First, the command flushes the MemTables from memory to SSTables on disk, and afterward, it creates a hard link for each SSTable in each keyspace.
With time, SSTables are compacted, but the hard link keeps a copy of each file. This takes up an increasing amount of disk space. It is important to clear space by `clean unnecessary snapshots`_.
Snapshots are taken using :doc:`nodetool snapshot </operating-scylla/nodetool-commands/snapshot>`. First, the command flushes the MemTables from memory to SSTables on disk, and afterward, it creates a hard link for each SSTable in each keyspace.
With time, SSTables are compacted, but the hard link keeps a copy of each file. This takes up an increasing amount of disk space. It is important to clear space by :doc:`clean unnecessary snapshots </operating-scylla/procedures/backup-restore/delete-snapshot>`.
See the full code example `here <https://github.com/scylladb/scylla-code-samples/tree/master/dual_writes>`_
3. On each Apache Cassandra node, take a snapshot for every keyspace using the `nodetool snapshot`_ command. This will flush all SSTables to disk and generate a ``snapshots`` folder with an epoch timestamp for each underlying table in that keyspace.
3. On each Apache Cassandra node, take a snapshot for every keyspace using the :doc:`nodetool snapshot </operating-scylla/nodetool-commands/snapshot>` comand. This will flush all SSTables to disk and generate a ``snapshots`` folder with an epoch timestamp for each underlying table in that keyspace.
Folder path post snapshot: ``/var/lib/cassandra/data/keyspace/table-[uuid]/snapshots/[epoch_timestamp]/``
@@ -31,7 +31,7 @@ Apply the following procedure **serially** on each node. Do not move to the next
* Not to run administration functions, like repairs, refresh, rebuild or add or remove nodes. See `sctool <https://manager.docs.scylladb.com/stable/sctool/index.html>`_ for suspending ScyllaDB Manager (only available for ScyllaDB Enterprise) scheduled or running repairs.
* Not to apply schema changes
..note:: Before upgrading, make sure to use the latest `ScyllaDB Montioring <https://monitoring.docs.scylladb.com/>`_ stack.
..note:: Before upgrading, make sure to use the latest `ScyllaDB Monitoring <https://monitoring.docs.scylladb.com/>`_ stack.
Upgrade Steps
=============
@@ -58,9 +58,14 @@ When the upgrade is completed on all nodes, remove the snapshot with the ``nodet
Backup the configuration file
------------------------------
Back up the ``scylla.yaml`` configuration file and the ScyllaDB packages
in case you need to rollback the upgrade.
..code::sh
sudo cp -a /etc/scylla/scylla.yaml /etc/scylla/scylla.yaml.backup-src
sudo cp -a /etc/scylla/scylla.yaml /etc/scylla/scylla.yaml.backup
@@ -153,18 +162,7 @@ Restore the configuration file
..code::sh
sudo rm -rf /etc/scylla/scylla.yaml
sudo cp -a /etc/scylla/scylla.yaml.backup-src | /etc/scylla/scylla.yaml
Restore system tables
---------------------
Restore all tables of **system** and **system_schema** from previous snapshot because |NEW_VERSION| uses a different set of system tables. See :doc:`Restore from a Backup and Incremental Backup </operating-scylla/procedures/backup-restore/restore/>` for details.
..code::sh
cd /var/lib/scylla/data/keyspace_name/table_name-UUID/snapshots/<snapshot_name>/
@@ -34,7 +34,7 @@ Apply the following procedure **serially** on each node. Do not move to the next
* Not to run administration functions, like repairs, refresh, rebuild or add or remove nodes. See `sctool <https://manager.docs.scylladb.com/stable/sctool/index.html>`_ for suspending Scylla Manager (only available Scylla Enterprise) scheduled or running repairs.
* Not to apply schema changes
..note:: Before upgrading, make sure to use the latest `Scylla Montioring <https://monitoring.docs.scylladb.com/>`_ stack.
..note:: Before upgrading, make sure to use the latest `Scylla Monitoring <https://monitoring.docs.scylladb.com/>`_ stack.
This document is a step-by-step procedure for upgrading from |SCYLLA_NAME| |FROM| to |SCYLLA_NAME| |TO|, and rollback to 2021.1 if required.
Applicable Versions
------------------------
This guide covers upgrading |SCYLLA_NAME| from version |FROM| to version |TO| on |OS|. See :doc:`OS Support by Platform and Version </getting-started/os-support>` for information about supported versions.
Upgrade Procedure
----------------------------
..note::
Apply the following procedure **serially** on each node. Do not move to the next node before validating the node is up and running the new version.
A ScyllaDB upgrade is a rolling procedure which does **not** require full cluster shutdown.
For each of the nodes in the cluster, you will:
* Drain node and backup the data.
* Check your current release.
* Backup configuration file.
* Stop ScyllaDB.
* Download and install new ScyllaDB packages.
* Start ScyllaDB.
* Validate that the upgrade was successful.
**During** the rolling upgrade it is highly recommended:
* Not to use new |TO| features.
* Not to run administration functions, like repairs, refresh, rebuild or add or remove nodes.
* Not to apply schema changes.
Upgrade steps
-------------------------------
Drain node and backup the data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Before any major procedure, like an upgrade, it is recommended to backup all the data to an external device. In ScyllaDB, backup is done using the ``nodetool snapshot`` command. For **each** node in the cluster, run the following command:
..code::sh
nodetool drain
nodetool snapshot
Take note of the directory name that nodetool gives you, and copy all the directories having this name under ``/var/lib/scylla`` to a backup device.
When the upgrade is complete (all nodes), the snapshot should be removed by ``nodetool clearsnapshot -t <snapshot>``, or you risk running out of space.
Backup configuration file
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
..code::sh
sudo cp -a /etc/scylla/scylla.yaml /etc/scylla/scylla.yaml.backup-5.x.z
Gracefully stop the node
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
..code::sh
sudo service scylla-server stop
Download and install the new release
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Before upgrading, check what version you are running now using ``dpkg -s scylla-server``. You should use the same version in case you want to |ROLLBACK|_ the upgrade. If you are not running a |FROM| version, stop right here! This guide only covers |FROM| to |TO| upgrades.
#. Check cluster status with ``nodetool status`` and make sure **all** nodes, including the one you just upgraded, are in UN status.
#. Use ``curl -X GET "http://localhost:10000/storage_service/scylla_release_version"`` to check the ScyllaDB version.
#. Check the scylla-server log (execute ``journalctl _COMM=scylla``) and ``/var/log/syslog`` to validate there are no errors.
#. Check again after 2 minutes, to validate no new issues are introduced.
Once you are sure the node upgrade is successful, move to the next node in the cluster.
Rollback Procedure
-----------------------
..include:: /upgrade/_common/warning_rollback.rst
The following procedure describes a rollback from ScyllaDB release |TO| to |FROM|. Apply this procedure if an upgrade from |FROM| to |TO| failed before completing on all nodes. Use this procedure only for nodes you upgraded to |TO|.
ScyllaDB rollback is a rolling procedure which does **not** require full cluster shutdown.
For each of the nodes rollback to |FROM|, you will:
* Drain the node and stop ScyllaDB.
* Downgrade to previous release.
* Restore the configuration file.
* Restart ScyllaDB.
* Validate the rollback success.
Apply the following procedure **serially** on each node. Do not move to the next node before validating the node is up and running with the new version.
@@ -32,7 +32,7 @@ Apply the following procedure **serially** on each node. Do not move to the next
* Not to run administration functions, like repairs, refresh, rebuild or add or remove nodes. See `sctool <https://manager.docs.scylladb.com/stable/sctool/>`_ for suspending ScyllaDB Manager (only available for ScyllaDB Enterprise) scheduled or running repairs.
* Not to apply schema changes
..note:: Before upgrading, make sure to use the latest `ScyllaDB Montioring <https://monitoring.docs.scylladb.com/>`_ stack.
..note:: Before upgrading, make sure to use the latest `ScyllaDB Monitoring <https://monitoring.docs.scylladb.com/>`_ stack.
Upgrade Steps
=============
@@ -60,9 +60,13 @@ When the upgrade is completed on all nodes, remove the snapshot with the ``nodet
Backup the configuration file
------------------------------
Back up the ``scylla.yaml`` configuration file and the ScyllaDB packages
in case you need to rollback the upgrade.
..code::sh
sudo cp -a /etc/scylla/scylla.yaml /etc/scylla/scylla.yaml.backup-src
sudo cp -a /etc/scylla/scylla.yaml /etc/scylla/scylla.yaml.backup
sudo cp -a /etc/scylla/scylla.yaml.backup-src | /etc/scylla/scylla.yaml
Restore system tables
---------------------
Restore all tables of **system** and **system_schema** from the previous snapshot because |NEW_VERSION| uses a different set of system tables. See :doc:`Restore from a Backup and Incremental Backup </operating-scylla/procedures/backup-restore/restore/>` for reference.
..code::sh
cd /var/lib/scylla/data/keyspace_name/table_name-UUID/snapshots/<snapshot_name>/
*:ref:`Upgrading ScyllaDB and simultaneously updating 3rd party and OS packages <upgrade-image-recommended-procedure>`. It is recommended if you are running a ScyllaDB official image (EC2 AMI, GCP, and Azure images), which is based on Ubuntu 20.04.
*:ref:`Upgrading ScyllaDB without updating any external packages <upgrade-image-upgrade-guide-regular-procedure>`.
.._upgrade-image-recommended-procedure:
**To upgrade ScyllaDB and update 3rd party and OS packages (RECOMMENDED):**
..versionadded:: 5.0
Choosing this upgrade procedure allows you to upgrade your ScyllaDB version and update the 3rd party and OS packages using one command.
#. Update the |SCYLLA_REPO|_ to |NEW_VERSION|.
#. Load the new repo:
.. code:: sh
sudo apt-get update
#. Run the following command to update the manifest file:
* ``<version>`` - The ScyllaDB version to which you are upgrading ( |NEW_VERSION| ).
* ``<arch>`` - Architecture type: ``x86_64`` or ``aarch64``.
The file is included in the ScyllaDB packages downloaded in the previous step. The file location is ``http://downloads.scylladb.com/downloads/scylla/aws/manifest/scylla-packages-<version>-<arch>.txt``
@@ -139,72 +155,16 @@ Download and install the new release
sudo yum clean all
sudo yum update scylla\* -y
..group-tab:: EC2/GCP/Azure Ubuntu Image
..note::
Before upgrading, check what version you are running now using ``dpkg -s scylla-server``. You should use the same version as this version in case you want to |ROLLBACK|_ the upgrade. If you are not running a |SRC_VERSION|.x version, stop right here! This guide only covers |SRC_VERSION|.x to |NEW_VERSION|.y upgrades.
If you are running a ScyllaDB official image (for EC2 AMI, GCP, or Azure),
you need to:
There are two alternative upgrade procedures:
*:ref:`Upgrading ScyllaDB and simultaneously updating 3rd party and OS packages <upgrade-image-recommended-procedure>`. It is recommended if you are running a ScyllaDB official image (EC2 AMI, GCP, and Azure images), which is based on Ubuntu 20.04.
*:ref:`Upgrading ScyllaDB without updating any external packages <upgrade-image-upgrade-guide-regular-procedure>`.
.._upgrade-image-recommended-procedure:
**To upgrade ScyllaDB and update 3rd party and OS packages (RECOMMENDED):**
Choosing this upgrade procedure allows you to upgrade your ScyllaDB version and update the 3rd party and OS packages using one command.
#. Update the |SCYLLA_DEB_NEW_REPO| to |NEW_VERSION|.
#. Load the new repo:
.. code:: sh
sudo apt-get update
#. Run the following command to update the manifest file:
*``<version>`` - The ScyllaDB version to which you are upgrading ( |NEW_VERSION| ).
*``<arch>`` - Architecture type: ``x86_64`` or ``aarch64``.
The file is included in the ScyllaDB packages downloaded in the previous step. The file location is ``http://downloads.scylladb.com/downloads/scylla/aws/manifest/scylla-packages-<version>-<arch>.txt``
#. Update the |SCYLLA_RPM_SRC_REPO|_ to |SRC_VERSION|.
#. Install:
.. code:: console
sudo yum clean all
sudo rm -rf /var/cache/yum
sudo yum remove scylla\\*tools-core
sudo yum downgrade scylla\\* -y
sudo yum install scylla
..group-tab:: EC2/GCP/Azure Ubuntu Image
#. Remove the old repo file.
.. code:: sh
sudo rm -rf /etc/apt/sources.list.d/scylla.list
#. Update the |SCYLLA_DEB_SRC_REPO| to |SRC_VERSION|.
#. Install:
..code-block::
sudo apt-get update
sudo apt-get remove scylla\* -y
sudo apt-get install scylla
Answer ‘y’ to the first two questions.
sudo yum downgrade scylla-\*cqlsh -y
sudo yum remove scylla-\*cqlsh -y
sudo yum downgrade scylla\* -y
sudo yum install scylla -y
Restore the configuration file
------------------------------
..code::sh
sudo rm -rf /etc/scylla/scylla.yaml
sudo cp -a /etc/scylla/scylla.yaml.backup-src | /etc/scylla/scylla.yaml
Restore system tables
---------------------
Restore all tables of **system** and **system_schema** from the previous snapshot because |NEW_VERSION| uses a different set of system tables. See :doc:`Restore from a Backup and Incremental Backup </operating-scylla/procedures/backup-restore/restore/>` for reference.
..code::sh
cd /var/lib/scylla/data/keyspace_name/table_name-UUID/snapshots/<snapshot_name>/
@@ -147,72 +163,16 @@ Download and install the new release
sudo yum clean all
sudo yum update scylla\* -y
..group-tab:: EC2/GCP/Azure Ubuntu Image
..note::
Before upgrading, check what version you are running now using ``dpkg -s scylla-server``. You should use the same version as this version in case you want to |ROLLBACK|_ the upgrade. If you are not running a |SRC_VERSION|.x version, stop right here! This guide only covers |SRC_VERSION|.x to |NEW_VERSION|.y upgrades.
If you are running a ScyllaDB official image (for EC2 AMI, GCP, or Azure),
you need to:
* Download and install the new ScyllaDB release for Ubuntu; see
the Debian/Ubuntu tab above for instructions.
* Update underlying OS packages.
There are two alternative upgrade procedures:
*:ref:`Upgrading ScyllaDB and simultaneously updating 3rd party and OS packages <upgrade-image-recommended-procedure>`. It is recommended if you are running a ScyllaDB official image (EC2 AMI, GCP, and Azure images), which is based on Ubuntu 20.04.
*:ref:`Upgrading ScyllaDB without updating any external packages <upgrade-image-upgrade-guide-regular-procedure>`.
.._upgrade-image-recommended-procedure:
**To upgrade ScyllaDB and update 3rd party and OS packages (RECOMMENDED):**
Choosing this upgrade procedure allows you to upgrade your ScyllaDB version and update the 3rd party and OS packages using one command.
#. Update the |SCYLLA_DEB_NEW_REPO| to |NEW_VERSION|.
#. Load the new repo:
.. code:: sh
sudo apt-get update
#. Run the following command to update the manifest file:
*``<version>`` - The ScyllaDB version to which you are upgrading ( |NEW_VERSION| ).
*``<arch>`` - Architecture type: ``x86_64`` or ``aarch64``.
The file is included in the ScyllaDB packages downloaded in the previous step. The file location is ``http://downloads.scylladb.com/downloads/scylla/aws/manifest/scylla-packages-<version>-<arch>.txt``
#. Update the |SCYLLA_RPM_SRC_REPO|_ to |SRC_VERSION|.
#. Install:
.. code:: console
sudo yum clean all
sudo rm -rf /var/cache/yum
sudo yum remove scylla\\*tools-core
sudo yum downgrade scylla\\* -y
sudo yum install scylla
sudo yum downgrade scylla-\*cqlsh -y
sudo yum remove scylla-\*cqlsh -y
sudo yum downgrade scylla\* -y
sudo yum install scylla -y
..group-tab:: EC2/GCP/Azure Ubuntu Image
#. Remove the old repo file.
.. code:: sh
sudo rm -rf /etc/apt/sources.list.d/scylla.list
#. Update the |SCYLLA_DEB_SRC_REPO| to |SRC_VERSION|.
#. Install:
..code-block::
sudo apt-get update
sudo apt-get remove scylla\* -y
sudo apt-get install scylla
Answer ‘y’ to the first two questions.
Restore the configuration file
------------------------------
..code::sh
sudo rm -rf /etc/scylla/scylla.yaml
sudo cp -a /etc/scylla/scylla.yaml.backup-src | /etc/scylla/scylla.yaml
Restore system tables
---------------------
Restore all tables of **system** and **system_schema** from the previous snapshot because |NEW_VERSION| uses a different set of system tables. See :doc:`Restore from a Backup and Incremental Backup </operating-scylla/procedures/backup-restore/restore/>` for reference.
..code::sh
cd /var/lib/scylla/data/keyspace_name/table_name-UUID/snapshots/<snapshot_name>/
@@ -167,54 +167,27 @@ Download and install the new release
..group-tab:: EC2/GCP/Azure Ubuntu Image
Before upgrading, check what version you are running now using ``scylla --version``. You should use the same version as this version in case you want to |ROLLBACK|_ the upgrade. If you are not running a |SRC_VERSION|.x version, stop right here! This guide only covers |SRC_VERSION|.x to |NEW_VERSION|.y upgrades.
Before upgrading, check what version you are running now using ``scylla --version``. You should use the same version as this version in case you want to |ROLLBACK|_ the upgrade. If you are not running a |SRC_VERSION|.x version, stop right here! This guide only covers |SRC_VERSION|.x to |NEW_VERSION|.y upgrades.
There are two alternative upgrade procedures: upgrading ScyllaDB and simultaneously updating 3rd party and OS packages - recommended if you
are running a ScyllaDB official image (EC2 AMI, GCP, and Azure images), which is based on Ubuntu 20.04, and upgrading ScyllaDB without updating
any external packages.
If you’re using the ScyllaDB official image (recommended), see
the **Debian/Ubuntu** tab for upgrade instructions. If you’re using your
own image and have installed ScyllaDB packages for Ubuntu or Debian,
you need to apply an extended upgrade procedure:
#. Update the ScyllaDB deb repo (see above).
#. Configure Java 1.8 (see above).
#. Install the new ScyllaDB version with the additional
``scylla-enterprise-machine-image`` package:
**To upgrade ScyllaDB and update 3rd party and OS packages (RECOMMENDED):**
Choosing this upgrade procedure allows you to upgrade your ScyllaDB version and update the 3rd party and OS packages using one command.
#. Update the |SCYLLA_DEB_NEW_REPO| to |NEW_VERSION|.
#. Load the new repo:
.. code:: sh
sudo apt-get update
#. Run the following command to update the manifest file:
* ``<version>`` - The ScyllaDB Enterprise version to which you are upgrading ( |NEW_VERSION| ).
*``<arch>`` - Architecture type: ``x86_64`` or ``aarch64``.
The file is included in the ScyllaDB Enterprise packages downloaded in the previous step. The file location is ``http://downloads.scylladb.com/downloads/scylla/aws/manifest/scylla-packages-<version>-<arch>.txt``
To upgrade ScyllaDB without updating any external packages, follow the :ref:`download and installation instructions for Debian/Ubuntu <upgrade-debian-ubuntu-5.2-to-enterprise-2023.1>`.
There are two entries with the same base partition key, but in different streams. One of them corresponds to the write made before the generation change, the other --- to the write made after the change.
After the operating CDC generation changes, all writes with timestamps greater than or equal to the new generation's timestamp will use the new stream IDs. If you try to perform a write with a timestamp that is smaller than the new generation's timestamp, the write may be rejected, depending on the node you're connected to:
After the operating CDC generation changes, all writes with timestamps greater than or equal to the new generation's timestamp will use the new stream IDs.
* if the clock of the node you're connected to reports earlier time than the generation's timestamp, it will allow the write to be performed.
* Otherwise, the write will be rejected.
Therefore, if you've configured the driver to generate timestamps for you, make sure that the clock of the machine your driver is running on is not too desynchronized with the clock of the node you're connecting to. That way you can minimize the chance of writes being rejected while a new node is being bootstrapped.
Example: rejecting writes to an old generation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is a continuation of the :ref:`previous example <next-gen>`; a second node was bootstrapped recently, thus a new generation superseded the previous one.
#. Get the timestamp of the latest generation as an integer:
.. code-block:: cql
SELECT tounixtimestamp(time) FROM system_distributed.cdc_generation_timestamps WHERE key = 'timestamps';
result:
.. code-block:: none
system.tounixtimestamp(time)
------------------------------
1585152329484
1585140283006
(2 rows)
Generation timestamps have millisecond resolution. Here, the latest generation's timestamp is equal to ``1585152329484`` milliseconds.
#. Try to perform a write with a slightly smaller timestamp (remember that the ``USING TIMESTAMP`` clause expects a timestamp in **microseconds**):
.. code-block:: cql
INSERT INTO ks.t (pk, ck, v) VALUES (0, 0, 0) USING TIMESTAMP 1585152329483000;
result:
.. code-block:: none
InvalidRequest: Error from server: code=2200 [Invalid query] message="cdc: attempted to get a stream from an earlier generation than the currently used one. With CDC you cannot send writes with timestamps too far into the past, because that would break consistency properties (write timestamp: 2020/03/25 16:05:29, current generation started at: 2020/03/25 16:05:29)"
The write was rejected.
#. Perform a write with a timestamp equal to the generation's timestamp:
.. code-block:: cql
INSERT INTO ks.t (pk, ck, v) VALUES (0, 0, 0) USING TIMESTAMP 1585152329484000;
The write succeeds.
If the clock of the node you're connected to reports time distant from the write's timestamp, it may reject the write. If you've configured the driver to generate timestamps for you, make sure that the clock of the machine your driver is running on is not too desynchronized with the clock of the node you're connecting to. That way you can minimize the chance of writes being rejected.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.