Incremented the components_memory_reclaim_threshold config's default
value to 0.2 as the previous value was too strict and caused unnecessary
eviction in otherwise healthy clusters.
Fixes#18607
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 3d7d1fa72a)
Closes#19011
PR https://github.com/scylladb/scylladb/pull/17771 introduced a threshold for the total memory used by all bloom filters across SSTables. When the total usage surpasses the threshold, the largest bloom filter will be removed from memory, bringing the total usage back under the threshold. This PR adds support for reloading such reclaimed bloom filters back into memory when memory becomes available (i.e., within the 10% of available memory earmarked for the reclaimable components).
The SSTables manager now maintains a list of all SSTables whose bloom filter was removed from memory and attempts to reload them when an SSTable, whose bloom filter is still in memory, gets deleted. The manager reloads from the smallest to the largest bloom filter to maximize the number of filters being reloaded into memory.
Backported from https://github.com/scylladb/scylladb/pull/18186 to 5.2.
Closes#18666
* github.com:scylladb/scylladb:
sstable_datafile_test: add testcase to test reclaim during reload
sstable_datafile_test: add test to verify auto reload of reclaimed components
sstables_manager: reload previously reclaimed components when memory is available
sstables_manager: start a fiber to reload components
sstable_directory_test: fix generation in sstable_directory_test_table_scan_incomplete_sstables
sstable_datafile_test: add test to verify reclaimed components reload
sstables: support reloading reclaimed components
sstables_manager: add new intrusive set to track the reclaimed sstables
sstable: add link and comparator class to support new instrusive set
sstable: renamed intrusive list link type
sstable: track memory reclaimed from components per sstable
sstable: rename local variable in sstable::total_reclaimable_memory_size
Even when configured to not do any validation at all, the validator still did some. This small series fixes this, and adds a test to check that validation levels in general are respected, and the validator doesn't validate more than it is asked to.
Fixes: #18662
(cherry picked from commit f6511ca1b0)
(cherry picked from commit e7b07692b6)
(cherry picked from commit 78afb3644c)
Refs #18667Closes#18723
* github.com:scylladb/scylladb:
test/boost/mutation_fragment_test.cc: add test for validator validation levels
mutation: mutation_fragment_stream_validating_filter: fix validation_level::none
mutation: mutation_fragment_stream_validating_filter: add raises_error ctor parameter
when we convert timestamp into string it must look like: '2017-12-27T11:57:42.500Z'
it concerns any conversion except JSON timestamp format
JSON string has space as time separator and must look like: '2017-12-27 11:57:42.500Z'
both formats always contain milliseconds and timezone specification
Fixes#14518Fixes#7997Closes#14726Fixes#16575
(cherry picked from commit ff721ec3e3)
Closes#18852
Despite its name, this validation level still did some validation. Fix
this, by short-circuiting the catch-all operator(), preventing any
validation when the user asked for none.
(cherry picked from commit e7b07692b6)
When set to false, no exceptions will be raised from the validator on
validation error. Instead, it will just return false from the respective
validator methods. This makes testing simpler, asserting exceptions is
clunky.
When true (default), the previous behaviour will remain: any validation
error will invoke on_internal_error(), resulting in either std::abort()
or an exception.
Backporting notes:
* Added const const mutation_fragment_stream_validating_filter&
param to on_validation_error()
* Made full_name() public
(cherry picked from commit f6511ca1b0)
Currently, if the fill ctor throws an exception,
the destructor won't be called, as it object is not fully constructed yet.
Call the default ctor first (which doesn't throw)
to make sure the destructor will be called on exception.
Fixes scylladb/scylladb#18635
- [x] Although the fixes is for a rare bug, it has very low risk and so it's worth backporting to all live versions
(cherry picked from commit 64c51cf32c)
(cherry picked from commit 88b3173d03)
(cherry picked from commit 4bbb66f805)
Refs #18636Closes#18680
* github.com:scylladb/scylladb:
chunked_vector_test: add more exception safety tests
chunked_vector_test: exception_safe_class: count also moved objects
utils: chunked_vector: fill ctor: make exception safe
We have to account for moved objects as well
as copied objects so they will be balanced with
the respective `del_live_object` calls called
by the destructor.
However, since chunked_vector requires the
value_type to be nothrow_move_constructible,
just count the additional live object, but
do not modify _countdown or, respectively, throw
an exception, as this should be considered only
for the default and copy constructors.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, if the fill ctor throws an exception,
the destructor won't be called, as it object is not
fully constructed yet.
Call the default ctor first (which doesn't throw)
to make sure the destructor will be called on exception.
Fixesscylladb/scylladb#18635
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When an SSTable is dropped, the associated bloom filter gets discarded
from memory, bringing down the total memory consumption of bloom
filters. Any bloom filter that was previously reclaimed from memory due
to the total usage crossing the threshold, can now be reloaded back into
memory if the total usage can still stay below the threshold. Added
support to reload such reclaimed filters back into memory when memory
becomes available.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 0b061194a7)
Start a fiber that gets notified whenever an sstable gets deleted. The
fiber doesn't do anything yet but the following patch will add support
to reload reclaimed components if there is sufficient memory.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit f758d7b114)
The testcase uses an sstable whose mutation key and the generation are
owned by different shards. Due to this, when process_sstable_dir is
called, the sstable gets loaded into a different shard than the one that
was intended. This also means that the sstable and the sstable manager
end up in different shards.
The following patch will introduce a condition variable in sstables
manager which will be signalled from the sstables. If the sstable and
the sstable manager are in different shards, the signalling will cause
the testcase to fail in debug mode with this error : "Promise task was
set on shard x but made ready on shard y". So, fix it by supplying
appropriate generation number owned by the same shard which owns the
mutation key as well.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 24064064e9)
Added support to reload components from which memory was previously
reclaimed as the total memory of reclaimable components crossed a
threshold. The implementation is kept simple as only the bloom filters
are considered reclaimable for now.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 54bb03cff8)
When a compaction strategy uses garbage collected sstables to track
expired tombstones, do not use complete partition estimates for them,
instead, use a fraction of it based on the droppable tombstone ratio
estimate.
Fixes#18283
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#18465
(cherry picked from commit d39adf6438)
Closes#18659
The new set holds the sstables from where the memory has been reclaimed
and is sorted in ascending order of the total memory reclaimed.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 2340ab63c6)
Renamed the intrusive list link type to differentiate it from the set
link type that will be added in an upcoming patch.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 3ef2f79d14)
Added a member variable _total_memory_reclaimed to the sstable class
that tracks the total memory reclaimed from a sstable.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 02d272fdb3)
Renamed local variable in sstable::total_reclaimable_memory_size in
preparation for the next patch which adds a new member variable
_total_memory_reclaimed to the sstable class.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit a53af1f878)
The direct failure detector design is simplistic. It sends pings
sequentially and times out listeners that reached the threshold (i.e.
didn't hear from a given endpoint for too long) in-between pings.
Given the sequential nature, the previous ping must finish so the next
ping can start. We timeout pings that take too long. The timeout was
hardcoded and set to 300ms. This is too low for wide-area setups --
latencies across the Earth can indeed go up to 300ms. 3 subsequent timed
out pings to a given node were sufficient for the Raft listener to "mark
server as down" (the listener used a threshold of 1s).
Increase the ping timeout to 600ms which should be enough even for
pinging the opposite side of Earth, and make it tunable.
Increase the Raft listener threshold from 1s to 2s. Without the
increased threshold, one timed out ping would be enough to mark the
server as down. Increasing it to 2s requires 3 timed out pings which
makes it more robust in presence of transient network hiccups.
In the future we'll most likely want to decrease the Raft listener
threshold again, if we use Raft for data path -- so leader elections
start quickly after leader failures. (Faster than 2s). To do that we'll
have to improve the design of the direct failure detector.
Ref: scylladb/scylladb#16410
Fixes: scylladb/scylladb#16607
---
I tested the change manually using `tc qdisc ... netem delay`, setting
network delay on local setup to ~300ms with jitter. Without the change,
the result is as observed in scylladb/scylladb#16410: interleaving
```
raft_group_registry - marking Raft server ... as dead for Raft groups
raft_group_registry - marking Raft server ... as alive for Raft groups
```
happening once every few seconds. The "marking as dead" happens whenever
we get 3 subsequent failed pings, which is happens with certain (high)
probability depending on the latency jitter. Then as soon as we get a
successful ping, we mark server back as alive.
With the change, the phenomenon no longer appears.
(cherry picked from commit 8df6d10e88)
Closes#18558
The event is used in a loop.
Found by clang-tidy:
```
streaming/stream_result_future.cc:80:49: warning: 'event' used after it was moved [bugprone-use-after-move]
listener->handle_stream_event(std::move(event));
^
streaming/stream_result_future.cc:80:39: note: move occurred here
listener->handle_stream_event(std::move(event));
^
streaming/stream_result_future.cc:80:49: note: the use happens in a later loop iteration than the move
listener->handle_stream_event(std::move(event));
^
```
Fixes#18332
(cherry picked from commit 4fd4e6acf3)
Closes#18430
When reclaiming memory from bloom filters, do not remove them from
_recognised_components, as that leads to the on-disk filter component
being left back on disk when the SSTable is deleted.
Fixes#18398
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#18400
(cherry picked from commit 6af2659b57)
Closes#18437
in handler.cc, `make_non_overlapping_ranges()` references a moved
instance of `ColumnSlice` when something unexpected happens to
format the error message in an exception, the move constructor of
`ColumnSlice` is default-generated, so the members' move constructors
are used to construct the new instance in the move constructor. this
could lead to undefined behavior when dereferencing the move instance.
in this change, in order to avoid use-after free, let's keep
a copy of the referenced member variables and reference them when
formatting error message in the exception.
this use-after-move issue was introduced in 822a315dfa, which implemented
`get_multi_slice` verb and this piece in the first place. since both 5.2
and 5.4 include this commit, we should backport this change to them.
Refs 822a315dfaFixes#18356
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 1ad3744edc)
Closes#18373
Currently, we use the sum of the estimated_partitions from each
participant node as the estimated_partitions for sstable produced by
repair. This way, the estimated_partitions is the biggest possible
number of partitions repair would write.
Since repair will write only the difference between repair participant
nodes, using the biggest possible estimation will overestimate the
partitions written by repair, most of the time.
The problem is that overestimated partitions makes the bloom filter
consume more memory. It is observed that it causes OOM in the field.
This patch changes the estimation to use a fraction of the average
partitions per node instead of sum. It is still not a perfect estimation
but it already improves memory usage significantly.
Fixes#18140Closesscylladb/scylladb#18141
(cherry picked from commit 642f9a1966)
Added support to track and limit the memory usage by sstable components. A reclaimable component of an SSTable is one from which memory can be reclaimed. SSTables and their managers now track such reclaimable memory and limit the component memory usage accordingly. A new configuration variable defines the memory reclaim threshold. If the total memory of the reclaimable components exceeds this limit, memory will be reclaimed to keep the usage under the limit. This PR considers only the bloom filters as reclaimable and adds support to track and limit them as required.
The feature can be manually verified by doing the following :
1. run a single-node single-shard 1GB cluster
2. create a table with bloom-filter-false-positive-chance of 0.001 (to intentionally cause large bloom filter)
3. populate with tiny partitions
4. watch the bloom filter metrics get capped at 100MB
The default value of the `components_memory_reclaim_threshold` config variable which controls the reclamation process is `.1`. This can also be reduced further during manual tests to easily hit the threshold and verify the feature.
Fixes https://github.com/scylladb/scylladb/issues/17747
Backported from #17771 to 5.2.
Closes#18247
* github.com:scylladb/scylladb:
test_bloom_filter.py: disable reclaiming memory from components
sstable_datafile_test: add tests to verify auto reclamation of components
test/lib: allow overriding available memory via test_env_config
sstables_manager: support reclaiming memory from components
sstables_manager: store available memory size
sstables_manager: add variable to track component memory usage
db/config: add a new variable to limit memory used by table components
sstable_datafile_test: add testcase to verify reclamation from sstables
sstables: support reclaiming memory from components
Disabled reclaiming memory from sstable components in the testcase as it
interferes with the false positive calculation.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit d86505e399)
Reclaim memory from the SSTable that has the most reclaimable memory if
the total reclaimable memory has crossed the threshold. Only the bloom
filter memory is considered reclaimable for now.
Fixes#17747
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit a36965c474)
The available memory size is required to calculate the reclaim memory
threshold, so store that within the sstables manager.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 2ca4b0a7a2)
sstables_manager::_total_reclaimable_memory variable tracks the total
memory that is reclaimable from all the SSTables managed by it.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit f05bb4ba36)
A new configuration variable, components_memory_reclaim_threshold, has
been added to configure the maximum allowed percentage of available
memory for all SSTable components in a shard. If the total memory usage
exceeds this threshold, it will be reclaimed from the components to
bring it back under the limit. Currently, only the memory used by the
bloom filters will be restricted.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit e8026197d2)
Added support to track total memory from components that are reclaimable
and to reclaim memory from them if and when required. Right now only the
bloom filters are considered as reclaimable components but this can be
extended to any component in the future.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 4f0aee62d1)
Repair memory limit includes only the size of frozen mutation
fragments in repair row. The size of other members of repair
row may grow uncontrollably and cause out of memory.
Modify what's counted to repair memory limit.
Fixes: https://github.com/scylladb/scylladb/issues/16710.
(cherry picked from commit a4dc6553ab)
(cherry picked from commit 51c09a84cc)
Refs https://github.com/scylladb/scylladb/pull/17785Closes#18237
* github.com:scylladb/scylladb:
test: add test for repair_row::size()
repair: fix memory accounting in repair_row
In repair, only the size of frozen mutation fragments of repair row
is counted to the memory limit. So, huge keys of repair rows may
lead to OOM.
Include other repair_row's members' memory size in repair memory
limit.
(cherry picked from commit a4dc6553ab)
Currently, Scylla logs a warning when it writes a cell, row or partition which are larger than certain configured sizes. These warnings contain the partition key and in case of rows and cells also the cluster key which allow the large row or partition to be identified. However, these keys can contain user-private, sensitive information. The information which identifies the partition/row/cell is also inserted into tables system.large_partitions, system.large_rows and system.large_cells respectivelly.
This change removes the partition and cluster keys from the log messages, but still inserts them into the system tables.
The logged data will look like this:
Large cells:
WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large cell ks_name/tbl_name: cell_name (SIZE bytes) to sstable.db
Large rows:
WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large row ks_name/tbl_name: (SIZE bytes) to sstable.db
Large partitions:
WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large partition ks_name/tbl_name: (SIZE bytes) to sstable.db
Fixes#18041Closesscylladb/scylladb#18166
(cherry picked from commit f1cc6252fd)
before this change, `reclaim_timer::report()` calls
```c++
fmt::format(", at {}", current_backtrace())
```
which allocates a `std::string` on heap, so it can fail and throw. in
that case, `std::terminate()` is called. but at that moment, the reason
why `reclaim_timer::report()` gets called is that we fail to reclaim
memory for the caller. so we are more likely to run into this issue. anyway,
we should not allocate memory in this path.
in this change, a dedicated printer is created so that we don't format
to a temporary `std::string`, and instead write directly to the buffer
of logger. this avoids the memory allocation.
Fixes#18099
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18100
(cherry picked from commit fcf7ca5675)
When a view update has both a local and remote target endpoint,
it extends the lifetime of its memory tracking semaphore units
only until the end of the local update, while the resources are
actually used until the remote update finishes.
This patch changes the semaphore transferring so that in case
of both local and remote endpoints, both view updates share the
units, causing them to be released only after the update that
takes longer finishes.
Fixes#17890
(cherry picked from commit 9789a3dc7c)
Closes#18104
Currently, when dividing memory tracked for a batch of updates
we do not take into account the overhead that we have for processing
every update. This patch adds the overhead for single updates
and joins the memory calculation path for batches and their parts
so that both use the same overhead.
Fixes#17854
(cherry picked from commit efcb718)
Closes#17999
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.
in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision. the tests are updated to print the time duration
instead of count to provide information with a higher resolution.
Fixes#15902
(cherry picked from commit 8a5689e7a7)
(cherry picked from commit 1d33a68dd7)
Closes#17911
* github.com:scylladb/scylladb:
tests: utils: error injection: print time duration instead of count
error_injection: do not cast to milliseconds when injecting timeout
In order to avoid running out of memory, we can't
underestimate the memory used when processing a view
update. Particularly, we need to handle the remote
view updates well, because we may create many of them
at the same time in contrast to local updates which
are processed synchronously.
After investigating a coredump generated in a crash
caused by running out of memory due to these remote
view updates, we found that the current estimation
is much lower than what we observed in practice; we
identified overhead of up to 2288 bytes for each
remote view update. The overhead consists of:
- 512 bytes - a write_response_handler
- less than 512 bytes - excessive memory allocation
for the mutation in bytes_ostream
- 448 bytes - the apply_to_remote_endpoints coroutine
started in mutate_MV()
- 192 bytes - a continuation to the coroutine above
- 320 bytes - the coroutine in result_parallel_for_each
started in mutate_begin()
- 112 bytes - a continuation to the coroutine above
- 192 bytes - 5 unspecified allocations of 32, 32, 32,
48 and 48 bytes
This patch changes the previous overhead estimate
of 256 bytes to 2288 bytes, which should take into
account all allocations in the current version of the
code. It's worth noting that changes in the related
pieces of code may result in a different overhead.
The allocations seem to be mostly captures for the
background tasks. Coroutines seem to allocate extra,
however testing shows that replacing a coroutine with
continuations may result in generating a few smaller
futures/continuations with a larger total size.
Besides that, considering that we're waiting for
a response for each remote view update, we need the
relatively large write_response_handler, which also
includes the mutation in case we needed to reuse it.
The change should not majorly affect workloads with many
local updates because we don't keep many of them at
the same time anyway, and an added benefit of correct
memory utilization estimation is avoiding evictions
of other memory that would be otherwise necessary
to handle the excessive memory used by view updates.
Fixes#17364
(cherry picked from commit 5ab3586135)
Closes#17858
instead of casting / comparing the count of duration unit, let's just
compare the durations, so that boost.test is able to print the duration
in a more informative and user friendly way (line wrapped)
test/boost/error_injection_test.cc(167): fatal error:
in "test_inject_future_disabled":
critical check wait_time > sleep_msec has failed [23839ns <= 10ms]
Refs #15902
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 1d33a68dd7)
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.
in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision.
Fixes#15902
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 8a5689e7a7)
Major compaction semantics is that all data of a table will be compacted
together, so user can expect e.g. a recently introduced tombstone to be
compacted with the data it shadows.
Today, it can happen that all data in maintenance set won't be included
for major, until they're promoted into main set by off-strategy.
So user might be left wondering why major is not having the expected
effect.
To fix this, let's perform off-strategy first, so data in maintenance
set will be made available by major. A similar approach is done for
data in memtable, so flush is performed before major starts.
The only exception will be data in staging, which cannot be compacted
until view building is done with it, to avoid inconsistency in view
replicas.
The serialization in comapaction manager of reshape jobs guarantee
correctness if there's an ongoing off-strategy on behalf of the
table.
Fixes#11915.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15792
(cherry picked from commit ea6c281b9f)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#17901
This commit updates the Upgrade ScyllaDB Image page.
- It removes the incorrect information that updating underlying OS packages is mandatory.
- It adds information about the extended procedure for non-official images.
(cherry picked from commit fc90112b97)
Closes#17885
The test in its original form relies on the
`error_injections_at_startup` feature, which 5.2 doesn't have, so I
adapted the test to enable error injections after bootstrapping nodes in
the backport (9c44bbce67). That is however
incorrect, it's important for the injection to be enabled while the
nodes are booting, otherwise the test will be flaky, as we observed.
Details in scylladb/scylladb#17749.
Remove the test from 5.2 branch.
Fixesscylladb/scylladb#17749Closes#17750
This is a backport of 0c376043eb and follow-up fix 57b14580f0 to 5.2.
We haven't identified any specific issues in test or field in 5.2/2023.1 releases, but the bug should be fixed either way, it might bite us in unexpected ways.
Closes#17640
* github.com:scylladb/scylladb:
migration_manager: only jump to shard 0 in migration_request during group 0 snapshot transfer
raft_group0_client: assert that hold_read_apply_mutex is called on shard 0
migration_manager: fix indentation after the previous patch.
messaging_service: process migration_request rpc on shard 0
migration_manager: take group0 lock during raft snapshot taking
A materialized view in CQL allows AT MOST ONE view key column that
wasn't a key column in the base table. This is because if there were
two or more of those, the "liveness" (timestamp, ttl) of these different
columns can change at every update, and it's not possible to pick what
liveness to use for the view row we create.
We made an exception for this rule for Alternator: DynamoDB's API allows
creating a GSI whose partition key and range key are both regular columns
in the base table, and we must support this. We claim that the fact that
Alternator allows neither TTL (Alternator's "TTL" is a different feature)
nor user-defined timestamps, does allow picking the liveness for the view
row we create. But we did it wrong!
We claimed in a comment - and implemented in the code before this patch -
that in Alternator we can assume that both GSI key columns will have the
*same* liveness, and in particular timestamp. But this is only true if
one modifies both columns together! In fact, in general it is not true:
We can have two non-key attributes 'a' and 'b' which are the GSI's key
columns, and we can modify *only* b, without modifying a, in which case
the timestamp of the view modification should be b's newer timestamp,
not a's older one. The existing code took a's timestamp, assuming it
will be the same as b's, which is incorrect. The result was that if
we repeatedly modify only b, all view updates will receive the same
timestamp (a's old timestamp), and a deletion will always win over
all the modifications. This patch includes a reproducing test written by
a user (@Zak-Kent) that demonstrates how after a view row is deleted
it doesn't get recreated - because all the modifications use the same
timestamp.
The fix is, as suggested above, to use the *higher* of the two
timestamps of both base-regular-column GSI key columns as the timestamp
for the new view rows or view row deletions. The reproducer that
failed before this patch passes with it. As usual, the reproducer
passes on AWS DynamoDB as well, proving that the test is correct and
should really work.
Fixes#17119
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#17172
(cherry picked from commit 21e7deafeb)
The test is booting nodes, and then immediately starts shutting down
nodes and removing them from the cluster. The shutting down and
removing may happen before driver manages to connect to all nodes in the
cluster. In particular, the driver didn't yet connect to the last
bootstrapped node. Or it can even happen that the driver has connected,
but the control connection is established to the first node, and the
driver fetched topology from the first node when the first node didn't
yet consider the last node to be normal. So the driver decides to close
connection to the last node like this:
```
22:34:03.159 DEBUG> [control connection] Removing host not found in
peers metadata: <Host: 127.42.90.14:9042 datacenter1>
```
Eventually, at the end of the test, only the last node remains, all
other nodes have been removed or stopped. But the driver does not have a
connection to that last node.
Fix this problem by ensuring that:
- all nodes see each other as NORMAL,
- the driver has connected to all nodes
at the beginning of the test, before we start shutting down and removing
nodes.
Fixes scylladb/scylladb#16373
(cherry picked from commit a68701ed4f)
Closes#17703
Jumping to shard 0 during group 0 snapshot transfer is required because
we take group 0 lock, onyl available on shard 0. But outside of Raft
mode it only pessimizes performance unnecessarily, so don't do it.
Repairs have to obtain a permit to the reader concurrency semaphore on
each shard they have a presence on. This is prone to deadlocks:
node1 node2
repair1_master (takes permit) repair1_follower (waits on permit)
repair2_master (waits for permit) repair2_follower (takes permit)
In lieu of strong central coordination, we solved this by making permits
evictable: if repair2 can evict repair1's permit so it can obtain one
and make progress. This is not efficient as evicting a permit usually
means discarding already done work, but it prevents the deadlocks.
We recently discovered that there is a window when deadlocks can still
happen. The permit is made evictable when the disk reader is created.
This reader is an evictable one, which effectively makes the permit
evictable. But the permit is obtained when the repair constrol
structrure -- repair meta -- is create. Between creating the repair meta
and reading the first row from disk, the deadlock is still possible. And
we know that what is possible, will happen (and did happen). Fix by
making the permit evictable as soon as the repair meta is created. This
is very clunky and we should have a better API for this (refs #17644),
but for now we go with this simple patch, to make it easy to backport.
Refs: #17644Fixes: #17591Closes#17646
(cherry picked from commit c6e108a)
Backport notes:
The fix above does not apply to 5.2, because on 5.2 the reader is
created immediately when the repair-meta is created. So we don't need
the game with a fake inactive read, we can just pause the already
created reader in the repair-reader constructor.
Closes#17730
key_view::explode() contains a blatant use-after-free:
unless the input is already linearized, it returns a view to a local temporary buffer.
This is rare, because partition keys are usually not large enough to be fragmented.
But for a sufficiently large key, this bug causes a corrupted partition_key down
the line.
Fixes#17625
(cherry picked from commit 7a7b8972e5)
Closes#17725
Store schema_ptr in reader permit instead of storing a const pointer to
schema to ensure that the schema doesn't get changed elsewhere when the
permit is holding on to it. Also update the constructors and all the
relevant callers to pass down schema_ptr instead of a raw pointer.
Fixes#16180
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#16658
(cherry picked from commit 76f0d5e35b)
Closes#17694
Commit 0c376043eb added access to group0
semaphore which can be done on shard0 only. Unlike all other group0 rpcs
(that already always forwarded to shard0) migration_request does not
since it is an rpc that what reused from non raft days. The patch adds
the missing jump to shard0 before executing the rpc.
(cherry picked from commit 4a3c79625f)
Group0 state machine access atomicity is guaranteed by a mutex in group0
client. A code that reads or writes the state needs to hold the log. To
transfer schema part of the snapshot we used existing "migration
request" verb which did not follow the rule. Fix the code to take group0
lock before accessing schema in case the verb is called as part of
group0 snapshot transfer.
Fixesscylladb/scylladb#16821
(cherry picked from commit 0c376043eb)
Backport note: introduced missing
`raft_group0_client::hold_read_apply_mutex`
RPC calls lose information about the type of returned exception.
Thus, if a table is dropped on receiver node, but it still exists
on a sender node and sender node streams the table's data, then
the whole operation fails.
To prevent that, add a method which synchronizes schema and then
checks, if the exception was caused by table drop. If so,
the exception is swallowed.
Use the method in streaming and repair to continue them when
the table is dropped in the meantime.
Fixes: https://github.com/scylladb/scylladb/issues/17028.
Fixes: https://github.com/scylladb/scylladb/issues/15370.
Fixes: https://github.com/scylladb/scylladb/issues/15598.
Closes#17528
* github.com:scylladb/scylladb:
repair: handle no_such_column_family from remote node gracefully
test: test drop table on receiver side during streaming
streaming: fix indentation
streaming: handle no_such_column_family from remote node gracefully
repair: add methods to skip dropped table
The function `gms::version_generator::get_next_version()` can only be called from shard 0 as it uses a global, unsynchronized counter to issue versions. Notably, the function is used as a default argument for the constructor of `gms::versioned_value` which is used from shorthand constructors such as `versioned_value::cache_hitrates`, `versioned_value::schema` etc.
The `cache_hitrate_calculator` service runs a periodic job which updates the `CACHE_HITRATES` application state in the local gossiper state. Each time the job is scheduled, it runs on the next shard (it goes through shards in a round-robin fashion). The job uses the `versioned_value::cache_hitrates` shorthand to create a `versioned_value`, therefore risking a data race if it is not currently executing on shard 0.
The PR fixes the race by moving the call to `versioned_value::cache_hitrates` to shard 0. Additionally, in order to help detect similar issues in the future, a check is introduced to `get_next_version` which aborts the process if the function was called on other shard than 0.
There is a possibility that it is a fix for #17493. Because `get_next_version` uses a simple incrementation to advance the global counter, a data race can occur if two shards call it concurrently and it may result in shard 0 returning the same or smaller value when called two times in a row. The following sequence of events is suspected to occur on node A:
1. Shard 1 calls `get_next_version()`, loads version `v - 1` from the global counter and stores in a register; the thread then is preempted,
2. Shard 0 executes `add_local_application_state()` which internally calls `get_next_version()`, loads `v - 1` then stores `v` and uses version `v` to update the application state,
3. Shard 0 executes `add_local_application_state()` again, increments version to `v + 1` and uses it to update the application state,
4. Gossip message handler runs, exchanging application states with node B. It sends its application state to B. Note that the max version of any of the local application states is `v + 1`,
5. Shard 1 resumes and stores version `v` in the global counter,
6. Shard 0 executes `add_local_application_state()` and updates the application state - again - with version `v + 1`.
7. After that, node B will never learn about the application state introduced in point 6. as gossip exchange only sends endpoint states with version larger than the previous observed max version, which was `v + 1` in point 4.
Note that the above scenario was _not_ reproduced. However, I managed to observe a race condition by:
1. modifying Scylla to run update of `CACHE_HITRATES` much more frequently than usual,
2. putting an assertion in `add_local_application_state` which fails if the version returned by `get_next_version` was not larger than the previous returned value,
3. running a test which performs schema changes in a loop.
The assertion from the second point was triggered. While it's hard to tell how likely it is to occur without making updates of cache hitrates more frequent - not to mention the full theorized scenario - for now this is the best lead that we have, and the data race being fixed here is a real bug anyway.
Refs: #17493Closesscylladb/scylladb#17499
* github.com:scylladb/scylladb:
version_generator: check that get_next_version is called on shard 0
misc_services: fix data race from bad usage of get_next_version
(cherry picked from commit fd32e2ee10)
If no_such_column_family is thrown on remote node, then repair
operation fails as the type of exception cannot be determined.
Use repair::with_table_drop_silenced in repair to continue operation
if a table was dropped.
(cherry picked from commit cf36015591)
If no_such_column_family is thrown on remote node, then streaming
operation fails as the type of exception cannot be determined.
Use repair::with_table_drop_silenced in streaming to continue
operation if a table was dropped.
(cherry picked from commit 219e1eda09)
Schema propagation is async so one node can see the table while on
the other node it is already dropped. So, if the nodes stream
the table data, the latter node throws no_such_column_family.
The exception is propagated to the other node, but its type is lost,
so the operation fails on the other node.
Add method which waits until all raft changes are applied and then
checks whether given table exists.
Add the function which uses the above to determine, whether the function
failed because of dropped table (eg. on the remote node so the exact
exception type is unknown). If so, the exception isn't rethrown.
(cherry picked from commit 5202bb9d3c)
If index_reader isn't closed before it is destroyed, then ongoing
sstables reads won't be awaited and assertion will be triggered.
Close index_reader in has_partition_key before destroying it.
Fixes: https://github.com/scylladb/scylladb/issues/17232.
Closes#17532
* github.com:scylladb/scylladb:
test: add test to check if reader is closed
sstables: close index_reader in has_partition_key
If index_reader isn't closed before it is destroyed, then ongoing
sstables reads won't be awaited and assertion will be triggered.
Close index_reader in has_partition_key before destroying it.
(cherry picked from commit 5227336a32)
Before this PR, writes to the previous CDC generations would
always be rejected. After this PR, they will be accepted if the
write's timestamp is greater than `now - generation_leeway`.
This change was proposed around 3 years ago. The motivation was
to improve user experience. If a client generates timestamps by
itself and its clock is desynchronized with the clock of the node
the client is connected to, there could be a period during
generation switching when writes fail. We didn't consider this
problem critical because the client could simply retry a failed
write with a higher timestamp. Eventually, it would succeed. This
approach is safe because these failed writes cannot have any side
effects. However, it can be inconvenient. Writing to previous
generations was proposed to improve it.
The idea was rejected 3 years ago. Recently, it turned out that
there is a case when the client cannot retry a write with the
increased timestamp. It happens when a table uses CDC and LWT,
which makes timestamps permanent. Once Paxos commits an entry
with a given timestamp, Scylla will keep trying to apply that entry
until it succeeds, with the same timestamp. Applying the entry
involves writing to the CDC log table. If it fails, we get stuck.
It's a major bug with an unknown perfect solution.
Allowing writes to previous generations for `generation_leeway` is
a probabilistic fix that should solve the problem in practice.
Apart from this change, this PR adds tests for it and updates
the documentation.
This PR is sufficient to enable writes to the previous generations
only in the gossiper-based topology. The Raft-based topology
needs some adjustments in loading and cleaning CDC generations.
These changes won't interfere with the changes introduced in this
PR, so they are left for a follow-up.
Fixesscylladb/scylladb#7251Fixesscylladb/scylladb#15260Closesscylladb/scylladb#17134
* github.com:scylladb/scylladb:
docs: using-scylla: cdc: remove info about failing writes to old generations
docs: dev: cdc: document writing to previous CDC generations
test: add test_writes_to_previous_cdc_generations
cdc: generation: allow increasing generation_leeway through error injection
cdc: metadata: allow sending writes to the previous generations
(cherry picked from commit 9bb4482ad0)
Backport note: replaced `servers_add` with `server_add` loop in tests
replaced `error_injections_at_startup` (not implemented in 5.2) with
`enable_injection` post-boot
For efficiency, if a base-table update generates many view updates that
go the same partition, they are collected as one mutation. If this
mutation grows too big it can lead to memory exhaustion, so since
commit 7d214800d0 we split the output
mutation to mutations no longer than 100 rows (max_rows_for_view_updates)
each.
This patch fixes a bug where this split was done incorrectly when
the update involved range tombstones, a bug which was discovered by
a user in a real use case (#17117).
Range tombstones are read in two parts, a beginning and an end, and the
code could split the processing between these two parts and the result
that some of the range tombstones in update could be missed - and the
view could miss some deletions that happened in the base table.
This patch fixes the code in two places to avoid breaking up the
processing between range tombstones:
1. The counter "_op_count" that decides where to break the output mutation
should only be incremented when adding rows to this output mutation.
The existing code strangely incrmented it on every read (!?) which
resulted in the counter being incremented on every *input* fragment,
and in particular could reach the limit 100 between two range
tombstone pieces.
2. Moreover, the length of output was checked in the wrong place...
The existing code could get to 100 rows, not check at that point,
read the next input - half a range tombstone - and only *then*
check that we reached 100 rows and stop. The fix is to calculate
the number of rows in the right place - exactly when it's needed,
not before the step.
The first change needs more justification: The old code, that incremented
_op_count on every input fragment and not just output fragments did not
fit the stated goal of its introduction - to avoid large allocations.
In one test it resulted in breaking up the output mutation to chunks of
25 rows instead of the intended 100 rows. But, maybe there was another
goal, to stop the iteration after 100 *input* rows and avoid the possibility
of stalls if there are no output rows? It turns out the answer is no -
we don't need this _op_count increment to avoid stalls: The function
build_some() uses `co_await on_results()` to run one step of processing
one input fragment - and `co_await` always checks for preemption.
I verfied that indeed no stalls happen by using the existing test
test_long_skipped_view_update_delete_with_timestamp. It generates a
very long base update where all the view updates go to the same partition,
but all but the last few updates don't generate any view updates.
I confirmed that the fixed code loops over all these input rows without
increasing _op_count and without generating any view update yet, but it
does NOT stall.
This patch also includes two tests reproducing this bug and confirming
its fixed, and also two additional tests for breaking up long deletions
that I wanted to make sure doesn't fail after this patch (it doesn't).
By the way, this fix would have also fixed issue #12297 - which we
fixed a year ago in a different way. That issue happend when the code
went through 100 input rows without generating *any* output rows,
and incorrectly concluding that there's no view update to send.
With this fix, the code no longer stops generating the view
update just because it saw 100 input rows - it would have waited
until it generated 100 output rows in the view update (or the
input is really done).
Fixes#17117
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#17164
(cherry picked from commit 14315fcbc3)
For gnutls 3.8.3.
Since Fedora 37 is end-of-life, pick the package from Fedora 38. libunistring
needs to be updated to satisfy the dependency solver.
Fixes#17285.
Closesscylladb/scylladb#17287
Signed-off-by: Avi Kivity <avi@scylladb.com>
Closes#17411
The currently used version of "rustix" depency had a minor
security vulnerability. This patch updates the corresponding
crate.
The update was performed using "cargo update" on "rustix"
package and version "0.36.17"
relevant package and the corresponding version.
Refs #15772Closes#17408
While describing materialized view, print `synchronous_updates` option
only if the tag is present in schema's extensions map. Previously if the
key wasn't present, the default (false) value was printed.
Fixes: #14924Closes#14928
(cherry picked from commit b92d47362f)
The reason we introduced the tombstone-limit
(query_tombstone_page_limit), was to allow paged queries to return
incomplete/empty pages in the face of large tombstone spans. This works
by cutting the page after the tombstone-limit amount of tombstones were
processed. If the read is unpaged, it is killed instead. This was a
mistake. First, it doesn't really make sense, the reason we introduced
the tombstone limit, was to allow paged queries to process large
tombstone-spans without timing out. It does not help unpaged queries.
Furthermore, the tombstone-limit can kill internal queries done on
behalf of user queries, because all our internal queries are unpaged.
This can cause denial of service.
So in this patch we disable the tombstone-limit for unpaged queries
altogether, they are allowed to continue even after having processed the
configured limit of tombstones.
Fixes: #17241Closesscylladb/scylladb#17242
(cherry picked from commit f068d1a6fa)
* seastar 29badd99...ad0f2d5d (1):
> Merge "Slowdown IO scheduler based on dispatched/completed ratio" into branch-5.2
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This PR fixes the bug of certain calls to the `mintimeuuid()` CQL function which large negative timestamps could crash Scylla. It turns out we already had protections in place against very positive timestamps, but very negative timestamps could still cause bugs.
The actual fix in this series is just a few lines, but the bigger effort was improving the test coverage in this area. I added tests for the "date" type (the original reproducer for this bug used totimestamp() which takes a date parameter), and also reproducers for this bug directly, without totimestamp() function, and one with that function.
Finally this PR also replaces the assert() which made this molehill-of-a-bug into a mountain, by a throw.
Fixes#17035Closesscylladb/scylladb#17073
* github.com:scylladb/scylladb:
utils: replace assert() by on_internal_error()
utils: add on_internal_error with common logger
utils: add a timeuuid minimum, like we had maximum
test/cql-pytest: tests for "date" type
(cherry picked from commit 2a4b991772)
Backports required to fix scylladb/scylladb#16683 in 5.2:
- when creating first group 0 server, create a snapshot with non-empty ID, and start it at index 1 instead of 0 to force snapshot transfer to servers that join group 0
- add an API to trigger Raft snapshot
- use the API when we restart and see that the existing snapshot is at index 0, to trigger a new one --- in order to fix broken deployments that already bootstrapped with index-0 snapshot.
Closes#17087
* github.com:scylladb/scylladb:
test_raft_snapshot_request: fix flakiness (again)
test_raft_snapshot_request: fix flakiness
Merge 'raft_group0: trigger snapshot if existing snapshot index is 0' from Kamil Braun
Merge 'Add an API to trigger snapshot in Raft servers' from Kamil Braun
raft: server: add workaround for scylladb/scylladb#12972
raft: Store snapshot update and truncate log atomically
service: raft: force initial snapshot transfer in new cluster
raft_sys_table_storage: give initial snapshot a non zero value
Commit e81fc1f095 accidentally broke the control
flow of row_cache::do_update().
Before that commit, the body of the loop was wrapped in a lambda.
Thus, to break out of the loop, `return` was used.
The bad commit removed the lambda, but didn't update the `return` accordingly.
Thus, since the commit, the statement doesn't just break out of the loop as
intended, but also skips the code after the loop, which updates `_prev_snapshot_pos`
to reflect the work done by the loop.
As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted,
`do_update()` fails to update `_prev_snapshot_pos`. It remains in a
stale state, until `do_update()` runs again and either finishes or
is preempted outside of `updater`.
If we read a partition processed by `do_update()` but not covered by
`_prev_snapshot_pos`, we will read stale data (from the previous snapshot),
which will be remembered in the cache as the current data.
This results in outdated data being returned by the replica.
(And perhaps in something worse if range tombstones are involved.
I didn't investigate this possibility in depth).
Note: for queries with CL>1, occurences of this bug are likely to be hidden
by reconciliation, because the reconciled query will only see stale data if
the queried partition is affected by the bug on on *all* queried replicas
at the time of the query.
Fixes#16759Closesscylladb/scylladb#17138
(cherry picked from commit ed98102c45)
At the end of the test, we wait until a restarted node receives a
snapshot from the leader, and then verify that the log has been
truncated.
To check the snapshot, the test used the `system.raft_snapshots` table,
while the log is stored in `system.raft`.
Unfortunately, the two tables are not updated atomically when Raft
persists a snapshot (scylladb/scylladb#9603). We first update
`system.raft_snapshots`, then `system.raft` (see
`raft_sys_table_storage::store_snapshot_descriptor`). So after the wait
finishes, there's no guarantee the log has been truncated yet -- there's
a race between the test's last check and Scylla doing that last delete.
But we can check the snapshot using `system.raft` instead of
`system.raft_snapshots`, as `system.raft` has the latest ID. And since
1640f83fdc, storing that ID and truncating
the log in `system.raft` happens atomically.
Closesscylladb/scylladb#17106
(cherry picked from commit c911bf1a33)
Add workaround for scylladb/python-driver#295.
Also an assert made at the end of the test was false, it is fixed with
appropriate comment added.
(cherry picked from commit 74bf60a8ca)
The persisted snapshot index may be 0 if the snapshot was created in
older version of Scylla, which means snapshot transfer won't be
triggered to a bootstrapping node. Commands present in the log may not
cover all schema changes --- group 0 might have been created through the
upgrade upgrade procedure, on a cluster with existing schema. So a
deployment with index=0 snapshot is broken and we need to fix it. We can
use the new `raft::server::trigger_snapshot` API for that.
Also add a test.
Fixesscylladb/scylladb#16683Closesscylladb/scylladb#17072
* github.com:scylladb/scylladb:
test: add test for fixing a broken group 0 snapshot
raft_group0: trigger snapshot if existing snapshot index is 0
(cherry picked from commit 181f68f248)
Backport note: test_raft_fix_broken_snapshot had to be removed because
the "error injections enabled at startup" feature does not yet exist in
5.2.
This allows the user of `raft::server` to cause it to create a snapshot
and truncate the Raft log (leaving no trailing entries; in the future we
may extend the API to specify number of trailing entries left if
needed). In a later commit we'll add a REST endpoint to Scylla to
trigger group 0 snapshots.
One use case for this API is to create group 0 snapshots in Scylla
deployments which upgraded to Raft in version 5.2 and started with an
empty Raft log with no snapshot at the beginning. This causes problems,
e.g. when a new node bootstraps to the cluster, it will not receive a
snapshot that would contain both schema and group 0 history, which would
then lead to inconsistent schema state and trigger assertion failures as
observed in scylladb/scylladb#16683.
In 5.4 the logic of initial group 0 setup was changed to start the Raft
log with a snapshot at index 1 (ff386e7a44)
but a problem remains with these existing deployments coming from 5.2,
we need a way to trigger a snapshot in them (other than performing 1000
arbitrary schema changes).
Another potential use case in the future would be to trigger snapshots
based on external memory pressure in tablet Raft groups (for strongly
consistent tables).
The PR adds the API to `raft::server` and a HTTP endpoint that uses it.
In a follow-up PR, we plan to modify group 0 server startup logic to automatically
call this API if it sees that no snapshot is present yet (to automatically
fix the aforementioned 5.2 deployments once they upgrade.)
Closesscylladb/scylladb#16816
* github.com:scylladb/scylladb:
raft: remove `empty()` from `fsm_output`
test: add test for manual triggering of Raft snapshots
api: add HTTP endpoint to trigger Raft snapshots
raft: server: add `trigger_snapshot` API
raft: server: track last persisted snapshot descriptor index
raft: server: framework for handling server requests
raft: server: inline `poll_fsm_output`
raft: server: fix indentation
raft: server: move `io_fiber`'s processing of `batch` to a separate function
raft: move `poll_output()` from `fsm` to `server`
raft: move `_sm_events` from `fsm` to `server`
raft: fsm: remove constructor used only in tests
raft: fsm: move trace message from `poll_output` to `has_output`
raft: fsm: extract `has_output()`
raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor`
raft: server: pass `*_aborted` to `set_exception` call
(cherry picked from commit d202d32f81)
Backport notes:
- `has_output()` has a smaller condition in the backported version
(because the condition was smaller in `poll_output()`)
- `process_fsm_output` has a smaller body (because `io_fiber` had a
smaller body) in the backported version
- the HTTP API is only started if `raft_group_registry` is started
When a node joins the cluster, it closes connections after learning
topology information from other nodes, in order to reopen them with
correct encryption, compression etc.
In ScyllaDB 5.2, this mechanism may interrupt an ongoing Raft snapshot
transfer. This was fixed in later versions by putting some order into
the bootstrap process with 50e8ec77c6 but
the fix was not backported due to many prerequisites and complexity.
Raft automatically recovers from interrupted snapshot transfer by
retrying it eventually, and everything works. However an ERROR is
reported due to that one failed snapshot transfer, and dtests dont like
ERRORs -- they report the test case as failed if an ERROR happened in
any node's logs even if the test passed otherwise.
Here we apply a simple workaround to please dtests -- in this particular
scenario, turn the ERROR into a WARN.
When we upgrade a cluster to use Raft, or perform manual Raft recovery
procedure (which also creates a fresh group 0 cluster, using the same
algorithm as during upgrade), we start with a non-empty group 0 state
machine; in particular, the schema tables are non-empty.
In this case we need to ensure that nodes which join group 0 receive the
group 0 state. Right now this is not the case. In previous releases,
where group 0 consisted only of schema, and schema pulls were also done
outside Raft, those nodes received schema through this outside
mechanism. In 91f609d065 we disabled
schema pulls outside Raft; we're also extending group 0 with other
things, like topology-specific state.
To solve this, we force snapshot transfers by setting the initial
snapshot index on the first group 0 server to `1` instead of `0`. During
replication, Raft will see that the joining servers are behind,
triggering snapshot transfer and forcing them to pull group 0 state.
It's unnecessary to do this for cluster which bootstraps with Raft
enabled right away but it also doesn't hurt, so we keep the logic simple
and don't introduce branches based on that.
Extend Raft upgrade tests with a node bootstrap step at the end to
prevent regressions (without this patch, the step would hang - node
would never join, waiting for schema).
Fixes: #14066Closes#14336
(cherry picked from commit ff386e7a44)
Backport note: contrary to the claims above, it turns out that it is
actually necessary to create snapshots in clusters which bootstrap with
Raft, because of tombstones in current schema state expire hence
applying schema mutations from old Raft log entries is not really
idempotent. Snapshot transfer, which transfers group 0 history and
state_ids, prevents old entries from applying schema mutations over
latest schema state.
Ref: scylladb/scylladb#16683
We create a snapshot (config only, but still), but do not assign it any
id. Because of that it is not loaded on start. We do want it to be
loaded though since the state of group0 will not be re-created from the
log on restart because the entries will have outdated id and will be
skipped. As a result in memory state machine state will not be restored.
This is not a problem now since schema state it restored outside of raft
code.
Message-Id: <20230316112801.1004602-5-gleb@scylladb.com>
(cherry picked from commit a690070722)
Before returning task status, wait_task waits for it to finish with
done() method and calls get() on a resulting future.
If requested task fails, an exception will be thrown and user will
get internal server error instead of failed task status.
Result of done() method is ignored.
Fixes: #14914.
(cherry picked from commit ae67f5d47e)
Closes#16438
discard_result ignores only successful futures. Thus, if
perform_compaction<regular_compaction_task_executor> call fails,
a failure is considered abandoned, causing tests to fail.
Explicitly ignore failed future.
Fixes: #14971.
Closes#15000
(cherry picked from commit 7a28cc60ec)
Closes#16441
`system.raft` was using the "user memory pool", i.e. the
`dirty_memory_manager` for this table was set to
`database::_dirty_memory_manager` (instead of
`database::_system_dirty_memory_manager`).
This meant that if a write workload caused memory pressure on the user
memory pool, internal `system.raft` writes would have to wait for
memtables of user tables to get flushed before the write would proceed.
This was observed in SCT longevity tests which ran a heavy workload on
the cluster and concurrently, schema changes (which underneath use the
`system.raft` table). Raft would often get stuck waiting many seconds
for user memtables to get flushed. More details in issue #15622.
Experiments showed that moving Raft to system memory fixed this
particular issue, bringing the waits to reasonable levels.
Currently `system.raft` stores only one group, group 0, which is
internally used for cluster metadata operations (schema and topology
changes) -- so it makes sense to keep use system memory.
In the future we'd like to have other groups, for strongly consistent
tables. These groups should use the user memory pool. It means we won't
be able to use `system.raft` for them -- we'll just have to use a
separate table.
Fixes: scylladb/scylladb#15622Closesscylladb/scylladb#15972
(cherry picked from commit f094e23d84)
When a base table changes and altered, so does the views that might
refer to the added column (which includes "SELECT *" views and also
views that might need to use this column for rows lifetime (virtual
columns).
However the query processor implementation for views change notification
was an empty function.
Since views are tables, the query processor needs to at least treat them
as such (and maybe in the future, do also some MV specific stuff).
This commit adds a call to `on_update_column_family` from within
`on_update_view`.
The side effect true to this date is that prepared statements for views
which changed due to a base table change will be invalidated.
Fixes https://github.com/scylladb/scylladb/issues/16392
This series also adds a test which fails without this fix and passes when the fix is applied.
Closesscylladb/scylladb#16897
* github.com:scylladb/scylladb:
Add test for mv prepared statements invalidation on base alter
query processor: treat view changes at least as table changes
(cherry picked from commit 5810396ba1)
On some environment such as VMware instance, /dev/disk/by-uuid/<UUID> is
not available, scylla_raid_setup will fail while mounting volume.
To avoid failing to mount /dev/disk/by-uuid/<UUID>, fetch all available
paths to mount the disk and fallback to other paths like by-partuuid,
by-id, by-path or just using real device path like /dev/md0.
To get device path, and also to dumping device status when UUID is not
available, this will introduce UdevInfo class which communicate udev
using pyudev.
Related #11359Closesscylladb/scylladb#13803
(cherry picked from commit 58d94a54a3)
[syuu: renegerate tools/toolchain/image for new python3-pyudev package]
Closes#16938
When the reader is currently paused, it is resumed, fast-forwarded, then
paused again. The fast forwarding part can throw and this will lead to
destroying the reader without it being closed first.
Add a try-catch surrounding this part in the code. Also mark
`maybe_pause()` and `do_pause()` as noexcept, to make it clear why
that part doesn't need to be in the try-catch.
Fixes: #16606Closesscylladb/scylladb#16630
(cherry picked from commit 204d3284fa)
If std::vector is resized its iterators and references may
get invalidated. While task_manager::task::impl::_children's
iterators are avoided throughout the code, references to its
elements are being used.
Since children vector does not need random access to its
elements, change its type to std::list<foreign_task_ptr>, which
iterators and references aren't invalidated on element insertion.
Fixes: #16380.
Closesscylladb/scylladb#16381
(cherry picked from commit 9b9ea1193c)
Closes#16777
Table properties validation is performed on statement execution.
Thus, when one attempts to create a table with invalid options,
an incorrect command gets committed in Raft. But then its
application fails, leading to a raft machine being stopped.
Check table properties when create and alter statements are prepared.
Fixes: https://github.com/scylladb/scylladb/issues/14710.
Closes#16750
* github.com:scylladb/scylladb:
cql3: statements: delete execute override
cql3: statements: call check_restricted_table_properties in prepare
cql3: statements: pass data_dictionary::database to check_restricted_table_properties
Delete overriden create_table_statement::execute as it only calls its
direct parent's (schema_altering_statement) execute method anyway.
(cherry picked from commit 6c7eb7096e)
Table properties validation is performed on statement execution.
Thus, when one attempts to create a table with invalid options,
an incorrect command gets committed in Raft. But then its
application fails, leading to a raft machine being stopped.
Check table properties when create and alter statements are prepared.
The error is no longer returned as an exceptional future, but it
is thrown. Adjust the tests accordingly.
(cherry picked from commit 60fdc44bce)
Pass data_dictionary::database to check_restricted_table_properties
as an arguemnt instead of query_processor as the method will be called
from a context which does not have access to query processor.
(cherry picked from commit ec98b182c8)
TWCS tables require partition estimation adjustment as incoming streaming data can be segregated into the time windows.
Turns out we had two problems in this area that leads to suboptimal bloom filters.
1) With off-strategy enabled, data segregation is postponed, but partition estimation was adjusted as if segregation wasn't postponed. Solved by not adjusting estimation if segregation is postponed.
2) With off-strategy disabled, data segregation is not postponed, but streaming didn't feed any metadata into partition estimation procedure, meaning it had to assume the max windows input data can be segregated into (100). Solved by using schema's default TTL for a precise estimation of window count.
For the future, we want to dynamically size filters (see https://github.com/scylladb/scylladb/issues/2024), especially for TWCS that might have SSTables that are left uncompacted until they're fully expired, meaning that the system won't heal itself in a timely manner through compaction on a SSTable that had partition estimation really wrong.
Fixes https://github.com/scylladb/scylladb/issues/15704.
Closes scylladb/scylladb#15938
* github.com:scylladb/scylladb:
streaming: Improve partition estimation with TWCS
streaming: Don't adjust partition estimate if segregation is postponed
(cherry picked from commit 64d1d5cf62)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#16672
Fixes#15269
If segment being replayed is corrupted/truncated we can attempt skipping
completely bogues byte amounts, which can cause assert (i.e. crash) in
file_data_source_impl. This is not a crash-level error, so ensure we
range check the distance in the reader.
v2: Add to corrupt_size if trying to skip more than available. The amount added is "wrong", but at least will
ensure we log the fact that things are broken
Closesscylladb/scylladb#15270
(cherry picked from commit 6ffb482bf3)
The reader used to read the sstables was not closed. This could
sometimes trigger an abort(), because the reader was destroyed, without
it being closed first.
Why only sometimes? This is due to two factors:
* read_mutation_from_flat_mutation_reader() - the method used to extract
a mutation from the reader, uses consume(), which does not trigger
`set_close_is_required()` (#16520). Due to this, the top-level
combined reader did not complain when destroyed without close.
* The combined reader closes underlying readers who have no more data
for the current range. If the circumstances are just right, all
underlying readers are closed, before the combined reader is
destoyed. Looks like this is what happens for the most time.
This bug was discovered in SCT testing. After fixing #16520, all
invokations of `scylla-sstable`, which use this code would trigger the
abort, without this patch. So no further testing is required.
Fixes: #16519Closesscylladb/scylladb#16521
(cherry picked from commit da033343b7)
The schema version is updated by group0, so if group0 starts before
schema version observer is registered some updates may be missed. Since
the observer is used to update node's gossiper state the gossiper may
contain wrong schema version.
Fix by registering the observer before starting group0 and even before
starting gossiper to avoid a theoretical case that something may pull
schema after start of gossiping and before the observer is registered.
Fixes: #15078
Message-Id: <ZOYZWhEh6Zyb+FaN@scylladb.com>
(cherry picked from commit d1654ccdda)
Fixes#9405
`sync_point` API provided with incorrect sync point id might allocate
crazy amount of memory and fail with `std::bad_alloc`.
To fix this, we can check if the encoded sync point has been modified
before decoding. We can achieve this by calculating a checksum before
encoding, appending it to the encoded sync point, and compering it with
a checksum calculated in `db::hints::decode` before decoding.
Closes#14534
* github.com:scylladb/scylladb:
db: hints: add checksum to sync point encoding
db: hints: add the version_size constant
(cherry picked from commit eb6202ef9c)
The only difference from the original merge commit is the include
path of `xx_hasher.hh`. On branch 5.2, this file is in the root
directory, not `utils`.
Closes#16458
* tools/java e2aad6e3a0...d7ec9bf45f (1):
> Merge "build: take care of old libthrift" from Piotr Grabowski
Fixes: scylladb/scylla-tools-java#352
Closes#16464
Lists can grow very big. Let's use a chunked vector to prevent large contiguous
allocations.
Fixes: #15302.
Closesscylladb/scylladb#15428
(cherry picked from commit 62a8a31be7)
The default reconnection policy in Python Driver is an exponential
backoff (with jitter) policy, which starts at 1 second reconnection
interval and ramps up to 600 seconds.
This is a problem in tests (refs #15104), especially in tests that restart
or replace nodes. In such a scenario, a node can be unavailable for an
extended period of time and the driver will try to reconnect to it
multiple times, eventually reaching very long reconnection interval
values, exceeding the timeout of a test.
Fix the issue by using a exponential reconnection policy with a maximum
interval of 4 seconds. A smaller value was not chosen, as each retry
clutters the logs with reconnection exception stack trace.
Fixes#15104Closes#15112
(cherry picked from commit 17e3e367ca)
server_impl::apply_snapshot() assumes that it cannot receive a snapshots
from the same host until the previous one is handled and usually this is
true since a leader will not send another snapshot until it gets
response to a previous one. But it may happens that snapshot sending
RPC fails after the snapshot was sent, but before reply is received
because of connection disconnect. In this case the leader may send
another snapshot and there is no guaranty that the previous one was
already handled, so the assumption may break.
Drop the assert that verifies the assumption and return an error in this
case instead.
Fixes: #15222
Message-ID: <ZO9JoEiHg+nIdavS@scylladb.com>
(cherry picked from commit 55f047f33f)
Having values of the duration type is not allowed for clustering
columns, because duration can't be ordered. This is correctly validated
when creating a table but do not validated when we alter the type.
Fixes#12913Closesscylladb/scylladb#16022
(cherry picked from commit bd73536b33)
systemd man page says:
systemd-fstab-generator(3) automatically adds dependencies of type Before= to
all mount units that refer to local mount points for this target unit.
So "Before=local-fs.taget" is the correct dependency for local mount
points, but we currently specify "After=local-fs.target", it should be
fixed.
Also replaced "WantedBy=multi-user.target" with "WantedBy=local-fs.target",
since .mount are not related with multi-user but depends local
filesystems.
Fixes#8761Closesscylladb/scylladb#15647
(cherry picked from commit a23278308f)
Shard id is logged twice in repair (once explicitly, once added by logger).
Redundant occurrence is deleted.
shard_repair_task_impl::id (which contains global repair shard)
is renamed to avoid further confusion.
Fixes: https://github.com/scylladb/scylladb/issues/12955Closes#16439
* github.com:scylladb/scylladb:
repair: rename shard_repair_task_impl::id
repair: delete redundant shard id from logs
Alternator's implementation of TagResource, UntagResource and UpdateTimeToLive (the latter uses tags to store the TTL configuration) was unsafe for concurrent modifications - some of these modifications may be lost. This short series fixes the bug, and also adds (in the last patch) a test that reproduces the bug and verifies that it's fixed.
The cause of the incorrect isolation was that we separately read the old tags and wrote the modified tags. In this series we introduce a new function, `modify_tags()` which can do both under one lock, so concurrent tag operations are serialized and therefore isolated as expected.
Fixes#6389.
Closes#13150
* github.com:scylladb/scylladb:
test/alternator: test concurrent TagResource / UntagResource
db/tags: drop unsafe update_tags() utility function
alternator: isolate concurrent modification to tags
db/tags: add safe modify_tags() utility functions
migration_manager: expose access to storage_proxy
(cherry picked from commit dba1d36aa6)
Closes#16453
Backport the following patches to 5.2:
- gossiper: mark_alive: enter background_msg gate (#14791)
- gossiper: mark_alive: use deferred_action to unmark pending (#14839)
Closes#16452
* github.com:scylladb/scylladb:
gossiper: mark_alive: use deferred_action to unmark pending
gossiper: mark_alive: enter background_msg gate
The implementation of "SELECT TOJSON(t)" or "SELECT JSON t" for a column
of type "time" forgot to put the time string in quotes. The result was
invalid JSON. This is patch is a one-liner fixing this bug.
This patch also removes the "xfail" marker from one xfailing test
for this issue which now starts to pass. We also add a second test for
this issue - the existing test was for "SELECT TOJSON(t)", and the second
test shows that "SELECT JSON t" had exactly the same bug - and both are
fixed by the same patch.
We also had a test translated from Cassandra which exposed this bug,
but that test continues to fail because of other bugs, so we just
need to update the xfail string.
The patch also fixes one C++ test, test/boost/json_cql_query_test.cc,
which enshrined the *wrong* behavior - JSON output that isn't even
valid JSON - and had to be fixed. Unlike the Python tests, the C++ test
can't be run against Cassandra, and doesn't even run a JSON parser
on the output, which explains how it came to enshrine wrong output
instead of helping to discover the bug.
Fixes#7988
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#16121
(cherry picked from commit 8d040325ab)
Make sure _pending_mark_alive_endpoints is unmarked in
any case, including exceptions.
Fixes#14839
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14840
(cherry picked from commit 1e7e2eeaee)
we compare the symbols lists of stripped ELF file ($orig.stripped) and
that of the one including debugging symbols ($orig.debug) to get a
an ELF file which includes only the necessary bits as the debuginfo
($orig.minidebug).
but we generate the symbol list of stripped ELF file using the
sysv format, while generate the one from the unstripped one using
posix format. the former is always padded the symbol names with spaces
so that their the length at least the same as the section name after
we split the fields with "|".
that's why the diff includes the stuff we don't expect. and hence,
we have tons of warnings like:
```
objcopy: build/node_exporter/node_exporter.keep_symbols:4910: Ignoring rubbish found on this line
```
when using objcopy to filter the ELF file to keep only the
symbols we are interested in.
so, in this change
* use the same format when dumping the symbols from unstripped ELF
file
* include the symbols in the text area -- the code, by checking
"T" and "t" in the dumped symbols. this was achieved by matching
the lines with "FUNC" before this change.
* include the the symbols in .init data section -- the global
variables which are initialized at compile time. they could
be also interesting when debugging an application.
Fixes#15513
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15514
(cherry picked from commit 50c937439b)
If the constructor of row_cache throws, `_partitions` is cleared in the
wrong allocator, possibly causing allocator corruption.
Fix that.
Fixes#15632Closesscylladb/scylladb#15633
(cherry picked from commit 330d221deb)
For JSON objects represented as map<ascii, int>, don't treat ASCII keys
as a nested JSON string. We were doing that prior to the patch, which
led to parsing errors.
Included the error offset where JSON parsing failed for
rjson::parse related functions to help identify parsing errors
better.
Fixes: #7949
Signed-off-by: Michael Huang <michaelhly@gmail.com>
Closesscylladb/scylladb#15499
(cherry picked from commit 75109e9519)
before this change, when running into a zero chunk_len, scylla
crashes with `assert(chunk_size != 0)`. but we can do better than
printing a backtrace like:
```
scylla: sstables/compress.cc:158: void
sstables::compression::segmented_offsets::init(uint32_t): Assertion `chunk_size != 0' failed.
```
so, in this change, a `malformed_sstable_exception` is throw in place
of an `assert()`, which is supposed to verify the programming
invariants, not for identifying corrupted data file.
Fixes#15265
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15264
(cherry picked from commit 1ed894170c)
When a table is dropped, we delete its sstables, and finally try to delete
the table's top-level directory with the rmdir system call. When the
auto-snapshot feature is enabled (this is still Scylla's default),
the snapshot will remain in that directory so it won't be empty and will
cannot be removed. Today, this results in a long, ugly and scary warning
in the log:
```
WARN 2023-07-06 20:48:04,995 [shard 0] sstable - Could not remove table directory "/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots": std::filesystem::__cxx11::filesystem_error (error system:39, filesystem error: remove failed: Directory not empty [/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots]). Ignored.
```
It is bad to log as a warning something which is completely normal - it
happens every time a table is dropped with the perfectly valid (and even
default) auto-snapshot mode. We should only log a warning if the deletion
failed because of some unexpected reason.
And in fact, this is exactly what the code **tried** to do - it does
not log a warning if the rmdir failed with EEXIST. It even had a comment
saying why it was doing this. But the problem is that in Linux, deleting
a non-empty directory does not return EEXIST, it returns ENOTEMPTY...
Posix actually allows both. So we need to check both, and this is the
only change in this patch.
To confirm this that this patch works, edit test/cql-pytest/run.py and
change auto-snapshot from 0 to 1, run test/alternator/run (for example)
and see many "Directory not empty" warnings as above. With this patch,
none of these warnings appear.
Fixes#13538
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14557
(cherry picked from commit edfb89ef65)
before this change, we format a `long` using `{:f}`. fmtlib would
throw an exception when actually formatting it.
so, let's make the percentage a float before formatting it.
Fixes#14587
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14588
(cherry picked from commit 1eb76d93b7)
shard_repair_task_impl::id stores global repair id. To avoid confusion
with the task id, the field is renamed to global_repair_id.
(cherry picked from commit d889a599e8)
The problem was that the holder in with_gate
call was released too early. This happened
before the possible call to on_hint_send_failure
in then_wrapped. As a result, the effects of
on_hint_send_failure (segment_replay_failed flag)
were not visible in send_one_file after
ctx_ptr->file_send_gate.close(), so we could decide
that the segment was sent in full and delete
it even if sending of some hints led to errors.
Fixes#15110
(cherry picked from commit 9fd3df13a2)
before this change, we format a sstring with "{:d}", fmtlib would throw
`fmt::format_error` at runtime when formatting it. this is not expected.
so, in this change, we just print the int8_t using `seastar::format()`
in a single pass. and with the format specifier of `#02x` instead of
adding the "0x" prefix manually.
Fixes#14577
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14578
(cherry picked from commit 27d6ff36df)
before this change, `scylla sstable dump-statistics` prints the
"regular_columns" as a list of strings, like:
```
"regular_columns": [
"name",
"clustering_order",
"type_name",
"org.apache.cassandra.db.marshal.UTF8Type",
"name",
"column_name_bytes",
"type_name",
"org.apache.cassandra.db.marshal.BytesType",
"name",
"kind",
"type_name",
"org.apache.cassandra.db.marshal.UTF8Type",
"name",
"position",
"type_name",
"org.apache.cassandra.db.marshal.Int32Type",
"name",
"type",
"type_name",
"org.apache.cassandra.db.marshal.UTF8Type"
]
```
but according
https://opensource.docs.scylladb.com/stable/operating-scylla/admin-tools/scylla-sstable.html#dump-statistics,
> $SERIALIZATION_HEADER_METADATA := {
> "min_timestamp_base": Uint64,
> "min_local_deletion_time_base": Uint64,
> "min_ttl_base": Uint64",
> "pk_type_name": String,
> "clustering_key_types_names": [String, ...],
> "static_columns": [$COLUMN_DESC, ...],
> "regular_columns": [$COLUMN_DESC, ...],
> }
>
> $COLUMN_DESC := {
> "name": String,
> "type_name": String
> }
"regular_columns" is supposed to be a list of "$COLUMN_DESC".
the same applies to "static_columnes". this schema makes sense,
as each column should be considered as a single object which
is composed of two properties. but we dump them like a list.
so, in this change, we guard each visit() call of `json_dumper()`
with `StartObject()` and `EndObject()` pair, so that each column
is printed as an object.
after the change, "regular_columns" are printed like:
```
"regular_columns": [
{
"name": "clustering_order",
"type_name": "org.apache.cassandra.db.marshal.UTF8Type"
},
{
"name": "column_name_bytes",
"type_name": "org.apache.cassandra.db.marshal.BytesType"
},
{
"name": "kind",
"type_name": "org.apache.cassandra.db.marshal.UTF8Type"
},
{
"name": "position",
"type_name": "org.apache.cassandra.db.marshal.Int32Type"
},
{
"name": "type",
"type_name": "org.apache.cassandra.db.marshal.UTF8Type"
}
]
```
Fixes#15036
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15037
(cherry picked from commit c82f1d2f57)
This commit introduces a new boolean flag, `shutdown`, to the
forward_service, along with a corresponding shutdown method. It also
adds checks throughout the forward_service to verify the value of the
shutdown flag before retrying or invoking functions that might use the
messaging service under the hood.
The flag is set before messaging service shutdown, by invoking
forward_service::shutdown in main. By checking the flag before each call
that potentially involves the messaging service, we can ensure that the
messaging service is still operational. If the flag is false, indicating
that the messaging service is still active, we can proceed with the
call. In the event that the messaging service is shutdown during the
call, appropriate exceptions should be thrown somewhere down in called
functions, avoiding potential hangs.
This fix should resolve the issue where forward_service retries could
block the shutdown.
Fixes#12604Closes#13922
(cherry picked from commit e0855b1de2)
We had a redundant copy in receive_mutation_handler
forward_fn callback. This frozen_mutation is
dynamically allocated and can be arbitrary large.
Fixes: #12504
(cherry picked from commit 5adbb6cde2)
This is used for readiness API: /storage_service/rpc_server and the fix prevents from returning 'true' prematurely.
Some improvement for readiness was added in a51529dd15 but thrift implementation wasn't fully done.
Fixes https://github.com/scylladb/scylladb/issues/12376Closes#13319
* github.com:scylladb/scylladb:
thrift: return address in listen_addresses() only after server is ready
thrift: simplify do_start_server() with seastar:async
(cherry picked from commit 9a024f72c4)
Fix a few cases where instead of printing column names in error messages, we printed weird stuff like ASCII codes or the address of the name.
Fixes#13657Closes#13658
* github.com:scylladb/scylladb:
cql3: fix printing of column_specification::name in some error messages
cql3: fix printing of column_definition::name in some error messages
(cherry picked from commit a29b8cd02b)
This series introduces a new gossiper method: get_endpoints that returns a vector of endpoints (by value) based on the endpoint state map.
get_endpoints is used here by gossiper and storage_service for iterations that may preempt
instead of iterating direction over the endpoint state map (`_endpoint_state_map` in gossiper or via `get_endpoint_states()`) so to prevent use-after-free that may potentially happen if the map is rehashed while the function yields causing invalidation of the loop iterators.
\Fixes #13899
\Closes #13900
* github.com:scylladb/scylladb:
storage_service: do not preempt while traversing endpoint_state_map
gossiper: do not preempt while traversing endpoint_state_map
(cherry picked from commit d2d53fc1db)
Closes#16431
`clustering_key_columns()` returns a range view, and `front()` returns
the reference to its first element. so we cannot assume the availability
of this reference after the expression is evaluated. to address this
issue, let's capture the returned range by value, and keep the first
element by reference.
this also silences warning from GCC-13:
```
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:30: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
3654 | const column_definition& first_view_ck = v->clustering_key_columns().front();
| ^~~~~~~~~~~~~
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:79: note: the temporary was destroyed at the end of the full expression ‘(& v)->view_ptr::operator->()->schema::clustering_key_columns().boost::iterator_range<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> > >::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::random_access_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::bidirectional_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::incrementable_traversal_tag>::front()’
3654 | const column_definition& first_view_ck = v->clustering_key_columns().front();
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
```
Fixes#13720
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13721
(cherry picked from commit 135b4fd434)
Currently, the reader might dereference a null pointer
if the input stream reaches eof prematurely,
and read_exactly returns an empty temporary_buffer.
Detect this condition before dereferencing the buffer
and sstables::malformed_sstable_exception.
Fixes#13599
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13600
(cherry picked from commit 77b70dbdb7)
The previous version of wasmtime had a vulnerability that possibly
allowed causing undefined behavior when calling UDFs.
We're directly updating to wasmtime 8.0.1, because the update only
requires a slight code modification and the Wasm UDF feature is
still experimental. As a result, we'll benefit from a number of
new optimizations.
Fixes#13807Closes#13804
(cherry picked from commit 6bc16047ba)
The metric shows the opposite of what its name suggests.
It shows available memory rather than consumed memory.
Fix that.
Fixes#13810Closes#13811
(cherry picked from commit 0813fa1da0)
The use statement execution code can throw if the keyspace is
doesn't exist, this can be a problem for code that will use
execute in a fiber since the exception will break the fiber even
if `then_wrapped` is used.
Fixes#14449
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Closesscylladb/scylladb#14394
(cherry picked from commit c5956957f3)
reconcilable_result_builder passes range tombstone changes to _rt_assembler
using table schema, not query schema.
This means that a tombstone with bounds (a; b), where a < b in query schema
but a > b in table schema, will not be emitted from mutation_query.
This is a very serious bug, because it means that such tombstones in reverse
queries are not reconciled with data from other replicas.
If *any* queried replica has a row, but not the range tombstone which deleted
the row, the reconciled result will contain the deleted row.
In particular, range deletes performed while a replica is down will not
later be visible to reverse queries which select this replica, regardless of the
consistency level.
As far as I can see, this doesn't result in any persistent data loss.
Only in that some data might appear resurrected to reverse queries,
until the relevant range tombstone is fully repaired.
This series fixes the bug and adds a minimal reproducer test.
Fixes#10598Closesscylladb/scylladb#16003
* github.com:scylladb/scylladb:
mutation_query_test: test that range tombstones are sent in reverse queries
mutation_query: properly send range tombstones in reverse queries
(cherry picked from commit 65e42e4166)
When scanning our latest docker image using `trivy` (command: `trivy
image docker.io/scylladb/scylla-nightly:latest`), it shows we have OS
packages which are out of date.
Also removing `openssh-server` and `openssh-client` since we don't use
it for our docker images
Fixes: https://github.com/scylladb/scylladb/issues/16222
Closes scylladb/scylladb#16224
(cherry picked from commit 7ce6962141)
Closes#16360
The execution loop consumes permits from the _ready_list and executes
them. The _ready_list usually contains a single permit. When the
_ready_list is not empty, new permits are queued until it becomes empty.
The execution loops relies on admission checks triggered by the read
releasing resouces, to bring in any queued read into the _ready_list,
while it is executing the current read. But in some cases the current
read might not free any resorces and thus fail to trigger an admission
check and the currently queued permits will sit in the queue until
another source triggers an admission check.
I don't yet know how this situation can occur, if at all, but it is
reproducible with a simple unit test, so it is best to cover this
corner-case in the off-chance it happens in the wild.
Add an explicit admission check to the execution loop, after the
_ready_list is exhausted, to make sure any waiters that can be admitted
with an empty _ready_list are admitted immediately and execution
continues.
Fixes: #13540Closes#13541
(cherry picked from commit b790f14456)
Currently the cache updaters aren't exception safe
yet they are intended to be.
Instead of allowing exceptions from
`external_updater::execute` escape `row_cache::update`,
abort using `on_fatal_internal_error`.
Future changes should harden all `execute` implementations
to effectively make them `noexcept`, then the pure virtual
definition can be made `noexcept` to cement that.
\Fixes scylladb/scylladb#15576
\Closes scylladb/scylladb#15577
* github.com:scylladb/scylladb:
row_cache: abort on exteral_updater::execute errors
row_cache: do_update: simplify _prev_snapshot_pos setup
(cherry picked from commit 4a0f16474f)
Closesscylladb/scylladb#16256
Fixes#16153
* java e716e1bd1d...80701efa8d (1):
> NodeProbe: allow addressing table name with colon in it
/home/nyh/scylla/tools$ git submodule summary jmx | cat
* jmx bc4f8ea...f21550e (3):
> ColumnFamilyStore: only quote table names if necessary
> APIBuilder: allow quoted scope names
> ColumnFamilyStore: don't fail if there is a table with ":" in its name
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#16296
This commit fixes the rollback procedure in
the 4.6-to-5.0 upgrade guide:
- The "Restore system tables" step is removed.
- The "Restore the configuration file" command
is fixed.
- The "Gracefully shutdown ScyllaDB" command
is fixed.
In addition, there are the following updates
to be in sync with the tests:
- The "Backup the configuration file" step is
extended to include a command to backup
the packages.
- The Rollback procedure is extended to restore
the backup packages.
- The Reinstallation section is fixed for RHEL.
Refs https://github.com/scylladb/scylladb/issues/11907
This commit must be backported to branch-5.4, branch-5.2, and branch-5.1
Closesscylladb/scylladb#16155
(cherry picked from commit 1e80bdb440)
This commit fixes the rollback procedure in
the 5.0-to-5.1 upgrade guide:
- The "Restore system tables" step is removed.
- The "Restore the configuration file" command
is fixed.
- The "Gracefully shutdown ScyllaDB" command
is fixed.
In addition, there are the following updates
to be in sync with the tests:
- The "Backup the configuration file" step is
extended to include a command to backup
the packages.
- The Rollback procedure is extended to restore
the backup packages.
- The Reinstallation section is fixed for RHEL.
Also, I've the section removed the rollback
section for images, as it's not correct or
relevant.
Refs https://github.com/scylladb/scylladb/issues/11907
This commit must be backported to branch-5.4, branch-5.2, and branch-5.1
Closesscylladb/scylladb#16154
(cherry picked from commit 7ad0b92559)
This commit fixes the rollback procedure in
the 5.1-to-5.2 upgrade guide:
- The "Restore system tables" step is removed.
- The "Restore the configuration file" command
is fixed.
- The "Gracefully shutdown ScyllaDB" command
is fixed.
In addition, there are the following updates
to be in sync with the tests:
- The "Backup the configuration file" step is
extended to include a command to backup
the packages.
- The Rollback procedure is extended to restore
the backup packages.
- The Reinstallation section is fixed for RHEL.
Also, I've the section removed the rollback
section for images, as it's not correct or
relevant.
Refs https://github.com/scylladb/scylladb/issues/11907
This commit must be backported to branch-5.4 and branch-5.2.
Closesscylladb/scylladb#16152
(cherry picked from commit 91cddb606f)
database_test is failing sporadically and the cause was traced back
to commit e3e7c3c7e5.
The commit forces a subset of tests in database_test, to run once
for each of predefined x_log2_compaction_group settings.
That causes two problems:
1) test becomes 240% slower in dev mode.
2) queries on system.auth is timing out, and the reason is a small
table being spread across hundreds of compaction groups in each
shard. so to satisfy a range scan, there will be multiple hops,
making the overhead huge. additionally, the compaction group
aware sstable set is not merged yet. so even point queries will
unnecessarily scan through all the groups.
Fixes#13660.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13851
(cherry picked from commit a7ceb987f5)
Passing the gate_closed_exception to the task promise in start()
ends up with abandoned exception since no-one is waiting
for it.
Instead, enter the gate when the task is made
so it will fail make_task if the gate is already closed.
Fixesscylladb/scylladb#15211
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit f9a7635390)
The copy assignment operator of _ck can throw
after _type and _bound_weight have already been changed.
This leaves position_in_partition in an inconsistent state,
potentially leading to various weird symptoms.
The problem was witnessed by test_exception_safety_of_reads.
Specifically: in cache_flat_mutation_reader::add_to_buffer,
which requires the assignment to _lower_bound to be exception-safe.
The easy fix is to perform the only potentially-throwing step first.
Fixes#15822Closesscylladb/scylladb#15864
(cherry picked from commit 93ea3d41d8)
Update node_exporter to 1.7.0.
The previous version (1.6.1) was flagged by security scanners (such as
Trivy) with HIGH-severity CVE-2023-39325. 1.7.0 release fixed that
problem.
[Botond: regenerate frozen toolchain]
Fixes#16085Closesscylladb/scylladb#16086Closesscylladb/scylladb#16090
(cherry picked from commit 321459ec51)
[avi: regenerate frozen toolchain]
* tools/jmx 88d9bdc...bc4f8ea (1):
> Merge "scylla-apiclient: update several Java dependencies" from Piotr Grabowski
* tools/java f8f556d802...e716e1bd1d (1):
> Merge 'build: update several dependencies' from Piotr Grabowski
Update build dependencies which were flagged by security scanners.
Refs: scylladb/scylla-jmx#220
Refs: scylladb/scylla-tools-java#351
Closes#16150
Currently, the API call recalculates only per-node schema version. To
workaround issues like #4485 we want to recalculate per-table
digests. One way to do that is to restart the node, but that's slow
and has impact on availability.
Use like this:
curl -X POST http://127.0.0.1:10000/storage_service/relocal_schemaFixes#15380Closes#15381
(cherry picked from commit c27d212f4b)
Currently, when said feature is enabled, we recalcuate the schema
digest. But this feature also influences how table versions are
calculated, so it has to trigger a recalculation of all table versions,
so that we can guarantee correct versions.
Before, this used to happen by happy accident. Another feature --
table_digest_insensitive_to_expiry -- used to take care of this, by
triggering a table version recalulation. However this feature only takes
effect if digest_insensitive_to_expiry is also enabled. This used to be
the case incidently, by the time the reload triggered by
table_digest_insensitive_to_expiry ran, digest_insensitive_to_expiry was
already enabled. But this was not guaranteed whatsoever and as we've
recently seen, any change to the feature list, which changes the order
in which features are enabled, can cause this intricate balance to
break.
This patch makes digest_insensitive_to_expiry also kick off a schema
reload, to eliminate our dependence on (unguaranteed) feature order, and
to guarantee that table schemas have a correct version after all features
are enabled. In fact, all schema feature notification handlers now kick
off a full schema reload, to ensure bugs like this don't creep in, in
the future.
Fixes: #16004Closesscylladb/scylladb#16013
(cherry picked from commit 22381441b0)
In 0c86abab4d `merge_schema` obtained a new flag, `reload`.
Unfortunately, the flag was assigned a default value, which I think is
almost always a bad idea, and indeed it was in this case. When
`merge_scehma` is called on shard different than 0, it recursively calls
itself on shard 0. That recursive call forgot to pass the `reload` flag.
Fix this.
(cherry picked from commit 48164e1d09)
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.
Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.
After ae8d2a550d (5.2.0), it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).
This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.
A similar problem was fixed for per-node digest calculation in
c2ba94dc39e4add9db213751295fb17b95e6b962. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.
Fixes#4485.
Manually tested using ccm on cluster upgrade scenarios and node restarts.
Closes#14441
* github.com:scylladb/scylladb:
test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled
schema_mutations, migration_manager: Ignore empty partitions in per-table digest
migration_manager, schema_tables: Implement migration_manager::reload_schema()
schema_tables: Avoid crashing when table selector has only one kind of tables
(cherry picked from commit cf81eef370)
Currently the code will assert because cl pointer will be null and it
will be null because there is no mutations to initialize it from.
Message-Id: <20230212144837.2276080-3-gleb@scylladb.com>
(cherry picked from commit 941407b905)
Backport needed by #4485.
Currently, it is started/stopped in the streaming/maintenance sg, which
is what the API itself runs in.
Starting the native transport in the streaming sg, will lead to severely
degraded performance, as the streaming sg has significantly less
CPU/disk shares and reader concurrency semaphore resources.
Furthermore, it will lead to multi-paged reads possibly switching
between scheduling groups mid-way, triggering an internal error.
To fix, use `with_scheduling_group()` for both starting and stopping
native transport. Technically, it is only strictly necessary for
starting, but I added it for stop as well for consistency.
Also apply the same treatment to RPC (Thrift). Although no one uses it,
best to fix it, just to be on the safe side.
I think we need a more systematic approach for solving this once and for
all, like passing the scheduling group to the protocol server and have
it switch to it internally. This allows the server to always run on the
correct scheduling group, not depending on the caller to remember using
it. However, I think this is best done in a follow-up, to keep this
critical patch small and easily backportable.
Fixes: #15485Closesscylladb/scylladb#16019
(cherry picked from commit dfd7981fa7)
$ID_LIKE = "rhel" works only on RHEL compatible OSes, not for RHEL
itself.
To detect RHEL correctly, we also need to check $ID = "rhel".
Fixes#16040Closesscylladb/scylladb#16041
(cherry picked from commit 338a9492c9)
When base write triggers mv write and it needs to be send to another
shard it used the same service group and we could end up with a
deadlock.
This fix affects also alternator's secondary indexes.
Testing was done using (yet) not committed framework for easy alternator
performance testing: https://github.com/scylladb/scylladb/pull/13121.
I've changed hardcoded max_nonlocal_requests config in scylla from 5000 to 500 and
then ran:
./build/release/scylla perf-alternator-workloads --workdir /tmp/scylla-workdir/ --smp 2 \
--developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write_gsi \
--duration 60 --ring-delay-ms 0 --skip-wait-for-gossip-to-settle 0 --continue-after-error true --concurrency 2000
Without the patch when scylla is overloaded (i.e. number of scheduled futures being close to max_nonlocal_requests) after couple seconds
scylla hangs, cpu usage drops to zero, no progress is made. We can confirm we're hitting this issue by seeing under gdb:
p seastar::get_smp_service_groups_semaphore(2,0)._count
$1 = 0
With the patch I wasn't able to observe the problem, even with 2x
concurrency. I was able to make the process hang with 10x concurrency
but I think it's hitting different limit as there wasn't any depleted
smp service group semaphore and it was happening also on non mv loads.
Fixes https://github.com/scylladb/scylladb/issues/15844Closesscylladb/scylladb#15845
(cherry picked from commit 020a9c931b)
We have observed do_repair_ranges() receiving tens of thousands of
ranges to repairs on occasion. do_repair_ranges() repairs all ranges in
parallel, with parallel_for_each(). This is normally fine, as the lambda
inside parallel_for_each() takes a semaphore and this will result in
limited concurrency.
However, in some instances, it is possible that most of these ranges are
skipped. In this case the lambda will become synchronous, only logging a
message. This can cause stalls beacuse there are no opportunities to
yield. Solve this by adding an explicit yield to prevent this.
Fixes: #14330Closesscylladb/scylladb#15879
(cherry picked from commit 90a8489809)
While looking for specific UDF/UDA, result of
`functions::functions::find()` needs to be filtered out based on
function's type.
Fixes: #14360
(cherry picked from commit d498451cdf)
These APIs may return stale or simply incorrect data on shards
other than 0. Newer versions of Scylla are better at maintaining
cross-shard consistency, but we need a simple fix that can be easily and
without risk be backported to older versions; this is the fix.
Add a simple test to check that the `failure_detector/endpoints`
API returns nonzero generation.
Fixes: scylladb/scylladb#15816Closesscylladb/scylladb#15970
* github.com:scylladb/scylladb:
test: rest_api: test that generation is nonzero in `failure_detector/endpoints`
api: failure_detector: fix indentation
api: failure_detector: invoke on shard 0
(cherry picked from commit 9443253f3d)
This `with` context is supposed to disable, then re-enable
autocompaction for the given keyspaces, but it used the wrong API for
it, it used the column_family/autocompaction API, which operates on
column families, not keyspaces. This oversight led to a silent failure
because the code didn't check the result of the request.
Both are fixed in this patch:
* switch to use `storage_service/auto_compaction/{keyspace}` endpoint
* check the result of the API calls and report errors as exceptions
Fixes: #13553Closes#13568
(cherry picked from commit 66ee73641e)
Before integration with task manager the state of one shard repair
was kept in repair_info. repair_info object was destroyed immediately
after shard repair was finished.
In an integration process repair_info's fields were moved to
shard_repair_task_impl as the two served the similar purposes.
Though, shard_repair_task_impl isn't immediately destoyed, but is
kept in task manager for task_ttl seconds after it's complete.
Thus, some of repair_info's fields have their lifetime prolonged,
which makes the repair state change delayed.
Release shard_repair_task_impl resources immediately after shard
repair is finished.
Fixes: #15505.
(cherry picked from commit 0474e150a9)
Closes#15875
scylla-sstable currently has two ways to obtain the schema:
* via a `schema.cql` file.
* load schema definition from memory (only works for system tables).
This meant that for most cases it was necessary to export the schema into a CQL format and write it to a file. This is very flexible. The sstable can be inspected anywhere, it doesn't have to be on the same host where it originates form. Yet in many cases the sstable is inspected on the same host where it originates from. In this cases, the schema is readily available in the schema tables on disk and it is plain annoying to have to export it into a file, just to quickly inspect an sstable file.
This series solves this annoyance by providing a mechanism to load schemas from the on-disk schema tables. Furthermore, an auto-detect mechanism is provided to detect the location of these schema tables based on the path of the sstable, but if that fails, the tool check the usual locations of the scylla data dir, the scylla confguration file and even looks for environment variables that tell the location of these. The old methods are still supported. In fact, if a schema.cql is present in the working directory of the tool, it is preferred over any other method, allowing for an easy force-override.
If the auto-detection magic fails, an error is printed to the console, advising the user to turn on debug level logging to see what went wrong.
A comprehensive test is added which checks all the different schema loading mechanisms. The documentation is also updated to reflect the changes.
This change breaks the backward-compatibility of the command-line API of the tool, as `--system-schema` is now just a flag, the keyspace and table names are supplied separately via the new `--keyspace` and `--table` options. I don't think this will break anybody's workflow as this tools is still lightly used, exactly because of the annoying way the schema has to be provided. Hopefully after this series, this will change.
Example:
```
$ ./build/dev/scylla sstable dump-data /var/lib/scylla/data/ks/tbl2-d55ba230b9a811ed9ae8495671e9e4f8/quarantine/me-1-big-Data.db
{"sstables":{"/var/lib/scylla/data/ks/tbl2-d55ba230b9a811ed9ae8495671e9e4f8/quarantine//me-1-big-Data.db":[{"key":{"token":"-3485513579396041028","raw":"000400000000","value":"0"},"clustering_elements":[{"type":"clustering-row","key":{"raw":"","value":""},"marker":{"timestamp":1677837047297728},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1677837047297728,"value":"0"}}}]}]}}
```
As seen above, subdirectories like qurantine, staging etc are also supported.
Fixes: https://github.com/scylladb/scylladb/issues/10126Closes#13448
* github.com:scylladb/scylladb:
test/cql-pytest: test_tools.py: add tests for schema loading
test/cql-pytest: add no_autocompaction_context
docs: scylla-sstable.rst: remove accidentally added copy-pasta
docs: scylla-sstable.rst: remove paragraph with schema limitations
docs: scylla-sstable.rst: update schema section
test/cql-pytest: nodetool.py: add flush_keyspace()
tools/scylla-sstable: reform schema loading mechanism
tools/schema_loader: add load_schema_from_schema_tables()
db/schema_tables: expose types schema
(cherry picked from commit 952b455310)
Closes#15386
This is a backport of https://github.com/scylladb/scylladb/pull/14158 to branch 5.2
Closes#15872
* github.com:scylladb/scylladb:
migration_notifier: get schema_ptr by value
migration_manager: propagate listener notification exceptions
storage_service: keyspace_changed: execute only on shard 0
database: modify_keyspace_on_all_shards: execute func first on shard 0
database: modify_keyspace_on_all_shards: call notifiers only after applying func on all shards
database: add modify_keyspace_on_all_shards
schema_tables: merge_keyspaces: extract_scylla_specific_keyspace_info for update_keyspace
database: create_keyspace_on_all_shards
database: update_keyspace_on_all_shards
database: drop_keyspace_on_all_shards
Off-strategy suffers with a 100% space overhead, as it adopted
a sort of all or nothing approach. Meaning all input sstables,
living in maintenance set, are kept alive until they're all
reshaped according to the strategy criteria.
Input sstables in off-strategy are very likely to be mostly disjoint,
so it can greatly benefit from incremental compaction.
The incremental compaction approach is not only good for
decreasing disk usage, but also memory usage (as metadata of
input and output live in memory), and file desc count, which
takes memory away from OS.
Turns out that this approach also greatly simplifies the
off-strategy impl in compaction manager, as it no longer have
to maintain new unused sstables and mark them for
deletion on failure, and also unlink intermediary sstables
used between reshape rounds.
Fixes https://github.com/scylladb/scylladb/issues/14992.
Backport notes: relatively easy to backport, had to include
**replica: Make compaction_group responsible for deleting off-strategy compaction input**
and
**compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0**
Closes#15793
* github.com:scylladb/scylladb:
test: Verify that off-strategy can do incremental compaction
compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0
compaction: Clear pending_replacement list when tombstone GC is disabled
compaction: Enable incremental compaction on off-strategy
compaction: Extend reshape type to allow for incremental compaction
compaction: Move reshape_compaction in the source
compaction: Enable incremental compaction only if replacer callback is engaged
replica: Make compaction_group responsible for deleting off-strategy compaction input
To prevent use-after-free as seen in
https://github.com/scylladb/scylladb/issues/15097
where a temp schema_ptr retrieved from a global_schema_ptr
get destroyed when the notification function yielded.
Capturing the schema_ptr on the coroutine frame
is inexpensive since its a shared ptr and it makes sure
that the schema remains valid throughput the coroutine
life time.
\Fixes scylladb/scylladb#15097
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
\Closes #15098
(cherry picked from commit 0f54e24519)
1e29b07e40 claimed
to make event notification exception safe,
but swallawing the exceptions isn't safe at all,
as this might leave the node in an inconsistent state
if e.g. storage_service::keyspace_changed fails on any of the
shards. Propagating the exception here will cause abort,
but it is better than leaving the node up, but in an
inconsistent state.
We keep notifying other listeners even if any of them failed
Based on 1e29b07e40:
```
If one of the listeners throws an exception, we must ensure that other
listeners are still notified.
```
The decision about swallowing exceptions can't be
made in such a generic layer.
Specific notification listeners that may ignore exceptions,
like in transport/evenet_notifier, may decide to swallow their
local exceptions on their own (as done in this patch).
Refs #3389
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 825d617a53)
Previously all shards called `update_topology_change_info`
which in turn calls `mutate_token_metadata`, ending up
in quadratic complexity.
Now that the notifications are called after
all database shards are updated, we can apply
the changes on token metadata / effective replication map
only on shard 0 and count on replicate_to_all_cores to
propagate those changes to all other shards.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit a690f0e81f)
When creating or altering a keyspace, we create a new
effective_replication_map instance.
It is more efficient to do that first on shard 0
and then on all other shards, otherwise multiple
shards might need to calculate to new e_r_m (and reach
the same result). When the new e_r_m is "seeded" on
shard 0, other shards will find it there and clone
a local copy of it - which is more efficient.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 13dd92e618)
When creating, updating, or dropping keyspaces,
first execute the database internal function to
modify the database state, and only when all shards
are updated, run the listener notifications,
to make sure they would operate when the database
shards are consistent with each other.
\Fixes #13137
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ba15786059)
Run all keyspace create/update/drop ops
via `modify_keyspace_on_all_shards` that
will standardize the execution on all shards
in the coming patches.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 3b8c913e61)
Similar to create_keyspace_on_all_shards,
`extract_scylla_specific_keyspace_info` and
`create_keyspace_from_schema_partition` can be called
once in the upper layer, passing keyspace_metadata&
down to database::update_keyspace_on_all_shards
which now would only make the per-shard
keyspace_metadata from the reference it gets
from the schema_tables layer.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit dc9b0812e9)
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 3520c786bd)
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 53a6ea8616)
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 9d40305ef6)
before this change, `checksummed_file_data_sink_impl` just inherits the
`data_sink_impl::flush()` from its parent class. but as a wrapper around
the underlying `_out` data_sink, this is not only an unusual design
decision in a layered design of an I/O system, but also could be
problematic. to be more specific, the typical user of `data_sink_impl`
is a `data_sink`, whose `flush()` member function is called when
the user of `data_sink` want to ensure that the data sent to the sink
is pushed to the underlying storage / channel.
this in general works, as the typical user of `data_sink` is in turn
`output_stream`, which calls `data_sink.flush()` before closing the
`data_sink` with `data_sink.close()`. and the operating system will
eventually flush the data after application closes the corresponding
fd. to be more specific, almost none of the popular local filesystem
implements the file_operations.op, hence, it's safe even if the
`output_stream` does not flush the underlying data_sink after writing
to it. this is the use case when we write to sstables stored on local
filesystem. but as explained above, if the data_sink is backed by a
network filesystem, a layered filesystem or a storage connected via
a buffered network device, then it is crucial to flush in a timely
manner, otherwise we could risk data lost if the application / machine /
network breaks when the data is considerered persisted but they are
_not_!
but the `data_sink` returned by `client::make_upload_jumbo_sink` is
a little bit different. multipart upload is used under the hood, and
we have to finalize the upload once all the parts are uploaded by
calling `close()`. but if the caller fails / chooses to close the
sink before flushing it, the upload is aborted, and the partially
uploaded parts are deleted.
the default-implemented `checksummed_file_data_sink_impl::flush()`
breaks `upload_jumbo_sink` which is the `_out` data_sink being
wrapped by `checksummed_file_data_sink_impl`. as the `flush()`
calls are shortcircuited by the wrapper, the `close()` call
always aborts the upload. that's why the data and index components
just fail to upload with the S3 backend.
in this change, we just delegate the `flush()` call to the
wrapped class.
Fixes#15079
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15134
(cherry picked from commit d2d1141188)
The grammar mistakenly allows nothing to be parsed as an
intValue (itself accepted in LIMIT and similar clauses).
Easily fixed by removing the empty alternative. A unit test is
added.
Fixes#14705.
Closes#14707
(cherry picked from commit e00811caac)
Prevent div-by-zero byt returning const level 1
if max_sstable_size is zero, as configured by
cleanup_incremental_compaction_test, before it's
extended to cover also offstrategy compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b1e164a241)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
pending_replacement list is used by incremental compaction to
communicate to other ongoing compactions about exhausted sstables
that must be replaced in the sstable set they keep for tombstone
GC purposes.
Reshape doesn't enable tombstone GC, so that list will not
be cleared, which prevents incremental compaction from releasing
sstables referenced by that list. It's not a problem until now
where we want reshape to do incremental compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Off-strategy suffers with a 100% space overhead, as it adopted
a sort of all or nothing approach. Meaning all input sstables,
living in maintenance set, are kept alive until they're all
reshaped according to the strategy criteria.
Input sstables in off-strategy are very likely to mostly disjoint,
so it can greatly benefit from incremental compaction.
The incremental compaction approach is not only good for
decreasing disk usage, but also memory usage (as metadata of
input and output live in memory), and file desc count, which
takes memory away from OS.
Turns out that this approach also greatly simplifies the
off-strategy impl in compaction manager, as it no longer have
to maintain new unused sstables and mark them for
deletion on failure, and also unlink intermediary sstables
used between reshape rounds.
Fixes#14992.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 42050f13a0)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's done by inheriting regular_compaction, which implement
incremental compaction. But reshape still implements its own
methods for creating writer and reader. One reason is that
reshape is not driven by controller, as input sstables to it
live in maintenance set. Another reason is customization
of things like sstable origin, etc.
stop_sstable_writer() is extended because that's used by
regular_compaction to check for possibility of removing
exhausted sstables earlier whenever an output sstable
is sealed.
Also, incremental compaction will be unconditionally
enabled for ICS/LCS during off-strategy.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit db9ce9f35a)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's in preparation to next change that will make reshape
inherit from regular compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's needed for enabling incremental compaction to operate, and
needed for subsequent work that enables incremental compaction
for off-strategy, which in turn uses reshape compaction type.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Compaction group is responsible for deleting SSTables of "in-strategy"
compactions, i.e. regular, major, cleanup, etc.
Both in-strategy and off-strategy compaction have their completion
handled using the same compaction group interface, which is
compaction_group::table_state::on_compaction_completion(...,
sstables::offstrategy offstrategy)
So it's important to bring symmetry there, by moving the responsibility
of deleting off-strategy input, from manager to group.
Another important advantage is that off-strategy deletion is now throttled
and gated, allowing for better control, e.g. table waiting for deletion
on shutdown.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13432
(cherry picked from commit 457c772c9c)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Commit 8c4b5e4 introduced an optimization which only
calculates max purgeable timestamp when a tombstone satisfy the
grace period.
Commit 'repair: Get rid of the gc_grace_seconds' inverted the order,
probably under the assumption that getting grace period can be
more expensive than calculating max purgeable, as repair-mode GC
will look up into history data in order to calculate gc_before.
This caused a significant regression on tombstone heavy compactions,
where most of tombstones are still newer than grace period.
A compaction which used to take 5s, now takes 35s. 7x slower.
The reason is simple, now calculation of max purgeable happens
for every single tombstone (once for each key), even the ones that
cannot be GC'ed yet. And each calculation has to iterate through
(i.e. check the bloom filter of) every single sstable that doesn't
participate in compaction.
Flame graph makes it very clear that bloom filter is a heavy path
without the optimization:
45.64% 45.64% sstable_compact sstable_compaction_test_g
[.] utils::filter::bloom_filter::is_present
With its resurrection, the problem is gone.
This scenario can easily happen, e.g. after a deletion burst, and
tombstones becoming only GC'able after they reach upper tiers in
the LSM tree.
Before this patch, a compaction can be estimated to have this # of
filter checks:
(# of keys containing *any* tombstone) * (# of uncompacting sstable
runs[1])
[1] It's # of *runs*, as each key tend to overlap with only one
fragment of each run.
After this patch, the estimation becomes:
(# of keys containing a GC'able tombstone) * (# of uncompacting
runs).
With repair mode for tombstone GC, the assumption, that retrieval
of gc_before is more expensive than calculating max purgeable,
is kept. We can revisit it later. But the default mode, which
is the "timeout" (i.e. gc_grace_seconds) one, we still benefit
from the optimization of deferring the calculation until
needed.
Cherry picked from commit 38b226f997
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Fixes#14091.
Closes#13908
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#15744
This is a backport of PR https://github.com/scylladb/scylladb/pull/15740.
This commit removes the information about the recommended way of upgrading ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade procedure is not supported (it was implemented, but then reverted).
The scope of this commit:
- Remove the information from the 5.0-to.-5.1 upgrade guide and replace with general info.
- Remove the information from the 4.6-to.-5.1 upgrade guide and replace with general info.
- Remove the information from the 5.x.y-to.-5.x.z upgrade guide and replace with general info.
- Remove the following files as no longer necessary (they were only created to incorporate the (invalid) information about image upgrade into the upgrade guides.
/upgrade/_common/upgrade-image-opensource.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p1.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p2.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian.rst
Closes#15768
* github.com:scylladb/scylladb:
doc: remove wrong image upgrade info (5.x.y-to-5.x.y)
doc: remove wrong image upgrade info (4.6-to-5.0)
doc: remove wrong image upgrade info (5.0-to-5.1)
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 5.x.y-to-5.x.y upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).
Refs https://github.com/scylladb/scylladb/issues/15733
In addition, the following files are removed as no longer
necessary (they were only created to incorporate the (invalid)
information about image upgrade into the upgrade guides.
/upgrade/_common/upgrade-image-opensource.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p1.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p2.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian.rst
(cherry picked from commit dd1207cabb)
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 4.6-to-5.0 upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).
Refs https://github.com/scylladb/scylladb/issues/15733
(cherry picked from commit 526d543b95)
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 5.0-to-5.1 upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).
Refs https://github.com/scylladb/scylladb/issues/15733
(cherry picked from commit 9852130c5b)
The estimated_partitions is estimated after the repair_meta is created.
Currently, the default estimated_partitions was used to create the
write which is not correct.
To fix, use the updated estimated_partitions.
Reported by Petr Gusev
Closes#14179Fixes#15748
(cherry picked from commit 4592bbe182)
This commit removes the information about
the recommended way of upgrading ScyllaDB
images - by updating ScyllaDB and OS packages
in one step.
This upgrade procedure is not supported
(it was implemented, but then reverted).
The scope of this commit:
- Remove the information from the 5.1-to.-5.2
upgrade guide and replace with general info.
- Remove the information from the Image Upgrade
page.
- Remove outdated info (about previous releases)
from the Image Upgrade page.
- Rename "AMI Upgrade" as "Image Upgrade"
in the page tree.
Refs: https://github.com/scylladb/scylladb/issues/15733
(cherry picked from commit f6767f6d6e)
Closes#15754
Scylla can crash due to a complicated interaction of service level drop,
evictable readers, inactive read registration path.
1) service level drop invoke stop of reader concurrency semaphore, which will
wait for in flight requests
2) turns out it stops first the gate used for closing readers that will
become inactive.
3) proceeds to wait for in-flight reads by closing the reader permit gate.
4) one of evictable reads take the inactive read registration path, and
finds the gate for closing readers closed.
5) flat mutation reader is destroyed, but finds the underlying reader was
not closed gracefully and triggers the abort.
By closing permit gate first, evictable readers becoming inactive will
be able to properly close underlying reader, therefore avoiding the
crash.
Fixes#15534.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15535
(cherry picked from commit 914cbc11cf)
Modeled after get_live_members_synchronized,
get_unreachable_members_synchronized calls
replicate_live_endpoints_on_change to synchronize
the state of unreachable_members on all shards.
Fixes#12261Fixes#15088
Also, add rest_api unit test for those apis
Closes#15093
* github.com:scylladb/scylladb:
test: rest_api: add test_gossiper
gossiper: add get_unreachable_members_synchronized
(cherry picked from commit 57deeb5d39)
Backport note: `gossiper::lock_endpoint_update_semaphore` helper
function was missing, replaced with
`get_units(g._endpoint_update_semaphore, 1)`
It is possible that a gossip message from an old node is delivered
out of order during a slow boot and the raft address map overwrites
a new IP address with an obsolete one, from the previous incarnation
of this node. Take into account the node restart counter when updating
the address map.
A test case requires a parameterized error injection, which
we don't support yet. Will be added as a separate commit.
Fixes#14257
Refs #14357Closes#14329
(cherry picked from commit b9c2b326bc)
Backport note: replaced `gms::generation_type` with `int64_t` because
the branch is missing the refactor which introduced `generation_type`
(7f04d8231d)
Currently, when creating the table, permissions may be mistakenly
granted to the user even if the table is already existing. This
can happen in two cases:
The query has a IF NOT EXISTS clause - as a result no exception
is thrown after encountering the existing table, and the permission
granting is not prevented.
The query is handled by a non-zero shard - as a result we accept
the query with a bounce_to_shard result_message, again without
preventing the granting of permissions.
These two cases are now avoided by checking the result_message
generated when handling the query - now we only grant permissions
when the query resulted in a schema_change message.
Additionally, a test is added that reproduces both of the mentioned
cases.
CVE-2023-33972
Fixes#15467.
* 'no-grant-on-no-create' of github.com:scylladb/scylladb-ghsa-ww5v-p45p-3vhq:
auth: do not grant permissions to creator without actually creating
transport: add is_schema_change() method to result_message
(cherry picked from commit ab6988c52f)
Fixes https://github.com/scylladb/scylladb/issues/14490
This commit fixes mulitple links that were broken
after the documentation is published (but not in
the preview) due to incorrect syntax.
I've fixed the syntax to use the :docs: and :ref:
directive for pages and sections, respectively.
Closes#14664
(cherry picked from commit a93fd2b162)
This commit adds the information that ScyllaDB Enterprise
supports FIPS-compliant systems in versions
2023.1.1 and later.
The information is excluded from OSS docs with
the "only" directive, because the support was not
added in OSS.
This commit must be backported to branch-5.2 so that
it appears on version 2023.1 in the Enterprise docs.
Closes#15415
(cherry picked from commit fb635dccaa)
Today, we base compaction throughput on the amount of data written,
but it should be based on the amount of input data compacted
instead, to show the amount of data compaction had to process
during its execution.
A good example is a compaction which expire 99% of data, and
today throughput would be calculated on the 1% written, which
will mislead the reader to think that compaction was terribly
slow.
Fixes#14533.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14615
(cherry picked from commit 3b1829f0d8)
We allow inserting column values using a JSON value, eg:
```cql
INSERT INTO mytable JSON '{ "\"myKey\"": 0, "value": 0}';
```
When no JSON value is specified, the query should be rejected.
Scylla used to crash in such cases. A recent change fixed the crash
(https://github.com/scylladb/scylladb/pull/14706), it now fails
on unwrapping an uninitialized value, but really it should
be rejected at the parsing stage, so let's fix the grammar so that
it doesn't allow JSON queries without JSON values.
A unit test is added to prevent regressions.
Refs: https://github.com/scylladb/scylladb/pull/14707
Fixes: https://github.com/scylladb/scylladb/issues/14709
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
\Closes #14785
(cherry picked from commit cbc97b41d4)
The Alternator test test_ttl.py::test_ttl_expiration_gsi_lsi was flaky.
The test incorrectly assumes that when we write an already expired item,
it will be visible for a short time until being deleted by the TTL thread.
But this doesn't need to be true - if the test is slow enough, it may go
look or the item after it was already expired!
So we fix this test by splitting it into two parts - in the first part
we write a non-expiring item, and notice it eventually appears in the
GSI, LSI, and base-table. Then we write the same item again, with an
expiration time - and now it should eventually disappear from the GSI,
LSI and base-table.
This patch also fixes a small bug which prevented this test from running
on DynamoDB.
Fixes#14495
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14496
(cherry picked from commit 599636b307)
Permits added to `_ready_list` remain there until
executed by `execution_loop()`.
But `execution_loop()` exits when `_stopped == true`,
even though nothing prevents new permits from being added
to `_ready_list` after `stop()` sets `_stopped = true`.
Thus, if there are reads concurrent with `stop()`,
it's possible for a permit to be added to `_ready_list`
after `execution_loop()` has already quit. Such a permit will
never be destroyed, and `stop()` will forever block on
`_permit_gate.close()`.
A natural solution is to dismiss `execution_loop()` only after
it's certain that `_ready_list` won't receive any new permits.
This is guaranteed by `_permit_gate.close()`. After this call completes,
it is certain that no permits *exist*.
After this patch, `execution_loop()` no longer looks at `_stopped`.
It only exits when `_ready_list_cv` breaks, and this is triggered
by `stop()` right after `_permit_gate.close()`.
Fixes#15198Closes#15199
(cherry picked from commit 2000a09859)
Call replicate_live_endpoints on shard 0 to copy from 0 to the rest of
the shards. And get the list of live members from shard 0.
Move lock to the callers.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13240
(cherry picked from commit da00052ad8)
Add an API call to wait for all shards to reach the current shard 0
gossiper version. Throws when timeout is reached.
Closes#12540
* github.com:scylladb/scylladb:
api: gossiper: fix alive nodes
gms, service: lock live endpoint copy
gms, service: live endpoint copy method
(cherry picked from commit b919373cce)
when the local_deletion_time is too large and beyond the
epoch time of INT32_MAX, we cap it to INT32_MAX - 1.
this is a signal of bad configuration or a bug in scylla.
so let's add more information in the logging message to
help track back to the source of the problem.
Fixes#15015
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 9c24be05c3)
Closes#15150
Secondary index creation is asynchronous, meaning it
takes time for existing data to be reflected within
the index. However, new data added after the
index is created should appear in it immediately.
The test consisted of two parts. The first created
a series of indexes for one table, added
test data to the table, and then ran a series of checks.
In the second part, several new indexes were added to
the same table, and checks were made to make sure that
already existing data would appear in them. This
last part was flaky.
The patch just moves the index creation statements
from the second part to the first.
Fixes: #14076Closes#14090
(cherry picked from commit 0415ac3d5f)
Closes#15101
This mini-series backports the fix for #12010 along with low-risk patches it depends on.
Fixes: #12010Closes#15137
* github.com:scylladb/scylladb:
distributed_loader: process_sstable_dir: do not verify snapshots
utils/directories: verify_owner_and_mode: add recursive flag
utils: Restore indentation after previous patch
utils: Coroutinize verify_owner_and_mode()
Skip over verification of owner and mode of the snapshots
sub-directory as this might race with scylla-manager
trying to delete old snapshots concurrently.
Fixes#12010
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 845b6f901b)
Allow the caller to verify only the top level directories
so that sub-directories can be verified selectively
(in particular, skip validation of snapshots).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 60862c63dd)
There's a helper verification_error() that prints a warning and returns
excpetional future. The one is converted into void throwing one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 4ebb812df0)
Loop in shard_reshaping_compaction_task_impl::run relies on whether
sstables::compaction_stopped_exception is thrown from run_custom_job.
The exception is swallowed for each type of compaction
in compaction_manager::perform_task.
Rethrow an exception in perfrom task for reshape compaction.
Fixes: #15058.
(cherry picked from commit e0ce711e4f)
Closes#15122
This argument was dead since its introduction and 'discard' was
always configured regardless of its value.
This patch allows actually configuring things using this argument.
Fixes#14963Closes#14964
(cherry picked from commit e13a2b687d)
While repair requested by user is performed, some tables
may be dropped. When the repair proceeds to these tables,
it should skip them and continue with others.
When no_such_column_family is thrown during user requested
repair, it is logged and swallowed. Then the repair continues with
the remaining tables.
Fixes: #13045Closes#13068
* github.com:scylladb/scylladb:
repair: fix indentation
repair: continue user requested repair if no_such_column_family is thrown
repair: add find_column_family_if_exists function
(cherry picked from commit 9859bae54f)
We have had support for COUNTER columns for quite some time now, but some functionality was left unimplemented - various internal and CQL functions resulted in "unimplemented" messages when used, and the goal of this series is to fix those issues. The primary goal was to add the missing support for CASTing counters to other types in CQL (issue #14501), but we also add the missing CQL `counterasblob()` and `blobascounter()` functions (issue #14742).
As usual, the series includes extensive functional tests for these features, and one pre-existing test for CAST that used to fail now begins to pass.
Fixes#14501Fixes#14742Closes#14745
* github.com:scylladb/scylladb:
test/cql-pytest: test confirming that casting to counter doesn't work
cql: support casting of counter to other types
cql: implement missing counterasblob() and blobascounter() functions
cql: implement missing type functions for "counters" type
(cherry picked from commit a637ddd09c)
This is a translation of Cassandra's CQL unit test source file
validation/operations/CompactStorageTest.java into our cql-pytest
framework.
This very large test file includes 86 tests for various types of
operations and corner cases of WITH COMPACT STORAGE tables.
All 86 tests pass on Cassandra (except one using a deprecated feature
that needs to be specially enabled). 30 of the tests fail on Scylla
reproducing 7 already-known Scylla issues and 7 previously-unknown issues:
Already known issues:
Refs #3882: Support "ALTER TABLE DROP COMPACT STORAGE"
Refs #4244: Add support for mixing token, multi- and single-column
restrictions
Refs #5361: LIMIT doesn't work when using GROUP BY
Refs #5362: LIMIT is not doing it right when using GROUP BY
Refs #5363: PER PARTITION LIMIT doesn't work right when using GROUP BY
Refs #7735: CQL parser missing support for Cassandra 3.10's new "+=" syntax
Refs #8627: Cleanly reject updates with indexed values where value > 64k
New issues:
Refs #12471: Range deletions on COMPACT STORAGE is not supported
Refs #12474: DELETE prints misleading error message suggesting
ALLOW FILTERING would work
Refs #12477: Combination of COUNT with GROUP BY is different from
Cassandra in case of no matches
Refs #12479: SELECT DISTINCT should refuse GROUP BY with clustering column
Refs #12526: Support filtering on COMPACT tables
Refs #12749: Unsupported empty clustering key in COMPACT table
Refs #12815: Hidden column "value" in compact table isn't completely hidden
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12816
(cherry picked from commit 328cdb2124)
This is a translation of Cassandra's CQL unit test source file
functions/CastFctsTest.java into our cql-pytest framework.
There are 13 tests, 9 of them currently xfail.
The failures are caused by one recently-discovered issue:
Refs #14501: Cannot Cast Counter To Double
and by three previously unknown or undocumented issues:
Refs #14508: SELECT CAST column names should match Cassandra's
Refs #14518: CAST from timestamp to string not same as Cassandra on zero
milliseconds
Refs #14522: Support CAST function not only in SELECT
Curiously, the careful translation of this test also caused me to
find a bug in Cassandra https://issues.apache.org/jira/browse/CASSANDRA-18647
which the test in Java missed because it made the same mistake as the
implementation.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14528
(cherry picked from commit f08bc83cb2)
This patch adds tests to reproduce issue #13551. The issue, discovered
by a dtest (cql_cast_test.py), claimed that either cast() or sum(cast())
from varint type broke. So we add two tests in cql-pytest:
1. A new test file, test_cast_data.py, for testing data casts (a
CAST (...) as ... in a SELECT), starting with testing casts from
varint to other types.
The test uncovers a lot of interesting cases (it is heavily
commented to explain these cases) but nothing there is wrong
and all tests pass on Scylla.
2. An xfailing test for sum() aggregate of +Inf and -Inf. It turns out
that this caused #13551. In Cassandra and older Scylla, the sum
returned a NaN. In Scylla today, it generates a misleading
error message.
As usual, the tests were run on both Cassandra (4.1.1) and Scylla.
Refs #13551.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 78555ba7f1)
The eps reference was reused to manipulate
the racks dictionary. This resulted in
assigning a set of nodes from the racks
dictionary to an element of the _dc_endpoints dictionary.
This is a backport of bcb1d7c to branch-5.2.
Refs: #14184Closes#14893
If semaphore mismatch occurs, check whether both semaphores belong
to user. If so, log a warning, log a `querier_cache_scheduling_group_mismatches` stat and drop cached reader instead of throwing an error.
Until now, semaphore mismatch was only checked in multi-partition queries. The PR pushes the check to `querier_cache` and perform it on all `lookup_*_querier` methods.
The mismatch can happen if user's scheduling group changed during
a query. We don't want to throw an error then, but drop and reset
cached reader.
This patch doesn't solve a problem with mismatched semaphores because of changes in service levels/scheduling groups but only mitigate it.
Refers: https://github.com/scylladb/scylla-enterprise/issues/3182
Refers: https://github.com/scylladb/scylla-enterprise/issues/3050Closes: #14770Closes#14736
* github.com:scylladb/scylladb:
querier_cache: add stats of scheduling group mismatches
querier_cache: check semaphore mismatch during querier lookup
querier_cache: add reference to `replica::database::is_user_semaphore()`
replica:database: add method to determine if semaphore is user one
(cherry picked from commit a8feb7428d)
before this change, there are chances that the temporary sstables
created for collecting the GC-able data create by a certain
compaction can be picked up by another compaction job. this
wastes the CPU cycles, adds write amplification, and causes
inefficiency.
in general, these GC-only SSTables are created with the same run id
as those non-GC SSTables, but when a new sstable exhausts input
sstable(s), we proactively replace the old main set with a new one
so that we can free up the space as soon as possible. so the
GC-only SSTables are added to the new main set along with
the non-GC SSTables, but since the former have good chance to
overlap the latter. these GC-only SSTables are assigned with
different run ids. but we fail to register them to the
`compaction_manager` when replacing the main sstable set.
that's why future compactions pick them up when performing compaction,
when the compaction which created them is not yet completed.
so, in this change,
* to prevent sstables in the transient stage from being picked
up by regular compactions, a new interface class is introduced
so that the sstable is always added to registration before
it is added to sstable set, and removed from registration after
it is removed from sstable set. the struct helps to consolidate
the regitration related logic in a single place, and helps to
make it more obvious that the timespan of an sstable in
the registration should cover that in the sstable set.
* use a different run_id for the gc sstable run, as it can
overlap with the output sstable run. the run_id for the
gc sstable run is created only when the gc sstable writer
is created. because the gc sstables is not always created
for all compactions.
please note, all (indirect) callers of
`compaction_task_executor::compact_sstables()` passes a non-empty
`std::function` to this function, so there is no need to check for
empty before calling it. so in this change, the check is dropped.
Fixes#14560
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14725
(cherry picked from commit fdf61d2f7c)
Closes#14827
In #14668, we have decided to introduce a new scylla.yaml variable
for the schema commitlog segment size. The segment size puts a limit
on the mutation size that can be written at once, and some schema
mutation writes are much larger than average, as shown in #13864.
Therefore, increasing the schema commitlog segment size is sometimes
necessary.
(cherry picked from commit 5b167a4ad7)
Te view updating consumer uses `_buffer_size` to decide when to flush the accumulated mutations, passing them to the actual view building code. This `_buffer_size` is incremented every time a mutation fragment is consumed. This is not exact, as e.g. range tombstones are represented differently in the mutation object, than in the fragment, but it is good enough. There is one flaw however: `_buffer_size` is not incremented when consuming a partition-start fragment. This is when the mutation object is created in the mutation rebuilder. This is not a big problem when partition have many rows, but if the partitions are tiny, the error in accounting quickly becomes significant. If the partitions are empty, `_buffer_size` is not bumped at all for empty partitions, and any number of these can accumulate in the buffer. We have recently seen this causing stalls and OOM as the buffer got to immense size, only containing empty and tiny partitions.
This PR fixes this by accounting the size of the freshly created `mutation` object in `_buffer_size`, after the partition-start fragment is consumed.
Fixes: #14819Closes#14821
* github.com:scylladb/scylladb:
test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations
db/view/view_updating_consumer: account for the size of mutations
mutation/mutation_rebuilder*: return const mutation& from consume_new_partition()
mutation/mutation: add memory_usage()
(cherry picked from commit 056d04954c)
LWT queries with empty clustering range used to cause a crash.
For example in:
```cql
UPDATE tab SET r = 9000 WHERE p = 1 AND c = 2 AND c = 2000 IF r = 3
```
The range of `c` is empty - there are no valid values.
This caused a segfault when accessing the `first` range:
```c++
op.ranges.front()
```
Cassandra rejects such queries at the preparation stage. It doesn't allow two `EQ` restriction on the same clustering column when an IF is involved.
We reject them during runtime, which is a worse solution. The user can prepare a query with `c = ? AND c = ?`, and then run it, but unexpectedly it will throw an `invalid_request_exception` when the two bound variables are different.
We could ban such queries as well, we already ban the usage of `IN` in conditional statements. The problem is that this would be a breaking change.
A better solution would be to allow empty ranges in `LWT` statements. When an empty range is detected we just wouldn't apply the change. This would be a larger change, for now let's just fix the crash.
Fixes: https://github.com/scylladb/scylladb/issues/13129Closes#14429
* github.com:scylladb/scylladb:
modification_statement: reject conditional statements with empty clustering key
statements/cas_request: fix crash on empty clustering range in LWT
(cherry picked from commit 49c8c06b1b)
It was found that cached_file dtor can hit the following assert
after OOM
cached_file_test: utils/cached_file.hh:379: cached_file::~cached_file(): Assertion _cache.empty()' failed.`
cached_file's dtor iterates through all entries and evict those
that are linked to LRU, under the assumption that all unused
entries were linked to LRU.
That's partially correct. get_page_ptr() may fetch more than 1
page due to read ahead, but it will only call cached_page::share()
on the first page, the one that will be consumed now.
share() is responsible for automatically placing the page into
LRU once refcount drops to zero.
If the read is aborted midway, before cached_file has a chance
to hit the 2nd page (read ahead) in cache, it will remain there
with refcount 0 and unlinked to LRU, in hope that a subsequent
read will bring it out of that state.
Our main user of cached_file is per-sstable index caching.
If the scenario above happens, and the sstable and its associated
cached_file is destroyed, before the 2nd page is hit, cached_file
will not be able to clear all the cache because some of the
pages are unused and not linked.
A page read ahead will be linked into LRU so it doesn't sit in
memory indefinitely. Also allowing for cached_file dtor to
clear all cache if some of those pages brought in advance
aren't fetched later.
A reproducer was added.
Fixes#14814.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14818
(cherry picked from commit 050ce9ef1d)
In 10c1f1dc80 I fixed
`make_group0_history_state_id_mutation` to use correct timestamp
resolution (microseconds instead of milliseconds) which was supposed to
fix the flakiness of `test_group0_history_clearing_old_entries`.
Unfortunately, the test is still flaky, although now it's failing at a
later step -- this is because I was sloppy and I didn't adjust this
second part of the test to also use microsecond resolution. The test is
counting the number of entries in the `system.group0_history` table that
are older than a certain timestamp, but it's doing the counting using
millisecond resolution, causing it to give results that are off by one
sometimes.
Fix it by using microseconds everywhere.
Fixes#14653Closes#14670
(cherry picked from commit 9d4b3c6036)
The new test detected a stack-use-after-return when using table's
as_mutation_source_excluding_staging() for range reads.
This doesn't really affect view updates that generate single
key reads only. So the problem was only stressed in the recently
added test. Otherwise, we'd have seen it when running dtests
(in debug mode) that stress the view update path from staging.
The problem happens because the closure was feeded into
a noncopyable_function that was taken by reference. For range
reads, we defer before subsequent usage of the predicate.
For single key reads, we only defer after finished using
the predicate.
Fix is about using sstable_predicate type, so there won't
be a need to construct a temporary object on stack.
Fixes#14812.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14813
(cherry picked from commit 0ac43ea877)
View building from staging creates a reader from scratch (memtable
+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.
perf shows that the reader creation is very expensive:
+ 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+ 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+ 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare
+ 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare
+ 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+ 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping
+ 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment
+ 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free
+ 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()#1}::operator()() const::{lambda()#1}::operator()
+ 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64
+ 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe
+ 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+ 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small
+ 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects
+ 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run
+ 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate
+ 1.19% 0.05% reactor-3 libc.so.6 [.] syscall
+ 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+ 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert
That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).
The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.
This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.
This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.
With this improvement, view building was measured to be 3x faster.
from
INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s
to
INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s
Refs #14089.
Fixes#14244.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 1d8cb32a5d)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14764
Currently, scylla_fstrim_setup does not start scylla-fstrim.timer and
just enables it, so the timer starts only after rebooted.
This is incorrect behavior, we start start it during the setup.
Also, unmask is unnecessary for enabling the timer.
Fixes#14249Closes#14252
(cherry picked from commit c70a9cbffe)
Closes#14421
do_refresh_state() keeps iterators to rows_entry in a vector.
This vector might be resized during the procedure, triggering
memory reclaim and invalidating the iterators, which can cause
arbitrarily long loops and/or a segmentation fault during make_heap().
To fix this, do_refresh_state has to always be called from the allocating
section.
Additionally, it turns out that the first do_refresh_state is useless,
because reset_state() doesn't set _change_mark. This causes do_refresh_state
to be needlessly repeated during a next_row() or next_range_tombstone() which
happens immediately after it. Therefore this patch moves the _change_mark
assignment from maybe_refresh_state to do_refresh_state, so that the change mark
is properly set even after the first refresh.
Fixes#14696Closes#14697
Consider
- 10 repair instances take all the 10 _streaming_concurrency_sem
- repair readers are done but the permits are not released since they
are waiting for view update _registration_sem
- view updates trying to take the _streaming_concurrency_sem to make
progress of view update so it could release _registration_sem, but it
could not take _streaming_concurrency_sem since the 10 repair
instances have taken them
- deadlock happens
Note, when the readers are done, i.e., reaching EOS, the repair reader
replaces the underlying (evictable) reader with an empty reader. The
empty reader is not evictable, so the resources cannot be forcibly
released.
To fix, release the permits manually as soon as the repair readers are
done even if the repair job is waiting for _registration_sem.
Fixes#14676Closes#14677
(cherry picked from commit 1b577e0414)
Adds preemption points used in Alternator when:
- sending bigger json response
- building results for BatchGetItem
I've tested manually by inserting in preemptible sections (e.g. before `os.write`) code similar to:
auto start = std::chrono::steady_clock::now();
do { } while ((std::chrono::steady_clock::now() - start) < 100ms);
and seeing reactor stall times. After the patch they
were not increasing while before they kept building up due to no preemption.
Refs #7926Fixes#13689Closes#12351
* github.com:scylladb/scylladb:
alternator: remove redundant flush call in make_streamed
utils: yield when streaming json in print()
alternator: yield during BatchGetItem operation
(cherry picked from commit d2e089777b)
On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name.
If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen.
This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster.
This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore.
I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before.
Fixes: #13841Fixes: #12552Closes#13843
* github.com:scylladb/scylladb:
message: match unknown tenants to the default tenant
message: generalize per-tenant connection types
(cherry picked from commit a7c2c9f92b)
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.
This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
e225c88a65 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.
To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.
\Fixes scylladb/scylladb#14182
Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing #14182
\Closes scylladb/scylladb#14183
* github.com:scylladb/scylladb:
docs: dml: add update ordering section
cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
atomic_cell: compare_atomic_cell_for_merge: update and add documentation
compare_atomic_cell_for_merge: compare value last for live cells
mutation_test: test_cell_ordering: improve debuggability
(cherry picked from commit 87b4606cd6)
Closes#14649
Fixes#11017
When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.
Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.
Closes#14636
View update routines accept mutation objects.
But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to convert the fragment stream into mutations. This is done by piping the stream to mutation_rebuilder_v2.
To keep memory usage limited, the stream for a single partition might have to be split into multiple partial mutation objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error.
This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next mutation object).
The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic.
Backported from c25201c1a3. `view_build_test.cc` needed some tiny adjustments for the backport.
Closes#14619Fixes#14503
* github.com:scylladb/scylladb:
test: view_build_test: add range tombstones to test_view_update_generator_buffering
test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
view_updating_consumer: make buffer limit a variable
view: fix range tombstone handling on flushes in view_updating_consumer
The discussion on the thread says, when we reformat a volume with another
filesystem, kernel and libblkid may skip to populate /dev/disk/by-* since it
detected two filesystem signatures, because mkfs.xxx did not cleared previous
filesystem signature.
To avoid this, we need to run wipefs before running mkfs.
Note that this runs wipefs twice, for target disks and also for RAID device.
wipefs for RAID device is needed since wipefs on disks doesn't clear filesystem signatures on /dev/mdX (we may see previous filesystem signature on /dev/mdX when we construct RAID volume multiple time on same disks).
Also dropped -f option from mkfs.xfs, it will check wipefs is working as we
expected.
Fixes#13737
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#13738
(cherry picked from commit fdceda20cc)
In mutation_reader_merger and clustering_order_reader_merger, the
operator()() is responsible for producing mutation fragments that will
be merged and pushed to the combined reader's buffer. Sometimes, it
might have to advance existing readers, open new and / or close some
existing ones, which requires calling a helper method and then calling
operator()() recursively.
In some unlucky circumstances, a stack overflow can occur:
- Readers have to be opened incrementally,
- Most or all readers must not produce any fragments and need to report
end of stream without preemption,
- There has to be enough readers opened within the lifetime of the
combined reader (~500),
- All of the above needs to happen within a single task quota.
In order to prevent such a situation, the code of both reader merger
classes were modified not to perform recursion at all. Most of the code
of the operator()() was moved to maybe_produce_batch which does not
recur if it is not possible for it to produce a fragment, instead it
returns std::nullopt and operator()() calls this method in a loop via
seastar::repeat_until_value.
A regression test is added.
Fixes: scylladb/scylladb#14415
Closes#14452
(cherry picked from commit ee9bfb583c)
Closes#14605
This patch adds a full-range tombstone to the compacted mutation.
This raises the coverage of the test. In particular, it reproduces
issue #14503, which should have been caught by this test, but wasn't.
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of
mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to
convert the fragment stream into `mutation`s. This is done by piping
the stream to mutation_rebuilder_v2.
To keep memory usage limited, the stream for a single partition might
have to be split into multiple partial `mutation` objects.
view_update_consumer does that, but in improper way -- when the
split/flush happens inside an active range tombstone, the range
tombstone isn't closed properly. This is illegal, and triggers an
internal error.
This patch fixes the problem by closing the active range tombstone
(and reopening in the same position in the next `mutation` object).
The tombstone is closed just after the last seen clustered position.
This is not necessary for correctness -- for example we could delay
all processing of the range tombstone until we see its end
bound -- but it seems like the most natural semantic.
Fixes#14503
There was a bug that caused aggregates to fail when
used on column-sensitive columns.
For example:
```
SELECT SUM("SomeColumn") FROM ks.table;
```
would fail, with a message saying that there
is no column "somecolumn".
This is because the case-sensitivity got lost on the way.
For non case-sensitive column names we convert them to lowercase,
but for case sensitive names we have to preserve the name
as originally written.
The problem was in `forward_service` - we took a column name
and created a non case-sensitive `column_identifier` out of it.
This converted the name to lowercase, and later such column
couldn't be found.
To fix it, let's make the `column_identifier` case-sensitive.
It will preserve the name, without converting it to lowercase.
Fixes: https://github.com/scylladb/scylladb/issues/14307
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 7fca350075)
This PR fixes the Restore System Tables section of the upgrade guides by adding a command to clean upgraded SStables during rollback or adding the entire section to restore system tables (which was missing from the older documents).
This PR fixes is a bug and must be backported to branch-5.3, branch-5.2., and branch-5.1.
Refs: https://github.com/scylladb/scylla-enterprise/issues/3046
- [x] 5.1-to-2022.2 - update command (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x] 5.0-to-2022.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x] 4.3-to-2021.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)
(see https://github.com/scylladb/scylla-enterprise/issues/3046#issuecomment-1604232864)
Closes#14444
* github.com:scylladb/scylladb:
doc: fix rollback in 4.3-to-2021.1 upgrade guide
doc: fix rollback in 5.0-to-2022.1 upgrade guide
doc: fix rollback in 5.1-to-2022.2 upgrade guide
(cherry picked from commit 8a7261fd70)
with off-strategy, input list size can be close to 1k, which will
lead to unneeded reallocations when formatting the list for
logging.
in the past, we faced stalls in this area, and excessive reallocation
(log2 ~1k = ~10) may have contributed to that.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13907
(cherry picked from commit 5544d12f18)
Fixesscylladb/scylladb#14071
Information was duplicated before and the version on this page was outdated - RBNO is enabled for replace operation already.
Closes#12984
(cherry picked from commit bd7caefccf)
`query_partition_range_concurrent` implements an optimization when
querying a token range that intersects multiple vnodes. Instead of
sending a query for each vnode separately, it sometimes sends a single
query to cover multiple vnodes - if the intersection of replica sets for
those vnodes is large enough to satisfy the CL and good enough in terms
of the heat metric. To check the latter condition, the code would take
the smallest heat metric of the intersected replica set and compare them
to smallest heat metrics of replica sets calculated separately for each
vnode.
Unfortunately, there was an edge case that the code didn't handle: the
intersected replica set might be empty and the code would access an
empty range.
This was catched by an assertion added in
8db1d75c6c by the dtest
`test_query_dc_with_rf_0_does_not_crash_db`.
The fix is simple: check if the intersected set is empty - if so, don't
calculate the heat metrics because we can decide early that the
optimization doesn't apply.
Also change the `assert` to `on_internal_error`.
Fixes#14284Closes#14300
(cherry picked from commit 732feca115)
Backport note: the original `assert` was never added to branch-5.2, but
the fix is still applicable, so I backported the fix and the
`on_internal_error` check.
Another node can stop after it joined the group0 but before it advertised itself
in gossip. `get_inet_addrs` will try to resolve all IPs and
`wait_for_peers_to_enter_synchronize_state` will loop indefinitely.
But `wait_for_peers_to_enter_synchronize_state` can return early if one of
the nodes confirms that the upgrade procedure has finished. For that, it doesn't
need the IPs of all group 0 members - only the IP of some nodes which can do
the confirmation.
This commit restructures the code so that IPs of nodes are resolved inside the
`max_concurrent_for_each` that `wait_for_peers_to_enter_synchronize_state` performs.
Then, even if some IPs won't be resolved, but one of the nodes confirms a
successful upgrade, we can continue.
Fixes#13543
(cherry picked from commit a45e0765e4)
This commit fixes the Restore System Tables section
in the 5.2-to-2023.1 upgrade guide by adding a command
to clean upgraded SStables during rollback.
This is a bug (an incomplete command) and must be
backported to branch-5.3 and branch-5.2.
Refs: https://github.com/scylladb/scylla-enterprise/issues/3046Closes#14373
(cherry picked from commit f4ae2c095b)
The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the previous buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between.
The code guranteeing this forward progress had a bug: the comparison between the position after the last buffer-fill and the current last fragment position was done in the wrong direction.
So if the condition that we wanted to achieve was already true, we would continue filling the buffer until partition end which may lead to OOMs such as in #13491.
There was already a fix in this area to handle `partition_start` fragments correctly - #13563 - but it missed that the position comparison was done in the wrong order.
Fix the comparison and adjust one of the tests (added in #13563) to detect this case.
After the fix, the evictable reader starts generating some redundant (but expected) range tombstone change fragments since it's now being paused and resumed. For this we need to adjust mutation source tests which were a bit too specific. We modify `flat_mutation_reader_assertions` to squash the redundant `r_t_c`s.
Fixes#13491Closes#14375
* github.com:scylladb/scylladb:
readers: evictable_reader: don't accidentally consume the entire partition
test: flat_mutation_reader_assertions: squash `r_t_c`s with the same position
(cherry picked from commit 586102b42e)
Fixes https://github.com/scylladb/scylla-enterprise/issues/3036
This commit adds support for Ubuntu 22.04 to the list
of OSes supported by ScyllaDB Enterprise 2021.1.
This commit fixex a bug and must be backported to
branch-5.3 and branch-5.2.
Closes#14372
(cherry picked from commit 74fc69c825)
Fixes https://github.com/scylladb/scylladb/issues/14333
This commit replaces the documentation landing page with
the Open Source-only documentation landing page.
This change is required as now there is a separate landing
page for the ScyllaDB documentation, so the page is duplicated,
creating bad user experience.
(cherry picked from commit f60f89df17)
Closes#14370
The RPC server now has a lighter .shutdown() method that just does what
m.s. shutdown() needs, so call it. On stop call regular stop to finalize
the stopping process
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it do_with_servers() and make it accept method to call and message
to print. This gives the ability to reuse this helper in next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fixes https://github.com/scylladb/scylladb/issues/14097
This commit removes support for Ubuntu 18 from
platform support for ScyllaDB Enterprise 2023.1.
The update is in sync with the change made for
ScyllaDB 5.2.
This commit must be backported to branch-5.2 and
branch-5.3.
Closes#14118
(cherry picked from commit b7022cd74e)
After c7826aa910, sstable runs are cleaned up together.
The procedure which executes cleanup was holding reference to all
input sstables, such that it could later retry the same cleanup
job on failure.
Turns out it was not taking into account that incremental compaction
will exhaust the input set incrementally.
Therefore cleanup is affected by the 100% space overhead.
To fix it, cleanup will now have the input set updated, by removing
the sstables that were already cleaned up. On failure, cleanup
will retry the same job with the remaining sstables that weren't
exhausted by incremental compaction.
New unit test reproduces the failure, and passes with the fix.
Fixes#14035.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14038
(cherry picked from commit 23443e0574)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14193
The previous implementation didn't actually do a read barrier, because
the statement failed on an early prepare/validate step which happened
before read barrier was even performed.
Change it to a statement which does not fail and doesn't perform any
schema change but requires a read barrier.
This breaks one test which uses `RandomTables.verify_schema()` when only
one node is alive, but `verify_schema` performs a read barrier. Unbreak
it by skipping the read barrier in this case (it makes sense in this
particular test).
Closes#13933
(cherry picked from commit 64dc76db55)
Backport note: skipped the test_snapshot.py change, as the test doesn't
exist on this branch.
`RandomTables.verify_schema` is often called in topology tests after
performing a schema change. It compares the schema tables fetched from
some node to the expected latest schema stored by the `RandomTables`
object.
However there's no guarantee that the latest schema change has already
propagated to the node which we query. We could have performed the
schema change on a different node and the change may not have been
applied yet on all nodes.
To fix that, pick a specific node and perform a read barrier on it, then
use that node to fetch the schema tables.
Fixes#13788Closes#13789
(cherry picked from commit 3f3dcf451b)
Raft replication doesn't guarantee that all replicas see
identical Raft state at all times, it only guarantees the
same order of events on all replicas.
When comparing raft state with gossip state on a node, first
issue a read barrier to ensure the node has the latest raft state.
To issue a read barrier it is sufficient to alter a non-existing
state: in order to validate the DDL the node needs to sync with the
leader and fetch its latest group0 state.
Fixes#13518 (flaky topology test).
Closes#13756
(cherry picked from commit e7c9ca560b)
This is a follow-up to #13399, the patch
addresses the issues mentioned there:
* linesep can be split between blocks;
* linesep can be part of UTF-8 sequence;
* avoid excessively long lines, limit to 256 chars;
* the logic of the function made simpler and more maintainable.
Closes#13427
* github.com:scylladb/scylladb:
pylib_test: add tests for read_last_line
pytest: add pylib_test directory
scylla_cluster.py: fix read_last_line
scylla_cluster.py: move read_last_line to util.py
(cherry picked from commit 70f2b09397)
server to see other servers after start/restart
When starting/restarting a server, provide a way to wait for the server
to see at least n other servers.
Also leave the implementation methods available for manual use and
update previous tests, one to wait for a specific server to be seen, and
one to wait for a specific server to not be seen (down).
Fixes#13147
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13438
(cherry picked from commit 11561a73cb)
Backport note: skipped the test_mutation_schema_change.py fix as the
test doesn't exist on this branch.
Helper to get list of gossiper alive endpoints from REST API.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
(cherry picked from commit 62a945ccd5)
For most tests there will be nodes down, increase replication factor to
3 to avoid having problems for partitions belonging to down nodes.
Use replication factor 1 for raft upgrade tests.
(cherry picked from commit 08d754e13f)
Make replication factor configurable for the RandomTables helper.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
(cherry picked from commit 3508a4e41e)
There are two occasions in scylla_cluster
where we read the node logs, and in both of
them we read the entire file in memory.
This is not efficient and may cause an OOM.
In the first case we need the last line of the
log file, so we seek at the end and move backwards
looking for a new line symbol.
In the second case we look through the
log file to find the expected_error.
The readlines() method returns a Python
list object, which means it reads the entire
file in memory. It's sufficient to just remove
it since iterating over the file instance
already yields lines lazily one by one.
This is a follow-up for #13134.
Closes#13399
(cherry picked from commit 09636b20f3)
When adding extra columns in a test, make them value column. Name them
with the "v_" prefix and use the value column number counter.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13271
(cherry picked from commit 81b40c10de)
Sometimes when creating a node it's useful
to just install it and not start. For example,
we may want to try to start it later with
expected error.
The ScyllaServer.install method has been made
exception safe, if an exception occurs, it
reverts to the original state. This allows
to not duplicate the try/except logic
in two of its call sites.
(cherry picked from commit e407956e9f)
We are going to allow the
ScyllaCluster.add_server function not to
start the server if the caller has requested
that with a special parameter. The host_id
can only be obtained from a running node, so
add_server won't be able to return it in
this case. I've grepped the tests for host_id
and there doesn't seem to be any
reference to it in the code.
(cherry picked from commit 794d0e4000)
Sometimes it's useful to check that the node has failed
to start for a particular reason. If server_start can't
find expected_error in the node's log or if the
node has started without errors, it throws an exception.
(cherry picked from commit c1d0ee2bce)
Extract the function that encapsulates all the error
reporting logic. We are going to use it in several
other places to implement expected_error feature.
(cherry picked from commit a4411e9ec4)
The ScyllaServer expects cmd to be None if the
Scylla process is not running. Otherwise, if start failed
and the test called update_config, the latter will
try to send a signal to a non-existent process via cmd.
(cherry picked from commit 21b505e67c)
Instead of decommission of initial cluster, use custom cluster.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13589
(cherry picked from commit ce87aedd30)
To allow tests with custom clusters, allow configuration of initial
cluster size of 0.
Add a proof-of-concept test to be removed later.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13342
(cherry picked from commit e3b462507d)
Move long running topology tests out of `test_topology.py` and into their own files, so they can be run in parallel.
While there, merge simple schema tests.
Closes#12804
* github.com:scylladb/scylladb:
test/topology: rename topology test file
test/topology: lint and type for topology tests
test/topology: move topology ip tests to own file
test/topology: move topology test remove garbaje...
test/topology: move topology rejoin test to own file
test/topology: merge topology schema tests and...
test/topology: isolate topology smp params test
test/topology: move topology helpers to common file
(cherry picked from commit a24600a662)
Recently we enabled RBNO by default in all topology operations. This
made the operations a bit slower (repair-based topology ops are a bit
slower than classic streaming - they do more work), and in debug mode
with large number of concurrent tests running, they might timeout.
The timeout for bootstrap was already increased before, do the same for
decommission/removenode. The previously used timeout was 300 seconds
(this is the default used by aiohttp library when it makes HTTP
requests), now use the TOPOLOGY_TIMEOUT constant from ScyllaServer which
is 1000 seconds.
Closes#12765
* github.com:scylladb/scylladb:
test/pylib: use larger timeout for decommission/removenode
test/pylib: scylla_cluster: rename START_TIMEOUT to TOPOLOGY_TIMEOUT
(cherry picked from commit e55f475db1)
Existing helper with async context manager only worked for non one-shot
error injections. Fix it and add another helper for one-shot without a
context manager.
Fix tests using the previous helper.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
(cherry picked from commit 9ceb6aba81)
There was a check for immediate consistency after a decommission
operation has finished in one of the tests, but it turns out that also
after decommission it might take some time for token ring to be updated
on other nodes. Replace the check with a wait.
Also do the wait in another test that performs a sequence of
decommissions. We won't attempt to start another decommission until
every node learns that the previously decommissioned node has left.
Closes#12686
(cherry picked from commit 40142a51d0)
After topology changes like removing a node, verify that the set of
group 0 members and token ring members is the same.
Modify `get_token_ring_host_ids` to only return NORMAL members. The
previous version which used the `/storage_service/host_id` endpoint
might have returned non-NORMAL members as well.
Fixes: #12153Closes#12619
(cherry picked from commit fa9cf81af2)
If a server is stopped suddenly (i.e. not graceful), schema tables might
be in inconsistent state. Add a test case and enable Scylla
configuration option (force_schema_commit_log) to handle this.
Fixes#12218Closes#12630
* github.com:scylladb/scylladb:
pytest: test start after ungraceful stop
test.py: enable force_schema_commit_log
(cherry picked from commit 5eadea301e)
Improve logging by printing the cluster at the end of each test.
Stop performing operations like attempting queries or dropping keyspaces on dirty clusters. Dirty clusters might be completely dead and these operations would only cause more "errors" to happen after a failed test, making it harder to find the real cause of failure.
Mark cluster as dirty when a test that uses it fails - after a failed test, we shouldn't assume that the cluster is in a usable state, so we shouldn't reuse it for another test.
Rely on the `is_dirty` flag in `PythonTest`s and `CQLApprovalTest`s, similarly to what `TopologyTest`s do.
Closes#12652
* github.com:scylladb/scylladb:
test.py: rely on ScyllaCluster.is_dirty flag for recycling clusters
test/topology: don't drop random_tables keyspace after a failed test
test/pylib: mark cluster as dirty after a failed test
test: pylib, topology: don't perform operations after test on a dirty cluster
test/pylib: print cluster at the end of test
(cherry picked from commit 2653865b34)
With regards to closing the looked-up querier if an exception is thrown. In particular, this requires closing the querier if a semaphore mismatch is detected. Move the table lookup above the line where the querier is looked up, to avoid having to handle the exception from it. As a consequence of closing the querier on the error path, the lookup lambda has to be made a coroutine. This is sad, but this is executed once per page, so its cost should be insignificant when spread over an
entire page worth of work.
Also add a unit test checking that the mismatch is detected in the first place and that readers are closed.
Fixes: #13784Closes#13790
* github.com:scylladb/scylladb:
test/boost/database_test: add unit test for semaphore mismatch on range scans
partition_slice_builder: add set_specific_ranges()
multishard_mutation_query: make reader_context::lookup_readers() exception safe
multishard_mutation_query: lookup_readers(): make inner lambda a coroutine
(cherry picked from commit 1c0e8c25ca)
Due to a simple programming oversight, one of keyspace_metadata
constructors is using empty user_types_metadata instead of the
passed one. Fix that.
Fixes#14139Closes#14143
(cherry picked from commit 1a521172ec)
A long long time ago there was an issue about removing infinite timeouts
from distributed queries: #3603. There was also a fix:
620e950fc8. But apparently some queries
escaped the fix, like the one in `default_role_row_satisfies`.
With the right conditions and timing this query may cause a node to hang
indefinitely on shutdown. A node tries to perform this query after it
starts. If we kill another node which is required to serve this query
right before that moment, the query will hang; when we try to shutdown
the querying node, it will wait for the query to finish (it's a
background task in auth service), which it never does due to infinite
timeout.
Use the same timeout configuration as other queries in this module do.
Fixes#13545.
Closes#14134
(cherry picked from commit f51312e580)
Fixes https://github.com/scylladb/scylladb/issues/13915
This commit fixes broken links to the Enterprise docs.
They are links to the enterprise branch, which is not
published. The links to the Enterprise docs should include
"stable" instead of the branch name.
This commit must be backported to branch-5.2, because
the broken links are present in the published 5.2 docs.
Closes#13917
(cherry picked from commit 6f4a68175b)
This branch backports to branch-5.2 several fixes related to node operations:
- ba919aa88a (PR #12980; Fixes: #11011, #12969)
- 53636167ca (part of PR #12970; Fixes: #12764, #12956)
- 5856e69462 (part of PR #12970)
- 2b44631ded (PR #13028; Fixes: #12989)
- 6373452b31 (PR #12799; Fixes#12798)
Closes#13531
* github.com:scylladb/scylladb:
Merge 'Do not mask node operation errors' from Benny Halevy
Merge 'storage_service: Make node operations safer by detecting asymmetric abort' from Tomasz Grabiec
storage_service: Wait for normal state handler to finish in replace
storage_service: Wait for normal state handler to finish in bootstrap
storage_service: Send heartbeat earlier for node ops
Fixes https://github.com/scylladb/scylladb/issues/13857
This commit adds the OS support for ScyllaDB Enterprise 2023.1.
The support is the same as for ScyllaDB Open Source 5.2, on which
2023.1 is based.
After this commit is merged, it must be backported to branch-5.2.
In this way, it will be merged to branch-2023.1 and available in
the docs for Enterprise 2023.1
Closes: #13858
(cherry picked from commit 84ed95f86f)
range_tombstone_change_generator::flush() mishandles the case when two range
tombstones are adjacent and flush(pos, end_of_range=true) is called with pos
equal to the end bound of the lesser-position range tombstone.
In such case, the start change of the greater-position rtc will be accidentally
emitted, and there won't be an end change, which breaks reader assumptions by
ending the stream with an unclosed range tombstone, triggering an assertion.
This is due to a non-strict inequality used in a place where strict inequality
should be used. The modified line was intended to close range tombstones
which end exactly on the flush position, but this is unnecessary because such
range tombstones are handled by the last `if` in the function anyway.
Instead, this line caused range tombstones beginning right after the flush
position to be emitted sometimes.
Fixes https://github.com/scylladb/scylladb/issues/12462Closes#13894
* github.com:scylladb/scylladb:
tests: row_cache: Add reproducer for reader producing missing closing range tombstone
range_tombstone_change_generator: fix an edge case in flush()
static report:
sstables/mx/reader.cc:1705:58: error: invalid invocation of method 'operator*' on object 'schema' while it is in the 'consumed' state [-Werror,-Wconsumed]
legacy_reverse_slice_to_native_reverse_slice(*schema, slice.get()), pc, std::move(trace_state), fwd, fwd_mr, monitor);
Fixes#13394.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 213eaab246)
use-after-free in ctor, which potentially leads to a failure
when locating table from moved schema object.
static report
In file included from db/system_keyspace.cc:51:
./db/view/build_progress_virtual_reader.hh:202:40: warning: invalid invocation of method 'operator->' on object 's' while it is in the 'consumed' state [-Wconsumed]
_db.find_column_family(s->ks_name(), system_keyspace::v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS),
Fixes#13395.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 1ecba373d6)
static report:
./index/built_indexes_virtual_reader.hh:228:40: warning: invalid invocation of method 'operator->' on object 's' while it is in the 'consumed' state [-Wconsumed]
_db.find_column_family(s->ks_name(), system_keyspace::v3::BUILT_VIEWS),
Fixes#13396.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit f8df3c72d4)
Variant used by
streaming/stream_transfer_task.cc: , reader(cf.make_streaming_reader(cf.schema(), std::move(permit_), prs))
as full slice is retrieved after schema is moved (clang evaluates
left-to-right), the stream transfer task can be potentially working
on a stale slice for a particular set of partitions.
static report:
In file included from replica/dirty_memory_manager.cc:6:
replica/database.hh:706:83: error: invalid invocation of method 'operator->' on object 'schema' while it is in the 'consumed' state [-Werror,-Wconsumed]
return make_streaming_reader(std::move(schema), std::move(permit), range, schema->full_slice());
Fixes#13397.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 04932a66d3)
Adds a reproducer for #12462.
The bug manifests by reader throwing:
std::logic_error: Stream ends with an active range tombstone: {range_tombstone_change: pos={position: clustered,ckp{},-1}, {tombstone: timestamp=-9223372036854775805, deletion_time=2}}
The reason is that prior to the fix range_tombstone_change_generator::flush()
was used with end_of_range=true to produce the closing range_tombstone_change
and it did not handle correctly the case when there are two adjacent range
tombstones and flush(pos, end_of_range=true) is called such that pos is the
boundary between the two.
Cherry-picked from a717c803c7.
range_tombstone_change_generator::flush() mishandles the case when two range
tombstones are adjacent and flush(pos, end_of_range=true) is called with pos
equal to the end bound of the lesser-position range tombstone.
In such case, the start change of the greater-position rtc will be accidentally
emitted, and there won't be an end change, which breaks reader assumptions by
ending the stream with an unclosed range tombstone, triggering an assertion.
This is due to a non-strict inequality used in a place where strict inequality
should be used. The modified line was intended to close range tombstones
which end exactly on the flush position, but this is unnecessary because such
range tombstones are handled by the last `if` in the function anyway.
Instead, this line caused range tombstones beginning right after the flush
position to be emitted sometimes.
Fixes#12462
The immediate mode is similar to timeout mode with gc_grace_seconds
zero. Thus, the gc_before returned should be the query_time instead of
gc_clock::time_point::max in immediate mode.
Setting gc_before to gc_clock::time_point::max, a row could be dropped
by compaction even if the ttl is not expired yet.
The following procedure reproduces the issue:
- Start 2 nodes
- Insert data
```
CREATE KEYSPACE ks2a WITH REPLICATION = { 'class' : 'SimpleStrategy',
'replication_factor' : 2 };
CREATE TABLE ks2a.tb (pk int, ck int, c0 text, c1 text, c2 text, PRIMARY
KEY(pk, ck)) WITH tombstone_gc = {'mode': 'immediate'};
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (10 ,1, 'x', 'y', 'z')
USING TTL 1000000;
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (20 ,1, 'x', 'y', 'z')
USING TTL 1000000;
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (30 ,1, 'x', 'y', 'z')
USING TTL 1000000;
```
- Run nodetool flush and nodetool compact
- Compaction drops all data
```
~128 total partitions merged to 0.
```
Fixes#13572Closes#13800
(cherry picked from commit 7fcc403122)
This is not really an error, so print it in debug log_level
rather than error log_level.
Fixes#13374
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13462
(cherry picked from commit cc42f00232)
Courtersy of clang-tidy:
row_cache.cc:1191:28: warning: 'entry' used after it was moved [bugprone-use-after-move]
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{_schema});
^
row_cache.cc:1191:60: note: move occurred here
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{_schema});
^
row_cache.cc:1191:28: note: the use and move are unsequenced, i.e. there is no guarantee about the order in which they are evaluated
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{*_schema});
The use-after-move is UB, as for it to happen, depends on evaluation order.
We haven't hit it yet as clang is left-to-right.
Fixes#13400.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13401
(cherry picked from commit d2d151ae5b)
Aggregation query on counter column is failing because forward_service is looking for function with counter as an argument and such function doesn't exist. Instead the long type should be used.
Fixes: #12939Closes#12963
* github.com:scylladb/scylladb:
test:boost: counter column parallelized aggregation test
service:forward_service: use long type when column is counter
(cherry picked from commit 61e67b865a)
Consider
- n1, n2, n3
- n3 is down
- n4 replaces n3 with the same ip address 127.0.0.3
- Inside the storage_service::handle_state_normal callback for 127.0.0.3 on n1/n2
```
auto host_id = _gossiper.get_host_id(endpoint);
auto existing = tmptr->get_endpoint_for_host_id(host_id);
```
host_id = new host id
existing = empty
As a result, del_replacing_endpoint() will not be called.
This means 127.0.0.3 will not be removed as a pending node on n1 and n2 when
replacing is done. This is wrong.
This is a regression since commit 9942c60d93
(storage_service: do not inherit the host_id of a replaced a node), where
replacing node uses a new host id than the node to be replaced.
To fix, call del_replacing_endpoint() when a node becomes NORMAL and existing
is empty.
Before:
n1:
storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3
token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Marked ops done from coordinator=127.0.0.3
storage_service - Node 127.0.0.3 state jump to normal
storage_service - Set host_id=6f9ba4e8-9457-4c76-8e2a-e2be257fe123 to be owned by node=127.0.0.3
After:
n1:
storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3
token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Marked ops done from coordinator=127.0.0.3
storage_service - Node 127.0.0.3 state jump to normal
token_metadata - Removed node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - Set host_id=72219180-e3d1-4752-b644-5c896e4c2fed to be owned by node=127.0.0.3
Tests: https://github.com/scylladb/scylla-dtest/pull/3126Closes#13677
Fixes: https://github.com/scylladb/scylla-enterprise/issues/2852
(cherry picked from commit a8040306bb)
The evictable reader must ensure that each buffer fill makes forward
progress, i.e. the last fragment in the buffer has a position larger
than the last fragment from the last buffer-fill. Otherwise, the reader
could get stuck in an infinite loop between buffer fills, if the reader
is evicted in-between.
The code guranteeing this forward change has a bug: when the next
expected position is a partition-start (another partition), the code
would loop forever, effectively reading all there is from the underlying
reader.
To avoid this, add a special case to ignore the progress guarantee loop
altogether when the next expected position is a partition start. In this
case, progress is garanteed anyway, because there is exactly one
partition-start fragment in each partition.
Fixes: #13491Closes#13563
(cherry picked from commit 72003dc35c)
I've no idea why the quotes are there at all, it works even without
them. However, with quotes gdb-13 fails to find the _all_threads static
thread-local variable _unless_ it's printed with gdb "p" command
beforehand.
fixes: #13125
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13132
(cherry picked from commit 537510f7d2)
This series handles errors when aborting node operations and prints them rather letting them leak and be exposed to the user.
Also, cleanup the node_ops logging formats when aborting different node ops
and add more error logging around errors in the "worker" nodes.
Closes#12799
* github.com:scylladb/scylladb:
storage_service: node_ops_signal_abort: print a warning when signaling abort
storage_service: s/node_ops_singal_abort/node_ops_signal_abort/
storage_service: node_ops_abort: add log messages
storage_service: wire node_ops_ctl for node operations
storage_service: add node_ops_ctl class to formalize all node_ops flow
repair: node_ops_cmd_request: add print function
repair: do_decommission_removenode_with_repair: log ignore_nodes
repair: replace_with_repair: get ignore_nodes as unordered_set
gossiper: get_generation_for_nodes: get nodes as unordered_set
storage_service: don't let node_ops abort failures mask the real error
(cherry picked from commit 6373452b31)
This patch fixes a problem which affects decommission and removenode
which may lead to data consistency problems under conditions which
lead one of the nodes to unliaterally decide to abort the node
operation without the coordinator noticing.
If this happens during streaming, the node operation coordinator would
proceed to make a change in the gossiper, and only later dectect that
one of the nodes aborted during sending of decommission_done or
removenode_done command. That's too late, because the operation will
be finalized by all the nodes once gossip propagates.
It's unsafe to finalize the operation while another node aborted. The
other node reverted to the old topolgy, with which they were running
for some time, without considering the pending replica when handling
requests. As a result, we may end up with consistency issues. Writes
made by those coordinators may not be replicated to CL replicas in the
new topology. Streaming may have missed to replicate those writes
depending on timing.
It's possible that some node aborts but streaming succeeds if the
abort is not due to network problems, or if the network problems are
transient and/or localized and affect only heartbeats.
There is no way to revert after we commit the node operation to the
gossiper, so it's ok to close node_ops sessions before making the
change to the gossiper, and thus detect aborts and prevent later aborts
after the change in the gossiper is made. This is already done during
bootstrap (RBNO enabled) and replacenode. This patch canges removenode
to also take this approach by moving sending of remove_done earlier.
We cannot take this approach with decommission easily, because
decommission_done command includes a wait for the node to leave the
ring, which won't happen before the change to the gossiper is
made. Separating this from decommission_done would require protocol
changes. This patch adds a second-best solution, which is to check if
sessions are still there right before making a change to the gossiper,
leaving decommission_done where it was.
The race can still happen, but the time window is now much smaller.
The PR also lays down infrastructure which enables testing the scenarios. It makes node ops
watchdog periods configurable, and adds error injections.
Fixes#12989
Refs #12969Closes#13028
* github.com:scylladb/scylladb:
storage_service: node ops: Extract node_ops_insert() to reduce code duplication
storage_service: Make node operations safer by detecting asymmetric abort
storage_service: node ops: Add error injections
service: node_ops: Make watchdog and heartbeat intervals configurable
(cherry picked from commit 2b44631ded)
Similar to "storage_service: Wait for normal state handler to finish in
bootstrap", this patch enables the check on the replace procedure.
(cherry picked from commit 5856e69462)
In storage_service::handle_state_normal, storage_service::notify_joined
will be called which drops the rpc connections to the node becomes
normal. This causes rpc calls with that node fail with
seastar::rpc::closed_error error.
Consider this:
- n1 in the cluster
- n2 is added to join the cluster
- n2 sees n1 is in normal status
- n2 starts bootstrap process
- notify_joined on n2 closes rpc connection to n1 in the middle of
bootstrap
- n2 fails to bootstrap
For example, during bootstrap with RBNO, we saw repair failed in a
test that sets ring_delay to zero and does not wait for gossip to
settle.
repair - repair[9cd0dbf8-4bca-48fc-9b1c-d9e80d0313a2]: sync data for
keyspace=system_distributed_everywhere, status=failed:
std::runtime_error ({shard 0: seastar::rpc::closed_error (connection is
closed)})
This patch fixes the race by waiting for the handle_state_normal handler
to finish before the bootstrap process.
Fixes#12764Fixes#12956
(cherry picked from commit 53636167ca)
Node ops has the following procedure:
1 for node in sync_nodes
send prepare cmd to node
2 for node in sync_nodes
send heartbeat cmd to node
If any of the prepare cmd in step 1 takes longer than the heartbeat
watchdog timeout, the heartbeat in step 2 will be too late to update the
watchdog, as a result the watchdog will abort the operation.
To prevent slow prepare cmd kills the node operations, we can start the
heartbeat earlier in the procedure.
Fixes#11011Fixes#12969Closes#12980
(cherry picked from commit ba919aa88a)
Cranelift-codegen 0.92.0 and wasmtime 5.0.0 have security issues
potentially allowing malicious UDFs to read some memory outside
the wasm sandbox. This patch updates them to versions 0.92.1
and 5.0.1 respectively, where the issues are fixed.
Fixes#13157Closes#13171
(cherry picked from commit aad2afd417)
Wasmtime added some improvements in recent releases - particularly,
two security issues were patched in version 2.0.2. There were no
breaking changes for our use other than the strategy of returning
Traps - all of them are now anyhow::Errors instead, but we can
still downcast to them, and read the corresponding error message.
The cxx, anyhow and futures dependency versions now match the
versions saved in the Cargo.lock.
Closes#12830
(cherry picked from commit 8b756cb73f)
Ref #13157
Undefined behavior because the evaluation order is undefined.
With GCC, where evaluation is right-to-left, schema will be moved
once it's forwarded to make_flat_mutation_reader_from_mutations_v2().
The consequence is that memory tracking of mutation_fragment_v2
(for tracking only permit used by view update), which uses the schema,
can be incorrect. However, it's more likely that Scylla will crash
when estimating memory usage for row, which access schema column
information using schema::column_at(), which in turn asserts that
the requested column does really exist.
Fixes#13093.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13092
(cherry picked from commit 3fae46203d)
Fixes https://github.com/scylladb/scylladb/issues/13106
This commit removes the information that BYPASS CACHE
is an Enterprise-only feature and replaces that info
with the link to the BYPASS CACHE description.
Closes#13316
(cherry picked from commit 1cfea1f13c)
* tools/python3 279b6c1...cf7030a (1):
> dist: redhat: provide only a single version
s/%{version}/%{version}-%{release}/ in `Requires:` sections.
this enforces the runtime dependencies of exactly the same
releases between scylla packages.
Fixes#13222
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 7165551fd7)
The REST test test_storage_service.py::test_toppartitions_pk_needs_escaping
was flaky. It tests the toppartition request, which unfortunately needs
to choose a sampling duration in advance, and we chose 1 second which we
considered more than enough - and indeed typically even 1ms is enough!
but very rarely (only know of only one occurance, in issue #13223) one
second is not enough.
Instead of increasing this 1 second and making this test even slower,
this patch takes a retry approach: The tests starts with a 0.01 second
duration, and is then retried with increasing durations until it succeeds
or a 5-seconds duration is reached. This retry approach has two benefits:
1. It de-flakes the test (allowing a very slow test to take 5 seconds
instead of 1 seconds which wasn't enough), and 2. At the same time it
makes a successful test much faster (it used to always take a full
second, now it takes 0.07 seconds on a dev build on my laptop).
A *failed* test may, in some cases, take 10 seconds after this patch
(although in some other cases, an error will be caught immediately),
but I consider this acceptable - this test should pass, after all,
and a failure indicates a regression and taking 10 seconds will be
the last of our worries in that case.
Fixes#13223.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13238
(cherry picked from commit c550e681d7)
This patch increases the connection timeout in the get_cql_cluster()
function in test/cql-pytest/run.py. This function is used to test
that Scylla came up, and also test/alternator/run uses it to set
up the authentication - which can only be done through CQL.
The Python driver has 2-second and 5-second default timeouts that should
have been more than enough for everybody (TM), but in #13239 we saw
that in one case it apparently wasn't enough. So to be extra safe,
let's increase the default connection-related timeouts to 60 seconds.
Note this change only affects the Scylla *boot* in the test/*/run
scripts, and it does not affect the actual tests - those have different
code to connect to Scylla (see cql_session() in test/cql-pytest/util.py),
and we already increased the timeouts there in #11289.
Fixes#13239
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13291
(cherry picked from commit 4fdcee8415)
sleep_abortable() is aborted on success, which causes sleep_aborted
exception to be thrown. This causes scylla to throw every 100ms for
each pinged node. Throwing may reduce performance if happens often.
Also, it spams the logs if --logger-log-level exception=trace is enabled.
Avoid by swallowing the exception on cancellation.
Fixes#13278.
Closes#13279
(cherry picked from commit 99cb948eac)
before this change, we use `round(random.random(), 5)` for
the value of `bloom_filter_fp_chance` config option. there are
chances that this expression could return a number lower or equal
to 6.71e-05.
but we do have a minimal for this option, which is defined by
`utils::bloom_calculations::probs`. and the minimal false positive
rate is 6.71e-05.
we are observing test failures where the we are using 0 for
the option, and scylla right rejected it with the error message of
```
bloom_filter_fp_chance must be larger than 6.71e-05 and less than or equal to 1.0 (got 0)
```.
so, in this change, to address the test failure, we always use a number
slightly greater or equal to a number slightly greater to the minimum to
ensure that the randomly picked number is in the range of supported
false positive rate.
Fixes#13313
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13314
(cherry picked from commit 33f4012eeb)
Otherwise the null pointer is dereferenced.
Add a unit test reproducing the issue
and testing this fix.
Fixes#13636
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 12877ad026)
The tombstone_gc was documented as experimental in version 5.0.
It is no longer experimental in version 5.2.
This commit updates the information about the option.
Closes#13469
(cherry picked from commit a68b976c91)
in `make_group0_history_state_id_mutation`, when adding a new entry to
the group 0 history table, if the parameter `gc_older_than` is engaged,
we create a range tombstone in the mutation which deletes entries older
than the new one by `gc_older_than`. In particular if
`gc_older_than = 0`, we want to delete all older entries.
There was a subtle bug there: we were using millisecond resolution when
generating the tombstone, while the provided state IDs used microsecond
resolution. On a super fast machine it could happen that we managed to
perform two schema changes in a single millisecond; this happened
sometimes in `group0_test.test_group0_history_clearing_old_entries`
on our new CI/promotion machines, causing the test to fail because the
tombstone didn't clear the entry correspodning to the previous schema
change when performing the next schema change (since they happened in
the same millisecond).
Use microsecond resolution to fix that. The consecutive state IDs used
in group 0 mutations are guaranteed to be strictly monotonic at
microsecond resolution (see `generate_group0_state_id` in
service/raft/raft_group0_client.cc).
Fixes#13594Closes#13604
* github.com:scylladb/scylladb:
db: system_keyspace: use microsecond resolution for group0_history range tombstone
utils: UUID_gen: accept decimicroseconds in min_time_UUID
(cherry picked from commit 10c1f1dc80)
This patch backports https://github.com/scylladb/scylladb/pull/12710 to branch-5.2. To resolve the conflicts that it's causing, it also includes
* https://github.com/scylladb/scylladb/pull/12680
* https://github.com/scylladb/scylladb/pull/12681Closes#13542
* github.com:scylladb/scylladb:
uda: change the UDF used in a UDA if it's replaced
functions: add helper same_signature method
uda: return aggregate functions as shared pointers
udf: also check reducefunc to confirm that a UDF is not used in a UDA
udf: fix dropping UDFs that share names with other UDFs used in UDAs
pytest: add optional argument for new_function argument types
udt: disallow dropping a user type used in a user function
A problem in compaction reevaluation can cause the SSTable set to be left uncompacted for indefinite amount of time, potentially causing space and read amplification to be suboptimal.
Two revaluation problems are being fixed, one after off-strategy compaction ended, and another in compaction manager which intends to periodically reevaluate a need for compaction.
Fixes https://github.com/scylladb/scylladb/issues/13429.
Fixes https://github.com/scylladb/scylladb/issues/13430.
Closes#13431
* github.com:scylladb/scylladb:
compaction: Make compaction reevaluation actually periodic
replica: Reevaluate regular compaction on off-strategy completion
(cherry picked from commit 9a02315c6b)
The purpose of `_stop` is to remember whether the consumption of the
last partition was interrupted or it was consumed fully. In the former
case, the compactor allows retreiving the compaction state for the given
partition, so that its compaction can be resumed at a later point in
time.
Currently, `_stop` is set to `stop_iteration::yes` whenever the return
value of any of the `consume()` methods is also `stop_iteration::yes`.
Meaning, if the consuming of the partition is interrupted, this is
remembered in `_stop`.
However, a partition whose consumption was interrupted is not always
continued later. Sometimes consumption of a partitions is interrputed
because the partition is not interesting and the downstream consumer
wants to stop it. In these cases the compactor should not return an
engagned optional from `detach_state()`, because there is not state to
detach, the state should be thrown away. This was incorrectly handled so
far and is fixed in this patch, but overwriting `_stop` in
`consume_partition_end()` with whatever the downstream consumer returns.
Meaning if they want to skip the partition, then `_stop` is reset to
`stop_partition::no` and `detach_state()` will return a disengaged
optional as it should in this case.
Fixes: #12629Closes#13365
(cherry picked from commit bae62f899d)
Currently, if a UDA uses a UDF that's being replaced,
the UDA will still keep using the old UDF until the
node is restarted.
This patch fixes this behavior by checking all UDAs
when replacing a UDF and updating them if necessary.
Fixes#12709
(cherry picked from commit 02bfac0c66)
When deciding whether two functions have the same
signature, we have to check if they have the same name
and parameter types. Additionally, if they're represented
by pointers, we need to check if any of them is a nullptr.
This logic is used multiple times, so it's extracted to
a separate function.
To use this function, the `used_by_user_aggregate` method
takes now a function instead of name and types list - we
can do it because we always use it with an existing user
function (that we're trying to drop).
The method will also be useful when we'll be not dropping,
but replacing a user function.
(cherry picked from commit 58987215dc)
We will want to reuse the functions that we get from an aggregate
without making a deep copy, and it's only possible if we get
pointers from the aggregate instead of actual values.
(cherry picked from commit 20069372e7)
When dropping a UDF we're checking if it's not begin used in any UDAs
and fail otherwise. However, we're only checking its state function
and final function, and it may also be used as its reduce function.
This patch adds the missing checks and a test for them.
(cherry picked from commit ef1dac813b)
Currently, when dropping a function, we only check if there exist
an aggregate that uses a function with the same name as its state
function or final function. This may cause the drop to fail even
when it's just another UDF with the same name that's used in the
aggregate, even when the actual dropped function is not used there.
This patch fixes this by checking whether not only the name of the
UDA's sfunc and finalfunc, but also their argument types.
(cherry picked from commit 49077dd144)
When multiple functions with the same name but different argument types
are created, the default drop statement for these functions will fail
because it does not include the argument types.
With this change, this problem can be worked around by specifying
argument types when creating the function, as this will cause the drop
statement to include them.
(cherry picked from commit 8791b0faf5)
Currently, nothing prevents us from dropping a user type
used in a user function, even though doing so may make us
unable to use the function correctly.
This patch prevents this behavior by checking all function
argument and return types when executing a drop type statement
and preventing it from completing if the type is referenced
by any of them.
(cherry picked from commit 86c61828e6)
Related: https://github.com/scylladb/scylla-enterprise/issues/2794
This commit adds the information about the metric changes
in version 2023.1 compared to version 5.2.
This commit is part of the 5.2-to-2023.1 upgrade guide and
must be backported to branch-5.2.
Closes#13506
(cherry picked from commit 989a75b2f7)
The patch doesn't apply cleanly, so a targeted backport PR was necessary.
I also needed to cherry-pick two patches from https://github.com/scylladb/scylladb/pull/13255 that the backported patch depends on. Decided against backporting the entire https://github.com/scylladb/scylladb/pull/13255 as it is quite an intrusive change.
Fixes: https://github.com/scylladb/scylladb/issues/11803Closes#13515
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: don't evict inactive readers needlessly
reader_concurrency_semaphore: add stats to record reason for queueing permits
reader_concurrency_semaphore: can_admit_read(): also return reason for rejection
Our documentation states that writing an item with "USING TTL 0" means it
should never expire. This should be true even if the table has a default
TTL. But Scylla mistakenly handled "USING TTL 0" exactly like having no
USING TTL at all (i.e., it took the default TTL, instead of unlimited).
We had two xfailing tests demonstrating that Scylla's behavior in this
is different from Cassandra. Scylla's behavior in this case was also
undocumented.
By the way, Cassandra used to have the same bug (CASSANDRA-11207) but
it was fixed already in 2016 (Cassandra 3.6).
So in this patch we fix Scylla's "USING TTL 0" behavior to match the
documentation and Cassandra's behavior since 2016. One xfailing test
starts to pass and the second test passes this bug and fails on a
different one. This patch also adds a third test for "USING TTL ?"
with UNSET_VALUE - it behaves, on both Scylla and Cassandra, like a
missing "USING TTL".
The origin of this bug was that after parsing the statement, we saved
the USING TTL in an integer, and used 0 for the case of no USING TTL
given. This meant that we couldn't tell if we have USING TTL 0 or
no USING TTL at all. This patch uses an std::optional so we can tell
the case of a missing USING TTL from the case of USING TTL 0.
Fixes#6447
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13079
(cherry picked from commit a4a318f394)
This patch fixes#12475, where an aggregation (e.g., COUNT(*), MIN(v))
of absolutely no partitions (e.g., "WHERE p = null" or "WHERE p in ()")
resulted in an internal error instead of the "zero" result that each
aggregator expects (e.g., 0 for COUNT, null for MIN).
The problem is that normally our aggregator forwarder picks the nodes
which hold the relevant partition(s), forwards the request to each of
them, and then combines these results. When there are no partitions,
the query is sent to no node, and we end up with an empty result set
instead of the "zero" results. So in this patch we recognize this
case and build those "zero" results (as mentioned above, these aren't
always 0 and depend on the aggregation function!).
The patch also adds two tests reproducing this issue in a fairly general
way (e.g., several aggregators, different aggregation functions) and
confirming the patch fixes the bug.
The test also includes two additional tests for COUNT aggregation, which
uncovered an incompatibility with Cassandra which is still not fixed -
so these tests are marked "xfail":
Refs #12477: Combining COUNT with GROUP by results with empty results
in Cassandra, and one result with empty count in Scylla.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12715
(cherry picked from commit 3ba011c2be)
total disk space used metric is incorrectly telling the amount of
disk space ever used, which is wrong. It should tell the size of
all sstables being used + the ones waiting to be deleted.
live disk space used, by this defition, shouldn't account the
ones waiting to be deleted.
and live sstable count, shouldn't account sstables waiting to
be deleted.
Fix all that.
Fixes#12717.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 529a1239a9)
Some callees of update_pending_ranges use the variant of get_address_ranges()
which builds a hashmap of all <endpoint, owned range> pairs. For
everywhere_topology, the size of this map is quadratic in the number of
endpoints, making it big enough to cause contiguous allocations of tens of MiB
for clusters of realistic size, potentially causing trouble for the
allocator (as seen e.g. in #12724). This deserves a correction.
This patch removes the quadratic variant of get_address_ranges() and replaces
its uses with its linear counterpart.
Refs #10337
Refs #10817
Refs #10836
Refs #10837Fixes#12724
(cherry picked from commit 9e57b21e0c)
There was a missing check in validation of named
bind markers.
Let's say that a user prepares a query like:
```cql
INSERT INTO ks.tab (pk, ck, v) VALUES (:pk, :ck, :v)
```
Then they execute the query, but specify only
values for `:pk` and `:ck`.
We should detect that a value for :v is missing
and throw an invalid_request_exception.
Until now there was no such check, in case of a missing variable
invalid `query_options` were created and Scylla crashed.
Sadly it's impossible to create a regression test
using `cql-pytest` or `boost`.
`cql-pytest` uses the python driver, which silently
ignores mising named bind variables, deciding
that the user meant to send an UNSET_VALUE for them.
When given values like `{'pk': 1, 'ck': 2}`, it will automaticaly
extend them to `{'pk': 1, 'ck': 2, 'v': UNSET_VALUE}`.
In `boost` I tried to use `cql_test_env`,
but it only has methods which take valid `query_options`
as a parameter. I could create a separate unit tests
for the creation and validation of `query_options`
but it won't be a true end-to-end test like `cql-pytest`.
The bug was found using the rust driver,
the reproducer is available in the issue description.
Fixes: #12727
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#12730
(cherry picked from commit 2a5ed115ca)
The test test_scan.py::test_scan_long_partition_tombstone_string
checks that a full-table Scan operation ends a page in the middle of
a very long string of partition tombstones, and does NOT scan the
entire table in one page (if we did that, getting a single page could
take an unbounded amount of time).
The test is currently flaky, having failed in CI runs three times in
the past two months.
The reason for the flakiness is that we don't know exactly how long
we need to make the sequence of partition tombstones in the test before
we can be absolutely sure a single page will not read this entire sequence.
For single-partition scans we have the "query_tombstone_page_limit"
configuration parameter, which tells us exactly how long we need to
make the sequence of row tombstones. But for a full-table scan of
partition tombstones, the situation is more complicated - because the
scan is done in parallel on several vnodes in parallel and each of
them needs to read query_tombstone_page_limit before it stops.
In my experiments, using query_tombstone_limit * 4 consecutive tombstones
was always enough - I ran this test hundreds of times and it didn't fail
once. But since it did fail on Jenkins very rarely (3 times in the last
two months), maybe the multiplier 4 isn't enough. So this patch doubles
it to 8. Hopefully this would be enough for anyone (TM).
This makes this test even bigger and slower than it was. To make it
faster, I changed this test's write isolation mode from the default
always_use_lwt to forbid_rmw (not use LWT). This leaves the test's
total run time to be similar to what it was before this patch - around
0.5 seconds in dev build mode on my laptop.
Fixes#12817
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12819
(cherry picked from commit 14cdd034ee)
Inactive readers should only be evicted to free up resources for waiting
readers. Evicting them when waiters are not admitted for any other
reason than resources is wasteful and leads to extra load later on when
these evicted readers have to be recreated end requeued.
This patch changes the logic on both the registering path and the
admission path to not evict inactive readers unless there are readers
actually waiting on resources.
A unit-test is also added, reproducing the overly-agressive eviction and
checking that it doesn't happen anymore.
Fixes: #11803Closes#13286
(cherry picked from commit bd57471e54)
When diagnosing problems, knowing why permits were queued is very
valuable. Record the reason in a new stats, one for each reason a permit
can be queued.
(cherry picked from commit 7b701ac52e)
After a failed topology operation, like bootstrap / decommission /
removenode, the cluster might contain a garbage entry in either token
ring or group 0. This entry can be cleaned-up by executing removenode on
any other node, pointing to the node that failed to bootstrap or leave
the cluster.
Document this procedure, including a method of finding the host ID of a
garbage entry.
Add references in other documents.
Fixes: #13122Closes#13186
(cherry picked from commit c2a2996c2b)
This commit removes the Enterprise upgrade guides from
the Open Source documentation. The Enterprise upgrade guides
should only be available in the Enterprise documentation,
with the source files stored in scylla-enterprise.git.
In addition, this commit:
- adds the links to the Enterprise user guides in the Enterprise
documentation at https://enterprise.docs.scylladb.com/
- adds the redirections for the removed pages to avoid
breaking any links.
This commit must be reverted in scylla-enterprise.git.
(cherry picked from commit 61bc05ae49)
Closes#13473
Related: https://github.com/scylladb/scylla-enterprise/issues/2770
This commit adds the upgrade guide from ScyllaDB Open Source 5.2
to ScyllaDB Enterprise 2023.1.
This commit does not cover metric updates (the metrics file has no
content, which needs to be added in another PR).
As this is an upgrade guide, this commit must be merged to master and
backported to branch-5.2 and branch-2023.1 in scylla-enterprise.git.
Closes#13294
(cherry picked from commit 595325c11b)
Fixes#12810
We did not update total_size_on_disk in commitlog totals when use o_dsync was off.
This means we essentially ran with no registered footprint, also causing broken comparisons in delete_segments.
Closes#12950
* github.com:scylladb/scylladb:
commitlog: Fix updating of total_size_on_disk on segment alloc when o_dsync is off
commitlog: change type of stored size
(cherry picked from commit e70be47276)
The `database::stop` method is sometimes hanging and it's always hard to spot where exactly it sleeps. Few more logging messages would make this much simpler.
refs: #13100
refs: #10941Closes#13141
* github.com:scylladb/scylladb:
database: Increase verbosity of database::stop() method
large_data_handler: Increase verbosity on shutdown
large_data_handler: Coroutinize .stop() method
(cherry picked from commit e22b27a107)
This reverts commit c6087cf3a0.
Said commit can cause a deadlock when 2 or more repairs compete for
locks on 2 or more nodes. Consider the following scenario:
Node n1 and n2 in the cluster, 1 shard per node, rf = 2, each shard has
1 available unit for the reader lock
n1 starts repair r1
r1-n1 (instance of r1 on node1) takes the reader lock on node1
n2 starts repair r2
r2-n2 (instance of r2 on node2) takes the reader lock on node2
r1-n2 will fail to take the reader lock on node2
r2-n1 will fail to take the reader lock on node1
As a result, r1 and r2 could not make progress and deadlock happens.
The complexity comes from the fact that a repair job needs lock on more
than one node. It is not guaranteed that all the participant nodes could
take the lock in one short.
There is no simple solution to this so we have to revert this locking
mechanism and look for another way to prevent reader trashing when
repairing nodes with mismatching shard count.
Fixes: #12693Closes#13266
(cherry picked from commit 7699904c54)
We currently don't clean up the system_distributed.view_build_status
table after removed nodes. This can cause false-positive check for
whether view update generation is needed for streaming.
The proper fix is to clean up this table, but that will be more
involved, it even when done, it might not be immediate. So until then
and to be on the safe side, filter out entries belonging to unknown
hosts from said table.
Fixes: #11905
Refs: #11836Closes#11860
(cherry picked from commit 84a69b6adb)
`paxos_response_handler::learn_decision` was calling
`cdc_service::augment_mutation_call` concurrently with
`storage_proxy::mutate_internal`. `augment_mutation_call` was selecting
rows from the base table in order to create the preimage, while
`mutate_internal` was writing rows to the table. It was therefore
possible for the preimage to observe the update that it accompanied,
which doesn't make any sense, because the preimage is supposed to show
the state before the update.
Fix this by performing the operations sequentially. We can still perform
the CDC mutation write concurrently with the base mutation write.
`cdc_with_lwt_test` was sometimes failing in debug mode due to this bug
and was marked flaky. Unmark it.
Fixes#12098
(cherry picked from commit 1ef113691a)
If request processing ended with an error, it is worth
sending the error to the client through
make_error/write_response. Previously in this case we
just wrote a message to the log and didn't handle the
client connection in any way. As a result, the only
thing the client got in this case was timeout error.
A new test_batch_with_error is added. It is quite
difficult to reproduce error condition in a test,
so we use error injection instead. Passing injection_key
in the body of the request ensures that the exception
will be thrown only for this test request and
will not affect other requests that
the driver may send in the background.
Closes: scylladb#12104
(cherry picked from commit a4cf509c3d)
Fixes https://github.com/scylladb/scylladb/issues/13138
Fixes https://github.com/scylladb/scylladb/issues/13153
This PR:
- Fixes outdated information about the recommended OS. Since version 5.2, the recommended OS should be Ubuntu 22.04 because that OS is used for building the ScyllaDB image.
- Adds the OS support information for version 5.2.
This PR (both commits) needs to be backported to branch-5.2.
Closes#13188
* github.com:scylladb/scylladb:
doc: Add OS support for version 5.2
doc: Updates the recommended OS to be Ubuntu 22.04
(cherry picked from commit f4b5679804)
This PR backports 2f4a793457 to branch-5.2. Said patch depends on some other patches that are not part of any release yet.
This PR should apply to 5.1 and 5.0 too.
Closes#13162
* github.com:scylladb/scylladb:
reader_concurrency_semaphore:: clear_inactive_reads(): defer evicting to evict()
reader_permit: expose operator<<(reader_permit::state)
reader_permit: add get_state() accessor
The series fixes the `make_nonforwardable` reader, it shouldn't emit `partition_end` for previous partition after `next_partition()` and `fast_forward_to()`
Fixes: #12249Closes#12978
* github.com:scylladb/scylladb:
flat_mutation_reader_test: cleanup, seastar::async -> SEASTAR_THREAD_TEST_CASE
make_nonforwardable: test through run_mutation_source_tests
make_nonforwardable: next_partition and fast_forward_to when single_partition is true
make_forwardable: fix next_partition
flat_mutation_reader_v2: drop forward_buffer_to
nonforwardable reader: fix indentation
nonforwardable reader: refactor, extract reset_partition
nonforwardable reader: add more tests
nonforwardable reader: no partition_end after fast_forward_to()
nonforwardable reader: no partition_end after next_partition()
nonforwardable reader: no partition_end for empty reader
row_cache: pass partition_start though nonforwardable reader
(cherry picked from commit 46efdfa1a1)
Instead of open-coding the same, in an incomplete way.
clear_inactive_reads() does incomplete eviction in severeal ways:
* it doesn't decrement _stats.inactive_reads
* it doesn't set the permit to evicted state
* it doesn't cancel the ttl timer (if any)
* it doesn't call the eviction notifier on the permit (if there is one)
The list goes on. We already have an evict() method that all this
correctly, use that instead of the current badly open-coded alternative.
This patch also enhances the existing test for clear_inactive_reads()
and adds a new one specifically for `stop()` being called while having
inactive reads.
Fixes: #13048Closes#13049
(cherry picked from commit 2f4a793457)
There was a bug in `expr::search_and_replace`.
It doesn't preserve the `order` field of binary_operator.
`order` field is used to mark relations created
using the SCYLLA_CLUSTERING_BOUND.
It is a CQL feature used for internal queries inside Scylla.
It means that we should handle the restriction as a raw
clustering bound, not as an expression in the CQL language.
Losing the SCYLLA_CLUSTERING_BOUND marker could cause issues,
the database could end up selecting the wrong clustering ranges.
Fixes: #13055
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#13056
(cherry picked from commit aa604bd935)
EOF is only guarateed to be set if one tried to read past the end of the
file. So when checking for EOF, also try to read some more. This
should force the EOF flag into a correct value. We can then check that
the read yielded 0 bytes.
This should ensure that `validate_checksums()` will not falsely declare
the validation to have failed.
Fixes: #11190Closes#12696
(cherry picked from commit 693c22595a)
This commit makes the following changes to the docs landing page:
- Adds the ScyllaDB enterprise docs as one of three tiles.
- Modifies the three tiles to reflect the three flavors of ScyllaDB.
- Moves the "New to ScyllaDB? Start here!" under the page title.
- Renames "Our Products" to "Other Products" to list the products other
than ScyllaDB itself. In addtition, the boxes are enlarged from to
large-4 to look better.
The major purpose of this commit is to expose the ScyllaDB
documentation.
docs: fix the link
(cherry picked from commit 27bb8c2302)
Closes#13086
This PR adds a note to the Alternator TTL section to specify in which Open Source and Enterprise versions the feature was promoted from experimental to non-experimental.
The challenge here is that OSS and Enterprise are (still) **documented together**, but they're **not in sync** in promoting the TTL feature: it's still experimental in 5.1 (released) but no longer experimental in 2022.2 (to be released soon).
We can take one of the following approaches:
a) Merge this PR with master and ask the 2022.2 users to refer to master.
b) Merge this PR with master and then backport to branch-5.1. If we choose this approach, it is necessary to backport https://github.com/scylladb/scylladb/pull/11997 beforehand to avoid conflicts.
I'd opt for a) because it makes more sense from the OSS perspective and helps us avoid mess and backporting.
Closes#12295
* github.com:scylladb/scylladb:
doc: fix the version in the comment on removing the note
doc: specify the versions where Alternator TTL is no longer experimental
(cherry picked from commit d5dee43be7)
It's known that reading large cells in reverse cause large allocations.
Source: https://github.com/scylladb/scylladb/issues/11642
The loading is preliminary work for splitting large partitions into
fragments composing a run and then be able to later read such a run
in an efficiency way using the position metadata.
The splitting is not turned on yet, anywhere. Therefore, we can
temporarily disable the loading, as a way to avoid regressions in
stable versions. Large allocations can cause stalls due to foreground
memory eviction kicking in.
The default values for position metadata say that first and last
position include all clustering rows, but they aren't used anywhere
other than by sstable_run to determine if a run is disjoint at
clustering level, but given that no splitting is done yet, it
does not really matter.
Unit tests relying on position metadata were adjusted to enable
the loading, such that they can still pass.
Fixes#11642.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12979
(cherry picked from commit d73ffe7220)
Currently all consumed range tombstone changes are unconditionally
forwarded to the validator. Even if they are shadowed by a higher level
tombstone and/or purgable. This can result in a situation where a range
tombstone change was seen by the validator but not passed to the
consumer. The validator expects the range tombstone change to be closed
by end-of-partition but the end fragment won't come as the tombstone was
dropped, resulting in a false-positive validation failure.
Fix by only passing tombstones to the validator, that are actually
passed to the consumer too.
Fixes: #12575Closes#12578
(cherry picked from commit e2c9cdb576)
Check the first fragment before dereferencing it, the fragment might be
empty, in which case move to the next one.
Found by running range scan tests with random schema and random data.
Fixes: #12821Fixes: #12823Fixes: #12708Closes#12824
(cherry picked from commit ef548e654d)
After 5badf20c7a applier fiber does not
stop after it gets abort error from a state machine which may trigger an
assertion because previous batch is not applied. Fix it.
Fixes#12863
(cherry picked from commit 9bdef9158e)
we should never return a reference to local variable.
so in this change, a reference to a static variable is returned
instead. this should address following warning from Clang 17:
```
/home/kefu/dev/scylladb/tools/schema_loader.cc:146:16: error: returning reference to local temporary object [-Werror,-Wreturn-stack-address]
return {};
^~
```
Fixes#12875
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12876
(cherry picked from commit 6eab8720c4)
We currently configure only TimeoutStartSec, but probably it's not
enough to prevent coredump timeout, since TimeoutStartSec is maximum
waiting time for service startup, and there is another directive to
specify maximum service running time (RuntimeMaxSec).
To fix the problem, we should specify RunTimeMaxSec and TimeoutSec (it
configures both TimeoutStartSec and TimeoutStopSec).
Fixes#5430Closes#12757
(cherry picked from commit bf27fdeaa2)
Related https://github.com/scylladb/scylladb/issues/12658.
This issue fixes the bug in the upgrade guides for the released versions.
Closes#12679
* github.com:scylladb/scylladb:
doc: fix the service name in the upgrade guide for patch releases versions 2022
doc: fix the service name in the upgrade guide from 2021.1 to 2022.1
(cherry picked from commit 325246ab2a)
Both patches are important to fix inefficiencies when updating the backlog tracker, which can manifest as a reactor stall, on a special event like schema change.
No conflicts when backporting.
Regression since 1d9f53c881, which is present in branch 5.1 onwards.
Closes#12851
* github.com:scylladb/scylladb:
compaction: Fix inefficiency when updating LCS backlog tracker
table: Fix quadratic behavior when inserting sstables into tracker on schema change
LCS backlog tracker uses STCS tracker for L0. Turns out LCS tracker
is calling STCS tracker's replace_sstables() with empty arguments
even when higher levels (> 0) *only* had sstables replaced.
This unnecessary call to STCS tracker will cause it to recompute
the L0 backlog, yielding the same value as before.
As LCS has a fragment size of 0.16G on higher levels, we may be
updating the tracker multiple times during incremental compaction,
which operates on SSTables on higher levels.
Inefficiency is fixed by only updating the STCS tracker if any
L0 sstable is being added or removed from the table.
This may be fixing a quadratic behavior during boot or refresh,
as new sstables are loaded one by one.
Higher levels have a substantial higher number of sstables,
therefore updating STCS tracker only when level 0 changes, reduces
significantly the number of times L0 backlog is recomputed.
Refs #12499.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12676
(cherry picked from commit 1b2140e416)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Each time backlog tracker is informed about a new or old sstable, it
will recompute the static part of backlog which complexity is
proportional to the total number of sstables.
On schema change, we're calling backlog_tracker::replace_sstables()
for each existing sstable, therefore it produces O(N ^ 2) complexity.
Fixes#12499.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12593
(cherry picked from commit 87ee547120)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The "cluster manager" used by the topology test suite uses a UNIX-domain
socket to communicate between the cluster manager and the individual tests.
The socket is currently located in the test directory but there is a
problem: In Linux the length of the path used as a UNIX-domain socket
address is limited to just a little over 100 bytes. In Jenkins run, the
test directory names are very long, and we sometimes go over this length
limit and the result is that test.py fails creating this socket.
In this patch we simply put the socket in /tmp instead of the test
directory. We only need to do this change in one place - the cluster
manager, as it already passes the socket path to the individual tests
(using the "--manager-api" option).
Tested by cloning Scylla in a very long directory name.
A test like ./test.py --mode=dev test_concurrent_schema fails before
this patch, and passes with it.
Fixes#12622Closes#12678
(cherry picked from commit 681a066923)
`ScyllaClusterManager` is used to run a sequence of test cases from
a single test file. Between two consecutive tests, if the previous test
left the cluster 'dirty', meaning the cluster cannot be reused, it would
free up space in the pool (using `steal`), stop the cluster, then get a
new cluster from the pool.
Between the `steal` and the `get`, a concurrent test run (with its own
instance of `ScyllaClusterManager` would start, because there was free
space in the pool.
This resulted in undesirable behavior when we ran tests with
`--repeat X` for a large `X`: we would start with e.g. 4 concurrent
runs of a test file, because the pool size was 4. As soon as one of the
runs freed up space in the pool, we would start another concurrent run.
Soon we'd end up with 8 concurrent runs. Then 16 concurrent runs. And so
on. We would have a large number of concurrent runs, even though the
original 4 runs didn't finish yet. All of these concurrent runs would
compete waiting on the pool, and waiting for space in the pool would
take longer and longer (the duration is linear w.r.t number of
concurrent competing runs). Tests would then time out because they would
have to wait too long.
Fix that by using the new `replace_dirty` function introduced to the
pool. This function frees up space by returning a dirty cluster and then
immediately takes it away to be used for a new cluster. Thanks to this,
we will only have at most as many concurrent runs as the pool size. For
example with --repeat 8 and pool size 4, we would run 4 concurrent runs
and start the 5th run only when one of the original 4 runs finishes,
then the 6th run when a second run finishes and so on.
The fix is preceded by a refactor that replaces `steal` with `put(is_dirty=True)`
and a `destroy` function passed to the pool (now the pool is responsible
for stopping the cluster and releasing its IPs).
Fixes#11757Closes#12549
* github.com:scylladb/scylladb:
test/pylib: scylla_cluster: ensure there's space in the cluster pool when running a sequence of tests
test/pylib: pool: introduce `replace_dirty`
test/pylib: pool: replace `steal` with `put(is_dirty=True)`
(cherry picked from commit 132af20057)
From reviews of https://github.com/scylladb/scylladb/pull/12569, avoid
using `async with` and access the `Pool` of clusters with
`get()`/`put()`.
Closes#12612
* github.com:scylladb/scylladb:
test.py: manual cluster handling for PythonSuite
test.py: stop cluster if PythonSuite fails to start
test.py: minor fix for failed PythonSuite test
(cherry picked from commit 5bc7f0732e)
If the after test check fails (is_after_test_ok is False), discard the cluster and raise exception so context manager (pool) does not recycle it.
Ignore exception re-raised by the context manager.
Fixes#12360Closes#12569
* github.com:scylladb/scylladb:
test.py: handle broken clusters for Python suite
test.py: Pool discard method
(cherry picked from commit 54f174a1f4)
`ScyllaCluster.server_stop` had this piece of code:
```
server = self.running.pop(server_id)
if gracefully:
await server.stop_gracefully()
else:
await server.stop()
self.stopped[server_id] = server
```
We observed `stop_gracefully()` failing due to a server hanging during
shutdown. We then ended up in a state where neither `self.running` nor
`self.stopped` had this server. Later, when releasing the cluster and
its IPs, we would release that server's IP - but the server might have
still been running (all servers in `self.running` are killed before
releasing IPs, but this one wasn't in `self.running`).
Fix this by popping the server from `self.running` only after
`stop_gracefully`/`stop` finishes.
Make an analogous fix in `server_start`: put `server` into
`self.running` *before* we actually start it. If the start fails, the
server will be considered "running" even though it isn't necessarily,
but that is OK - if it isn't running, then trying to stop it later will
simply do nothing; if it is actually running, we will kill it (which we
should do) when clearing after the cluster; and we don't leak it.
Closes#12613
(cherry picked from commit a0ff33e777)
Don't use a range scan, which is very inefficient, to perform a query for checking CQL availability.
Improve logging when waiting for server startup times out. Provide details about the failure: whether we managed to obtain the Host ID of the server and whether we managed to establish a CQL connection.
Closes#12588
* github.com:scylladb/scylladb:
test/pylib: scylla_cluster: better logging for timeout on server startup
test/pylib: scylla_cluster: use less expensive query to check for CQL availability
(cherry picked from commit ccc2c6b5dd)
If an endpoint handler throws an exception, the details of the exception
are not returned to the client. Normally this is desirable so that
information is not leaked, but in this test framework we do want to
return the details to the client so it can log a useful error message.
Do it by wrapping every handler into a catch clause that returns
the exception message.
Also modify a bit how HTTPErrors are rendered so it's easier to discern
the actual body of the error from other details (such as the params used
to make the request etc.)
Before:
```
E test.pylib.rest_client.HTTPError: HTTP error 500: 500 Internal Server Error
E
E Server got itself in trouble, params None, json None, uri http+unix://api/cluster/before-test/test_stuff
```
After:
```
E test.pylib.rest_client.HTTPError: HTTP error 500, uri: http+unix://api/cluster/before-test/test_stuff, params: None, json: None, body:
E Failed to start server at host 127.155.129.1.
E Check the log files:
E /home/kbraun/dev/scylladb/testlog/test.py.dev.log
E /home/kbraun/dev/scylladb/testlog/dev/scylla-1.log
```
Closes#12563
(cherry picked from commit 2f84e820fd)
When we obtained a new cluster for a test case after the previous test
case left a dirty cluster, we would release the old cluster's used IP
addresses (`_before_test` function). However, we would not release the
last cluster's IP after the last test case. We would run out of IPs with
sufficiently many test files or `--repeat` runs. Fix this.
Also reorder the operations a bit: stop the cluster (and release its
IPs) before freeing up space in the cluster pool (i.e. call
`self.cluster.stop()` before `self.clusters.steal()`). This reduces
concurrency a bit - fewer Scyllas running at the same time, which is
good (the pool size gives a limit on the desired max number of
concurrently running clusters). Killing a cluster is quick so it won't
make a significant difference for the next guy waiting on the pool.
Closes#12564
(cherry picked from commit 3ed3966f13)
If a cluster fails to boot, it saves the exception in
`self.start_exception` variable; the exception will be rethrown when
a test tries to start using this cluster. As explained in `before_test`:
```
def before_test(self, name) -> None:
"""Check that the cluster is ready for a test. If
there was a start error, throw it here - the server is
running when it's added to the pool, which can't be attributed
to any specific test, throwing it here would stop a specific
test."""
```
It's arguable whether we should blame some random test for a failure
that it didn't cause, but nevertheless, there's a problem here: the
`start_exception` will be rethrown and the test will fail, but then the
cluster will be simply returned to the pool and the next test will
attempt to use it... and so on.
Prevent this by marking the cluster as dirty the first time we rethrow
the exception.
Closes#12560
(cherry picked from commit 147dd73996)
Commitlog O_DSYNC is intended to make Raft and schema writes durable
in the face of power loss. To make O_DSYNC performant, we preallocate
the commitlog segments, so that the commitlog writes only change file
data and not file metadata (which would require the filesystem to commit
its own log).
However, in tests, this causes each ScyllaDB instance to write 384MB
of commitlog segments. This overloads the disks and slows everything
down.
Fix this by disabling O_DSYNC (and therefore preallocation) during
the tests. They can't survive power loss, and run with
--unsafe-bypass-fsync anyway.
Closes#12542
(cherry picked from commit 9029b8dead)
There was a small chance that we called `timeout_src.request_abort()`
twice in the `with_timeout` function, first by timeout and then by
shutdown. `abort_source` fails on an assertion in this case. Fix this.
Fixes: #12512Closes#12514
(cherry picked from commit 54170749b8)
before this change, we returns the total memory managed by Seastar
in the "total" field in system.memory. but this value only reflect
the total memory managed by Seastar's allocator. if
`reserve_additional_memory` is set when starting app_template,
Seastar's memory subsystem just reserves a chunk of memory of this
specified size for system, and takes the remaining memory. since
f05d612da8, we set this value to 50MB for wasmtime runtime. hence
the test of `TestRuntimeInfoTable.test_default_content` in dtest
fails. the test expects the size passed via the option of
`--memory` to be identical to the value reported by system.memory's
"total" field.
after this change, the "total" field takes the reserved memory
for wasm udf into account. the "total" field should reflect the total
size of memory used by Scylla, no matter how we use a certain portion
of the allocated memory.
Fixes#12522
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12573
(cherry picked from commit 4a0134a097)
Currently reverse types match the default case (false), even though they
might be wrapping a tuple type. One user-visible effect of this is that
a schema, which has a reversed<frozen<UDT>> clustering key component,
will have this component incorrectly represented in the schema cql dump:
the UDT will loose the frozen attribute. When attempting to recreate
this schema based on the dump, it will fail as the only frozen UDTs are
allowed in primary key components.
Fixes: #12576Closes#12579
(cherry picked from commit ebc100f74f)
Fixes#12601 (maybe?)
Sort the set of tables on ID. This should ensure we never
generate duplicates in a paged listing here. Can obviously miss things if they
are added between paged calls and end up with a "smaller" UUID/ARN, but that
is to be expected.
(cherry picked from commit da8adb4d26)
Since we're potentially searching the row_lock in parallel to acquiring
the read_lock on the partition, we're racing with row_locker::unlock
that may erase the _row_locks entry for the same clustering key, since
there is no lock to protect it up until the partition lock has been
acquired and the lock_partition future is resolved.
This change moves the code to search for or allocate the row lock
_after_ the partition lock has been acquired to make sure we're
synchronously starting the read/write lock function on it, without
yielding, to prevent this use-after-free.
This adds an allocation for copying the clustering key in advance
even if a row_lock entry already exists, that wasn't needed before.
It only us slows down (a bit) when there is contention and the lock
already existed when we want to go locking. In the fast path there
is no contention and then the code already had to create the lock
and copy the key. In any case, the penalty of copying the key once
is tiny compared to the rest of the work that view updates are doing.
This is required on top of 5007ded2c1 as
seen in https://github.com/scylladb/scylladb/issues/12632
which is closely related to #12168 but demonstrates a different race
causing use-after-free.
Fixes#12632
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 4b5e324ecb)
before this change, we construct a sstring from a comma statement,
which evaluates to the return value of `name.size()`, but what we
expect is `sstring(const char*, size_t)`.
in this change
* instead of passing the size of the string_view,
both its address and size are used
* `std::string_view` is constructed instead of sstring, for better
performance, as we don't need to perform a deep copy
the issue is reported by GCC-13:
```
In file included from cql3/selection/selectable.cc:11:
cql3/selection/field_selector.hh:83:60: error: ignoring return value of function declared with 'nodiscard' attribute [-Werror,-Wunused-result]
auto sname = sstring(reinterpret_cast<const char*>(name.begin(), name.size()));
^~~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12666
(cherry picked from commit 186ceea009)
Fixes#12739.
Currently, segment file removal first calls `f.remove_file()` and
does `total_size_on_disk -= f.known_size()` later.
However, `remove_file()` resets `known_size` to 0, so in effect
the freed space in not accounted for.
`total_size_on_disk` is not just a metric. It is also responsible
for deciding whether a segment should be recycled -- it is recycled
only if `total_size_on_disk - known_size < max_disk_size`.
Therefore this bug has dire performance consequences:
if `total_size_on_disk - known_size` ever exceeds `max_disk_size`,
the recycling of commitlog segments will stop permanently, because
`total_size_on_disk - known_size` will never go back below
`max_disk_size` due to the accounting bug. All new segments from this
point will be allocated from scratch.
The bug was uncovered by a QA performance test. It isn't easy to trigger --
it took the test 7 hours of constant high load to step into it.
However, the fact that the effect is permanent, and degrades the
performance of the cluster silently, makes the bug potentially quite severe.
The bug can be easily spotted with Prometheus as infinitely rising
`commitlog_total_size_on_disk` on the affected shards.
Fixes#12645Closes#12646
(cherry picked from commit fa7e904cd6)
New clusters that use a fresh conf/scylla.yaml will have `consistent_cluster_management: true`, which will enable Raft, unless the user explicitly turns it off before booting the cluster.
People using existing yaml files will continue without Raft, unless consistent_cluster_management is explicitly requested during/after upgrade.
Also update the docs: cluster creation and node addition procedures.
Fixes#12572.
Closes#12585
* github.com:scylladb/scylladb:
docs: mention `consistent_cluster_management` for creating cluster and adding node procedures
conf: enable `consistent_cluster_management` by default
(cherry picked from commit 5c886e59de)
Make it so that failures in `removenode`/`decommission` don't lead to reduced availability, and any leftovers in group 0 can be removed by `removenode`:
- In `removenode`, make the node a non-voter before removing it from the token ring. This removes the possibility of having a group 0 voting member which doesn't correspond to a token ring member. We can still be left with a non-voter, but that's doesn't reduce the availability of group 0.
- As above but for `decommission`.
- Make it possible to remove group 0 members that don't correspond to token ring members from group 0 using `removenode`.
- Add an API to query the current group 0 configuration.
Fixes#11723.
Closes#12502
* github.com:scylladb/scylladb:
test: test_topology: test for removing garbage group 0 members
test/pylib: move some utility functions to util.py
db: system_keyspace: add a virtual table with raft configuration
db: system_keyspace: improve system.raft_snapshot_config schema
service: storage_service: better error handling in `decommission`
service: storage_service: fix indentation in removenode
service: storage_service: make `removenode` work for group 0 members which are not token ring members
service/raft: raft_group0: perform read_barrier in wait_for_raft
service: storage_service: make leaving node a non-voter before removing it from group 0 in decommission/removenode
test: test_raft_upgrade: remove test_raft_upgrade_with_node_remove
service/raft: raft_group0: link to Raft docs where appropriate
service/raft: raft_group0: more logging
service/raft: raft_group0: separate function for checking and waiting for Raft
This test is frequently failing due to a timeout when we try to restart
one of the nodes. The shutdown procedure apparently hangs when we try to
stop the `hints_manager` service, e.g.:
```
INFO 2023-01-13 03:18:02,946 [shard 0] hints_manager - Asked to stop
INFO 2023-01-13 03:18:02,946 [shard 0] hints_manager - Stopped
INFO 2023-01-13 03:18:02,946 [shard 0] hints_manager - Asked to stop
INFO 2023-01-13 03:18:02,946 [shard 1] hints_manager - Asked to stop
INFO 2023-01-13 03:18:02,946 [shard 1] hints_manager - Stopped
INFO 2023-01-13 03:18:02,946 [shard 1] hints_manager - Asked to stop
INFO 2023-01-13 03:18:02,946 [shard 1] hints_manager - Stopped
INFO 2023-01-13 03:22:56,997 [shard 0] hints_manager - Stopped
```
observe the 5 minute delay at the end.
There is a known issue about `hints_manager` stop hanging: #8079.
Now, for some reason, this is the only test case that is hitting this
issue. We don't completely understand why. There is one significant
difference between this test case and others: this is the only test case
which kills 2 (out of 3) servers in the cluster and then tries to
gracefully shutdown the last server. There's a hypothesis that the last
server gets stuck trying to send hints to the killed servers. We weren't
able to prove/falsify it yet. But if it's true, then this patch will:
- unblock next promotions,
- give us some important information when we see that the issue stops
appearing.
In the patch we shutdown all servers gracefully instead of killing them,
like we do in the other test cases.
Closes#12548
The docs mention that method, but it doesn't exist. Instead, the
state_machine interface defines plain .apply() one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12541
Add a new virtual table `system.raft_state` that shows the currently
operating Raft configuration for each present group. The schema is the
same as `system.raft_snapshot_config` (the latter shows the config from
the last snapshot). In the future we plan to add more columns to this
table, showing more information (like the current leader and term),
hence the generic name.
Adding the table requires some plumbing of
`sharded<raft_group_registry>&` through function parameters to make it
accessible from `register_virtual_tables`, but it's mostly
straightforward.
Also added some APIs to `raft_group_registry` to list all groups and
find a given group (returning `nullptr` if one isn't found, not throwing
an exception).
Remove the `ip_addr` column which was not used. IP addresses are not
part of Raft configuration now and they can change dynamically.
Swap the `server_id` and `disposition` columns in the clustering key, so
when querying the configuration, we first obtain all servers with the
current disposition and then all servers with the previous disposition
(note that a server may appear both in current and previous).
Improve the error handling in `decommission` in case `leave_group0`
fails, informing the user what they should do (i.e. call `removenode` to
get rid of the group 0 member), and allowing decommission to finish; it
does not make sense to let the node continue to run after it leaves the
token ring. (And I'm guessing it's also not safe. Or maybe impossible.)
Due to failures we might end up in a situation where we have a group 0
member which is not a token ring member: a decommission/removenode
which failed after leaving/removing a node from the token ring but
before leaving / removing a node from group 0.
There was no way to get rid of such a group 0 member. A node that left
the token ring must not be allowed to run further (or it can cause data
loss, data resurrection and maybe other fun stuff), so we can't run
decommission a second time (even if we tried, it would just say that
"we're not a member of the token ring" and abort). And `removenode`
would also not work, because it proceeds only if the node requested to
be removed is a member of the token ring.
We modify `removenode` so it can run in this situation and remove the
group 0 member. The parts of `removenode` related to token ring
modification are now conditioned on whether the node was a member of the
token ring. The final `remove_from_group0` step is in its own branch. Some
minor refactors were necessary. Some log messages were also modified so
it's easier to understand which messages correspond the "token movement"
part of the procedure.
The `make_nonvoter` step happens only if token ring removal happens,
otherwise we can skip directly to `remove_from_group0`.
We also move `remove_from_group0` outside the "try...catch",
fixing #11723. The "node ops" part of the procedure is related strictly
to token ring movement, so it makes sense for `remove_from_group0` to
happen outside.
Indentation is broken in this commit for easier reviewability, fixed in
the following commit.
Fixes: #11723
Right now wait_for_raft is called before performing group 0
configuration changes. We want to also call it before checking for
membership, for that it's desirable to have the most recent information,
hence call read_barrier. In the existing use cases it's not strictly
necessary, but it doesn't hurt.
removenode currently works roughly like this:
1. stream/repair data so it ends up on new replica sets (calculated
without the node we want to remove)
2. remove the node from the token ring
3. remove the node from group 0 configuration.
If the procedure fails before after step 2 but before step 3 finishes,
we're in trouble: the cluster is left with an additional voting group 0
member, which reduces group 0's availability, and there is no way to
remove this member because `removenode` no longer considers it to be
part of the cluster (it consults the token ring to decide).
Improve this failure scenario by including a new step at the beginning:
make the node a non-voter in group 0 configuration. Then, even if we
fail after removing the node from the token ring but before removing it
from group 0, we'll only be left with a non-voter which doesn't reduce
availability.
We make a similar change for `decommission`: between `unbootstrap()` (which
streams data) and `leave_ring()` (which removes our tokens from the
ring), become a non-voter. The difference here is that we don't become a
non-voter at the beginning, but only after streaming/repair. In
`removenode` it's desirable to make the node a non-voter as soon as
possible because it's already dead. In decommission it may be desirable
for us to remain a voter if we fail during streaming because we're still
alive and functional in that case.
In a later commit we'll also make it possible to retry `removenode` to
remove a node that is only a group 0 member and not a token ring member.
The test would create a scenario where one node was down while the others
started the Raft upgrade procedure. The procedure would get stuck, but
it was possible to `removenode` the downed node using one of the alive
nodes, which would unblock the Raft upgrade procedure.
This worked because:
1. the upgrade procedure starts by ensuring that all peers can be
contacted,
2. `removenode` starts by removing the node from the token ring.
After removing the node from the token ring, the upgrade procedure
becomes able to contact all peers (the peers set no longer contains the
down node). At the end, after removing the node from the token ring,
`removenode` would actually get stuck for a while, waiting for the
upgrade procedure to finish before removing the peer from group 0.
After the upgrade procedure finished, `removenode` would also finish.
(so: first the upgrade procedure waited for removenode, then removenode
waited for the upgrade procedure).
We want to modify the `removenode` procedure and include a new step
before removing the node from the token ring: making the node a
non-voter. The purpose is to improve the possible failure scenarios.
Previously, if the `removenode` procedure failed after removing the node
from the token ring but before removing it from group 0, the cluster
would contain a 'garbage' group 0 member which is a voter - reducing
group 0's availability. If the node is made a non-voter first, then this
failure will not be as big of a problem, because the leftover group 0
member will be a non-voter.
However, to correctly perform group 0 operations including making
someone a nonvoter, we must first wait for the Raft upgrade procedure to
finish (or at least wait until everyone joins group 0). Therefore by
including this 'make the node a non-voter' step at the beginning of
`removenode`, we make it impossible to remove a token ring member in the
middle of the upgrade procedure, on which the test case relied. The test
case would get stuck waiting for the `removenode` operation to finish,
which would never finish because it would wait for the upgrade procedure
to finish, which would not finish because of the dead peer.
We remove the test case; it was "lucky" to pass in the first place. We
have a dedicated mechanism for handling dead peers during Raft upgrade
procedure: the manual Raft group 0 RECOVERY procedure. There are other
test cases in this file which are using that procedure.
leave_group0 and remove_from_group0 functions both start with the
following steps:
- if Raft is disabled or in RECOVERY mode, print a simple log message
and abort
- if Raft cluster feature flag is not yet enabled, print a complex log
message and abort
- wait for Raft upgrade procedure to finish
- then perform the actual group 0 reconfiguration.
Refactor these preparation steps to a separate function,
`wait_for_raft`. This reduces code duplication; the function will also
be used in more operations later (becoming a nonvoter or turning another
server into a nonvoter).
We also change the API so that the preparation function is called from
outside by the caller before they call the reconfiguration function.
This is because in later commits, some of the call sites (mainly
`removenode`) will want to check explicitly whether Raft is enabled and
wait for Raft's availabilty, then perform a sequence of steps related
to group 0 configuration depending on the result.
Also add a private function `raft_upgrade_complete()` which we use to
assert that Raft is ready to be used.
Currently, we create `forward_aggregates` inside a function that
returns the result of a future lambda that captures these aggregates
by reference. As a result, the aggregates may be destructed before
the lambda finishes, resulting in a heap use-after-free.
To prolong the lifetime of these aggregates, we cannot use a move
capture, because the lambda is wrapped in a with_thread_if_needed()
call on these aggregates. Instead, we fix this by wrapping the
entire return statement in a do_with().
Fixes#12528Closes#12533
Currently they are upgraded during learn on a replica. The are two
problems with this. First the column mapping may not exist on a replica
if it missed this particular schema (because it was down for instance)
and the mapping history is not part of the schema. In this case "Failed
to look up column mapping for schema version" will be thrown. Second lwt
request coordinator may not have the schema for the mutation as well
(because it was freed from the registry already) and when a replica
tries to retrieve the schema from the coordinator the retrieval will fail
causing the whole request to fail with "Schema version XXXX not found"
Both of those problems can be fixed by upgrading stored mutations
during prepare on a node it is stored at. To upgrade the mutation its
column mapping is needed and it is guarantied that it will be present
at the node the mutation is stored at since it is pre-request to store
it that the corresponded schema is available. After that the mutation
is processed using latest schema that will be available on all nodes.
Fixes#10770
Message-Id: <Y7/ifraPJghCWTsq@scylladb.com>
LCS reshape is compacting all levels if a single one breaks
disjointness. That's unnecessary work because rewriting that single
level is enough to restore disjointness. If multiple levels break
disjointness, they'll each be reshaped in its own iteration, so
reducing operation time for each step and disk space requirement,
as input files can be released incrementally.
Incremental compaction is not applied to reshape yet, so we need to
avoid "major compaction", to avoid the space overhead.
But space overhead is not the only problem, the inefficiency, when
deciding what to reshape when overlapping is detected, motivated
this patch.
Fixes#12495.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12496
Some functions defined by a template in types.cc are used in other
translation units (via `cql3/untyped_result_set.hh`), but aren't
explicitly instantiated. Therefore their linking can fail, depending
on inlining decisions. (I experienced this when playing with compiler
options).
Fix that.
Closes#12539
As requested by issue #5619, commit 2150c0f7a2
added a sanity check for USING TIMESTAMP - the number specified in the
timestamp must not be more than 3 days into the future (when viewed as
a number of microseconds since the epoch).
This sanity checking helps avoid some annoying client-side bugs and
mis-configurations, but some users genuinely want to use arbitrary
or futuristic-looking timestamps and are hindered by this sanity check
(which Cassandra doesn't have, by the way).
So in this patch we add a new configuration option, restrict_future_timestamp
If set to "true", futuristic timestamps (more than 3 days into the future)
are forbidden. The "true" setting is the default (as has been the case
sinced #5619). Setting this option to "false" will allow using any 64-bit
integer as a timestamp, like is allowed Cassanda (and was allowed in
Scylla prior to #5619.
The error message in the case where a futuristic timestamp is rejected
now mentions the configuration paramter that can be used to disable this
check (this, and the option's name "restrict_*", is similar to other
so-called "safe mode" options).
This patch also includes a test, which works in Scylla and Cassandra,
with either setting of restrict_future_timestamp, checking the right
thing in all these cases (the futuristic timestamp can either be written
and read, or can't be written). I used this test to manually verify that
the new option works, defaults to "true", and when set to "false" Scylla
behaves like Cassandra.
Fixes#12527
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12537
Cassandra refuses a request with more than one relation to the same
clustering column, for example
DELETE FROM tbl WHERE p = ? and c = ? AND c > ?
complains that
c cannot be restricted by more than one relation if it includes an Equal
But it produces different error messages for different operators and
even order.
Currently, Scylla doesn't consider such requests an error. Whether or
not we should be compatible with Cassandra here is discussed in
issue #12472. But as long as we do accept these queries, we should be
sure we do the right thing: "WHERE c = 1 AND c > 2" should match
nothing, "WHERE c = 1 AND c > 0" should match the matches of c = 1,
and so on. This patch adds a test for verify that these requests indeed
yield correct results. The test is scylla_only because, as explained
above, Cassandra doesn't support these requests at all.
Refs #12472
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12498
remove unnecessary bashism, so that this script can be interpreted
by a POSIX shell.
/bin/sh is specified in the shebang line. on debian derivatives,
/bin/sh is dash, which is POSIX compliant. but this script is
written in the bash dialect.
before this change, we could run into following build failure
when building the tree on Debian:
[7/904] ./SCYLLA-VERSION-GEN
./SCYLLA-VERSION-GEN: 37: [[: not found
after this change, the build is able to proceed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12530
The CQL binary protocol introduced "unset" values in version 4
of the protocol. Unset values can be bound to variables, which
cause certain CQL fragments to be skipped. For example, the
fragment `SET a = :var` will not change the value of `a` if `:var`
is bound to an unset value.
Unsets, however, are very limited in where they can appear. They
can only appear at the top-level of an expression, and any computation
done with them is invalid. For example, `SET list_column = [3, :var]`
is invalid if `:var` is bound to unset.
This causes the code to be littered with checks for unset, and there
are plenty of tests dedicated to catching unsets. However, a simpler
way is possible - prevent the infiltration of unsets at the point of
entry (when evaluating a bind variable expression), and introduce
guards to check for the few cases where unsets are allowed.
This is what this long patch does. It performs the following:
(general)
1. unset is removed from the possible values of cql3::raw_value and
cql3::raw_value_view.
(external->cql3)
2. query_options is fortified with a vector of booleans,
unset_bind_variable_vector, where each boolean corresponds to a bind
variable index and is true when it is unset.
3. To avoid churn, two compatiblity structs are introduced:
cql3::raw_value{,_view}_vector_with_unset, which can be constructed
from a std::vector<raw_value{,_view/}>, which is what most callers
have. They can also be constructed with explicit unset vectors, for
the few cases they are needed.
(cql3->variables)
4. query_options::get_value_at() now throws if the requested bind variable
is unset. This replaces all the throwing checks in expression evaluation
and statement execution, which are removed.
5. A new query_options::is_unset() is added for the users that can tolerate
unset; though it is not used directly.
6. A new cql3::unset_operation_guard class guards against unsets. It accepts
an expression, and can be queried whether an unset is present. Two
conditions are checked: the expression must be a singleton bind
variable, and at runtime it must be bound to an unset value.
7. The modification_statement operations are split into two, via two
new subclasses of cql3::operation. cql3::operation_no_unset_support
ignores unsets completely. cql3::operation_skip_if_unset checks if
an operand is unset (luckily all operations have at most one operand that
tolerates unset) and applies unset_operation_guard to it.
8. The various sites that accept expressions or operations are modified
to check for should_skip_operation(). This are the loops around
operations in update_statement and delete_statement, and the checks
for unset in attributes (LIMIT and PER PARTITION LIMIT)
(tests)
9. Many unset tests are removed. It's now impossible to enter an
unset value into the expression evaluation machinery (there's
just no unset value), so it's impossible to test for it.
10. Other unset tests now have to be invoked via bind variables,
since there's no way to create an unset cql3::expr::constant.
11. Many tests have their exception message match strings relaxed.
Since unsets are now checked very early, we don't know the context
where they happen. It would be possible to reintroduce it (by adding
a format string parameter to cql3::unset_operation_guard), but it
seems not to be worth the effort. Usage of unsets is rare, and it is
explicit (at least with the Python driver, an unset cannot be
introduced by ommission).
I tried as an alternative to wrap cql3::raw_value{,_view} (that doesn't
recognize unsets) with cql3::maybe_unset_value (that does), but that
caused huge amounts of churn, so I abandoned that in favor of the
current approach.
Closes#12517
Allow replacing a node given its Host ID rather than its ip address.
This series adds a replace_node_first_boot option to db/config
and makes use of it in storage_service.
The new option takes priority over the legacy replace_address* options.
When the latter are used, a deprecation warning is printed.
Documentation updated respectively.
And a cql unit_test is added.
Ref #12277Closes#12316
* github.com:scylladb/scylladb:
docs: document the new replace_node_first_boot option
dist/docker: support --replace-node-first-boot
db: config: describe replace_address* options as deprecated
test: test_topology: test replace using host_id
test: pylib: ServerInfo: add host_id
storage_service: get rid of get_replace_address
storage_service: is_replacing: rely directly on config options
storage_service: pass replacement_info to run_replace_ops
storage_service: pass replacement_info to booststrap
storage_service: join_token_ring: reuse replacement_info.address
storage_service: replacement_info: add replace address
init: do not allow cfg.replace_node_first_boot of seed node
db: config: add replace_node_first_boot option
`forward_request` verb carried information about timeouts using
`lowres_clock::time_point` (that came from local steady clock
`seastar::lowres_clock`). The time point was produced on one node and
later compared against other node `lowres_clock`. That behavior
was wrong (`lowres_clock::time_point`s produced with different
`lowres_clock`s cannot be compared) and could lead to delayed or
premature timeout.
To fix this issue, `lowres_clock::time_point` was replaced with
`lowres_system_clock::time_point` in `forward_request` verb.
Representation to which both time point types serialize is the same
(64-bit integer denoting the count of elapsed nanoseconds), so it was
possible to do an in-place switch of those types using logic suggested
by @avikivity:
- using steady_clock is just broken, so we aren't taking anything
from users by breaking it further
- once all nodes are upgraded, it magically starts to work
Closes#12529
The PR adds an api call allowing to get the statuses of a given
task and all its descendants.
The parent-child tree is traversed in BFS order and the list of
statuses is returned to user.
Closes#12317
* github.com:scylladb/scylladb:
test: add test checking recursive task status
api: get task statuses recursively
api: change retrieve_status signature
The replace_address options are still supported
But mention in their description that they are now deprecated
and the user should use replace_node_first_boot instead.
While at it fix a typo in ignore_dead_nodes_for_replace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add test cases exercising the --replace-node-first-boot option
by replacing nodes using their host_id rather
than ip address.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Populate replacement_info.address in prepare_replacement_info
as a first step towards getting rid of get_replace_address().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For replacing a node given its (now unique) Host ID.
The existing options for replace_address*
will be deprecated in the following patches
and eventually we will stop supporting them.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rename `system.raft_config` to `system.raft_snapshot_config` to make it clearer
what the table stores.
Remove the `my_server_id` partition key column from
`system.raft_snapshot_config` and a corresponding column from
`system.raft_snapshots` which would store the Raft server ID of the local node.
It's unnecessary, all servers running on a given node in different groups will
use the same ID - the Raft ID of the node which is equal to its Host ID. There
will be no multiple servers running in a single Raft group on the same node.
Closes#12513
* github.com:scylladb/scylladb:
db: system_keyspace: remove (my_)server_id column from RAFT_SNAPSHOTS and RAFT_SNAPSHOT_CONFIG
db: system_keyspace: rename 'raft_config' to 'raft_snapshot_config'
Leave the guide for manual opening in though, the script might not work
in all cases.
Also update the version example, we changed how development versions
look like.
Closes#12511
Make it clear that the table stores the snapshot configuration, which is
not necessarily the currently operating configuration (the last one
appended to the log).
In the future we plan to have a separate virtual table for showing the
currently operating configuration, perhaps we will call it
`system.raft_config`.
The planned integration of cross-module optimizations in scylladb/scylladb-enterprise requires several changes to `configure.py`. To minimize the divergence between the `configure.py`s of both repositories, this series upstreams some of these changes to scylladb/scylladb.
The changes mostly remove dead code and fix some traps for the unaware.
Closes#12431
* github.com:scylladb/scylladb:
configure.py: prevent deduplication of seastar compile options
configure.py: rename clang_inline_threshold()
configure.py: rework the seastar_cflags variable
configure.py: hoist the pkg_config() call for seastar-testing.pc
configure.py: unify the libs variable for tests and non-tests
configure.py: fix indentation
configure.py: remove a stale code path for .a artifacts
Currently, we call cargo build every time we build scylla, even
when no rust files have been changed.
This is avoided by adding a depfile to the ninja rule for the rust
library.
The rust file is generated by default during cargo build,
but it uses the full paths of all depenencies that it includes,
and we use relative paths. This is fixed by specifying
CARGO_BUILD_DEP_INFO_BASEDIR='.', which makes it so the current
path is subtracted from all generated paths.
Instead of using 'always' when specifying when to run the cargo
build, a dependency on Cargo.lock is added additionally to the
depfile. As a result, the rust files are recompiled not only
when the source files included in the depfile are modified,
but also when some rust dependency is updated.
Cargo may put an old cached file as a result of the build even
when the Cargo.lock was recently updated. Because of that, the
the build result may be older than the Cargo.lock file even
if the build was just performed. This may cause ninja to rebuilt
the file every following time. To avoid this, we 'touch' the
build result, so that its last modification time is up to date.
Because the dependency on Cargo.lock was added, the new command
for the build does not modify it. Instead, the developer must
update it when modifying the dependencies - the docs are updated
to reflect that.
Closes#12489Fixes#12508
In its infinite wisdom, CMake deduplicates the options passed
to `target_compile_options`, making it impossible to pass options which require
duplication, such as -mllvm.
Passing e.g.
`-mllvm;-pgso=false;-mllvm;-inline-threshold=2500` invokes the compiler
`-mllvm -pgso=false -inline-threshold=2500`, breaking the options.
As a workaround, CMake added the `SHELL:` syntax, which makes it possible to
pass the list of options not as a CMake list, but as a shell-quoted string.
Let's use it, so we can pass multiple -mllvm options.
The name of this variable is misleading. What it really does is pass flags to
static libraries compiled by us, not just to seastar.
We will need this capability to implement cross-artifact optimizations in our
build.
We will also need to pass linker flags, and we will need to vary those flags
depending on the build mode.
This patch splits the seastar_cflags variable into per-mode lib_cflags and
lib_ldflags variables. It shouldn't change the resulting build.ninja for now,
but will be needed by later planned patches.
Put the pkg_config() for seastar-testing.pc in the same area as the call
for seastar.pc, outside of the loop.
This is a cosmetic change aimed at making following commits cleaner.
Scylla haven't had `.a` artifacts for a long time (since the Urchin days,
I believe), and the piece of code responsible for them is stale and untested.
Remove it.
In 424dbf43f ("transport: drop cql protocol versions 1 and 2"),
we dropped support for protocols 1 and 2, but some code remains
that checks for those versions. It is now dead code, so remove it.
Closes#12497
Our handling of NULLs in expressions is different from Cassandra's,
and more uniform. For example, the filter "WHERE x = NULL" is an
error in Cassandra, but supported in Scylla. Let's explain how and why.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12494
Currently, no rule enforces that the cxx.h rust header
is generated before compiling the .cc files generated
from rust. This patch adds this dependency.
Closes#12492
Make it easy to see which clusters are operated on by which tests in which build modes and so on.
Add some additional logs.
These improvements would have saved me a lot of debugging time if I had them last week and we would have https://github.com/scylladb/scylladb/pull/12482 much faster.
Closes#12483
* github.com:scylladb/scylladb:
test.py: harmonize topology logs with test.py format
test/pylib: additional logging during cluster setup
test/pylib: prefix cluster/manager logs with the current test name
test/pylib: pool: pass *args and **kwargs to the build function from get()
test.py: include mode in ScyllaClusterManager logs
Sometimes to debug some task manager module, we may want to inspect
the whole tree of descendants of some task.
To make it easier, an api call getting a list of statuses of the requested
task and all its descendants in BFS order is added.
We need millisecond resolution in the log to be able to
correlate test log with test.py log and scylla logs. Harmonize
the log format for tests which actively manage scylla servers.
The log file produced by test.py combines logs coming from multiple
concurrent test runs. Each test has its own log file as well, but this
"global" log file is useful when debugging problems with topology tests,
since many events related to managing clusters are stored there.
Make the logs easier to read by including information about the test case
that's currently performing operations such as adding new servers to
clusters and so on. This includes the mode, test run name and the name
of the test case.
We do this by using custom `Logger` objects (instead of calling
`logging.info` etc. which uses the root logger) with `LoggerAdapter`s
that include the prefixes. A bit of boilerplate 'plumbing' through
function parameters is required but it's mostly straightforward.
This doesn't apply to all events, e.g. boost test cases which don't
setup a "real" Scylla cluster. These events don't have additional
prefixes.
Example:
```
17:41:43.531 INFO> [dev/topology.test_topology.1] Cluster ScyllaCluster(name: 7a414ffc-903c-11ed-bafb-f4d108a9e4a3, running: ScyllaServer(1, 127.40.246.1, 29c4ec73-8912-45ca-ae19-8bfda701a6b5), ScyllaServer(4, 127.40.246.4, 75ae2afe-ff9b-4760-9e19-cd0ed8d052e7), ScyllaServer(7, 127.40.246.7, 67a27df4-be63-4b4c-a70c-aeac0506304f), stopped: ) adding server...
17:41:43.531 INFO> [dev/topology.test_topology.1] installing Scylla server in /home/kbraun/dev/scylladb/testlog/dev/scylla-10...
17:41:43.603 INFO> [dev/topology.test_topology.1] starting server at host 127.40.246.10 in scylla-10...
17:41:43.614 INFO> [dev/topology.test_topology.2] Cluster ScyllaCluster(name: 7a497fce-903c-11ed-bafb-f4d108a9e4a3, running: ScyllaServer(2, 127.40.246.2, f59d3b1d-efbb-4657-b6d5-3fa9e9ef786e), ScyllaServer(5, 127.40.246.5, 9da16633-ce53-4d32-8687-e6b4d27e71eb), ScyllaServer(9, 127.40.246.9, e60c69cd-212d-413b-8678-dfd476d7faf5), stopped: ) adding server...
17:41:43.614 INFO> [dev/topology.test_topology.2] installing Scylla server in /home/kbraun/dev/scylladb/testlog/dev/scylla-11...
17:41:43.670 INFO> [dev/topology.test_topology.2] starting server at host 127.40.246.11 in scylla-11...
```
The batch constructor uses an unnecessarily complicated template,
where in fact it only vector<vector<raw_value | raw_value_view>>.
Simplify the constructor to allow exactly that. Delete some confusing
comments around it.
Closes#12488
This will be used to specify a custom logger when building new clusters
before starting tests, allowing to easily pinpoint which tests are
waiting for clusters to be built and what's happening to these
particular clusters.
The logs often mention the test run and the current test case in a given
run, such as `test_topology.1` and
`test_topology.1::test_add_server_add_column`. However, if we run
test.py in multiple modes, the different modes might be running the same
test case and the logs become confusing. To disambiguate, prefix the
test run/case names with the mode name.
Example:
```
Leasing Scylla cluster ScyllaCluster(name: 7a414ffc-903c-11ed-bafb-f4d108a9e4a3, running: ScyllaServer(1, 127.40.246.1, 29c4ec73-8912-45ca-ae19-8bfda701a6b5), ScyllaServer(4, 127.40.246.4, 75ae2afe-ff9b-4
760-9e19-cd0ed8d052e7), ScyllaServer(7, 127.40.246.7, 67a27df4-be63-4b4c-a70c-aeac0506304f), stopped: ) for test dev/topology.test_topology.1::test_add_server_add_column
```
Currently, UDAs can't be reused if Scylla has been
restarted since they have been created. This is
caused by the missing initialization of saved
UDAs that should have inserted them to the
cql3::functions::functions::_declared map, that
should store all (user-)created functions and
aggregates.
This patch adds the missing implementation in a way
that's analogous to the method of inserting UDF to
the _declared map.
Fixes#11309
Using lambda coroutines as arguments can lead to a use-after-free.
Currently, the way these lambdas were used in do_parse_schema_tables
did not lead to such a problem, but it's better to be safe and wrap
them in coroutine::lambda(), so that they can't lead to this problem
as long as we ensure that the lambda finishes in the
do_parse_schema_tables() statement (for example using co_await).
Closes#12487
This patch adds a few simple functional test for the SELECT DISTINCT
feature, and how it interacts with other features especiall GROUP BY.
2 of the 5 new tests are marked xfail, and reproduce one old and one
newly-discovered issue:
Refs #5361: LIMIT doesn't work when using GROUP BY (the test here uses
LIMIT and GROUP BY together with SELECT DISTINCT, so the
LIMIT isn't honored).
Refs #12479: SELECT DISTINCT doesn't refuse GROUP BY with clustering
column.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12480
In test_apply_is_atomic, a basic form of exception testing is used.
There is failure_injecting_allocation_strategy, which however is not
used for any allocation, since for some reason,
`with_allocator(r.allocator()` is used instead of
`with_allocator(alloc`. Fix that.
Closes#12354
Regression introduced in 23e4c8315.
view_and_holder position_in_partiton::after_key() triggers undefined
behavior when the key was not full because the holder is moved, which invalidates the view.
Fixes#12367Closes#12447
The function underlying_type() returns an data_type by value,
but the code assigned it to a reference.
At first I was sure this is an error
(assigning temporary value to a reference), but it turns out
that this is most likely correct due to C++ lifetime
extension rules.
I think it's better to avoid such unituitive tricks.
Assigning to value makes it clearer that the code
is correct and there are no dangling references.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#12485
With sufficiently many test cases we would eventually run out of IP
addresses, because IPs (which are leased from a global host registry)
would only be released at the end of an entire test suite.
In fact we already hit this during next promotions, causing much pain
indeed.
Release IPs when a cluster, after being marked dirty, is stopped and
thrown away.
Closes#12482
Introduce a new "script" operation, which loads a script from the specified path, then feeds the mutation fragment stream to it. The script can then extract, process and present information from the sstable as it wishes.
For now only Lua scripts are supported for the simple reason that Lua is easy to write bindings for, it is simple and lightweight and more importantly we already have Lua included in the Scylla binary as it is used as the implementation language for UDF/UDA. We might consider WASM support in the future, but for now we don't have any language support in WASM available.
Example:
```lua
function new_stats(key)
return {
partition_key = key,
total = 0,
partition = 0,
static_row = 0,
clustering_row = 0,
range_tombstone_change = 0,
};
end
total_stats = new_stats(nil);
function inc_stat(stats, field)
stats[field] = stats[field] + 1;
stats.total = stats.total + 1;
total_stats[field] = total_stats[field] + 1;
total_stats.total = total_stats.total + 1;
end
function on_new_sstable(sst)
max_partition_stats = new_stats(nil);
if sst then
current_sst_filename = sst.filename;
else
current_sst_filename = nil;
end
end
function consume_partition_start(ps)
current_partition_stats = new_stats(ps.key);
inc_stat(current_partition_stats, "partition");
end
function consume_static_row(sr)
inc_stat(current_partition_stats, "static_row");
end
function consume_clustering_row(cr)
inc_stat(current_partition_stats, "clustering_row");
end
function consume_range_tombstone_change(crt)
inc_stat(current_partition_stats, "range_tombstone_change");
end
function consume_partition_end()
if current_partition_stats.total > max_partition_stats.total then
max_partition_stats = current_partition_stats;
end
end
function on_end_of_sstable()
if current_sst_filename then
print(string.format("Stats for sstable %s:", current_sst_filename));
else
print("Stats for stream:");
end
print(string.format("\t%d fragments in %d partitions - %d static rows, %d clustering rows and %d range tombstone changes",
total_stats.total,
total_stats.partition,
total_stats.static_row,
total_stats.clustering_row,
total_stats.range_tombstone_change));
print(string.format("\tPartition with max number of fragments (%d): %s - %d static rows, %d clustering rows and %d range tombstone changes",
max_partition_stats.total,
max_partition_stats.partition_key,
max_partition_stats.static_row,
max_partition_stats.clustering_row,
max_partition_stats.range_tombstone_change));
end
```
Running this script wilt yield the following:
```
$ scylla sstable script --script-file fragment-stats.lua --system-schema system_schema.columns /var/lib/scylla/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/me-1-big-Data.db
Stats for sstable /var/lib/scylla/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f//me-1-big-Data.db:
397 fragments in 7 partitions - 0 static rows, 362 clustering rows and 28 range tombstone changes
Partition with max number of fragments (180): system - 0 static rows, 179 clustering rows and 0 range tombstone changes
```
Fixes: https://github.com/scylladb/scylladb/issues/9679Closes#11649
* github.com:scylladb/scylladb:
tools/scylla-sstable: consume_reader(): improve pause heuristincs
test/cql-pytest/test_tools.py: add test for scylla-sstable script
tools: add scylla-sstable-scripts directory
tools/scylla-sstable: remove custom operation
tools/scylla-sstable: add script operation
tools/sstable: introduce the Lua sstable consumer
dht/i_partitioner.hh: ring_position_ext: add weight() accessor
lang/lua: export Scylla <-> lua type conversion methods
lang/lua: use correct lib name for string lib
lang/lua: fix type in aligned_used_data (meant to be user_data)
lang/lua: use lua_State* in Scylla type <-> Lua type conversions
tools/sstable_consumer: more consistent method naming
tools/scylla-sstable: extract sstable_consumer interface into own header
tools/json_writer: add accessor to underlying writer
tools/scylla-sstable: fix indentation
tools/scylla-sstable: export mutation_fragment_json_writer declaration
tools/scylla-sstable: mutation_fragment_json_writer un-implement sstable_consumer
tools/scylla-sstable: extract json writing logic from json_dumper
tools/scylla-sstable: extract json_writer into its own header
tools/scylla-sstable: use json_writer::DataKey() to write all keys
tools/scylla-types: fix use-after-free on main lambda captures
Inferring shard from generation is long gone. We still use it in
some scripts, but that's no longer needed in Scylla, when loading
the SSTables, and it also conflicts with ongoing work of UUID-based
generations.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12476
The CQL binary protocol version 3 was introduced in 2014. All Scylla
version support it, and Cassandra versions 2.1 and newer.
Versions 1 and 2 have 16-bit collection sizes, while protocol 3 and newer
use 32-bit collection sizes.
Unfortunately, we implemented support for multiple serialization formats
very intrusively, by pushing the format everywhere. This avoids the need
to re-serialize (sometimes) but is quite obnoxious. It's also likely to be
broken, since it's almost untested and it's too easy to write
cql_serialization_format::internal() instead of propagating the client
specified value.
Since protocols 1 and 2 are obsolete for 9 years, just drop them. It's
easy to verify that they are no longer in use on a running system by
examining the `system.clients` table before upgrade.
Fixes#10607Closes#12432
* github.com:scylladb/scylladb:
treewide: drop cql_serialization_format
cql: modification_statement: drop protocol check for LWT
transport: drop cql protocol versions 1 and 2
The consume loop had some heuristics in place to determine whether after
pausing, the consumer wishes to skip just the partition or the remaining
content of the sstable. This heuristics was flawed so replace it with a
non-heuristic method: track the last consumed fragment and look at this
to determine what should be done.
To test the script operation, we use some of the example scripts from
the example directory. Namely, dump.lua and slice.lua. These two scripts
together have a very good coverage of the entire script API. Testing
their functionality therefore also provides a good coverage of the lua
bindings. A further advantage is that since both scripts dump output in
identical format to that of the data-dump operation, it is trivial to do
a comparison against this already tested operation.
A targeted test is written for the sstable skip functionality of the
consumer API.
To be the home of example scripts for scylla-sstable. For now only a
README.md is added describing the directory's purpose and with links to
useful resources.
One example script is added in this patch, more will come later.
Loads the script from the specified path, then feeds the mutation
fragment stream to it. For now only Lua scripts are supported for the
simple reason that Lua is easy to write bindings for, it is simple and
lightweight and more importantly we already have Lua included in the
Scylla binary as it is used as the implementation language for UDF/UDA.
We might consider WASM support in the future, but for now we don't have
any language support in WASM available.
The Lua sstable consumer loads a script from the specified path then
feeds the mutation fragment stream to the script via the
sstable_consumer methods, each method of which the script is allowed to
define, effectively overloading the virtual method in Lua.
This allows for very wide and flexible customization opportunities for
what to extract from sstables and how to process and present them,
without the need to recompile the scylla-sstable tool.
Instead of the lua_slice_state which is local to this file. We want to
reuse the Scylla type <-> Lua type conversion functions but for that
they have to use the more generic lua_State*. No functionality or
convenience is lost with the switch, the code didn't make use of the
other fields bundled in lua_slice_state.
So it can be used in code outside scylla-sstable.cc. This source file is
quite large already, and as we have yet another large chunk of code to
add, we want to add it in a separate file.
There is no point in the former implementing said interface. For one it
is a futurized interface, which is not needed for something writing to
the stdout. Rename the methods to follow the naming convention of rjson
writers more closely.
We want to split this class into two parts: one with the actual logic
converting mutation fragments to json, and a wrapper over this one,
which implements the sstable_consumer interface.
As a first step we extract the class as is (no changes) and just forward
all-calls from now empty wrapper to it.
This method was renamed from its previous name of PartitionKey. Since in
json partition keys and clustering keys look alike, with the only
difference being that the former may also have a token, it makes to have
a single method to write them (with an optional token parameter). This
was the case at some point, json_dumper::write_key() taking this role.
However at a later point, json_writer::PartitionKey() was introduced and
now the code uses both. Standardize on the latter and give it a more
generic name.
The main lambda of scylla-types, the one passed to app_template::run()
was recently made a coroytine. app_template::run() however doesn't keep
this lambda alive and hence after the first suspention point, accessing
the lambda's captures triggers use-after-free.
The simple fix is to convert the coroutine into continuation chain.
Consider the following MVCC state of a partition:
v2: ==== <7> [entry2] ==== <9> ===== <last dummy>
v1: ================================ <last dummy> [entry1]
Where === means a continuous range and --- means a discontinuous range.
After two LRU items are evicted (entry1 and entry2), we will end up with:
v2: ---------------------- <9> ===== <last dummy>
v1: ================================ <last dummy> [entry1]
This will cause readers to incorrectly think there are no rows before
entry <9>, because the range is continuous in v1, and continuity of a
snapshot is a union of continuous intervals in all versions. The
cursor will see the interval before <9> as continuous and the reader
will produce no rows.
This is only temporary, because current MVCC merging rules are such
that the flag on the latest entry wins, so we'll end up with this once
v1 is no longer needed:
v2: ---------------------- <9> ===== <last dummy>
...and the reader will go to sstables to fetch the evicted rows before
entry <9>, as expected.
The bug is in rows_entry::on_evicted(), which treats the last dummy
entry in a special way, and doesn't evict it, and doesn't clear the
continuity by omission.
The situation is not easy to trigger because it requires certain
eviction pattern concurrent with multiple reads of the same partition
in different versions, so across memtable flushes.
Closes#12452
Sstable read related metrics are broken for a long time now. First, the introduction of inactive reads (https://github.com/scylladb/scylladb/issues/1865) diluted this metric, as it now also contained inactive reads (contrary to the metric's name). Then, after moving the semaphore in front of the cache (3d816b7c1) this metric became completely broken as this metric now contains all kinds of reads: disk, in-memory and inactive ones too.
This series aims to remedy this:
* `scylla_database_active_reads` is fixed to only include active reads.
* `scylla_database_active_reads_memory_consumption` is renamed to `scylla_database_reads_memory_consumption` and its description is brought up-to-date.
* `scylla_database_disk_reads` is added to track current reads that are gone to disk.
* `scylla_database_sstables_read` is added to track the number of sstables read currently.
Fixes: https://github.com/scylladb/scylladb/issues/10065Closes#12437
* github.com:scylladb/scylladb:
replica/database: add disk_reads and sstables_read metrics
sstables: wire in the reader_permit's sstable read count tracking
reader_concurrency_semaphore: add disk_reads and sstables_read stats
replica/database: fix active_reads_memory_consumption_metric
replica/database: fix active_reads metric
Azure metadata API may return empty zone sometimes. If that happens
shard-0 gets empty string as its rack, but propagates UNKNOWN_RACK to
other shards.
Empty zones response should be handled regardless.
refs: #12185
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12274
Cassandra refuses a request with more than one equality relation to the
same column, for example
DELETE FROM tbl WHERE partitionKey = ? AND partitionKey = ?
It complains that
partitionkey cannot be restricted by more than one relation if it
includes an Equal
Currently, Scylla doesn't consider such requests an error. Whether or
not we should be compatible with Cassandra here is discussed in
issue #12472. But as long as we do accept this query, we should be
sure we do the right thing: "WHERE p = 1 AND p = 2" should match
nothing (not the first, or last, value being tested..), and "WHERE p = 1
AND p = 1" should match the matches of p = 1. This patch adds a test
for verify that these requests indeed yield correct results. The
test is scylla_only because, as explained above, Cassandra doesn't
support this feature at all.
Refs #12472
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12473
* seastar ca586cfb8d...8889cbc198 (14):
> http: request_parser: fix grammar ambiguity in field_content
Fixes#12468
> sstring: use fold expression to simply copy_str_to()
> sstring: use fold expression to simply str_len()
> metrics: capture by move in make_function()
> metrics: replace homebrew is_callable<> with is_invocable_v<>
> reactor: use std::move() to avoid copy.
> reactor: remove redundant semicolon.
> reactor: use mutable to make std::move() work.
> build: install liburing explicitly on ArchLinux.
> reactor: use a for loop for submitting ios
> metrics: add spaces around '='
> parallel utils: align concept with implementation
> reactor: s/resize(0)/clear()/
> reactor: fix a typo in comment
Closes#12469
This commit removes consume_in_reverse::legacy_half_reverse, an option
once used to indicate that the given key ranges are sorted descending,
based on the clustering key of the start of the range, and that the
range tombstones inside partition would be sorted (descending, as all
the mutation fragments would) according to their end (but range
tombstone would still be stored according to their start bound).
As it turns out, mutation::consume, when called with legacy_half_reverse
option produces invalid fragment stream, one where all the row
tombstone changes come after all the clustering rows. This was not an
issue, since when constructing results from the query, Scylla would not
pass the tombstones to the client, but instead compact data beforehand.
In this commit, the consume_in_reverse::legacy_half_reverse is removed,
along with all the uses.
As for the swap out in mutation_partition.cc in query_mutation and
to_data_query_result:
The downstream was not prepared to deal with legacy_half_reverse.
mutation::consume contains
```
if (reverse == consume_in_reverse::yes) {
while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::yes>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) {
co_await yield();
}
} else {
while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::no>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) {
co_await yield();
}
}
```
So why did it work at all? to_data_query_result deals with a single slice.
The used consumer (compact_for_query_v2) compacts-away the range tombstone
changes, and thus the only difference between the consume_in_reverse::no
and consume_in_reverse::yes was that one was ordered increasing wrt. ckeys
and the second one was ordered decreasing. This property is maintained if
we swap out for the consume_in_reverse::yes format.
Refs: #12353Closes#12453
* github.com:scylladb/scylladb:
mutation{,_consumer,_partition}: remove consume_in_reverse::legacy_half_reverse
mutation_partition_view: treat query::partition_slice::option::reversed in to_data_query_result as consume_in_reverse::yes
mutation: move consume_in_reverse def to mutation_consumer.hh
Convert decompressed temporary buffers into tracked buffers just before
returning them to the upper layer. This ensures these buffers are known
to the reader concurrency semaphore and it has an accurate view of the
actual memory consumption of reads.
Fixes: #12448Closes#12454
The test would use a trick to start a separate Scylla cluster from the
one provided originally by the test framework. This is not supported by
the test framework and may cause unexpected problems.
Change the test to perform regular node operations. Instead of starting
a fresh cluster of 3 nodes, we join the first of these nodes to the
original framework-provided cluster, then decommission the original
nodes, then bootstrap the other 2 fresh nodes.
Also add some logging to the test.
Refs: #12438, #12442Closes#12457
This series adds the implementation and usage of rust wasmtime bindings.
The WASM UDFs introduced by this patch are interruptable and use memory allocated using the seastar allocator.
This series includes #11102 (the first two commits) because #11102 required disabling wasm UDFs completely. This patch disables them in the middle of the series, and enables them again at the end.
After this patch, `libwasmtime.a` can be removed from the toolchain.
This patch also removes the workaround for #https://github.com/scylladb/scylladb/issues/9387 but it hasn't been tested with ARM yet - if the ARM test causes issues I'll revert this part of the change.
Closes#11351
* github.com:scylladb/scylladb:
build: remove references to unused c bindings of wasmtime
test: assert that WASM allocations can fail without crashing
wasm: limit memory allocated using mmap
wasm: add configuration options for instance cache and udf execution
test: check that wasmtime functions yield
wasm: use the new rust bindings of wasmtime
rust: add Wasmtime bindings
rust: add build profiles more aligned with ninja modes
rust: adjust build according to cxxbridge's recommendations
tools: toolchain: dbuild: prepare for sharing cargo cache
In issue #3668, a discussion spanning several years theorized that several
things are wrong with the "timestamp" type. This patch begins by adding
several tests that demonstrate that Scylla is in fact behaving correctly,
and mostly identically to Cassandra except one esoteric error handling
case.
However, after eliminating the red herrings, we are left for the real
issue that prompted opening #3668, which is a duplicate of issues #2693
and #2694, and this patch also adds a reproducer for that. The issue is
that Cassandra 4 added support for arithmetic expressions on values,
and timestamps can be added durations, for example:
'2011-02-03 04:05:12.345+0000' - 1d
is a valid timestamp - and we don't currently support this syntax.
So the new test - which passes on Cassandra 4 and fails on Scylla
(or Cassandra 3) is marked xfail.
Refs #2693
Refs #2694
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12436
The line modified in this patch was supposed to increase the
optimization levels of parsers in debug mode to 1, because they
were too slow otherwise. But as a side effect, it also reduced the
optimization level in release mode to 1. This is not a problem
for the CQL frontend, because statement preparation is not
performance-sensitive, but it is a serious performance problem
for Alternator, where it lies in the hot path.
Fix this by only applying the -O1 to debug modes.
Fixes#12463Closes#12460
Before the changes intorducing the new wasmtime bindings we relied
on an downloaded static library libwasmtime.a. Now that the bindings
are introduced, we do not rely on it anymore, so all references to
it can be removed.
The main source of big allocations in the WASM UDF implementation
is the WASM Linear Memory. We do not want Scylla to crash even if
a memory allocation for the WASM Memory fails, so we assert that
an exception is thrown instead.
The wasmtime runtime does not actually fail on an allocation failure
(assuming the memory allocator does not abort and returns nullptr
instead - which our seastar allocator does). What happens then
depends on the failed allocation handling of the code that was
compiled to WASM. If the original code threw an exception or aborted,
the resulting WASM code will trap. To make sure that we can handle
the trap, we need to allow wasmtime to handle SIGILL signals, because
that what is used to carry information about WASM traps.
The new test uses a special WASM Memory allocator that fails after
n allocations, and the allocations include both memory growth
instructions in WASM, as well as growing memory manually using the
wasmtime API.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
The wasmtime runtime allocates memory for the executable code of
the WASM programs using mmap and not the seastar allocator. As
a result, the memory that Scylla actually uses becomes not only
the memory preallocated for the seastar allocator but the sum of
that and the memory allocated for executable codes by the WASM
runtime.
To keep limiting the memory used by Scylla, we measure how much
memory do the WASM programs use and if they use too much, compiled
WASM UDFs (modules) that are currently not in use are evicted to
make room.
To evict a module it is required to evict all instances of this
module (the underlying implementation of modules and instances uses
shared pointers to the executable code). For this reason, we add
reference counts to modules. Each instance using a module is a
reference. When an instance is destroyed, a reference is removed.
If all references to a module are removed, the executable code
for this module is deallocated.
The eviction of a module is actually acheved by eviction of all
its references. When we want to free memory for a new module we
repeatedly evict instances from the wasm_instance_cache using its
LRU strategy until some module loses all its instances. This
process may not succeed if the instances currently in use (so not
in the cache) use too much memory - in this case the query also
fails. Otherwise the new module is added to the tracking system.
This strategy may evict some instances unnecessarily, but evicting
modules should not happen frequently, and any more efficient
solution requires an even bigger intervention into the code.
Different users may require different limits for their UDFs. This
patch allows them to configure the size of their cache of wasm,
the maximum size of indivitual instances stored in the cache, the
time after which the instances are evicted, the fuel that all wasm
UDFs are allowed to consume before yielding (for the control of
latency), the fuel that wasm UDFs are allowed to consume in total
(to allow performing longer computations in the UDF without
detecting an infinite loop) and the hard limit of the size of UDFs
that are executed (to avoid large allocations)
The new implementation for WASM UDFs allows executing the UDFs
in pieces. This commit adds a test asserting that the UDF is in fact
divided and that each of the execution segments takes no longer than
1ms.
This patch replaces all dependencies on the wasmtime
C++ bindings with our new ones.
The wasmtime.hh and wasm_engine.hh files are deleted.
The libwasmtime.a library is no longer required by
configure.py. The SCYLLA_ENABLE_WASMTIME macro is
removed and wasm udfs are now compiled by default
on all architectures.
In terms of implementation, most of code using
wasmtime was moved to the Rust source files. The
remaining code uses names from the new bindings
(which are mostly unchanged). Most of wasmtime objects
are now stored as a rust::Box<>, to make it compatible
with rust lifetime requirements.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
The C++ bindings provided by wasmtime are lacking a crucial
capability: asynchronous execution of the wasm functions.
This forces us to stop the execution of the function after
a short time to prevent increasing the latency. Fortunately,
this feature is implemented in the native language
of Wasmtime - Rust. Support for Rust was recently added to
scylla, so we can implement the async bindings ourselves,
which is done in this patch.
The bindings expose all the objects necessary for creating
and calling wasm functions. The majority of code implemented
in Rust is a translation of code that was previously present
in C++.
Types exported from Rust are currently required to be defined
by the same crate that contains the bridge using them, so
wasmtime types can't be exported directly. Instead, for each
class that was supposed to be exported, a wrapper type is
created, where its first member is the wasmtime class. Note
that the members are not visible from C++ anyway, the
difference only applies to Rust code.
Aside from wasmtime types and methods, two additional types
are exported with some associated methods.
- The first one is ValVec, which is a wrapper for a rust Vec
of wasmtime Vals. The underlying vector is required by
wasmtime methods for calling wasm functions. By having it
exported we avoid multiple conversions from a Val wrapper
to a wasmtime Val, as would be required if we exported a
rust Vec of Val wrappers (the rust Vec itself does not
require wrappers if the type it contains is already wrapped)
- The second one is Fut. This class represents an computation
tha may or may not be ready. We're currently using it
to control the execution of wasm functions from C++. This
class exposes one method: resume(), which returns a bool
that signals whether the computation is finished or not.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
A cargo profile is created for each of build modes: dev, debug,
sanitize, realease and coverage. The names of cargo profiles are
prefixed by "rust-" because cargo does not allow separate "dev"
and "debug" profiles.
The main difference between profiles are their optimization levels,
they correlate to the levels used in configure.py. The debug info
is stripped only in the dev mode, and only this mode uses
"incremental" compilation to speed it up.
Currently, the rust build system in Scylla creates a separate
static library for each incuded rust package. This could cause
duplicate symbol issues when linking against multiple libraries
compiled from rust.
This issue is fixed in this patch by creating a single static library
to link against, which combines all rust packages implemented in
Scylla.
The Cargo.lock for the combined build is now tracked, so that all
users of the same scylla version also use the same versions of
imported rust modules.
Additionally, the rust package implementation and usage
docs are modified to be compatible with the build changes.
This patch also adds a new header file 'rust/cxx.hh' that contains
definitions of additional rust types available in c++.
Rust's cargo caches downloaded sources in ~/.cargo. However dbuild
won't provide access to this directory since it's outside the source
directory.
Prepare for sharing the cargo cache between the host and the dbuild
environment by:
- Creating the cache if it doesn't already exist. This is likely if
the user only builds in a dbuild environment.
- Propagating the cache directory as a mounted volume.
- Respecting the CARGO_HOME override.
It's been a long while since we built ScyllaDB for s390x, and in
fact the last time I checked it was broken on the ragel parser
generator generating bad source files for the HTTP parser. So just
drop it from the list.
I kept s390x in the architecture mapping table since it's still valid.
Closes#12455
This commit removes consume_in_reverse::legacy_half_reverse, an option
once used to indicate that the given key ranges are sorted descending,
based on the clustering key of the start of the range, and that the
range tombstones inside partition would be sorted (descending, as all
the mutation fragments would) according to their end (but range
tombstone would still be stored according to their start bound).
As it turns out, mutation::consume, when called with legacy_half_reverse
option produces invalid fragment stream, one where all the row
tombstone changes come after all the clustering rows. This was not an
issue, since when constructing results from the query, Scylla would not
pass the tombstones to the client, but instead compact data beforehand.
In this commit, the consume_in_reverse::legacy_half_reverse is removed,
along with all the uses.
As for the swap out in mutation_partition.cc in query_mutation and
to_data_query_result:
The downstream was not prepared to deal with legacy_half_reverse.
mutation::consume contains
```
if (reverse == consume_in_reverse::yes) {
while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::yes>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) {
co_await yield();
}
} else {
while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::no>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) {
co_await yield();
}
}
```
So why did it work at all? to_data_query_result deals with a single slice.
The used consumer (compact_for_query_v2) compacts-away the range tombstone
changes, and thus the only difference between the consume_in_reverse::no
and consume_in_reverse::yes was that one was ordered increasing wrt. ckeys
and the second one was ordered decreasing. This property is maintained if
we swap out for the consume_in_reverse::yes format.
Aborting of repair operation is fully managed by task manager.
Repair tasks are aborted:
- on shutdown; top level repair tasks subscribe to global abort source. On shutdown all tasks are aborted recursively
- through node operations (applies to data_sync_repair_task_impls and their descendants only); data_sync_repair_task_impl subscribes to node_ops_info abort source
- with task manager api (top level tasks are abortable)
- with storage_service api and on failure; these cases were modified to be aborted the same way as the ones from above are.
Closes#12085
* github.com:scylladb/scylladb:
repair: make top level repair tasks abortable
repair: unify a way of aborting repair operations
repair: delete sharded abort source from node_ops_info
repair: delete unused node_ops_info from data_sync_repair_task_impl
repair: delete redundant abort subscription from shard_repair_task_impl
repair: add abort subscription to data sync task
tasks: abort tasks on system shutdown
Similar to the way we allow aborting streaming-based
removenode, subscribe to storage_service::_abort_source
to request abort locally and pass a shared_ptr<abort_source>
to `node_ops_info`, used to abort removenode_with_repair
on shutdown.
Fixes#12429Closes#12430
* github.com:scylladb/scylladb:
storage_service: restore_replica_count: demote status_checker related logging to debug level
storage_service: restore_replica_count: allow aborting removenode_with_repair
storage_service: coroutinize restore_replica_count
storage_service: restore_replica_count: undefer stop_status_checker
storage_service: restore_replica_count: handle exceptions from stream_async and send_replication_notification
storage_service: restore_replica_count: coroutinize status_checker
Unlike other experimental feature we want to raft to be opt in even
after it leaves experimental mode. For that we need to have a separate
option to enable it. The patch adds the binary option "consistent-cluster-management"
for that.
* 'consistent-cluster-management-flag' of github.com:scylladb/scylla-dev:
raft: replace experimental raft option with dedicated flag
main: move supervisor notification about group registry start where it actually starts
Sometimes we may need task status to be nothrow move constructible.
httpd::task_manager_json::task_status does not satisfy this requirement.
retrieve_status returns future<full_task_status> instead of future<task_status>
to provide an intermediate struct with better properties. An argument
is passed by reference to prevent the necessity to copy foreign_ptr.
Fixes https://github.com/scylladb/scylladb/issues/12314
This PR adds the upgrade guide for ScyllaDB Enterprise - from version
2022.1 to 2022.2. Using this opportunity, I've replaced "Scylla" with
"ScyllaDB" in the upgrade-enterprise index file.
In previous releases, we added several upgrade guides - one per platform
(and version). In this PR, I've merged the information for different
platforms to create one generic upgrade guide. It is similar to what
@kbr- added for the Open Source upgrade guide from 5.0 to 5.1. See
https://docs.scylladb.com/stable/upgrade/upgrade-opensource/upgrade-guide-from-5.0-to-5.1/.
Closes#12339
* github.com:scylladb/scylladb:
docs: add the info about minor release
docs: add the new upgade guide 2022.1 to 2022.2 to the index and the toctree
docs: add the index file for the new upgrage guide from 2022.1 to 2022.2
docs: add the metrics update file to the upgrade guide 2022.1 to 2022.2
docs: add the upgrade guide for ScyllaDB Enterprise from 2022.1 to 2022.2
the status_checker is not the main line of business
of restore_replica_count, starting and stopping it
do nt seem to deserve info level logging, which
might have been useful in the past to debug issues
surrounding that.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Similar to the way we allow aborting streaming-based
removenode, subscribe to storage_service::_abort_source
to request abort locally and pass a shared_ptr<abort_source>
to `node_ops_info`, used to abort removenode_with_repair
on shutdown.
Fixes#12429
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that all exceptions in the rest of the function
are swallowed, just execute the stop_status_checker
deferred action serially before returning, on the
wau to coroutinizing restore_replica_count (since
we can't co_await status_checker inside the deferred
action).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
On the way to coroutinizing restore_replica_count,
extract awaiting stream_async and send_replication_notification
into a try/catch blocks so we can later undefer stop_status_checker.
The exception is still returned as an exceptional future
which is logged by the caller as warning.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There is no need to start a thread for the status_checker
and can be implemented using a background coroutine.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Timouts are benign, especially on a read-ahead that turned out to be not
needed at all. They just introduce noise in the logs, so silence them.
Fixes: #12435Closes#12441
raft_group0 used to register RPC verbs only on shard 0. This worked on
clusters with the same --smp setting on all nodes, since RPCs in this
case are processed on the same shard as the calling code, and
raft_group0 methods only run on shard 0.
A new test test_nodes_with_different_smp was added to identify the
problem. Since --smp can only be specified via the command line, a
corresponding parameter was added to the ManagerClient.server_add
method. It allows to override the default parameters set by the
SCYLLA_CMDLINE_OPTIONS variable by changing, adding or deleting
individual items.
Fixes: #12252Closes#12374
* github.com:scylladb/scylladb:
raft: raft_group0, register RPC verbs on all shards
raft: raft_append_entries, copy entries to the target shard
test.py, allow to specify the node's command line in test
Due to lack of NDEBUG macro inlining was disabled. It's
important for parsing and printing performance.
Testing with perf_simple_query shows that it reduced around
7000 insns/op, thus increasing median tps by 4.2% for the alternator frontend.
Because inlined functions are called for every character
in json this scales with request/response size. When
default write size is increased by around 7x (from ~180 to ~ 1255
bytes) then the median tps increased by 12%.
Running:
./build/release/test/perf/perf_simple_query_g --smp 1 \
--alternator forbid --default-log-level error \
--random-seed=1235000092 --duration=60 --write
Results before the patch:
median 46011.50 tps (197.1 allocs/op, 12.1 tasks/op, 170989 insns/op, 0 errors)
median absolute deviation: 296.05
maximum: 46548.07
minimum: 42955.49
Results after the patch:
median 47974.79 tps (197.1 allocs/op, 12.1 tasks/op, 163723 insns/op, 0 errors)
median absolute deviation: 303.06
maximum: 48517.53
minimum: 44083.74
The change affects both json parsing and printing.
Closes#12440
Run tests for parallelized aggregation with
`enable_parallelized_aggregation` set always to true, so the tests work
even if the default value of the option is false.
Closes#12409
Now that we don't accept cql protocol version 1 or 2, we can
drop cql_serialization format everywhere, except when in the IDL
(since it's part of the inter-node protocol).
A few functions had duplicate versions, one with and one without
a cql_serialization_format parameter. They are deduplicated.
Care is taken that `partition_slice`, which communicates
the cql_serialization_format across nodes, still presents
a valid cql_serialization_format to other nodes when
transmitting itself and rejects protocol 1 and 2 serialization\
format when receiving. The IDL is unchanged.
One test checking the 16-bit serialization format is removed.
Version 3 was introduced in 2014 (Cassandra 2.1) and was supported
in the very first version of Scylla (2a7da21481 "CQL binary protocol").
Cassandra 3.0 (2015) dropped protocols 1 and 2 as well.
It's safe enough to drop it now, 9 years after introduction of v3
and 7 years after Cassandra stopped supporting it.
Dropping it allows dropping cql_serialization_format, which causes
quite a lot of pain, and is probably broken. This will be dropped in the
following patch.
* seastar 3db15b5681...ca586cfb8d (28):
> reactor: trim returned buffer to received number of bytes
> util/process: include used header
> build: drop unused target_include_directories()
> build: use BUILD_IN_SOURCE instead chdir <SOURCE_DIR>
> build: specify CMake policy CMP0135 to new
> tests: only destroy allocated pending connections
> build: silence the output when generating private keys
> tests, httpd: Limit loopback connection factory sharding
> lw_shared_ptr: Add nullptr_t comparing operators
> noncopyable_function: Add concept for (Func func) constructor
> reactor: add process::terminate() and process::kill()
> Merge 'tests, include: include headers without ".." in path' from Kefu Chai
> build: customize toolset for building Boost
> build: use different toolset base on specified compiler
> allocator: add an option to reserve additional memory for the OS
> Merge 'build: pass cflags and ldflags to cooking.sh' from Kefu Chai
> build: build static library of cryptopp
> gate: add gate holders debugging
> build: detect debug build of yaml-cpp also
> build: do not use pkg_search_module(IMPORTED_TARGET) for finding yaml-cpp
> build: bump yaml-cpp to 0.7.0 in cooking_recipe
> build: bump cryptopp to 8.7.0 in cooking_recipe
> build: bump boost to 1.81.0 in cooking_recipe
> build: bump fmtlib to 9.1.0 in cooking_recipe
> shared_ptr: add overloads for fmt::ptr()
> chunked_fifo: const_iterator: use the base class ctor
> build: s/URING_LIBARIES/URING_LIBRARIES/
> build: export the full path of uring with URING_LIBRARIES
Closes#12434
test_ssl
In very slow debug builds the default driver timeouts are too low and
tests might fail. Bump up the values to a more reasonable time.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#12408
The alternator compatibility.md document mentions the missing ACL
(access control) feature, but unlike other missing features we
forgot to link to the open issue about this missing feature.
So let's add that link.
Refs #5047.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12399
Fixes https://github.com/scylladb/scylladb/issues/12318
This PR removes all occurrences of the `auto_bootstrap` option in the docs.
In most cases, I've simply removed the option name and its definition, but sometimes additional changes were necessary:
- In node-joined-without-any-data.rst, I removed the `auto_bootstrap `option as one of the causes of the problem.
- In rebuild-node.rst, I removed the first step in the procedure (enabling the `auto_bootstrap `option).
- In admin. rst, I removed the section about manual bootstrapping - it's based on setting `auto_bootstrap` to false, which is not possible now.
Closes#12419
* github.com:scylladb/scylladb:
docs: remove the auto_bootstrap option from the admin procedures - involves removing the Manual Bootstraping section
docs: remove the auto_bootstrap option from the procedure to replace a dead node
docs: remove the auto_bootstrap option from the Troubleshooting article about a node joining with no data
docs: remove the auto_bootstrap option from the procedure to rebuild a node after losing the data volume
docs: remove the auto_bootstrap option from the procedures to create a cluster or add a DC
And the infrastructure to reader_permit to update them. The
infrastructure is not wired in yet.
These metrics will be used to count the number of reads gone to disk and
the number of sstables read currently respectively.
Rename to reads_memory_consumption and drop the "active" from the
description as well. This metric tracks the memory consumption of all
reads: active or inactive. We don't even currently have a way to track
the memory consumption of only active reads.
Drop the part of the description which explains the interaction with
other metrics: this part is outdated and the new interactions are much
more complicated, no way to explain in a metric description.
Also ask the semaphore to calculate the memory amount, instead of doing
it in the metric itself.
raft_group0 used to register RPC verbs only on shard 0.
This worked on clusters with the same --smp setting on
all nodes, since RPCs in this case are (usually)
processed on the same shard as the calling code,
and raft_group0 methods only run on shard 0.
A new test test_nodes_with_different_smp was added
to identify the problem.
Fixes: #12252
This metric has been broken for a long time, since inactive reads were
introduced. As calculated currently, it includes all permits that passed
admission, including inactive reads. On the other hand, it excludes
permits created bypassing admission.
Fix by using the newly introduced (in this patch)
reader_concurrency_semaphore::active_reads() as the basis of this
metric: this now includes all permits (reads) that are currently active,
excluding waiters and inactive reads.
If append_entries RPC was received on a non-zero shard, we may
need to pass it to a zero (or, potentially, some other) shard.
The problem is that raft::append_request contains entries in the form
of raft::log_entry_ptr == lw_shared_ptr<log_entry>, which doesn't
support cross-shard reference counting. In debug mode it contains
a special ref-counting facility debug_shared_ptr_counter_type,
which resorts to on_internal_error if it detects such a case.
To solve this, we just copy log entries to the target shard if it
isn't equal to the current one. In most cases, if --smp setting
is the same on all nodes, RPC will be handled on zero shard,
so there will be no overhead.
An optional parameter cmdline has been added to
the ManagerClient.server_add method.
It allows you to override the default parameters
set by the SCYLLA_CMDLINE_OPTIONS variable
by changing, adding or deleting individual
items. To change or add a parameter just specify
its name and value one after the other.
To remove parameter use the special keyword
__remove__ as a value. To set a parameter
without a value (such as --overprovisioned)
use the special keyword __missing__ as the value.
Add to test/cql-pytest/README.md an explanation of the philosophy
of the cql-pytest test suite, and some guideliness on how to write
good tests in that framework.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12400
Unlike other experimental feature we want to raft to be optional even
after it leaves experimental mode. For that we need to have a separate
option to enable it. The patch adds the binary option "consistent-cluster-management"
for that.
The PR introduces changes to task manager api:
- extends tasks' list returned with get_tasks with task type,
keyspace, table, entity, and sequence number
- extends status returned with get_task_status and wait_task
with a list of children's ids
Closes#12338
* github.com:scylladb/scylladb:
api: extend status in task manager api
api: extend get_tasks in task manager api
Fixes https://github.com/scylladb/scylladb/issues/11999.
This PR adds a description of scylla-api-cli.
Closes#12392
* github.com:scylladb/scylladb:
docs: fix the description of the system log POST example
docs: uptate the curl tool name
docs: describe how to use the scylla-api-client tool
docs: fix the scylla-api-client tool name
docs: document scylla-api-cli
When closing _lower_bound and *_upper_bound
in the final close() call, they are currently left with
an engaged current_list member.
If the index_reader uses a _local_index_cache,
it is evicted with evict_gently which will, rightfully,
see the respective pages as referenced, and they won't be
evicted gently (only later when the index_reader is destroyed).
Reset index_bound.current_list on close(index_bound&)
to free up the reference.
Ref #12271
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12370
Since [repair: Always use run_replace_ops](2ec1f719de), nodes no longer publish HIBERNATE state so we don't need to support handling it.
Replace is now always done using node operations (using repair or streaming).
so nodes are never expected to change status to HIBERNATE.
Therefore storage_service:handle_state_replacing is not needed anymore.
This series gets rid of it and updates documentation related to STATUS:HIBERNATE respectively.
Fixes#12330Closes#12349
* github.com:scylladb/scylladb:
docs: replace-dead-node: get rid of hibernate status
storage_service: get rid of handle_state_replacing
We already check is remote's node topology is missing before creating a
connection, but local node topology can be missing too when we will use
raft to manage it. Raft needs to be able to create connections before
topology is knows.
Message-Id: <20221228144944.3299711-7-gleb@scylladb.com>
database_test in timing out because it's having to run the tests calling
do_with_cql_env_and_compaction_groups 3x, one for each compaction group
setting. reduce it to 2 settings instead of 3 if running in debug mode.
Refs #12396.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12421
This lets us carry fewer things and rely on the distribution
for maintenance.
The frozen toolchain is updated. Incidental updates include clang 15.0.6,
and pytest that doesn't need workarounds.
Closes#12397
This reverts commit ac2e2f8883. It causes
a regression ("std::bad_variant_access in load_view_build_progress").
Commit 2978052113 (a reindent) is also reverted as part of
the process.
Fixes#12395
In commit acfa180766 we added to
test/cql-pytest a mechanism to detect when Scylla crashes in the middle
of a test function - in which case we report the culprit test and exit
immediately to avoid having a hundred more tests report that they failed
as well just because Scylla was down.
However, if Scylla was *never* up - e.g., if the user ran "pytest" without
ever running Scylla - we still report hundreds of tests as having failed,
which is confusing and not helpful.
So with this patch, if a connection cannot be made to Scylla at all,
the test exits immediately, explaining what went wrong, not blaming
any specific test:
$ pytest
...
! _pytest.outcomes.Exit: Cannot connect to Scylla at --host=localhost --port=9042 !
============================ no tests ran in 0.55s =============================
Beyond being a helpful reminder for a developer who runs "pytest" without
having started Scylla first (or using test/cql-pytest/run or test.py to
start Scylla easily), this patch is also important when running tests
through test.py if it reuses an instance of Scylla that crashed during an
earlier pytest file's run.
This patch does not fix test.py - it can still try to run pytest with
a dead Scylla server without checking. But at least with this patch
pytest will notice this problem immediately and won't report hundreds of
test functions having failed. The only report the user will see will be
the last test which crashed Scylla, which will make it easier to find
this failure without being hidden between hundreds of spurious failures.
Fixes#12360
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12401
This patch enables offstrategy compaction for all classic streaming
based node ops. We can use this method because tables are streamed one
after another. As long as there is still streamed data for a given
table, we update the automatic trigger timer. When all the streaming has
finished, the trigger timer will timeout and fire the offstrategy
compaction for the given table.
I checked with this patch, rebuild is 3X faster. There was no compaction
in the middle of the streaming. The streamed sstables are compacted
together after streaming is done.
Time Before:
INFO 2022-11-25 10:06:08,213 [shard 0] range_streamer - Rebuild
succeeded, took 67 seconds, nr_ranges_remaining=0
Time After:
INFO 2022-11-25 09:42:50,943 [shard 0] range_streamer - Rebuild
succeeded, took 23 seconds, nr_ranges_remaining=0
Compaciton Before:
88 sstables were written -> 88 sstables were added into main set
Compaction After:
88 sstables written -> after offstretegy 2 sstables were added into main seet
Closes#11848
invoke_on_task is used in translation units where its definition is not
visible, yet it has no explicit instantiations. If the compiler always
decides to inline the definition, not to instantiate it implicitly,
linking invoke_on_task will fail. (It happened to me when I turned up
inline-threshold). Fix that.
Closes#12387
In very slow debug builds the default driver timeouts are too low and
tests might fail. Bump up the values to more reasonable time.
These timeout values are the same as used in topology tests.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#12405
When the mutation compactor has all the rows it needs for a page, it
saves the decision to stop in a member flag: _stop.
For single partition queries, the mutation compactor is kept alive
across pages and so it has a method, start_new_page() to reset its state
for the next page. This method didn't clear the _stop flag. This meant
that the value set at the end of the previous could cause the new page
and subsequently the entire query to be stopped prematurely.
This can happen if the new page starts with a row that is covered by a
higher level tombstone and is completely empty after compaction.
Reset the _stop flag in start_new_page() to prevent this.
This commit also adds a unit test which reproduces the bug.
Fixes: #12361Closes#12384
On some docker instance configuration, hostname resolution does not
work, so our script will fail on startup because we use hostname -i to
construct cqlshrc.
To prevent the error, we can use --rpc-address or --listen-address
for the address since it should be same.
Fixes#12011Closes#12115
In case a table is dropped, we should ignore it in the repair_updater,
since we can not update off strategy trigger for a dropped table.
Refs #12373Closes#12388
Every 1 hour, compaction manager will submit all registered table_state
for a regular compaction attempt, all without yielding.
This can potentially cause a reactor stall if there are 1000s of table
states, as compaction strategy heuristics will run on behalf of each,
and processing all buckets and picking the best one is not cheap.
This problem can be magnified with compaction groups, as each group
is represented by a table state.
This might appear in dashboard as periodic stalls, every 1h, misleading
the investigator into believing that the problem is caused by a
chronological job.
This is fixed by piggybacking on compaction reevaluation loop which
can yield between each submission attempt if needed.
Fixes#12390.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12391
They are currently missing from the printout
when the a table is created, but they are determinal
to understanding the mode with which tombstones are to
be garbage-collected in the table. gcGraceSeconds alone
is no longer enough since the introduction of
tombstone_gc_option in a8ad385ecd.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12381
In issue #10767, concerned were raised that the CLUSTERING ORDER BY
clause is handled incorrectly in a CREATE MATERIALIZED VIEW definition.
The tests in this patch try to explore the different ways in which
CLUSTERING ORDER BY can be used in CREATE MATERIALIZED VIEW and allows
us to compare Scylla's behaivor to Cassandra, and to common sense.
The tests discover that the CLUSTERING ORDER BY feature in materialized
views generally works as expected, but there are *three* differences
between Scylla and Cassandra in this feature. We consider two differences
to be bugs (and hence the test is marked xfail) and one a Scylla extension:
1. When a base table has a reverse-order clustering column and this
clustering column is used in the materialized view, in Cassandra
the view's clustering order inherits the reversed order. In Scylla,
the view's clustering order reverts to the default order.
Arguably, both behaviors can be justified, but usually when in doubt
we should implement Cassandra's behavior - not pick a different
behavior, even if the different behavior is also reasonable. So
this test (test_mv_inherit_clustering_order()) is marked "xfail",
and a new issue was created about this difference: #12308.
If we want to fix this behavior to match Cassandra's we should also
consider backward compatibility - what happens if we change this
behavior in Scylla now, after we had the opposite behavior in
previous releases? We may choose to enshrine Scylla's Cassandra-
incompatible behavior here - and document this difference.
2. The CLUSTERING ORDER BY should, as its name suggests, only list
clustering columns. In Scylla, specifying other things, like regular
columns, partition-key columns, or non-existent columns, is silently
ignored, whereas it should result in an Invalid Request error (as it
does in Cassandra). So test_mv_override_clustering_order_error()
is marked "xfail".
This is the difference already discovered in #10767.
3. When a materialized view has several clustering columns, Cassandra
requires that a CLUSTERING ORDER BY clause, if present, must specify
the order of all of *all* clustering columns. Scylla, in contrast,
allows the user to override the order of only *some* of these columns -
and the rest get the default order. I consider this to be a
legitimate Scylla extension, and not a compatibility bug, so marked
the test with "scylla_only", and no issue was opened about it.
Refs #10767
Refs #12308
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12307
This patch adds a scylla_inject_error(), a context manager which tests
can use to temporarily enable some error injection while some test
code is running. It can be used to write tests that artificially
inject certain errors instead of trying to reach the elaborate (and
often requiring precise timing or high amounts of data) situation where
they occur naturally.
The error-injection API is Scylla-specific (it uses the Scylla REST API)
and does not work on "release"-mode builds (all other modes are supported),
so when Cassandra or release-mode build are being tested, the test which
uses scylla_inject_error() gets skipped.
Example usage:
```python
from rest_api import scylla_inject_error
with scylla_inject_error(cql, "injection_name", one_shot=True):
# do something here
...
```
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12264
Retrieves the configuration item with the given name and prints its
value as well as its metadata.
Example:
(gdb) scylla get-config-value compaction_static_shares
value: 100, type: "float", source: SettingsFile, status: Used, live: MustRestart
Closes#12362
* github.com:scylladb/scylladb:
scylla-gdb.py: add scylla get-config-value gdb command
scylla-gdb.py: extract $downcast_vptr logic to standalone method
test: scylla-gdb/run: improve diagnostics for failed tests
These options have been nonsense since 2017.
--pie and --so are ignored, --static disables (sic!) static linking of
libraries.
Remove them.
Closes#12366
Retrieves the configuration item with the given name and prints its
value as well as its metadata.
Example:
(gdb) scylla get-config-value compaction_static_shares
value: 100, type: "float", source: SettingsFile, status: Used, live: MustRestart
Due to an oversight, the local index cache isn't evicted gently
when _upper_bound existed. This is a source of reactor stalls.
Fix that.
Fixes#12271Closes#12364
The consume_in_reverse::legacy_half_reverse format is soon to be phased
out. This commit starts treating frozen_mutations from replicas for
reversed queries so that they are consumed with consume_in_reverse::yes.
Currently the scylla tools (`scylla-types` and `scylla-sstable`) have documentation in two places: high level documentation can be found at `docs/operating-scylla/admin-tools/scylla-{types,sstable}.rst`, while low level, more detailed documentation is embedded in the tool itself. This is especially pronounced for `scylla-sstable`, which only has a short description of its operations online, all details being found only in the command-line help.
We want to move away from this model, such that all documentation can be found online, with the command-line help being reserved to documenting how the various switches and flags work, on top of a short description of the operation and a link to the detailed online docs.
Closes#12284
* github.com:scylladb/scylladb:
tool/scylla-sstable: move documentation online
docs: scylla-sstable.rst: add sstable content section
docs: scylla-{sstable,types}.rst: drop Syntax section
Allows static configuration of number of compaction groups per table per shard.
To bootstrap the project, config option x_log2_compaction_groups was added which controls both number of groups and partitioning within a shard.
With a value of 0 (default), it means 1 compaction group, therefore all tokens go there.
With a value of 3, it means 8 compaction groups, and 3 most-significant-bits of tokens being used to decide which group owns the token.
And so on.
It's still missing:
- integration with repair / streaming
- integration with reshard / reshape.
perf/perf_simple_query --smp 1 --memory 1G
BEFORE
-----
median 61358.55 tps ( 71.1 allocs/op, 12.2 tasks/op, 56375 insns/op, 0 errors)
median 61322.80 tps ( 71.1 allocs/op, 12.2 tasks/op, 56391 insns/op, 0 errors)
median 61058.58 tps ( 71.1 allocs/op, 12.2 tasks/op, 56386 insns/op, 0 errors)
median 61040.94 tps ( 71.1 allocs/op, 12.2 tasks/op, 56381 insns/op, 0 errors)
median 61118.40 tps ( 71.1 allocs/op, 12.2 tasks/op, 56379 insns/op, 0 errors)
AFTER
-----
median 61656.12 tps ( 71.1 allocs/op, 12.2 tasks/op, 56486 insns/op, 0 errors)
median 61483.29 tps ( 71.1 allocs/op, 12.2 tasks/op, 56495 insns/op, 0 errors)
median 61638.05 tps ( 71.1 allocs/op, 12.2 tasks/op, 56494 insns/op, 0 errors)
median 61726.09 tps ( 71.1 allocs/op, 12.2 tasks/op, 56509 insns/op, 0 errors)
median 61537.55 tps ( 71.1 allocs/op, 12.2 tasks/op, 56491 insns/op, 0 errors)
Closes#12139
* github.com:scylladb/scylladb:
test: mutation_test: Test multiple compaction groups
test: database_test: Test multiple compaction groups
test: database_test: Adapt it to compaction groups
db: Add config for setting static number of compaction groups
replica: Introduce static compaction groups
test: sstable_test: Stop referencing single compaction group
api: compaction_manager: Stop a compaction type for all groups
api: Estimate pending tasks on all compaction groups
api: storage_service: Run maintenance compactions on all compaction groups
replica: table: Adapt assertion to compaction groups
replica: database: stop and disable compaction on behalf of all groups
replica: Introduce table::parallel_foreach_table_state()
replica: disable auto compaction on behalf of all groups
replica: table: Rework compaction triggers for compaction groups
replica: Adapt table::get_sstables_including_compacted_undeleted() to compaction groups
replica: Adapt table::rebuild_statistics() to compaction groups
replica: table: Perform major compaction on behalf of all groups
replica: table: Perform off-strategy compaction on behalf of all groups
replica: table: Perform cleanup compaction on behalf of all groups
replica: Extend table::discard_sstables() to operate on all compaction groups
replica: table: Create compound sstable set for all groups
replica: table: Set compaction strategy on behalf of all groups
replica: table: Return min memtable timestamp across all groups
replica: Adapt table::stop() to compaction groups
replica: Adapt table::clear() to compaction groups
replica: Adapt table::can_flush() to compaction groups
replica: Adapt table::flush() to compaction groups
replica: Introduce parallel_foreach_compaction_group()
replica: Adapt table::set_schema() to compaction groups
replica: Add memtables from all compaction groups for reads
replica: Add memtable_count() method to compaction_group
replica: table: Reserve reader list capacity through a callback
replica: Extract addition of memtables to reader list into a new function
replica: Adapt table::occupancy() to compaction groups
replica: Adapt table::active_memtable() to compaction groups
replica: Introduce table::compaction_groups()
replica: Preparation for multiple compaction groups
scylla-gdb: Fix backward compatibility of scylla_memtables command
Said mechanism broke tools and tests to some extent: the read it executes on sstable load time means that if the sstable is broken enough to fail this read, it will fail to load, preventing diagnostic tools to load it and examine it and preventing tests from producing broken sstables for testing purposes.
Closes#12359
* github.com:scylladb/scylladb:
sstables: allow bypassing first/last position metadata loading
sstables: sstable::{load,open_data}(): fix indentation
sstables: coroutinize sstable::open_data()
sstables: sstable::open_data(): use clear_gently() to clear token ranges
sstables: coroutinize sstable::load()
Type of the id of node operations is changed from utils::UUID
to node_ops_id. This way the id of node operations would be easily
distinguished from the ids of other entities.
Closes#11673
* seastar 3a5db04197...3db15b5681 (27):
> build: get the full path of c-ares
> build: unbreak pkgconfig output
> http: Add 206 Partial Content response code
> http: Carry integer content_length on reply
> tls_test: drop duplicated includes
> tls_test: remove duplicated test case
> reactor: define __NR_pidfd_open if not defined
> sockets: Wait on socket peer closing the connection
> tcp: Close connection when getting RST from server
> Merge 'Enhance rpc tester with delays, timeouts and verbosity' from Pavel Emelyanov
> Merge 'build: use pkg_search_module(.. IMPORTED_TARGET ..) ' from Kefu Chai
> build: define GnuTLS_{LIBRARIES, INCLUDE_DIRS} only if GnuTLS is found
> build: use pkg_search_module(.. IMPORTED_TARGET ..)
> addr2line: extend asan regex
> abort_source: move-assign operator: call base class unlink
> coroutine: correct syntax error in doxygen comment
> demo: Extend http connection demo with https
> test: temporarily disable warning for tests triggering warnings
> tests/unit/coroutine: Include <ranges>
> sstring: Document why sstring exists at all
> test: log error when read/write to pipe fails
> test: use executables in /bin
> tests: spawn_test: use BOOST_CHECK_EQUAL() for checking equality of temporary_buffer
> docker: bump up to clang {14,15} and gcc {11,12}
> shared_ptr: ignore false alarm from GCC-12
> build: check for fix of CWG2631
> circleci: use versioned container image
Closes#12355
In the past we had issue #7933 where very long strings of consecutive
tombstones caused Alternator's paging to take an unbounded amount of
time and/or memory for a single page. This issue was fixed (by commit
e9cbc9ee85) but the two tests we had
reproducing that issue were left with the "xfail" mark.
They were also marked "veryslow" - each taking about 100 seconds - so
they didn't run by default so nobody noticed they started to pass.
In this patch I make these tests much faster (taking less than a second
together), confirm that they pass - and remove the "xfail" mark and
improve their descriptions.
The trick to making these tests faster is to not create a million
tombstones like we used to: We now know that after string of just 10,000
tombstones ('query_tombstone_page_limit') the page should end, so
we can check specifically this number. The story is more complicated for
partition tombstones, but there too it should be a multiple of
query_tombstone_page_limit. To make the tests even faster, we change
run.py to lower the query_tombstone_page_limit from the default 10,000
to 1000. The tests work correctly even without this change, but they are
ten times faster with it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12350
* Update Nixpkgs base
* Clarify some comments
* Get rid of custom-packaged cxxbridge (it's now present in Nixpkgs as
cxx-rs)
* Add missing libraries (libdeflate, libxcrypt)
* Fix expected hash of the gdb patch
* Fix a couple of small build problems
Fixes#12259Closes#12346
* github.com:scylladb/scylladb:
build: fix Nix devenv
cql3: mark several private fields as maybe_unused
configure.py: link with more abseil libs
* Update Nixpkgs base
* Clarify some comments
* Get rid of custom-packaged cxxbridge (it's now present in Nixpkgs as
cxx-rs)
* Add missing libraries (libdeflate, libxcrypt)
* Fix expected hash of the gdb patch
* Bump Python driver to 3.25.20-scylla
Fixes#12259
Because they are indeed unused -- they are initialized, passed down
through some layers, but not actually used. No idea why only Clang 12
in debug mode in Nix devenv complains about it, though.
Specifically libabsl_strings{,_internal}.a.
This fixes failure to link tests in the Nix devenv; since presumably
all is good in other setups, it must be something weird having to do
with inlining?
The extra linked libraries shouldn't hurt in any case.
Extends mutation_test to run the tests with more than one
compaction group, in addition to a single one (default).
Piggyback on existing tests. Avoids duplication.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Extends database_test to run the tests with more than one
compaction group, in addition to a single one (default).
Piggyback on existing tests. Avoids duplication.
Caught a bug when snapshotting, in implementation of
table::can_flush(), showing its usefulness.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
data_sync_repair_task_impl subscribes to corresponding node_ops_info
abort source and then, when requested, all its descedants are
aborted recursively. Thus, shard_repair_task_impl does not need
to subscribe to the node_ops_info abort source, since the parent
task will take care of aborting once it is requested.
abort_subscription and connected attributes are deleted from
the shard_repair_task_impl.
When node operation is aborted, same should happen with
the corresponding task manager's repair task.
Subscribe data_sync_repair_task_impl abort() to node_ops_info
abort_source.
This new option allows user to control the number of compaction groups
per table per shard. It's 0 by default which implies a single compaction
group, as is today.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This is the initial support for multiple groups.
_x_log2_compaction_groups controls the number of compaction groups
and the partitioning strategy within a single table.
The value in _x_log2_compaction_groups refers to log base 2 of the
actual number of groups.
0 means 1 compaction group.
1 means 2 groups and 2 most significant bits of token being
used to pick the target group.
The group partitioner should be later abstracted for making tablet
integration easier in the future.
_x_log2_compaction_groups is still a constant but a config option
will come next.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Estimates # of compaction jobs to be performed on a table.
Adaptation is done by adding estimation from all groups.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
With compaction group model, truncate_table_on_all_shards() needs
to stop and disable compaction for all groups.
replica::table::as_table_state() will be removed once no user
remains, as each table may map to multiple groups.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This will replace table::as_table_state(). The latter will be
killed once its usage drops to zero.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Allow table-wide compaction trigger, as well as fine-grained trigger
like after flushing a memtable on behalf of a single group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
discard_sstables() runs on context of truncate, which is a table-wide
operation today, and will remain so with multiple static groups.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
clear() clears memtable content and cache.
Cache is shared by groups, therefore adaptation happens by only
clearing memtables of all groups.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
can_flush() is used externally to determine if a table has an active
memtable that can be flushed. Therefore, adaptation happens by
returning true if any of the groups can be flushed. A subsequent
flush request will flush memtable of all groups that are ready
for it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Adaptation of flush() happens by trigger flush on memtable of all
groups.
table::seal_active_memtable() will bail out if memtable is empty, so
it's not a problem to call flush on a group which memtable is empty.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This variant will be useful when iterating through groups
and performing async actions on each. It guarantees that all
groups are alive by the time they're reached in the loop.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
set_schema() is used by the database to apply schema changes to
table components which include memtables.
Adaptation happens by setting schema to memtable(s) of all groups.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Let's add memtables of all compaction groups. Point queries are
optimized by picking a single group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
add_memtables_to_reader_list() will be adapted to compaction groups.
For point queries, it will add memtables of a single group.
With the callback, add_memtables_to_reader_list() can tell its
caller the exact amount of memtable readers to be added, so it
can reserve precisely the readers capacity.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
table::occupancy() provides accumulated occupancy stats from
memtables.
Adaptation happens by accumulating stats from memtables of
all groups.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
active_memtable() was fine to a single group, but with multiple groups,
there will be one active memtable per group. Let's change the
interface to reflect that.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Useful for iterating through all groups. This is intermediary
implementation which requires allocation as only one group
is supported today.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
ea99750de7 ("test: give tests less-unique identifiers") made
the disambiguating ids only be unambiguous within a single test
case. This made all tests named "run" have the name name "run.1".
Fix that by adding the suite name everywhere: in test paths, and
in junit test case names.
Fixes#12310.
Closes#12313
`ranges_to_stream` is a map of ` std::unordered_multimap<dht::token_range, inet_address>` per keyspace.
On large clusters with a large number of keyspace, copying it may cause reactor stalls as seen in #12332
This series eliminates this copy by using std::move and also
turns `stream_ranges` into a coroutine, adding maybe_yield calls to avoid further stalls down the road.
Fixes#12332Closes#12343
* github.com:scylladb/scylladb:
storage_service: stream_ranges: unshare streamer
storage_service: stream_ranges: maybe_yield
storage_service: coroutinize stream_ranges
storage_service: unbootstrap: move ranges_to_stream_by_keyspace to stream_ranges
With replace using node operations, the HIBERNATE
gossip status is not used anymore.
This change updates documentation to reflect that.
During replace, the replacing nodes shows in gossipinfo
in STATUS:NORMAL.
Also, the replaced node shows as DN in `nodetool status`
while being replaced, so remove paragraph showing it's
not listed in `nodetool status`.
Plus. tidy up the text alignment.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since 2ec1f719de nodes no longer
publish HIBERNATE state so we don't need to support handling it.
Replace is now always done using node operations (using
repair or streaming).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that stream_ranges is a coroutine
streamer can be an automatic variable on the
coroutine stack frame.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Avoid a potentially large memory copy causing
a reactor stall with a large number of keyspaces.
Fixes#12332
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This is to define the API sstable needs from underlying storage. When implementing object-storage backend it will need to implement those. The API looks like
future<> snapshot(const sstable& sst, sstring dir, absolute_path abs) const;
future<> quarantine(const sstable& sst, delayed_commit_changes* delay);
future<> move(const sstable& sst, sstring new_dir, generation_type generation, delayed_commit_changes* delay);
void open(sstable& sst, const io_priority_class& pc); // runs in async context
future<> wipe(const sstable& sst) noexcept;
future<file> open_component(const sstable& sst, component_type type, open_flags flags, file_open_options options, bool check_integrity);
It doesn't have "list" or alike, because it's not a method of an individual sstable, but rather the one from sstables_manager. It will come as separate PR.
Closes#12217
* github.com:scylladb/scylladb:
sstable, storage: Mark dir/temp_dir private
sstable: Remove get_dir() (well, almost)
sstable: Add quarantine() method to storage
sstable: Use absolute/relative path marking for snapshot()
sstable: Remove temp_... stuff from sstable
sstable: Move open_component() on storage
sstable: Mark rename_new_sstable_component_file() const
sstable: Print filename(type) on open-component error
sstable: Reorganize new_sstable_component_file()
sstable: Mark filename() private
sstable: Introduce index_filename()
tests: Disclosure private filename() calls
sstable: Move wipe_storage() on storage
sstable: Remove temp dir in wipe_storage()
sstable: Move unlink parts into wipe_storage
sstable: Remove get_temp_dir()
sstable: Move write_toc() to storage
sstable: Shuffle open_sstable()
sstable: Move touch_temp_dir() to storage
sstable: Move move() to storage
sstable: Move create_links() to storage
sstable: Move seal_sstable() to storage
sstable: Tossing internals of seal_sstable()
sstable: Move remove_temp_dir() to storage
sstable: Move create_links_common() to storage
sstable: Move check_create_links_replay() to storage
sstable: Remove one of create_links() overloads
sstable: Remove create_links_and_mark_for_removal()
sstable: Indentation fix after prevuous patch
sstable: Coroutinize create_links_common()
sstable: Rename create_links_common()'s "dir" argument
sstable: Make mark_for_removal bool_class
sstable, table: Add sstable::snapshot() and use in table::take_snapshot
sstable: Move _dir and _temp_dir on filesystem_storage
sstable: Use sync_directory() method
test, sstable: Use component_basename in test
sstables: Move read_{digest|checksum} on sstable
Fix a bug in failure handling and log level.
Closes#12336
* github.com:scylladb/scylladb:
test.py: convert param to str
test.py: fix error level for CQL tests
Type of operation is related to a specific implementation
of a task. Then, it should rather be access with a virtual
method in tasks::task_manager::task::impl than be
its attribute.
Closes#12326
* github.com:scylladb/scylladb:
api: delete unused type parameter from task_manager_test api
tasks: repair: api: remove type attribute from task_manager::task::status
tasks: add type() method to task_manager::task::impl
repair: add reason attribute to repair_task
After commit a57724e711, off-strategy no longer races with view
building, therefore deletion code can be simplified and piggyback
on mechanism for deleting all sstables atomically, meaning a crash
midway won't result in some of the files coming back to life,
which leads to unnecessary work on restart.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12245
The format_unidiff() function takes str, not pathlib PosixPath, so
convert it to str.
This prevented diff output of unexpected result to be shown in the log
file.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Recently, the pytest script shipped by Fedora started invoking python
with the `-s` flag, which disables python considering user site
packages. This caused problems for our tests which install the cassandra
driver in the user site packages. This was worked around in e5e7780f32
by providing our own pytest interposer launcher script which does not
pass the above mentioned flag to python. Said patch fixed test.py but
not the run.py in cql-pytest. So if the cql-pytest suite is ran via
test.py it works fine, but if it is invoked via the run script, it fails
because it cannot find the cassandra driver. This patch patches run.py
to use our own pytest launcher script, so the suite can be run via the
run script as well.
Since run.py is shared with the alternator pytest suite, this patch also
fixes said test suite too.
Closes#12253
The inline-help of operations will only contain a short summary of the
operation and the link to the online documentation.
The move is not a straightforward copy-paste. First and foremost because
we move from simple markdown to RST. Informal references are also
replaced with proper RST links. Some small edits were also done on the
texts.
The intent is the following:
* the inline help serves as a quick reference for what the operation
does and what flags it has;
* the online documentation serves as the full reference manual,
explaining all details;
Provides a link to the architecture/sstable page for more details on the
sstable format itself. It also describes the mutation-fragment stream,
the parts of it that is relevant to the sstable operations.
The purpose of this section is to provide a target for links that want to
point to a common explanation on the topic. In particular, we will soon
move the detailed documentation of the scylla-sstable operations into
this file and we want to have a common explanation of the mutation
fragment stream that these operations can point to.
In both files, the section hierarchy is as follows:
Usage
Syntax
Sections with actual content
This scheme uses up 3 levels of hierarchy, leaving not much room to
expand the sections with actual content with subsections of their own.
Remove the Syntax level altogether, directly embedding the sections with
content under the Usage section.
This PR fixes several bugs related to handling of non-full
clustering keys.
One is in trim_clustering_row_ranges_to(), which is broken for non-full keys in reverse
mode. It will trim the range to position_in_partition_view::after_key(full_key) instead of
position_in_partition_view::before_key(key), hence it will include the
key in the resulting range rather than exclude it.
Fixes#12180
after_key() was creating a position which is after all keys prefixed
by a non-full key, rather than a position which is right after that
key.
This will issue will be caught by cql_query_test::test_compact_storage
in debug mode when mutation_partition_v2 merging starts inserting
sentinels at position after_key() on preemption.
It probably already causes problems for such keys as after_key() is used
in various parts in the read path.
Refs #1446Closes#12234
* github.com:scylladb/scylladb:
position_in_partition: Make after_key() work with non-full keys
position_in_partition: Introduce before_key(position_in_partition_view)
db: Fix trim_clustering_row_ranges_to() for non-full keys and reverse order
types: Fix comparison of frozen sets with empty values
Now all storage access via sstable happens with the help of storage
class API so its internals can be finally made private.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sstable::get_dir() is now gone, no callers know that sstable lives
in any path on a filesystem. There are only few callers left.
One is several places in code that need sstable datafile, toc and index
paths to print them in logs. The other one is sstable_directory that is
to be patched separately.
For both there's a storage.prefix() method that prepends component name
with where the sstable is "really" located.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Moving sstable to quarantine has some specific -- if the sstable is in
staging/ directory it's anyway moved into root/quarantine dir, not into
the quarantine subdir of its current location.
Encapsulate this feture in storage class method.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The snapshotting code uses full paths to files to manipulate snapshotted
sstables. Until this code is patched to use some proper snapshotting API
from sstable/ module, it will continue doing so.
Nowever, to remove the get_dir() method from sstable() the
seal_sstable() needs to put relative "backup" directory to
storage::snapshot() method. This patch adds a temporary bool_class for
this distinguishing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a bunch of helpers around XFS-specific temp-dir sitting in
publie sstable part. Drop it altogether, no code needs it for real.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The file path is going to disappear soon, so print the filename() on
error. For now it's the same, but the meaning of the filename()
returning string is changing to become "random label for the log
reader".
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The helper consists of three stages:
1. open a file (probably in a temp dir)
2. decorate it with extentions and checked_file
3. optionally rename a file from temp dir
The latter is done to trigger XFS allocate this file in separate block
group if the file was created in temp dir on step 1.
This patch swaps steps 2 and 3 to keep filesystem-specific opening next
to each other.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the sstable::filename(Index) is used in several places that
get the filename as a printable or throwable string and don't treat is
as a real location of any file.
For those, add the index_filename() helper symmetrical to toc_filename()
and (in some sense) the get_filename() one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sstable::filename() is going to become private method. Lots of tests
call it, but tests do call a lot of other sstable private methods,
that's OK. Make the sstable::filename() yet another one of that kind in
advance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the filesystem cleaning code is sitting in one method, it can
finally be made the storage class one.
Exception-safe allocation of toc_name (spoiler: it's copied anyway one
step later, so it's "not that safe" actually) is moved into storage as
well. The caller is left with toc_filename() call in its exception
handler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When unlinking an sstable for whatever reason it's good to check if the
temp dir is handing around. In some cases it's not (compaction), but
keeping the whole wiping code together makes it easier to move it on
storage class in one go.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This method initiates the sstable creation. Effectively it's the first
step in sstable creation transaction implemented on top of rename()
call. Thus this method is moved onto storage under respective name.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When an sstable is prepared to be written on disk the .write_toc() is
called on it which created temporary toc file. Prior to this, the writer
code calls generate_toc() to collect components on the sstable.
This patch adds the .open_sstable() API call that does both. This
prepares the write_toc() part to be moved to storage, because it's not
just "write data into TOC file", it's the first step in transaction
implemeted on top of rename()s.
The test need care -- there's rewrite_toc_without_scylla_component()
thing in utils that doesn't want the generate_toc() part to be called.
It's not patched here and continues calling .write_toc().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sstable can be "moved" in two cases -- to move from staging or to
move to quarantine. Both operation are sstable API ones, but the
implementation is storage-specific. This patch makes the latter a method
of storage class.
One thing to note is that only quarantine() touched the target directly.
Now also the move_to_new_dir() happenning on load also does it, but
that's harmless.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This method is currently used in two places: sstable::snapshot() and
sstable::seal_sstable(). The latter additionally touches the target
backup/ subdir.
This patch moves the whole thing on storage and adds touch for all the
cases. For snapshots this might be excessive, but harmless.
Tests get their private-disclosure way to access sstable._storage in
few places to call create_links directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the sstable sealing is split into storage part, internal-state part
and the seal-with-backup kick.
This move makes remove_temp_dir() private.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two of them -- one API call and the other one that just
"seals" it. The latter one also changes the _marked_for_deletion bit on
the sstable.
This patch makes the latter method prepared to be moved onto storage,
because sealing means comitting TOC file on disk with the help of rename
system call which is purely storage thing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Same as previous patch. This move makes the previously moved
check_create_links_replay() a private method of the storage class.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It needs to get sstable const reference to get the filename(s) from it.
Other than that it's pure filesystem-accessing method.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two -- one that accepts generation and the other one that does
not. The latter is only called by the former, so no need in keeping both.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's only one user of it, it can document its "and mark for removal"
intention via dedicated bool_class argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Looks much shorter and easier-to-patch this way.
The dst_dir argument is made value from const reference, old code copied
it with do_with() anyway.
Indentation is deliberately left broken until next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The whole method is going to move onto newly introduced
filesystem_storage that already has field of the same name onboard. To
avoid confusion, rename the argument to dst_dir.
No functional changes, _just_ s/dir/dst_dir/g throughout the method.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Its meaning is comment-documented anyway. Also, next patches will remove
the create_links_and_mark_for_removal() so callers need some verbose
meaning of this boolean in advance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The replica/ code now "knows" that snapshotting an sstable means
creating a bunch of hard-links on disk. Abstract that via
sstable::snapshot() method.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Those two fields define the way sstable is stored as collection of
on-disk files. First step towards making the storage access abstract is
in moving the paths onto filesystem_storage embedded class.
Both are made public for now, the rest of the code is patched to access
them via _storage.<smth>. The rest of the set moves parts of sstable::
methods into the filesystem_storage, then marks the paths private.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
One case gets full sstable datafile path to get the basename from it.
There's already the basename helper on the class sstable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
These methods access sstables as files on disk, in order to hide the
"path on filesystem" meaning of sstables::filename() the whole method
should be made sstable:: one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now, with a44ca06906, is_normal_token_owner that replaced is_member
does not rely anymore on the pending status
of endpoints in topology.
With that we can get rid of this state and just keep all endpoints we know about in the topology.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12294
* github.com:scylladb/scylladb:
topology: get rid of pending state
topology: debug log update and remove endpoint
Refactor the existing stats tracking and updating
code into struct latency_stats_tracker and while at it,
count lock_acquisitions only on success.
Decrement operations_currently_waiting_for_lock in the destructor
so it's always balanced with the uncoditional increment
in the ctor.
As for updating estimated_waiting_for_lock, it is always
updated in the dtor, both on success and failure since
the wait for the lock happened, whether waiting
timed out or not.
Fixes#12190
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12225
The schedule_repair() receives a bunch of endpoint:mutations pairs and tries to create handlers for those. When creating the handlers it re-obtains topology from schema->ks->effective_replication_map chain, but this new topology can be outdated as compared to the list of endpoints at hand.
The fix is to carry the e.r.m. pointer used by read executor reconciliation all the way down to repair handlers creation. This requires some manipulations with mutate_internal() and mutate_prepare() argument lists.
fixes: #12050 (it was the same problem)
Closes#12256
* github.com:scylladb/scylladb:
proxy: Carry replication map with repair mutation(s)
proxy: Wrap read repair entries into read_repair_mutation
proxy: Turn ref to forwardable ref in mutations iterator
This fixes a long standing bug related to handling of non-full
clustering keys, issue #1446.
after_key() was creating a position which is after all keys prefixed
by a non-full key, rather than a position which is right after that
key.
This will issue will be caught by cql_query_test::test_compact_storage
in debug mode when mutation_partition_v2 merging starts inserting
sentinels at position after_key() on preemption.
It probably already causes problems for such keys.
A complicated function (in continuation style) that benefits
from this simplification.
Closes#12289
* github.com:scylladb/scylladb:
sstables: update_info_for_opened_data: reindent
sstables: update_info_for_opened_data: coroutinize
Sometimes a single modification to a base partition requires updates to
a large number of view rows. A common example is deletion of a base
partition containing many rows. A large BATCH is also possible.
To avoid large allocations, we split the large amount of work into
batch of 100 (max_rows_for_view_updates) rows each. The existing code
assumed an empty result from one of these batches meant that we are
done. But this assumption was incorrect: There are several cases when
a base-table update may not need a view update to be generated (see
can_skip_view_updates()) so if all 100 rows in a batch were skipped,
the view update stopped prematurely. This patch includes two tests
showing when this bug can happen - one test using a partition deletion
with a USING TIMESTAMP causing the deletion to not affect the first
100 rows, and a second test using a specially-crafed large BATCH.
These use cases are fairly esoteric, but in fact hit a user in the
wild, which led to the discovery of this bug.
The fix is fairly simple: To detect when build_some() is done it is no
longer enough to check if it returned zero view-update rows; Rather,
it explicitly returns whether or not it is done as an std::optional.
The patch includes several tests for this bug, which pass on Cassandra,
failed on Scylla before this patch, and pass with this patch.
Fixes#12297.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12305
The abseil and tools/java submodules were accidentally updated in
71bc12eecc
(merged to master in 51f867339e)
This series reverts those changes.
Closes#12311
* github.com:scylladb/scylladb:
Revert accidental update of tools/java submodule
Revert accidental update of abseil submodule
The create_write_response_handler() for read repair needs the e.r.m.
from the caller, because it effectively accepts list of endpoints from
it.
So this patch equips all read_repair_mutation-s with the e.r.m. pointer
so that the handler creation can use it. It's the same for all
mutations, so it's a waste of space, but it's not bad -- there's
typically few mutations in this range and the entry passed there is
temporary, so even lots of them won't occupy lots of memory for long.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The schedule_repair() operates on a map of endpoint:mutations pairs.
Next patch will need to extend this entry and it's going to be easier if
the entry is wrapped in a helper structure in advance.
This is where the forwardable reference cursor from the previous patch
gets its user. The schedule_repair() produces a range of rvalue
wrappers, but the create_write_response_handler accepting it is OK, it
copies mutations anyway.
The printing operator is added to facilitate mutations logging from
mutate_internal() method.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The mutate_prepare() is iterating over range of mutation with 'auto&'
cursor thus accepting only lvalues. This is very restrictive, the caller
of mutate_prepare() may as well provide rvalues if the target
create_write_response_handler() or lambda accepts it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This PR implements two things:
* Getting the value of a conjunction of elements separated by `AND` using `expr::evaluate`
* Preparing conjunctions using `prepare_expression`
---
`NULL` is treated as an "unkown value" - maybe `true` maybe `false`.
`TRUE AND NULL` evaluates to `NULL` because it might be `true` but also might be `false`.
`FALSE AND NULL` evaluates to `FALSE` because no matter what value `NULL` acts as, the result will still be `FALSE`.
Unset and empty values are not allowed.
Usually in CQL the rule is that when `NULL` occurs in an operation the whole expression becomes `NULL`, but here we decided to deviate from this behavior.
Treating `NULL` as an "unkown value" is the standard SQL way of handing `NULLs` in conjunctions.
It works this way in MySQL and Postgres so we do it this way as well.
The evaluation short-circuits. Once `FALSE` is encountered the function returns `FALSE` immediately without evaluating any further elements.
It works this way in Postgres as well, for example:
`SELECT true AND NULL AND 1/0 = 0` will throw a division by zero error,
but `SELECT false AND 1/0 = 0` will successfully evaluate to `FALSE`.
Closes#12300
* github.com:scylladb/scylladb:
expr_test: add unit tests for prepare_expression(conjunction)
cql3: expr: make it possible to prepare conjunctions
expr_test: add tests for evaluate(conjunction)
cql3: expr: make it possible to evaluate conjunctions
Fixes https://github.com/scylladb/scylladb/issues/11712
Updates added with this PR:
- Added a new section with the description of AzureSnitch (similar to others + examples and language improvements).
- Fixed the headings so that they render properly.
- Replaced "Scylla" with "ScyllaDB".
Closes#12254
* github.com:scylladb/scylladb:
docs: replace Scylla with ScyllaDB on the Snitches page
docs: fix the headings on the Snitches page
doc: add the description of AzureSnitch to the documentation
prepare_expression used to throw an error
when encountering a conjunction.
Now it's possible to use prepare_expression
to prepare an expression that contains
conjunctions.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Previously it was impossible to use expr::evaluate()
to get the value of a conjunction of elements
separated by ANDs.
Now it has been implemented.
NULL is treated as an "unkown value" - maybe true maybe false.
`TRUE AND NULL` evaluates to NULL because it might be true but also might be false.
`FALSE AND NULL` evaluates to FALSE because no matter what value NULL acts as, the result will still be FALSE.
Unset and empty values are not allowed.
Usually in CQL the rule is that when NULL occurs in an operation the whole expression
becomes NULL, but here we decided to deviate from this behavior.
Treating NULL as an "unkown value" is the standard SQL way of handing NULLs in conjunctions.
It works this way in MySQL and Postgres so we do it this way as well.
The evaluation short-circuits. Once FALSE is encountered the function returns FALSE
immediately without evaluating any further elements.
It works this way in Postgres as well, for example:
`SELECT true AND NULL AND 1/0 = 0` will throw a division by zero error
but `SELECT false AND 1/0 = 0` will successfully evaluate to FALSE.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The infinetely high time_point of `db_clock::time_point::max()`
used in ba42852b0e
is too high for some clients that can't represent
that as a date_time string.
Instead, limit it to 9999-12-31T00:00:00+0000,
that is practically sufficient to ensure truncation of
all sstables and should be within the clients' limits.
Fixes#12239
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12273
Add instructions on how to backport a feature to on older version of Scylla.
It contains a detailed step-by-step instruction so that people unfamiliar with intricacies of Scylla's repository organization can easily get the hang of it.
This is the guide I wish I had when I had to do my first backport.
I put it in backport.md because that looks like the file responsible for this sort of information.
For a moment I thought about `CONTRIBUTING.md`, but this is a really short file with general information, so it doesn't really fit there. Maybe in the future there will be some sort of unification (see #12126)
Closes#12138
* github.com:scylladb/scylladb:
dev/docs: add additional git pull to backport docs
docs/dev: add a note about cherry-picking individual commits
docs/dev: use 'is merged into' instead of 'becomes'
docs/dev: mention that new backport instructions are for the contributor
docs/dev: Add backport instructions for contributors
Several snitch drivers make http requests to get
region/dc/zone/rack/whatever from the cloud provider. They blindly rely
on the response being successfull and read response body to parse the
data they need from.
That's not nice, add checks for requests finish with http OK statuses.
refs: #12185
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12287
Now, with a44ca06906,
is_normal_token_owner that replaced is_member
does not rely anymore on the pending status
of endpoints in topology.
With that we can get rid of this state and just keep
all endpoints we know about in the topology.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
update_normal_tokens checks that that the endpoint is in topology. Currently we call update_topology on this path only if it's not a normal_token_owner, but there are paths when the endpoint could be a normal token owner but still
be pending in topology so always update it, just in case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12080
* github.com:scylladb/scylladb:
storage_service: handle_state_normal: always update_topology before update_normal_tokens
storage_service: handle_state_normal: delete outdated comment regarding update pending ranges race
Thanks to #12250, Host IDs uniquely identify nodes. We can use them as Raft IDs which simplifies the code and makes reasoning about it easier, because Host IDs are always guaranteed to be present (while Raft IDs may be missing during upgrade).
Fixes: https://github.com/scylladb/scylladb/issues/12204Closes#12275
* github.com:scylladb/scylladb:
service/raft: raft_group0: take `raft::server_id` parameter in `remove_from_group0`
gms, service: stop gossiping and storing RAFT_SERVER_ID
Revert "gms/gossiper: fetch RAFT_SERVER_ID during shadow round"
service: use HOST_ID instead of RAFT_SERVER_ID during replace
service/raft: use gossiped HOST_ID instead of RAFT_SERVER_ID to update Raft address map
main: use Host ID as Raft ID
This series improves the add-node-to-cluster document, in particular around the documentation for the associated cleanup procedure, and the prerequisite steps.
It also removes information about outdated releases.
Closes#12210
* github.com:scylladb/scylladb:
docs: operating-scylla: add-node-to-cluster: deleted instructions for unsupported releases
docs: operating-scylla: add-node-to-cluster: cleanup: move tips to a note
docs: operating-scylla: add-node-to-cluster: improve wording of cleanup instructions
docs: operating-scylla: prerequisites: system_auth is a keyspace, not a table
docs: operating-scylla: prerequisites: no Authetication status is gathered
docs: operating-scylla: prerequisites: simplify grep commands
docs: operating-scylla: add-node-to-cluster: prerequisites: number sub-sections
docs: operating-scylla: add-node-to-cluster: describe other nodes in plural
Revert version change made by PR #11106, which increased it to `4.0.0`
to enable server-side describe on latest cqlsh.
Turns out that our tooling some way depends on it (eg. `sstableloader`)
and it breaks dtests.
Reverting only the version allows to leave the describe code unchanged
and it fixes the dtests.
cqlsh 6.0.0 will return a warning when running `DESC ...` commands.
Closes#12272
We no longer need to translate from IP to Raft ID using the address map,
because Raft ID is now equal to the Host ID - which is always available
at the call site of `remove_from_group0`.
It is equal to (if present) HOST_ID and no longer used for anything.
The application state was only gossiped if `experimental-features`
contained `raft`, so we can free this slot.
Similarly, `raft_server_id`s were only persisted in `system.peers` if
the `SUPPORTS_RAFT` cluster feature was enabled, which happened only
when `experimental-features` contained `raft`. The `raft_server_id`
field in the schema was also introduced recently in `master` and didn't
get to be in a release yet. Given either of these reasons, we can remove
this field safely.
Fixes /scylladb/scylla-enterprise/issues#1262
Changes the somewhat ambiguous "none" into "not set" to clarify that "none" is not an
option to be written out, but an absense of a choice (in which case you also have made
a choice).
Closes#12270
The Host ID now uniquely identifies a node (we no longer steal it during
node replace) and Raft is still experimental. We can reuse the Host ID
of a node as its Raft ID. This will allow us to remove and simplify a
lot of code.
With this we can already remove some dead code in this commit.
Script for "one-click" opening of coredumps.
It extracts the build-id from the coredump, retrieves metadata for that
build, downloads the binary package, the source code and finally
launches the dbuild container, with everything ready to load the
coredump.
The script is idempotent: running it after the prepartory steps will
re-use what is already donwloaded.
The script is not trying to provide a debugging environment that caters
to all the different ways and preferences of debugging. Instead, it just
sets up a minimalistic environment for debugging, while providing
opportunities for the user to customization according to their
preferred.
I'm not entirely sure, coredumps from master branch will work, but we
can address this later when we confirm they don't.
Example:
$ ~/ScyllaDB/scylla/worktree0/scripts/open-coredump.sh ./core.scylla.113.bac3650b616f4f09a4d1ab160574b6a5.4349.1669185225000000000000
Build id: 5009658b834aaf68970135bfc84f964b66ea4dee
Matching build is scylla-5.0.5 0.20221009.5a97a1060 release-x86_64
Downloading relocatable package from http://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-5.0/scylla-x86_64-package-5.0.5.0.20221009.5a97a1060.tar.gz
Extracting package scylla-x86_64-package-5.0.5.0.20221009.5a97a1060.tar.gz
Cloning scylla.git
Downloading scylla-gdb.py
Copying scylla-gdb.py from /home/bdenes/ScyllaDB/storage/11961/open-coredump.sh.dir/scylla.repo
Launching dbuild container.
To examine the coredump with gdb:
$ gdb -x scylla-gdb.py -ex 'set directories /src/scylla' --core ./core.scylla.113.bac3650b616f4f09a4d1ab160574b6a5.4349.1669185225000000000000 /opt/scylladb/libexec/scylla
See https://github.com/scylladb/scylladb/blob/master/docs/dev/debugging.md for more information on how to debug scylla.
Good luck!
[root@fedora workdir]#
Closes#12223
We want to always be able to distinguish between
the replacing node and the replacee by using different,
unique, host identifiers.
This will allow us to use the host_id authoritatively
to identify the node (rather then its endpoint ip address)
for token mapping and node operations.
Also, it will be used in the following patch to never allow the
replaced node to rejoin the cluster, as its host_id should never
be reused.
This change does not affect #5523, the replaced node may still steal back its tokens if restarted.
Refs #9839
Refs #12040Closes#12250
* github.com:scylladb/scylladb:
docs: replace-dead-node: update host_id of replacing node
docs: replace-dead-node: fix alignment
db: system_keyspace: change set_local_host_id to private set_local_random_host_id
storage_service: do not inherit the host_id of a replaced a node
The generic task holds and destroyes a task::impl
but we want the derived class's destructor to be called
when the task is destroyed otherwise, for example,
member like abort_source subscription will not be destroyed
(and auto-unlinked).
Fixes#12183
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12266
It is moved into the async thread so the encapsulating
function should be defined mutable to move the func
rather thna copying it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12267
This patch includes a translation of two more test files from
Cassandra's CQL unit test directory cql3/validation/operations.
All tests included here pass on Cassandra. Several test fail on Scylla
and are marked "xfail". These failures discovered two previously-unknown
bugs:
#12243: Setting USING TTL of "null" should be allowed
#12247: Better error reporting for oversized keys during INSERT
And also added reproducers for two previously-known bugs:
#3882: Support "ALTER TABLE DROP COMPACT STORAGE"
#6447: TTL unexpected behavior when setting to 0 on a table with
default_time_to_live
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12248
We recently (commit 6a5d9ff261) started
to use std::source_location instead of std::experimental::source_location.
However, this does not work on clang 14, because libc++ 12's
<source_location> only works if __builtin_source_location, and that is
not available on clang 14.
clang 15 is just three months old, and several relatively-recent
distributions still carry clang 14 so it would be nice to support it
as well.
So this patch adds a trivial compatibility header file, which, when
included and compiled with clang 14, it aliases the functional
std::experimental::source_location to std::source_location.
It turns out it's enough to include the new header file from three
headers that included <source_location> - I guess all other uses
of source_location depend on those header files directly or indirectly.
We may later need to include the compatibility header file in additional
places, bug for now we don't.
Refs #12259
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12265
This PR adds server-side `DESCRIBE` statement, which is required in latest cqlsh version.
The only change from the user perspective is the `DESC ...` statement can be used with cqlsh version >= 6.0. Previously the statement was executed from client side, but starting with Cassandra 4.0 and cqlsh 6.0, execution of describe was moved to server side, so the user was unable to do `DESC ...` with Scylla and cqlsh 6.0.
Implemented describe statements:
- `DESC CLUSTER`
- `DESC [FULL] SCHEMA`
- `DESC [ONLY] KEYSPACE`
- `DESC KEYSPACES/TYPES/FUNCTIONS/AGGREGATES/TABLES`
- `DESC TYPE/FUNCTION/AGGREGATE/MATERIALIZED VIEW/INDEX/TABLE`
- `DESC`
[Cassandra's implementation for reference](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/statements/DescribeStatement.java)
Changes in this patch:
- cql3::util: added `single_quite()` function
- added `data_dictionary::keyspace_element` interface
- implemented `data_dictionary::keyspace_element` for:
- keyspace_metadata,
- UDT, UDF, UDA
- schema
- cql3::functions: added `get_user_functions()` and `get_user_aggregates()` to get all UDFs/UDAs in specified keyspace
- data_dictionary::user_types_metadata: added `has_type()` function
- extracted `describe_ring()` from storage_service to standalone helper function in `locator/util.hh`
- storage_proxy: added `describe_ring()` (implemented using helper function mentioned above)
- extended CQL grammar to handle describe statement
- increased version in `version.hh` to 4.0.0, so cqlsh will use server-side describe statement
Referring: https://github.com/scylladb/scylla/issues/9571, https://github.com/scylladb/scylladb/issues/11475Closes#11106
* github.com:scylladb/scylladb:
version: Increasing version
cql-pytest: Add tests for server-side describe statement
cql-pytest: creating random elements for describe's tests
cql3: Extend CQL grammar with server-side describe statement
cql3:statements: server-side describe statement
data_dictonary: add `get_all_keyspaces()` and `get_user_keyspaces()`
storage_proxy: add `describe_ring()` method
storage_service, locator: extract describe_ring()
data_dictionary:user_types_metadata: add has_type() function
cql3:functions: `get_user_functions()` and `get_user_aggregates()`
implement `keyspace_element` interface
data_dictionary: add `keyspace_element` interface
cql3: single_quote() util function
view: row_lock: lock_ck: reindent
test/topology: enable replace tests
service/raft: report an error when Raft ID can't be found in `raft_group0::remove_from_group0`
service: handle replace correctly with Raft enabled
gms/gossiper: fetch RAFT_SERVER_ID during shadow round
service: storage_service: sleep 2*ring_delay instead of BROADCAST_INTERVAL before replace
The `current()` version in version.hh has to be increased to at
least 4.0.0, so server-side describe will be used. Otherwise,
cqlsh returns warning that client-side describe is not supported.
Add helper functions to create random elements (keyspaces, tables, types)
to increase the coverage of describe statment's tests.
This commit also adds `random_seed` fixture. The fixture should be
always used when using random functions. In case of test's failure, the
seed will be present in test's signature and the case can be easili
recreated.
After the test finishes, the fixture restores state of `random` to
before-test state.
Starting from cqlsh 6.0.0, execution of the describe statement was moved
from the client to the server.
This patch implements server-side describe statement. It's done by
simply fetching all needed keyspace elements (keyspace/table/index/view/UDT/UDF/UDA)
and generating the desired description or list of names of all elements.
The description of any element has to respect CQL restrictions(like
name's quoting) to allow quickly recreate the schema by simply copy-pasting the descritpion.
In order to execute `DESC CLUSTER`, there has to be a way to describe
ring. `storage_service` is not available at query execution. This patch
adds `describe_ring()` as a method of `storage_proxy()` (using helper
function from `locator/util.hh`).
`describe_ring()` was implemented as a method of `storage_service`. This
patch extracts it from there to a standalone helper function in
`locator/util.hh`.
A common interace for all keyspace elements, which are:
keyspace, UDT, UDF, UDA, tables, views, indexes.
The interface is to have a unified way to describe those elements.
`single_quote()` takes a string and transforms it to a string
which can be safely used in CQL commands.
Single quoting involves wrapping the name in single-quotes ('). A sigle-quote
character itself is quoted by doubling it.
Single quoting is necessary for dates, IP addresses or string literals.
Also simplify the code and improve logging in general.
The previous code did this: search for the ID in the address map. If it
couldn't be found, perform a read barrier and search again. If it again
couldn't be found, return.
This algorithm depended on the fact that IP addresses were stored in
group 0 configuration. The read barrier was used to obtain the most
recent configuration, and if the IP was not a part of address map after
the read barrier, that meant it's simply not a member of group 0.
This logic no longer applies so we can simplify the code.
Furthermore, when I was fixing the replace operation with Raft enabled,
at some point I had a "working" solution with all tests passing. But I
was suspicious and checked if the replaced node got removed from
group 0. It wasn't. So the replace finished "successfully", but we had
an additional (voting!) member of group 0 which didn't correspond to
a token ring member.
The last version of my fixes ensure that the node gets removed by the
replacing node. But the system is fragile and nothing prevents us from
breaking this again. At least log an error for now. Regression tests
will be added later.
We must place the Raft ID obtained during the shadow round in the
address map. It won't be placed by the regular gossiping route if we're
replacing using the same IP, because we override the application state
of the replaced node. Even if we replace a node with a different IP, it
is not guaranteed that background gossiping manages to update the
address map before we need it, especially in tests where we set
ring_delay to 0 and disable wait_for_gossip_to_settle. The shadow round,
on the other hand, performs a synchronous request (and if it fails
during bootstrap, bootstrap will fail - because we also won't be able to
obtain the tokens and Host ID of the replaced node).
Fetch the Raft ID of the replaced node in `prepare_replacement_info`,
which runs the shadow round. Return it in `replacement_info`. Then
`join_token_ring` passes it to `setup_group0`, which stores it in the
address map. It does that after `join_group0` so the entry is
non-expiring (the replaced node is a member of group 0). Later in the
replace procedure, we call `remove_from_group0` for the replaced node.
`remove_from_group0` will be able to reverse-translate the IP of the
replaced node to its Raft ID using the address map.
During the replace operation we need the Raft ID of the replaced node.
The shadow round is used for fetching all necessary information before
the replace operation starts.
Most of the sleeps related to gossiping are based on `ring_delay`,
which is configurable and can be set to lower value e.g. during tests.
But for some reason there was one case where we slept for a hardcoded
value, `service::load_broadcaster::BROADCAST_INTERVAL` - 60 seconds.
Use `2 * get_ring_delay()` instead. With the default value of
`ring_delay` (30 seconds) this will give the same behavior.
When we have a table with partition key p and an indexed regular column
v, the test included in this patch checks the query
SELECT p FROM table WHERE v = 1 AND TOKEN(p) > 17
This can work and not require ALLOW FILTERING, because the secondary index
posting-list of "v=1" is ordered in p's token order (to allow SELECT with
and without an index to return the same order - this is explained in
issue #7443). So this test should pass, and indeed it does on both current
Scylla, and Cassandra.
However, it turns out that this was a bug - issue #7043 - in older
versions of Scylla, and only fixed in Scylla 4.6. In older versions,
the SELECT wasn't accepted, claiming it requires ALLOW FILTERING,
and if ALLOW FILTERING was added, the TOKEN(p) > 17 part was silently
ignored.
The fix for issue #7043 actually included regression tests, C++ tests in
test/boost/secondary_index_test.cc. But in this patch we also add a Python
test in test/cql-pytest.
One of the benefits of cql-pytest is that we can (and I did) run the same
test on Cassandra to verify we're not implementing a wrong feature.
Another benefit is that we can run a new test on an old version, and
not even require re-compilation: You can run this new test on any
existing installation of Scylla to check if it still has issue #7043.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12237
The replacing node no longer assumes the host_id
of the replacee. It will continue to use a random,
unique host_id.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that the local host_id is never changed externally
(by the storage_service upon replace-node),
the method can be made private and be used only for initializing the
local host_id to a random one.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We want to always be able to distinguish between
the replacing node and the replacee by using different,
unique, host identifiers.
This will allow us to use the host_id authoritatively
to identify the node (rather then its endpoint ip address)
for token mapping and node operations.
Also, it will be used in the following patch to never allow the
replaced node to rejoin the cluster, as its host_id should never
be reused.
Refs #9839
Refs #12040
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the header includes storage_proxy.hh and spreads this over the
code via raft_group0_client.hh -> group0_state_machine.hh -> lang.hh
Forward declaring proxy class it eliminates ~100 indirect dependencies on
storage_proxy.hh via this chain.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12241
The latter is pretty popular test/lib header that disseminates the
former one over whole lot of unit tests. The former, in turn, naturally
includes sstables.hh thus making tons of unrelated tests depend on
sstables class unused by them.
However, simple removal doesn't work, becase of local_shard_only bool
class definition in sstable_utils.hh used in simple_schema.hh. This
thing, in turn, is used in keys making helpers that don't belong to
sstable utils, so these are moved into simple_schema as well.
When done, this affects the mutation_source_test.hh, which needs the
local_shard_only bool class (and helps spreading the sstables.hh
throughout more unrelated tests) and a bunch of .cc test sources that
used sstable_utils.hh to indirectly include various headers of their
demand.
After patching, sstables.hh touches 2x times less tests. As a side
effect the sstables_manager.hh also becomes 2x times less dependent
on by tests.
Continuation of 9bdea110a6
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12240
trim_clustering_row_ranges_to() is broken for non-full keys in reverse
mode. It will trim the range to
position_in_partition_view::after_key(full_key) instead of
position_in_partition_view::before_key(key), hence it will include the
key in the resulting range rather than exclude it.
Fixes#12180
Refs #1446
A frozen set can be part of the clustering key, and with compact
storage, the corresponding key component can have an empty value.
Comparison was not prepared for this, the iterator attempts to
deserialize the item count and will fail if the value is empty.
Fixes#12242
This pull request introduces support for global secondary indexes based on static columns.
Local secondary indexes based on secondary columns are not planned to be supported and are explicitly forbidden. Because there is only one static row per partition and local indexes require full partition key when querying, such indexes wouldn't be very useful and would only waste resources.
The index table for secondary indexes on static columns, unlike other secondary indexes, do not contain clustering keys from the base table. A static column's value determines a set of full partitions, so the clustering keys would only be unnecessary.
The already existing logic for querying using secondary indexes works after introducing minimal notifications. The view update generation path now works on a common representation of static and clustering rows, but the new representation allowed to keep most of the logic intact.
New cql-pytests are added. All but one of the existing tests for secondary indexes on static columns - ported from Cassandra - now work and have their `xfail` marks lifted; the remaining test requires support for collection indexing, so it will start working only after #2962 is fixed.
Materialized view with static rows as a key are __not__ implemented in this PR.
Fixes: #2963Closes#11166
* github.com:scylladb/scylladb:
test_materialized_view: verify that static columns are not allowed
test_secondary_index: add (currently failing) test for static index paging
test_secondary_index: add more tests for secondary indexes on static columns
cassandra_tests: enable existing tests for static columns
create_index_statement: lift restriction on secondary indexes on static rows
db/view: fetch and process static rows when building indexes
gms/feature_service: introduce SECONDARY_INDEXES_ON_STATIC_COLUMNS cluster feature
create_index_statement: disallow creation of local indexes with static columns
select_statement: prepare paging for indexes on static columns
select_statement: do not attempt to fetch clustering columns from secondary index's table
secondary_index_manager: don't add clustering key columns to index table of static column index
replica/table: adjust the view read-before-write to return static rows when needed
db/view: process static rows in view_update_builder::on_results
db/view: adjust existing view update generation path to use clustering_or_static_row
column_computation: adjust to use clustering_or_static_row
db/view: add clustering_or_static_row
deletable_row: add column_kind parameter to is_live
view_info: adjust view_column to accept column_kind
db/view: base_dependent_view_info: split non-pk columns into regular and static
abseil::hash depends on abseil::city and declareds CityHash32
as an external symbol. The city library static library, however,
precedes hash in the link list, which apparently makes the linker
simply drop it from the object list, since its symbols are not
used elsewhere.
Fix the linker ordering to help linker see that CityHash32
is used.
Closes#12231
Adds a test which verifies that static columns are not allowed in
materialized views. Although we added support for static columns in
secondary indexes, which share a lot of code with materialized views,
static columns in materialized views are not yet ready to use.
Currently, when executing queries accelerated by an index on a static
column, paging is unable to break base table partitions across pages and
is forced to return them in whole. This will cause problems if such a
query must return a very large base table partition because it will have
to be loaded into memory.
Fixing this issue will require a more sophisticated approach than what
was done in the PR. For the time being, an xfailing pytest is added
which should start passing after paging is improved.
This PR adds the link to the KB article about updating the mode after the upgrade to the 5.1 upgrade guide.
In addition, I have:
- updated the KB article to include the versions affected by that change.
- fixed the broken link to the page about metric updates (it is not related to the KB article, but I fixed it in the same PR to limit the number of PRs that need to be backported).
Related: https://github.com/scylladb/scylladb/pull/11122Closes#12148
* github.com:scylladb/scylladb:
doc: update the releases in the KB about updating the mode after upgrade
doc: fix the broken link in the 5.1 upgrade guide
doc: add the link to the 5.1-related KB article to the 5.1 upgrade guide
The reason is alloc-dealloc mismatch of position_in_partition objects
allocated by cursors inside coroutine object stored in the update
variable in row_cache::do_update()
It is allocated under cache region, but in case of exception it will
be destroyed under the standard allocator. If update is successful, it
will be cleared under region allocator, so there is not problem in the
normal case.
Fixes#12068Closes#12233
2.3 and 2018.1 ended their life and are long gone.
No need to have instructions for them in the master version of this
document.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"use `nodetool cleanup` cleanup command" repeats words, change to
"run the `nodetool cleanup` command".
Also, improve the description of the cleanup action
and how it relate to the bootstrapping process.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fix the phrase referring to it as a table respectively.
Also, do some minor phrasing touch-ups in this area.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Authetication status isn't gathered from scylla.yaml,
only the authenticator, so change the caption respectively.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Writing `cat X | grep Y` is both inefficient and somewhat
unprofessional. The grep command works very well on a file argument
so `grep Y X` will do the job perfectly without the need for a pipe.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Typically data will be streamed from multiple existing nodes
to the new node, not from a single one.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We need to obtain the Raft ID of the replaced node during the shadow round and
place it in the address map. It won't be placed by the regular gossiping route
if we're replacing using the same IP, because we override the application state
of the replaced node. Even if we replace a node with a different IP, it is not
guaranteed that background gossiping manages update the address map before we
need it, especially in tests where we set ring_delay to 0 and disable
wait_for_gossip_to_settle. The shadow round, on the other hand, performs a
synchronous request (and if it fails during bootstrap, bootstrap will fail -
because we also won't be able to obtain the tokens and Host ID of the replaced
node).
Fetch the Raft ID of the replaced node in `prepare_replacement_info`,
which runs the shadow round. Return it in `replacement_info`. Then
`join_token_ring` passes it to `setup_group0`, which stores it in the
address map. It does that after `join_group0` so the entry is
non-expiring (the replaced node is a member of group 0). Later in the
replace procedure, we call `remove_from_group0` for the replaced node.
`remove_from_group0` will be able to reverse-translate the IP of the
replaced node to its Raft ID using the address map.
Also remove an unconditional 60 seconds sleep from the replace code. Make it
dependent on ring_delay.
Enable the replace tests.
Modify some code related to removing servers from group 0 which depended on
storing IP addresses in the group 0 configuration.
Closes#12172
* github.com:scylladb/scylladb:
test/topology: enable replace tests
service/raft: report an error when Raft ID can't be found in `raft_group0::remove_from_group0`
service: handle replace correctly with Raft enabled
gms/gossiper: fetch RAFT_SERVER_ID during shadow round
service: storage_service: sleep 2*ring_delay instead of BROADCAST_INTERVAL before replace
This change removes sstables.hh from some other headers replacing it
with version.hh and shared_sstable.hh. Also this drops
sstables_manager.hh from some more headers, because this header
propagates sstables.hh via self. That change is pretty straightforward,
but has a recochet in database.hh that needs disk-error-handler.hh.
Without the patch touch sstables/sstable.hh results in 409 targets
recompillation, with the patch -- 299 targets.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12222
Tidy up namespaces, move code to the right file, and
move the whole thing to the replica module where it
belongs.
Closes#12219
* github.com:scylladb/scylladb:
dirty_memory_manager: move implementaton from database.cc
dirty_memory_manager: move to replica module
test: dirty_memory_manager_test: disambiguate classes named 'test_region_group'
dirty_memory_manager: stop using using namespace
There are two similarly named classes: ::test_region_group and
dirty_memory_manager_logalloc::test_region_group. Rename the
former to ::raii_region_group (that's what it's for) and the
latter to ::test_region_group, to reduce confusion.
Serializes the value that is an instance of a type. The opposite of `deserialize` (previously known as `print`).
All other actions operate on serialized values, yet up to now we were missing a way to go from human readable values to serialized ones. This prevented for example using `scylla types tokenof $pk` if one only had the human readable key value.
Example:
```
$ scylla types serialize -t Int32Type -- -1286905132
b34b62d4
$ scylla types serialize --prefix-compound -t TimeUUIDType -t Int32Type -- d0081989-6f6b-11ea-0000-0000001c571b 16
0010d00819896f6b11ea00000000001c571b000400000010
$ scylla types serialize --prefix-compound -t TimeUUIDType -t Int32Type -- d0081989-6f6b-11ea-0000-0000001c571b
0010d00819896f6b11ea00000000001c571b
```
Closes#12029
* github.com:scylladb/scylladb:
docs: scylla-types.rst: add mention of per-operation --help
tools/scylla-types: add serialize operation
tools/scylla-types: prepare for action handlers with string arguments
tools/scylla-types: s/print/deserialize/ operation
docs: scylla-types.rst: document tokenof and shardof
docs: scylla-types.rst: fix typo in compare operation description
Update the CODEOWNERS file with some people who joined different parts
of the project, and one person that left.
Note that despite is name, CODEOWNERS does not list "ownership" in any
strict sense of the word - it is more about who is willing and/or
knowledgeable enough to participate in reviewing changes to particular
files or directories. Github uses this file to automatically suggest
who should review a pull request.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12216
The problematic scenario this patch fixes might happen due to
unfortunate serialization of locks/unlocks between lock_pk and lock_ck,
as follows:
1. lock_pk acquires an exclusive lock on the partition.
2.a lock_ck attempts to acquire shared lock on the partition
and any lock on the row. both cases currently use a fiber
returning a future<rwlock::holder>.
2.b since the partition is locked, the lock_partition times out
returning an exceptional future. lock_row has no such problem
and succeeds, returning a future holding a rwlock::holder,
pointing to the row lock.
3.a the lock_holder previously returned by lock_pk is destroyed,
calling `row_locker::unlock`
3.b row_locker::unlock sees that the partition is not locked
and erases it, including the row locks it contains.
4.a when_all_succeeds continuation in lock_ck runs. Since
the lock_partition future failed, it destroyes both futures.
4.b the lock_row future is destroyed with the rwlock::holder value.
4.c ~holder attempts to return the semaphore units to the row rwlock,
but the latter was already destroyed in 3.b above.
Acquiring the partition lock and row lock in parallel
doesn't help anything, but it complicates error handling
as seen above,
This patch serializes acquiring the row lock in lock_ck
after locking the partition to prevent the above race.
This way, erasing the unlocked partition is never expected
to happen while any of its rows locks is held.
Fixes#12168
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12208
The diagnostics dumped by the reader concurrency semaphore are pretty
common-sight in logs, as soon as a node becomes problematic. The reason
is that the reader concurrency semaphore acts as the canary in the coal
mine: it is the first that starts screaming when the node or workload is
unhealthy. This patch adds documentation of the content of the
diagnostics and how to diagnose common problems based on it.
Fixes: #10471Closes#11970
Takes human readable values and converts them to serialized hex encoded
format. Only regular atomic types are supported for now, no
collection/UDT/tuple support, not even in frozen form.
Currently all action handlers have bytes arguments, parsed from
hexadecimal string representations. We plan on adding a serialize
command which will require raw string arguments. Prepare the
infrastructure for supporting both types of action handlers.
Soon we will have a serialize operation. Rename the current print
operation to deserialize in preparation to that. We want the two
operations (serialize and deserialize) to reflect their relation in
their names too.
Secondary indexes on static columns should work now. This commit lifts
the existing restriction after the cluster is fully upgraded to a
version which supports such indexes.
This commit modifies the view builder and its consumer so that static
rows are always fetched and properly processed during view build.
Currently, the view builder will always fetch both static and clustering
rows, regardless of the type of indexes being built. For indexes on
static columns this is wasteful and could be improved so that only the
types of rows relevant to indexes being built are fetched - however,
doing this sounds a bit complicated and I would rather start with
something simpler which has a better chance of working.
Local indexes on static columns don't make sense because there is only
one static row per partition. It's always better to just run SELECT
DISTINCT on the base table. Allowing for such an index would only make
such queries slower (due to double lookup), would take unnecessary space
and could pose potential consistency problems, so this commit explicitly
forbids them.
When performing a query on a table which is accelerated by a secondary
index, the paging state returned along with the query contains a
partition key and a clustering key of the secondary index table. The
logic wasn't prepared to handle the case of secondary indexes on static
columns - notably, it tried to put base table's clustering key columns
into the paging state which caused problems in other places.
This commit fixes the paging logic so that the PK and CK of a secondary
index table is calculated correctly. However, this solution has a major
drawback: because it is impossible to encode clustering key of the base
table in the paging state, partitions returned by queries accelerated by
secondary indexes on static columns will _not_ be split by paging. This
can be problematic in case there are large partitions in the base table.
The main advantage of this fix is that it is simple. Moreover, the
problem described above is not unique to static column indexes, but also
happens e.g. in case of some indexes on clustering columns (see case 2
of scylladb/scylla#7432). Fixing this issue will require a more
sophisticated solution and may affect more than only secondary indexes
on static columns, so this is left for a followup.
The previous commit made sure that the index table for secondary indexes
on static tables don't have columns corresponding to clustering rows in
the base table - therefore, we must make sure that we don't try to fetch
them when querying the index table.
The implementation of secondary indexes on static columns relies on the
fact that the index table only includes partition key columns of the
base table, but not clustering key columns. A static column's value
determines a set of full partitions, so including the clustering key
would only be redundant. It would also generate more work as a single
static column update would require a large portion of the index to be
updated.
This commit makes sure that clustering columns are not included in the
index table for indexes based on a static column.
Adjusts the read-before-write query issued in
`table::do_push_view_replica_updates` so that, when needed, requests
static columns and makes sure that the static row is present.
The `view_update_builder::on_results()` function is changed to react to
static rows when comparing read-before-write results with the base table
mutation.
Adjusts the column_computation interface so that it is able to accept
both clustering and static rows through the common
db::view::clustering_or_static_row interface.
Adds a `clustering_or_static_row`, which is a common, immutable
representation of either a static or clustering row. It will allow to
handle view update generation based on static or clustering rows in a
uniform way.
While deletable_row is used to hold regular columns of a clustering row,
its name or implementation doesn't suggest that it is a requirement. In
fact, some of its methods already take a column_kind parameter which is
used to interpret the kind of columns held in the row.
This commit removes the assumption about the column kind from the
`deletable_row::is_live` method.
The `view_info::view_column()` and `view_column` in view.cc allow to get
a view's column definition which corresponds to given base table's
column. They currently assume that the given column id corresponds to a
regular column. In preparation for secondary indexes based on static
columns, this commit adjusts those functions so that they accept other
kinds of columns, including static columns.
Currently, `base_dependent_view_info::_base_non_pk_columns_in_view_pk`
field keeps a list of non-primary-key columns from the base table which
are a part of the view's primary key. Because the current code does not
allow indexes on static columns yet, the columns kept in the
aforementioned field are always assumed to be regular columns of the
base table and are kept as `column_id`s which do not contain information
about the column kind.
This commit splits the `_base_non_pk_columns_in_view_pk` field into two,
one for regular columns and the other for static columns, so that it is
possible to keep both kinds of columns in `base_dependent_view_info` and
the structure can be used for secondary indexes on static columns.
Three lambdas were removed, simplifying the code.
Closes#12207
* github.com:scylladb/scylladb:
compaction_manager: reindent postponed_compactions_reevaluation()
compaction_manager: coroutinize postponed_compactions_reevaluation()
compaction_manager: make postponed_compactions_reevaluation() return a future
The first patch in this small series fixes a hang during shutdown when the expired-item scanning thread can hang in a retry loop instead of quitting. These hangs were seen in some test runs (issue #12145).
The second patch is a failsafe against additional bugs like those solved by the first patch: If any bugs causes the same page fetch to repeatedly time out, let's stop the attempts after 10 retries instead of retrying for ever. When we stop the retries, a warning will be printed to the log, Scylla will wait until the next scan period and start a new scan from scratch - from a random position in the database, instead of hanging potentially-forever waiting for the same page.
Closes#12152
* github.com:scylladb/scylladb:
alternator ttl: in scanning thread, don't retry the same page too many times
alternator: fix hang during shutdown of expiration-scanning thread
As it has a do_with(), coroutinizing it is an automatic win.
Closes#12195
* github.com:scylladb/scylladb:
cql3: batch_statement: reindent get_mutations()
cql3: batch_statement: coroutinize get_mutations()
postponed_compactions_reevaluation() runs until compaction_manager is
stopped, checking if it needs to launch new compactions.
Make it return a future instead of stashing its completion somewhere.
This makes is easier to convert it to a coroutine.
* abseil 7f3c0d78...4e5ff155 (125):
> Add a compilation test for recursive hash map types
> Add AbslStringify support for enum types in Substitute.
> Use a c++14-style constexpr initialization if c++14 constexpr is available.
> Move the vtable into a function to delay instantiation until the function is called. When the variable is a global the compiler is allowed to instantiate it more aggresively and it might happen before the types involved are complete. When it is inside a function the compiler can't instantiate it until after the functions are called.
> Cosmetic reformatting in a test.
> Reorder base64 unescape methods to be below the escaping methods.
> Fixes many compilation issues that come from having no external CI coverage of the accelerated CRC implementation and some differences bewteen the internal and external implementation.
> Remove static initializer from mutex.h.
> Import of CCTZ from GitHub.
> Remove unused iostream include from crc32c.h
> Fix MSVC builds that reject C-style arrays of size 0
> Remove deprecated use of absl::ToCrc32c()
> CRC: Make crc32c_t as a class for explicit control of operators
> Convert the full parser into constexpr now that Abseil requires C++14, and use this parser for the static checker. This fixes some outstanding bugs where the static checker differed from the dynamic one. Also, fix `%v` to be accepted with POSIX syntax.
> Write (more) directly into the structured buffer from StringifySink, including for (size_t, char) overload.
> Avoid using the non-portable type __m128i_u.
> Reduce flat_hash_{set,map} generated code size.
> Use ABSL_HAVE_BUILTIN to fix -Wundef __has_builtin warning
> Add a TODO for the deprecation of absl::aligned_storage_t
> TSAN: Remove report_atomic_races=0 from CI now that it has been fixed
> absl: fix Mutex TSan annotations
> CMake: Remove trailing commas in `AbseilDll.cmake`
> Fix AMD cpu detection.
> CRC: Get CPU detection and hardware acceleration working on MSVC x86(_64)
> Removing trailing period that can confuse a url in str_format.h.
> Refactor btree iterator generation code into a base class rather than using ifdefs inside btree_iterator.
> container.h: fix incorrect comments about the location of <numeric> algorithms.
> Zero encoded_remaining when a string field doesn't fit, so that we don't leave partial data in the buffer (all decoders should ignore it anyway) and to be sure that we don't try to put any subsequent operands in either (there shouldn't be enough space).
> Improve error messages when comparing btree iterators when generations are enabled.
> Document the WebSafe* and *WithPadding variants more concisely, as deltas from Base64Encode.
> Drop outdated comment about LogEntry copyability.
> Release structured logging.
> Minor formatting changes in preparation for structured logging...
> Update absl::make_unique to reflect the C++14 minimum
> Update Condition to allocate 24 bytes for MSVC platform pointers to methods.
> Add missing include
> Refactor "RAW: " prefix formatting into FormatLogPrefix.
> Minor formatting changes due to internal refactoring
> Fix typos
> Add a new API for `extract_and_get_next()` in b-tree that returns both the extracted node and an iterator to the next element in the container.
> Use AnyInvocable in internal thread_pool
> Remove absl/time/internal/zoneinfo.inc. It was used to guarantee availability of a few timezones for "time_test" and "time_benchmark", but (file-based) zoneinfo is now secured via existing Bazel data/env attributes, or new CMake environment settings.
> Updated documentation on use of %v Also updated documentation around FormatSink and PutPaddedString
> Use the correct Bazel copts in crc targets
> Run the //absl/time timezone tests with a data dependency on, and a matching ${TZDIR} setting for, //absl/time/internal/cctz:zoneinfo.
> Stop unnecessary clearing of fields in ~raw_hash_set.
> Fix throw_delegate_test when using libc++ with shared libraries
> CRC: Ensure SupportsArmCRC32PMULL() is defined
> Improve error messages when comparing btree iterators.
> Refactor the throw_delegate test into separate test cases
> Replace std::atomic_flag with std::atomic<bool> to avoid the C++20 deprecation of ATOMIC_FLAG_INIT.
> Add support for enum types with AbslStringify
> Release the CRC library
> Improve error messages when comparing swisstable iterators.
> Auto increase inlined capacity whenever it does not affect class' size.
> drop an unused dep
> Factor out the internal helper AppendTruncated, which is used and redefined in a couple places, plus several more that have yet to be released.
> Fix some invalid iterator bugs in btree_test.cc for multi{set,map} emplace{_hint} tests.
> Force a conservative allocation for pointers to methods in Condition objects.
> Fix a few lint findings in flags' usage.cc
> Narrow some _MSC_VER checks to not catch clang-cl.
> Small cleanups in logging test helpers
> Import of CCTZ from GitHub.
> Merge pull request abseil/abseil-cpp#1287 from GOGOYAO:patch-1
> Merge pull request abseil/abseil-cpp#1307 from KindDragon:patch-1
> Stop disabling some test warnings that have been fixed
> Support logging of user-defined types that implement `AbslStringify()`
> Eliminate span_internal::Min in favor of std::min, since Min conflicts with a macro in a third-party library.
> Fix -Wimplicit-int-conversion.
> Improve error messages when dereferencing invalid swisstable iterators.
> Cord: Avoid leaking a node if SetExpectedChecksum() is called on an empty cord twice in a row.
> Add a warning about extract invalidating iterators (not just the iterator of the element being extracted).
> CMake: installed artifacts reflect the compiled ABI
> Import of CCTZ from GitHub.
> Import of CCTZ from GitHub.
> Support empty Cords with an expected checksum
> Move internal details from one source file to another more appropriate source file.
> Removes `PutPaddedString()` function
> Return uint8_t from CappedDamerauLevenshteinDistance.
> Remove the unknown CMAKE_SYSTEM_PROCESSOR warning when configuring ABSL_RANDOM_RANDEN_COPTS
> Enforce Visual Studio 2017 (MSVC++ 15.0) minumum
> `absl::InlinedVector::swap` supports non-assignable types.
> Improve b-tree error messages when dereferencing invalid iterators.
> Mutex: Fix stall on single-core systems
> Document Base64Unescape() padding
> Fix sign conversion warnings in memory_test.cc.
> Fix a sign conversion warning.
> Fix a truncation warning on Windows 64-bit.
> Use btree iterator subtraction instead of std::distance in erase_range() and count().
> Eliminate use of internal interfaces and make the test portable and expose it to OSS.
> Fix various warnings for _WIN32.
> Disables StderrKnobsDefault due to order dependency
> Implement btree_iterator::operator-, which is faster than std::distance for btree iterators.
> Merge pull request abseil/abseil-cpp#1298 from rpjohnst:mingw-cmake-build
> Implement function to calculate Damerau-Levenshtein distance between two strings.
> Change per_thread_sem_test from size medium to size large.
> Support stringification of user-defined types in AbslStringify in absl::Substitute.
> Fix "unsafe narrowing" warnings in absl, 12/12.
> Revert change to internal 'Rep', this causes issues for gdb
> Reorganize InlineData into an inner Rep structure.
> Remove internal `VLOG_xxx` macros
> Import of CCTZ from GitHub.
> `absl::InlinedVector` supports move assignment with non-assignable types.
> Change Cord internal layout, which reduces store-load penalties on ARM
> Detects accidental multiple invocations of AnyInvocable<R(...)&&>::operator()&& by producing an error in debug mode, and clarifies that the behavior is undefined in the general case.
> Fix a bug in StrFormat. This issue would have been caught by any compile-time checking but can happen for incorrect formats parsed via ParsedFormat::New. Specifically, if a user were to add length modifiers with 'v', for example the incorrect format string "%hv", the ParsedFormat would incorrectly be allowed.
> Adds documentation for stringification extension
> CMake: Remove check_target calls which can be problematic in case of dependency cycle
> Changes mutex unlock profiling
> Add static_cast<void*> to the sources for trivial relocations to avoid spurious -Wdynamic-class-memaccess errors in the presence of other compilation errors.
> Configure ABSL_CACHE_ALIGNED for clang-like and MSVC toolchains.
> Fix "unsafe narrowing" warnings in absl, 11/n.
> Eliminate use of internal interfaces
> Merge pull request abseil/abseil-cpp#1289 from keith:ks/fix-more-clang-deprecated-builtins
> Merge pull request abseil/abseil-cpp#1285 from jun-sheaf:patch-1
> Delete LogEntry's copy ctor and assignment operator.
> Make sinks provided to `AbslStringify()` usable with `absl::Format()`.
> Cast unused variable to void
> No changes in OSS.
> No changes in OSS
> Replace the kPower10ExponentTable array with a formula.
> CMake: Mark absl::cord_test_helpers and absl::spy_hash_state PUBLIC
> Use trivial relocation for transfers in swisstable and b-tree.
> Merge pull request abseil/abseil-cpp#1284 from t0ny-peng:chore/remove-unused-class-in-variant
> Removes the legacy spellings of the thread annotation macros/functions by default.
Closes#12201
The cql server uses an execution stage to process and execute queries,
however, processing stage is best utilized when having a recurrent flow
that needs to be called repeatedly since it better utilizes the
instruction cache.
Up until now, every request was sent through the processing stage, but
most requests are not meant to be executed repeatedly with high volume.
This change processes and executes the data queries asynchronously,
through an execution stage, and all of the rest are processed one by
one, only continuing once the request has been done end to end.
Tests:
Unit tests in dev and debug.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Closes#12202
This PR hits two goals for "object storage" effort
1. Sstables loader "knows" that sstables components are stored in a Linux directory and uses utils/lister to access it. This is not going to work with sstables over object storage, the loader should be abstracted from the underlying storage.
2. Currently class keyspace and class column_family carry "datadir" and "all_datadirs" on board which are path on local filesystem where sstable files are stored (those usually started with /var/lib/scylla/data). The paths include subsdirs like "snapshots", "staging", etc. This is not going to look nice for obejct storage, the /var/lib/ prefix is excessive and meaningless in this case. Instead, ks and cf should know their "location" and some other component should know the directory where in which the files are stored.
Said that, this PR prepares distributed_loader and sstables_directly to stop using Linux paths explicitly by making both call sstables_manager to list and open sstables object. After it will be possible to teach manager to list sstables from object storage. Also this opens the way to removing paths from keyspace and column_family classes and replacing those with relative "location"s.
Closes#12128
* github.com:scylladb/scylladb:
sstable_directory: Get components lister from manager
sstable_directory: Extract directory lister
sstable_directory: Remove sstable creation callback
sstable_directory: Call manager to make sstables
sstable_directory: Keep error handler generator
sstable_directory: Keep schema_ptr
sstable_directory: Use directory semaphore from manager
sstable_directory: Keep reference on manager
tests: Use sstables creation helper in some cases
sstables_manager: Keep directory semaphore reference
sstables, code: Wrap directory semaphore with concurrency
Currently if a node that is outside of the config tries to add an entry
or modify config transient error is returned and this causes the node
to retry. But the error is not transient. If a node tries to do one of
the operations above it means it was part of the cluster at some point,
but since a node with the same id should not be added back to a cluster
if it is not in the cluster now it will never be.
Return a new error not_a_member to a caller instead.
Message-Id: <Y42mTOx8bNNrHqpd@scylladb.com>
We currently try to detect a replaced node so to insert it to
endpoints_to_remove when it has no owned tokens left.
However, for each token we first generate a multimap using
get_endpoint_to_token_map_for_reading().
There are 2 problems with that:
1. unless the replaced node owns a single token, this map will not
be empty after erasing one token out of it, since the
token metadata has not changed yet (this is done later with
update_normal_tokens(owned_tokens, endpoint)).
2. generating this map for each token is inefficient, turning this
algorithm complexity to quadratic in the number of tokens...
This change copies the current token_to_endpoint map
to temporary map and erases replaced tokens from it,
while maintaining a set of candidates_for_removal.
After traversing all replaced tokens, we check again
the `token_to_endpoint_map` erasing from `candidates_for_removal`
any endpoint that still owns tokens.
The leftover candidates are endpoints the own no tokens
and so they are added to `hosts_to_remove`.
Fixes#12082
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12141
Also simplify the code and improve logging in general.
The previous code did this: search for the ID in the address map. If it
couldn't be found, perform a read barrier and search again. If it again
couldn't be found, return.
This algorithm depended on the fact that IP addresses were stored in
group 0 configuration. The read barrier was used to obtain the most
recent configuration, and if the IP was not a part of address map after
the read barrier, that meant it's simply not a member of group 0.
This logic no longer applies so we can simplify the code.
Furthermore, when I was fixing the replace operation with Raft enabled,
at some point I had a "working" solution with all tests passing. But I
was suspicious and checked if the replaced node got removed from
group 0. It wasn't. So the replace finished "successfully", but we had
an additional (voting!) member of group 0 which didn't correspond to
a token ring member.
The last version of my fixes ensure that the node gets removed by the
replacing node. But the system is fragile and nothing prevents us from
breaking this again. At least log an error for now. Regression tests
will be added later.
We must place the Raft ID obtained during the shadow round in the
address map. It won't be placed by the regular gossiping route if we're
replacing using the same IP, because we override the application state
of the replaced node. Even if we replace a node with a different IP, it
is not guaranteed that background gossiping manages to update the
address map before we need it, especially in tests where we set
ring_delay to 0 and disable wait_for_gossip_to_settle. The shadow round,
on the other hand, performs a synchronous request (and if it fails
during bootstrap, bootstrap will fail - because we also won't be able to
obtain the tokens and Host ID of the replaced node).
Fetch the Raft ID of the replaced node in `prepare_replacement_info`,
which runs the shadow round. Return it in `replacement_info`. Then
`join_token_ring` passes it to `setup_group0`, which stores it in the
address map. It does that after `join_group0` so the entry is
non-expiring (the replaced node is a member of group 0). Later in the
replace procedure, we call `remove_from_group0` for the replaced node.
`remove_from_group0` will be able to reverse-translate the IP of the
replaced node to its Raft ID using the address map.
During the replace operation we need the Raft ID of the replaced node.
The shadow round is used for fetching all necessary information before
the replace operation starts.
Most of the sleeps related to gossiping are based on `ring_delay`,
which is configurable and can be set to lower value e.g. during tests.
But for some reason there was one case where we slept for a hardcoded
value, `service::load_broadcaster::BROADCAST_INTERVAL` - 60 seconds.
Use `2 * get_ring_delay()` instead. With the default value of
`ring_delay` (30 seconds) this will give the same behavior.
For now this is almost a no-op because manager just calls
sstables_directory code back to create the lister.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the utils/lister.cc code is in use to list regular files in a
directory. This patch wraps the lister into more abstract components
lister class.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the directory code has everyhting it needs to create sstable object
and can stop using the external lambda.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Yet another continuation to previous patch -- IO error handlers
generator is also needed to create sstables.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Continuation of one-before-previous patch. In order to create sstable
without external lambda the directory code needs schema.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After previous patch sstables_directory code may no longer require for
semaphore argument, because it can get one from manager. This makes the
directory API shorter and simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sstables_directly accesses /var/lib/scylla/data in two ways -- lists
files in it and opens sstables. The latter is abdtracted with the help
of lambdas passed around, but the former (listing) is done by using
directory liters from utils.
Listing sstables components with directlry lister won't work for object
storage, the directory code will need to call some abstraction layer
instead. Opening sstables with the help of a lambda is a bit of
overkill, having sstables manager at hand could make it much simpler.
Said that, this patch makes sstables_directly reference sstables_manager
on start.
This change will also simplify directory semaphore usage (next patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Several test cases push sstables creation lambda into
with_sstables_directory helper. There's a ready to use helper class that
does the same. Next patch will make additional use of that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently this is a sharded<semaphore> started/stopped in main and
referenced by database in order to be fed into sstables code. This
semaphore always comes with the "concurrency" parameter that limits the
parallel_for_each parallelizm.
This patch wraps both together into directory_semaphore class. This
makes its usage simpler and will allow extending it in the future.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When repair master and followers have different shard count, the repair
followers need to create multi-shard readers. Each multi-shard reader
will create one local reader on each shard, N (smp::count) local readers
in total.
There is a hard limit on the number of readers who can work in parallel.
When there are more readers than this limit. The readers will start to
evict each other, causing buffers already read from disk to be dropped
and recreating of readers, which is not very efficient.
To optimize and reduce reader eviction overhead, a global reader permit
is introduced which considers the multi-shard reader bloats.
With this patch, at any point in time, the number of readers created by
repair will not exceed the reader limit.
Test Results:
1) with stream sem 10, repair global sem 10, 5 ranges in parallel, n1=2
shards, n2=8 shards, memory wanted =1
1.1)
[asias@hjpc2 mycluster]$ time nodetool -p 7200 repair ks2 (repair on n2)
[2022-11-23 17:45:24,770] Starting repair command #1, repairing 1
ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true)
[2022-11-23 17:45:53,869] Repair session 1
[2022-11-23 17:45:53,869] Repair session 1 finished
real 0m30.212s
user 0m1.680s
sys 0m0.222s
1.2)
[asias@hjpc2 mycluster]$ time nodetool repair ks2 (repair on n1)
[2022-11-23 17:46:07,507] Starting repair command #1, repairing 1
ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true)
[2022-11-23 17:46:30,608] Repair session 1
[2022-11-23 17:46:30,608] Repair session 1 finished
real 0m24.241s
user 0m1.731s
sys 0m0.213s
2) with stream sem 10, repair global sem no_limit, 5 ranges in
parallel, n1=2 shards, n2=8 shards, memory wanted =1
2.1)
[asias@hjpc2 mycluster]$ time nodetool -p 7200 repair ks2 (repair on n2)
[2022-11-23 17:49:49,301] Starting repair command #1, repairing 1
ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true)
[2022-11-23 17:52:01,414] Repair session 1
[2022-11-23 17:52:01,415] Repair session 1 finished
real 2m13.227s
user 0m1.752s
sys 0m0.218s
2.2)
[asias@hjpc2 mycluster]$ time nodetool repair ks2 (repair on n1)
[2022-11-23 17:52:19,280] Starting repair command #1, repairing 1
ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true)
[2022-11-23 17:52:42,387] Repair session 1
[2022-11-23 17:52:42,387] Repair session 1 finished
real 0m24.196s
user 0m1.689s
sys 0m0.184s
Comparing 1.1) and 2.1), it shows the eviction played a major role here.
The patch gives 73s / 30s = 2.5X speed up in this setup.
Comparing 1.1 and 1.2, it shows even if we limit the readers, starting
on the lower shard is faster 30s / 24s = 1.25X (the total number of
multishard readers is lower)
Fixes#12157Closes#12158
Split the simple (and common) case from the complex case,
and coroutinize the latter. Hopefully this generates better
code for the simple case, and it makes the complex case a
little nicer.
Closes#12194
* github.com:scylladb/scylladb:
cql3: select_statement: reindent process_results_complex()
cql3: select_statement: coroutinize process_results_complex()
cql3: select_statement: split process_results() into fast path and complex path
run_snapshot_list_operation() takes a continuation, so passing it
a lambda coroutine without protection is dangerous.
Protect the coroutine with coroutine::lambda so it doesn't lost its
contents.
Fixes#12192.
Closes#12193
Not a huge gain, since it's just a do_with, but still a little better.
Note the inner lambda is not a coroutine, so isn't susceptibe to
the lambda coroutine fiasco.
One of the prerequisites to make sstables reside on object-storage is not to let the rest of the code "know" the filesystem path they are located on (because sometimes they will not be on any filesystem path). This patch makes the methods that can reveal this path back private so that later they can be abstracted out.
Closes#12182
* github.com:scylladb/scylladb:
sstable: Mark some methods private
test: Don't get sstable dir when known
test: Use move_to_quarantine() helper
test: Use sstable::filename() overload without dir name
sstables: Reimplement batch directory sync after move
table, tests: Make use of move_to_new_dir() default arg
sstables: Remove fsync_directory() helper
table: Simplify take_snapshot()'s collecting sstables names
The test enables an error injection inside the Raft upgrade procedure
on one of the nodes which will cause the node to throw an exception
before entering `synchronize` state. Then it restarts other nodes with
Raft enabled, waits until they enter `synchronize` state, puts them in
RECOVERY mode, removes the error-injected node and creates a new Raft
group 0.
As soon as the other nodes enter `synchronize`, the test disabled the
error injection (the rest of the test was outside the `async with
inject_error(...)` block). There was a small chance that we disabled the
error injection before the node reached it. In that case the node also
entered `synchronize` and the cluster managed to finish the upgrade
procedure. We encountered this during next promotion.
Eliminate this possibility by extending the scope of the `async with
inject_error(...)` block, so that the RECOVERY mode steps on the other
nodes are performed within that block.
Closes#12162
There are several class sstable methods that reveal internal directory
path to caller. It's not object-storage-friendly. Fortunately, all the
callers of those methods had been patched not to work with full paths,
so these can be marked private.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sstable_move_test creates sstables in its own temp directories and
the requests these dirs' paths back from sstables. Test can come with
the paths it has at hand, no need to call sstables for it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Two places in tests move sstable to quarantine subdir by hand. There's
the class sstable method that does the same, so use it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The dir this place currently uses is the directory where the sstable was
created, so dropping this argument would just render the same path.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a table::move_sstables_from_staging() method that gets a bunch
of sstables and moves them from staging subdit into table's root
datadir. Not to flush the root dir for every sstable move, it asks the
sstable::move_to_new_dir() not to flush, but collects staging dir names
and flushes them and the root dir at the end altothether.
In order to make it more friendly to object-storage and to remove one
more caller of sstable::get_dir() the delayed_commit_changes struct is
introduced. It collects _all_ the affected dir names in unordered_set,
then allows flushing them. By default the move_to_new_dir() doesn't
receive this object and flushes the directories instantly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question accepts boolean bit whether or not it should sync
directories at the end. It's always true but in one case, so there's the
default value for it. Make use of it.
Anticipating the suggestion to replace bool with bool_class -- next
patch will replace it with something else.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The one effectively wraps existing seastar sync_directory() helper into
two io_check-s. It's simpler just to call the latter directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question "snapshots" all sstables it can find, then writes
their Datafile names into the manifest file. To get the list of file
names it iterates over sstables list again and does silly conversion of
full file path to file name with the help of the directory path length.
This all can be made much simpler if just collecting component names
directly at the time sstable is hardlinked.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
compaction_state shouldn't be moved once emplaced. moving it could
theoretically cause task's gate holder to have a dangling pointer to
compaction_state's gate, but turns out gate's move ctor will actually
fail under this assertion:
assert(!_count && "gate reassigned with outstanding requests");
Cannot happen today, but let's make it more future proof.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#12167
We used GOSSIP_ECHO verb to perform failure detection. Now we use
a special verb DIRECT_FD_PING introduced for this purpose.
There are multiple reasons to do so.
One minor reason: we want to use the same connection as other Raft
verbs: if we can't deliver Raft append_entries or vote messages
somewhere, that endpoint should be marked dead; if we can, the
endpoint should be marked alive. So putting pings on the same
connection as the other Raft verbs is important when dealing with
weird situations where some connections are available but others are
not. Observe that in `do_get_rpc_client_idx`, we put the new verb in
the right place.
Another minor reason: we remove the awkward gossiper `echo_pinger`
abstraction which required storing and updating gossiper generation
numbers. This also removes one dependency from Raft service code to
gossiper.
Major reason 1: the gossip echo handler has a weird mechanism where a
replacing node returns errors during the replace operation to some of
the nodes. In Raft however, we want to mark servers as alive when they
are alive, including a server running on a node that's replacing
another node.
Major reason 2, related to the previous one: when server B is
replacing server A with the same IP, the failure detector will try to
ping both servers. Both servers are mapped to the same IP by the
address map, so pings to both servers will reach server B. We want
server B to respond to the pings destined for server B, but not to
pings destined for server A, so the sender can mark B alive but keep A
marked dead.
To do this, we include the destination's Raft ID in our RPCs. The
destination compares the received ID with its own. If it's different,
it returns a `wrong_destination` response, and the failure detector
knows that the ping did not reach the destination (it reached someone
else).
Yet another reason: removes "Not ready to respond gossip echo
message" log spam during replace.
Closes#12107
* github.com:scylladb/scylladb:
service/raft: specialized verb for failure detector pinger
db: system_keyspace: de-staticize `{get,set}_raft_server_id`
service/raft: make this node's Raft ID available early in group registry
According to seastar/doc/lambda-coroutine-fiasco.md lambda that
co_awaits once loses its capture frame. In distrobuted_loader
code there's at least one of that kind.
fixes: #12175
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12170
The method became unused since 70e5252a (table: no longer accept online
loading of SSTable files in the main directory) and the whole concept of
reshuffling sstables was dropped later by 7351db7c (Reshape upload files
and reshard+reshape at boot).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12165
We used GOSSIP_ECHO verb to perform failure detection. Now we use
a special verb DIRECT_FD_PING introduced for this purpose.
There are multiple reasons to do so.
One minor reason: we want to use the same connection as other Raft
verbs: if we can't deliver Raft append_entries or vote messages
somewhere, that endpoint should be marked dead; if we can, the
endpoint should be marked alive. So putting pings on the same
connection as the other Raft verbs is important when dealing with
weird situations where some connections are available but others are
not. Observe that in `do_get_rpc_client_idx`, we put the new verb in
the right place.
Another minor reason: we remove the awkward gossiper `echo_pinger`
abstraction which required storing and updating gossiper generation
numbers. This also removes one dependency from Raft service code to
gossiper.
Major reason 1: the gossip echo handler has a weird mechanism where a
replacing node returns errors during the replace operation to some of
the nodes. In Raft however, we want to mark servers as alive when they
are alive, including a server running on a node that's replacing
another node.
Major reason 2, related to the previous one: when server B is
replacing server A with the same IP, the failure detector will try to
ping both servers. Both servers are mapped to the same IP by the
address map, so pings to both servers will reach server B. We want
server B to respond to the pings destined for server B, but not to
pings destined for server A, so the sender can mark B alive but keep A
marked dead.
To do this, we include the destination's Raft ID in our RPCs. The
destination compares the received ID with its own. If it's different,
it returns a `wrong_destination` response, and the failure detector
knows that the ping did not reach the destination (it reached someone
else).
Yet another reason: removes "Not ready to respond gossip echo
message" log spam during replace.
Raft ID was loaded or created late in the boot procedure, in
`storage_service::join_token_ring`.
Create it earlier, as soon as it's possible (when `system_keyspace`
is started), pass it to `raft_group_registry::start` and store it inside
`raft_group_registry`.
We will use this Raft ID stored in group registry in following patches.
Also this reduces the number of disk accesses for this node's Raft ID.
It's now loaded from disk once, stored in `raft_group_registry`, then
obtained from there when needed.
This moves `raft_group_registry::start` a bit later in the startup
procedure - after `system_keyspace` is started - but it doesn't make
a difference.
In a recent commit 757d2a4, we removed the "xfail" mark from the test
test_manual_requests.py::test_too_large_request_content_length
because it started to pass on more modern versions of Python, with a
urllib3 bug fixed.
Unfortunately, the celebration was premature: It turns out that although
the test now *usually* passes, it sometimes fails. This is caused by
a Seastar bug scylladb/seastar#1325, which I opened #12166 to track
in this project. So unfortunately we need to add the "xfail" mark back
to this test.
Note that although the test will now be marked "xfail", it will actually
pass most of the time, so will appear as "xpass" to people run it.
I put a note in the xfail reason string as a reminder why this is
happening.
Fixes#12143
Refs #12166
Refs scylladb/seastar#1325
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12169
The field was not used for anything. We can keep decommissioned server
in `stopped` field.
In fact it caused us a problem: since recently, we're using
`ScyllaCluster.uninstall` to clean-up servers after test suite finishes
(previously we were using `ScyllaServer.uninstall` directly). But
`ScyllaCluster.uninstall` didn't look into the `decommissioned` field,
so if a server got decommissioned, we wouldn't uninstall it, and it left
us some unnecessary artifacts even for successful tests. This is now
fixed.
Closes#12163
Mainly this PR removes global db::config and feature service that are used by sstables::test_env as dependencies for embedded sstables_manager. Other than that -- drop unused methods, remove nested test_env-s and relax few cases that use two temp dirs at a time for no gain.
Closes#12155
* github.com:scylladb/scylladb:
test, utils: Use only one tempdir
sstable_compaction_test: Dont create nested envs
mutation_reader_test: Remove unused create_sstable() helper
tests, lib: Move globals onto sstables::test_env
tests: Use sstables::test_env.db_config() to access config
features: Mark feature_config_from_db_config const
sstable_3_x_test: Use env method to create sst
sstable_3_x_test: Indentation fix after previous patch
sstable_3_x_test: Use sstable::test_env
test: Add config to sstable::test_env creation
config: Add constexpr value for default murmur ignore bits
There's a do_with_cloned_tmp_directory that makes two temp dirs to toss
sstables between them. Make it go with just one, all the more so it
would resemble existing manipulations aroung staging/ subdir
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The "compact" test case runs in sstables::test_env and additionally
wraps it with another instance provided by do_with_tmp_directory helper.
It's simpler to create the temp dir by hand and use outter env.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a bunch of objects that are used by test_env as sstables_manager
dependencies. Now when no other code needs those globals they better sit
on the test_env next to the manager
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently some places use global test config, but it's going to be
removed soon, so switch to using config from environment
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's in fact such. Other than that, next patch will call it with const
config at hand and fail to compile without this fix
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are several cases there that construct sstables_manager by hand
with the help of a bunch of global dependencies. It's nicer to use
existing wrapper.
(indentation left broken until next patch)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
To make callers (tests) construct it with different options. In
particular, one test will soon want to construct it with custom large
data handler of its own.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
... and use in some places of sstable_compaction_test. This will allow
getting rid of global test_db_config thing later
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The PR introduces shard_repair_task_impl which represents a repair task
that spans over a single shard repair.
repair_info is replaced with shard_repair_task_impl, since both serve
similar purpose.
Closes#12066
* github.com:scylladb/scylladb:
repair: reindent
repair: replace repair_info with shard_repair_task_impl
repair: move repair_info methods to shard_repair_task_impl
repair: rename methods of repair_module
repair: change type of repair_module::_repairs
repair: keep a reference to shard_repair_task_impl in row_level_repair
repair: move repair_range method to shard_repair_task_impl
repair: make do_repair_ranges a method of shard_repair_task_impl
repair: copy repair_info methods to shard_repair_task_impl
repair: corutinize shard task creation
repair: define run for shard_repair_task_impl
repair: add shard_repair_task_impl
Since fixing issue #11737, when the expiration scanner times out reading
a page of data, it retries asking for the same page instead of giving up
on the scan and starting anew later. This retry was infinite - which can
cause problems if we have a bug in the code or several nodes down, which
can lead to getting hung in the same place in the scan for a very long
(potentially infinite) time without making any progress.
An example of such a bug was issue #12145, where we forgot to handle
shutdowns, so on shutdown of the cluster we just hung forever repeating
the same request that will never succeed. It's better in this case to
just give up on the current scan, and start it anew (from a random
position) later.
Refs #12145 (that issue was already fixed, by a different patch which
stops the iteration when shutting down - not waiting for an infinite
number of iterations and not even one more).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The expiration-scanning thread is a long-running thread which can scan
data for hours, but checks for its abort-source before fetching each
page to allow for timely shutdown. Recently, we added the ability to
retry the page fetching in case of timeout, for forgot to check the
abort source in this new retry loop - which lead to an infinitely-long
shutdown in some tests while the retry loop retries forever.
In this patch we fix this bug by using sleep_abortable() instead of
sleep(). sleep_abortable() will throw an exception if the abort source
was triggered before or during the sleep - and this exception will
stop the scan immediately.
Fixes#12145
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Some people prefer to cherry-pick individual commits
so that they have less conflicts to resolve at once.
Add a comment about this possibility.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
This is the core of dynamic IP address support in Raft, moving out the
IP address sourcing from Raft Group 0 configuration to gossip. At start
of Raft, the raft id <> IP address translation map is tuned into the
gossiper notifications and learns IP addresses of Raft hosts from them.
The series intentionally doesn't contain the part which speeds up the
initial cluster assembly by persisting the translation cache and using
more sources besides gossip (discovery, RPC) to show correctness of the
approach.
Closes#12035
* github.com:scylladb/scylladb:
raft: (rpc) do not throw in case of a missing IP address in RPC
raft: (address map) actively maintain ip <-> raft server id map
The backport instructions said that after passing
the tests next `becomes` master, but it's more
exact to say that next `is merged into` master.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Previously the section was called:
"How to backport a patch", which could be interpreted
as instructions for the maintainer.
The new title clearly states that these instructions
are for the contributor in case the maintainer couldn't
backport the patch by themselves.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Since we switched scylla-machine-image locale to C.UTF-8 because
ubuntu-minimal image does not have en_US.UTF-8 by default, we should
do same on our docker image to reduce image size.
Verified #9570 does not occur on new image, since it is still UTF-8
locale.
Closes#12122
The site member is created in ScyllaCluster.start(), for startup failure
this might not be initialized, so check it's present before stop()ing
it. And delete it as it's not running and proper initialization should
call ScyllaCluster.start().
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#11939
A question was raised on what fetch_size (the requested page size
in a paged scan) counts when there is a filter: does it count the
rows before filtering (as scanned from disk) or after filter (as
will be returned to the client)?
This patch adds a test which demonstrates that Cassandra and Scylla
behave differently in this respect: Cassandra counts post-filtering -
so fetch_size results are actually returned, while Scylla currently
counts pre-filtering.
It is arguable which behavior is the "correct" one - we discuss this in
issue #12102. But we have already had several users (such as #11340)
who complained about Scylla's behavior and expected Cassandra's behavior,
so if we decide to keep Scylla's behavior we should at least explain and
justify this decision in our documentation. Until then, let's have this
test which reminds us of this incompatibility. This test currently passes
on Cassandra and fails (xfail) on Scylla.
Refs #11340
Refs #12102
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12103
This patch adds a regression test for the old issue #65 which is about
a multi-column (tuple) clustering-column relation in a SELECT when one
these columns has reversed order. It turns out that we didn't notice,
but this issue was already solved - but we didn't have a regression test
for it. So this patch adds just a regression test. The test confirms that
Scylla now behaves like was desired when that issue was opened. The test
also passes on Cassandra, confirming that Scylla and Cassandra behave
the same for such requests.
Fixes#65
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12130
Better logging, less code, a minor fix.
Closes#12135
* github.com:scylladb/scylladb:
service/raft: raft_group0: less repetitive logging calls
service/raft: raft_group0: fix sleep_with_exponential_backoff
Add instructions on how to backport a feature
to on older version of Scylla.
It contains a detailed step-by-step instruction
so that people unfamiliar with intricacies
of Scylla's repository organization can
easily get the hang of it.
This is the guide I wish I had when I had
to do my first backport.
I put it in backport.md because that
looks like the file responsible
for this sort of information.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Remove raft_address_map::get_inet_address()
While at it, coroutinize some rpc mehtods.
To propagate up the event of missing IP address, use coroutine::exception(
with a proper type (raft::transport_error) and a proper error message.
This is a building block from removing
raft_address_map::get_inet_address() which is too generic, and shifting
the responsibility of handling missing addresses to the address map
clients. E.g. one-way RPC shouldn't throw if an address is missing, but
just drop the message.
PS An attempt to use a single template function rendered to be too
complex:
- some functions require a gate, some don't
- some return void, some future<> and some future<raft::data_type>
1) make address map API flexible
Before this patch:
- having a mapping without an actual IP address was an
internal error
- not having a mapping for an IP address was an internal
error
- re-mapping to a new IP address wasn't allowed
After this patch:
- the address map may contain a mapping
without an actual IP address, and the caller must be prepared for it:
find() will return a nullopt. This happens when we first add an entry
to Raft configuration and only later learn its IP address, e.g. via
gossip.
- it is allowed to re-map an existing entry to a new address;
2) subscribe to gossip notifications
Learning IP addresses from gossip allows us to adjust
the address map whenever a node IP address changes.
Gossiper is also the only valid source of re-mapping, other sources
(RPC) should not re-map, since otherwise a packet from a removed
server can remap the id to a wrong address and impact liveness of a Raft
cluster.
3) prompt address map state with app state
Initialize the raft address map with initial
gossip application state, specifically IPs of members
of the cluster. With this, we no longer need to store
these IPs in Raft configuration (and update them when they change).
The obvious drawback of this approach is that a node
may join Raft config before it propagates its IP address
to the cluster via gossip - so the boot process has to
wait until it happens.
Gossip also doesn't tell us which IPs are members of Raft configuration,
so we subscribe to Group0 configuration changes to mark the
members of Raft config "non-expiring" in the address translation
map.
Thanks to the changes above, Raft configuration no longer
stores IP addresses.
We still keep the 'server_info' column in the raft_config system table,
in case we change our mind or decide to store something else in there.
Some log messages in retry loops in the Raft upgrade procedure included
a sentence like "sleeping before retrying..."; but not all of them.
With the recently added `sleep_with_exponential_backoff` abstraction we
can put this "sleeping..." message in a single place, and it's also easy
to say how long we're going to sleep.
I also enjoy using this `source_location` thing.
The SELECT JSON statement, just like SELECT, allows the user to rename
selected columns using an "AS" specification. E.g., "SELECT JSON v AS foo".
This specification was not honored: We simply forgot to look at the
alias in SELECT JSON's implementation (we did it correctly in regular
SELECT). So this patch fixes this bug.
We had two tests in cassandra_tests/validation/entities/json_test.py
that reproduced this bug. The checks in those tests now pass, but these
two tests still continue to fail after this patch because of two other
unrelated bugs that were discovered by the same tests. So in this patch
I also add a new test just for this specific issue - to serve as a
regression test.
Fixes#8078
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12123
* seastar 4f4cc00660...3a5db04197 (16):
> tls: add missing include <map>
> Merge 'util/process: use then_unpack to help automatically unpack tuple.' from Jianyong Chen
> HTTP: define formatter for status_type to fix build.
> fsnotifier: move it into namespace experimental and add docs.
> Move fsnotify.hh to the 'include' directory for public use.
> Merge 'reactor: define make_pipe() and use make_pipe() in reactor::spawn()' from Kefu Chai
> Merge 'Fix: error when compiling http_client_demo' from Amossss
> util/process: using `data_sink_impl::put`
> Merge 'dns: serialize UDP sends.' from Calle Wilund
> build: use correct version when finding liburing
> Merge 'Add simple http client' from Pavel Emelyanov
> future: use invoke_result instead of nested requirements
> Merge 'reactor: use separate calls in reactor and reactor_backend for read/write/sendmsg/recvmsg' from Kefu Chai
> util, core: add spawn_process() helper
> parallel utils: add note about shard-local parallelism
> shared_mutex: return typed exceptional future in with_* error handlers
Closes#12131
Some of the tests in test/alternator/test_ttl.py need an expiration scan
pass to complete and expire items. In development builds on developer
machines, this usually takes less than a second (our scanning period is
set to half a second). However, in debug builds on Jenkins each scan
often takes up to 100 (!) seconds (this is the record we've seen so far).
This is why we set the tests' timeout to 120.
But recently we saw another test run failing. I think the problem is
that in some case, we need not one, but *two* scanning passes to
complete before the timeout: It is possible that the test writes an
item right after the current scan passed it, so it doesn't get expired,
and then we a second scan at a random position, possibly making that
item we mention one of the last items to be considered - so in total
we need to wait for two scanning periods, not one, for the item to
expire.
So this patch increases the timeout from 120 seconds to 240 seconds -
more than twice the highest scanning time we ever saw (100 seconds).
Note that this timeout is just a timeout, it's not the typical test
run time: The test can finish much more quickly, as little as one
second, if items expire quickly on a fast build and machine.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12106
Fix some issues found with gcc 12. Note we can't fully compile with gcc yet, due to [1].
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98056Closes#12121
* github.com:scylladb/scylladb:
utils: observer: qualify seastar::noncopyable_function
sstables: generation_type: forgo constexpr on hash of generation_type
logalloc: disambiguate types and non-type members
task_manager: disambiguate types and non-type members
direct_failure_detector: don't change meaning of endpoint_liveness
schema: abort on illegal per column computation kind
database: abort on illegal per partition rate limit operation
mutation_fragment: abort on illegal fragment type
per_partition_rate_limit_options: abort on illegal operation type
schema: drop unused lambda
mutation_partition: drop unused lambda
cql3: create_index_statement: remove unused lambda
transport: prevent signed and unsigned comparison
database: don't compare signed and unsigned types
raft: don't compare signed and unsigned types
compaction: don't compare signed and unsigned compaction counts
bytes_ostream: don't take reference to packed variable
Test identifiers are very unique, but this makes them less
useful in Jenkins Test Result Analyzer view. For example,
counter_test can be counter_test.432 in one run and counter_test.442
in another. Jenkins considers them different and so we don't see
a trend.
Limit the id uniqueness within a test case, so that we'll have
counter_test.{1, 2, 3} consistently. Those test will be grouped
together so we can see pass/fail trends.
Closes#11946
Fix https://github.com/scylladb/scylla-docs/issues/4126Closes#11122
* github.com:scylladb/scylladb:
doc: add info about the time-consuming step due to resharding
doc: add the new KB to the toctree
doc: doc: add a KB about updating the mode in perftune.yaml after upgrade
Our `null` expression, after the prepare stage, is redundant with a
`constant` expression containing the value NULL.
Remove it. Its role in the unprepared stage is taken over by
untyped_constant, which gains a new type_class enumeration to
represent it.
Some subtleties:
- Usually, handling of null and untyped_constant, or null and constant
was the same, so they are just folded into each other
- LWT "like" operator now has to discriminate between a literal
string and a literal NULL
- prepare and test_assignment were folded into the corresponing
untyped_constant functions. Some care had to be taken to preserve
error messages.
Closes#12118
Currently, each data sync repair task is started (and hence run) twice.
Thus, when two running operations happen within a time frame long
enough, the following situation may occur:
- the first run finishes
- after some time (ttl) the task is unregistered from the task manager
- the second run finishes and attempts to finish the task which does
not exist anymore
- memory access causes a segfault.
The second call to start is deleted. A check is added
to the start method to ensure that each task is started at most once.
Fixes: #12089Closes#12090
In ad3d2ee47d, we replaced `bool` as an expression element
(representing a boolean constant) with `constant`. But a comment
and a concept continue to mention it.
Remove the comment and the concept fragment.
Closes#12119
Recent changes in topology restricted the get_dc/get_rack calls. Older
code was trying to locate the endpoint in gossiper, then in system
keyspace cache and if the endpoint was not found in both -- returned
"default" location.
New code generates internal error in this case. This approach already
helped to spot several BUGs in code that had been eventually fixed, but
echoes of that change still pop up.
This patch relaxes the "missing endpoint" case by printing a warning in
logs and returning back the "default" location like old code did.
tests: update_cluster_layout_tests.py::*
hintedhandoff_additional_test.py::TestHintedHandoff::test_hintedhandoff_rebalance
bootstrap_test.py::TestBootstrap::test_decommissioned_wiped_node_can_join
bootstrap_test.py::TestBootstrap::test_failed_bootstap_wiped_node_can_join
materialized_views_test.py::TestMaterializedViews::test_decommission_node_during_mv_insert_4_nodes
refs: #11900
refs: #12054fixes: #11870
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12067
std::hash isn't constexpr, so gcc refuses to make hash of generation_type
constexpr. It's pointless anyway since we never have a compile-time
sstable generation.
logalloc::tracker has some members with the same names as types from
namespace scope. gcc (rightfully) complains that this changes
the meaning of the name. Qualify the types to disambiguate.
task_manager has some members with the same names as types from
namespace scope. gcc (rightfully) complains that this changes
the meaning of the name. Qualify the types to disambiguate.
Without memory corruption it's not possible for the switch to
fall through, and the compiler will error if we forget to add
a case. The compiler however is obliged to consider that we might
store some other value in the variable.
Without memory corruption it's not possible for the switch to
fall through, and the compiler will error if we forget to add
a case. The compiler however is obliged to consider that we might
store some other value in the variable.
Without memory corruption it's not possible for the switch to
fall through, and the compiler will error if we forget to add
a case. The compiler however is obliged to consider that we might
store some other value in the variable.
Without memory corruption it's not possible for the switch to
fall through, and the compiler will error if we forget to add
a case. The compiler however is obliged to consider that we might
store some other value in the variable.
bytes_ostream is packed, so its _begin member is packed as well.
gcc (correctly) disallows taking a reference to an unaligned variable
in an aligned refernce, and complains.
Make it happy by open-coding the exchange operation.
The `add_server` function now takes an optional `ReplaceConfig` struct
(implemented using `NamedTuple`), which specifies the ID of the replaced
server and whether to reuse the IP address.
If we want to reuse the IP address, we don't allocate one using the host
registry. This required certain refactors: moving the code responsible
for allocation of IPs outside `ScyllaServer`, into `ScyllaCluster`.
Add two tests, but they are now skipped: one of them is failing (unability
for new node to join group 0) and both suffer from a hardcoded 60-second sleep
in Scylla.
Closes#12032
* github.com:scylladb/scylladb:
test/topology: simple node replace tests (currently disabled)
test/pylib: scylla_cluster: support node replace operation
test/pylib: scylla_cluster: move members initialization to constructor
test/pylib: scylla_cluster: (re)lease IP addr outside ScyllaServer
test/pylib: scylla_cluster: refactor create_server parameters to a struct
test.py: stop/uninstall clusters instead of servers when cleaning up
test/pylib: artifact_registry: replace `Awaitable` type with `Coroutine`
test.py: prepare for adding extra config from test when creating servers
test/pylib: manager_client: convert `add_server` to use `put_json`
test/pylib: rest_client: allow returning JSON data from `put_json`
test/pylib: scylla_cluster: don't import from manager_client
This reverts commit e2fe8559ca. I
ran all the release mode tests on aarch64 with it reverted, and
it passes. So it looks like whatever problems we had with it
were fixed.
Closes#12072
As a part of CQL rewrite we want to be able to perform filtering by calling `evaluate()` on an expression and checking if it evaluates to `true`. Currently trying to do that for a binary operator would result in an error.
Right now checking if a binary operation like `col1 = 123` is true is done using `is_satisfied_by`, which is able to check if a binary operation evaluates to true for a small set of predefined cases.
Eventually once the grammar is relaxed we will be able to write expressions like: `(col1 < col2) = (1 > ?)`, which doesn't fit with what `is_satisfied_by` is supposed to do.
Additionally expressions like `1 = NULL` should evaluate to `NULL`, not `true` or `false`. `is_satsified_by` is not able to express that properly.
The proper way to go is implementing `evaluate(binary_operator)`, which takes a binary operation and returns what the result of it would be.
Implementing `prepare_expression` for `binary_operator` requires us to be able to evaluate it first. In the next PR I will add support for `prepare_expression`.
Closes#12052
* github.com:scylladb/scylladb:
cql-pytest: enable two unset value tests that pass now
cql-pytest: reduce unset value error message
cql3: expr: change unset value error messages to lowercase
cql_pytest: ensure that where clauses like token(p) = 0 AND p = 0 are rejected
cql3: expr: remove needless braces around switch cases
cql3: move evaluation IS_NOT NULL to a separate function
expr_test: test evaluating LIKE binary_operator
expr_test: test evaluating IS_NOT binary_operator
expr_test: test evaluating CONTAINS_KEY binary_operator
expr_test: test evaluating CONTAINS binary_operator
expr_test: test evaluating IN binary_operator
expr_test: test evaluating GTE binary_operator
expr_test: test evaluating GT binary_operator
expr_test: test evaluating LTE binary_operator
expr_test: test evaluating LT binary_operator
expr_test: test evaluating NEQ binary_operator
expr_test: test evaluating EQ binary_operator
cql3: expr properly handle null in is_one_of()
cql3: expr properly handle null in like()
cql3: expr properly handle null in contains_key()
cql3: expr properly handle null in contains()
cql3: expr: properly handle null in limits()
cql3: expr: remove unneeded overload of limits()
cql3: expr: properly handle null in equality operators
cql3: expr: remove unneeded overload of equal()
cql3: expr: use evaluate(binary_operator) in is_satisfied_by
cql3: expr: handle IS NOT NULL when evaluating binary_operator
cql3: expr: make it possible to evaluate binary_operator
cql3: expr: accept expression as lhs argument to like()
cql3: expr: accept expression as lhs in contains_key
cql3: expr: accept expression as lhs argument to contains()
The Alternator TTL expiration scanner scans an entire table using many
small pages. If any of those pages time out for some reason (e.g., an
overload situation), we currently consider the entire scan to have failed
and wait for the next scan period (which by default is 24 hours) when
we start the scan from scratch (at a random position). There is a risk
that if these timeouts are common enough to occur once or more per
scan, the result is that we double or more the effective expiration lag.
A better solution, done in this patch, is to retry from the same position
if a single page timed out - immediately (or almost immediately, we add
a one-second sleep).
Fixes#11737
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12092
scylla-driver causes dtests to fail randomly (likely
due to incorrect handling of the USE statement). Revert
it.
* tools/java 73422ee114...1c06006447 (2):
> Revert "Add Scylla Cloud serverless support"
> Revert "Switch cqlsh to use scylla-driver"
update_normal_tokens checks that that the endpoint is in topology.
Currently we call update_topology on this path only if it's
not a normal_token_owner, but there are paths when the
endpoint could be a normal token owner but still
be pending in topology so always update it, just in case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
asias@scylladb.com said:
> This comments was moved up to the wrong place when tmptr->update_topology was added.
> There is no race now since we use the copy-update-replace method to update token_metadada.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Recently, clang started complaining about std::unexpected_handler being
deprecated:
```
In file included from utils/exceptions.cc:18:
./utils/abi/eh_ia64.hh:26:10: warning: 'unexpected_handler' is deprecated [-Wdeprecated-declarations]
std::unexpected_handler unexpectedHandler;
^
/usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/exception:84:18: note: 'unexpected_handler' has been explicitly marked deprecated here
typedef void (*_GLIBCXX11_DEPRECATED unexpected_handler) ();
^
/usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/x86_64-redhat-linux/bits/c++config.h:2343:32: note: expanded from macro '_GLIBCXX11_DEPRECATED'
^
/usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/x86_64-redhat-linux/bits/c++config.h:2334:46: note: expanded from macro '_GLIBCXX_DEPRECATED'
^
1 warning generated.
```
According to cppreference.com, it was deprecated in C++11 and removed in
C++17 (!).
This commit gets rid of the warning by inlining the
std::unexpected_handler typedef, which is defined as a pointer a
function with 0 arguments, returning void.
Fixes: #12022Closes#12074
On rare occassions a SELECT on a DROPpped table throws
cassandra.ReadFailure instead of cassandra.InvalidRequest. This could
not be reproduced locally.
Catch both exceptions as the table is not present anyway and it's
correctly marked as a failure.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#12027
Global index page caching, as introduced in 4.6
(078a6e422b and 9f957f1cf9) has proven to be misdesigned,
because it poses a risk of catastrophic performance regressions in
common workloads by flooding the cache with useless index entries.
Because of that risk, it should be disabled by default.
Refs #11202Fixes#11889Closes#11890
As a preparation to replacing repair_info with shard_repair_task_impl,
type of _repairs in repair module is changed from
std::unordered_map<int, lw_shared_ptr<repair_info>> to
std::unordered_map<int, tasks::task_id>.
As a part of replacing repair_info with shard_repair_task_impl,
instead of a reference to repair_info, row_level_repair keeps
a reference to shard_repair_task_impl.
Function do_repair_ranges is directly connected to shard repair tasks.
Turning it into shard_repair_task_impl method enables an access to tasks'
members with no additional intermediate layers.
Methods of repair_info are copied to shard_repair_task_impl. They are
not used yet, it's a preparation for replacing repair_info with
shard_repair_task_impl.
In system_keyspace::get_repair_history value of repair_uuid
is got from row as tasks::task_id.
tasks::task_id is represented by an abstract_type specific
for utils::UUID. Thus, since their typeids differ, bad_cast
is thrown.
repair_uuid is got from row as utils::UUID and then cast.
Since no longer needed, data_type_for<tasks::task_id> is deleted.
Fixes: #11966Closes#12062
While implementing evaluate(binary_operator)
missing checks for unset value were added
for comparisons in filtering code.
Because of that some tests for unset value
started passing.
There are still other tests for unset value
that are failing because Scylla doesn't
have all the checks that it should.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
When unset value appears in an invalid place
both Cassandra and Scylla throw an error.
The tests were written with Cassandra
and thus the expected error messages were
exactly the same as produced by Cassandra.
Scylla produces different error messages,
but both databases return messages with
the text 'unset value'.
Reduce the expected message text
from the whole message to something
that contains 'unset value'.
It would be hard to mimic Cassandra's
error messages in Scylla. There is no
point in spending time on that.
Instead it's better to modify the tests
so that they are able to work with
both Cassandra and Scylla.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The messages used to contain UNSET_VALUE
in capital letters, but the tests
expect messages with 'unset value'.
Change the message so that it can
match the expected error text in tests.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add two node replace tests using the freshly added infrastructure.
One test replaces a node while using a different IP. It is disabled
because the replace operation has an unconditional 60-seconds sleep
(it doesn't depend on the ring_delay setting for some reason). The sleep
needs to be fixed before we can enable this test.
The other test replaces while reusing the replaced node's IP.
Additionally to the sleep, the test fails because the node cannot join
group 0; it's stuck in an infinite loop of trying to join:
```
INFO 2022-11-18 15:56:19,933 [shard 0] raft_group0 - server 8de951fd-a528-4a82-ac54-592ea269537f found no local group 0. Discovering...
INFO 2022-11-18 15:56:19,933 [shard 0] raft_group0 - server 8de951fd-a528-4a82-ac54-592ea269537f found group 0 with group id 25d2b050-6751-11ed-b534-c3c40c275dd3, leader b7047f7e-03e6-4797-a723-24054201f91d
INFO 2022-11-18 15:56:19,934 [shard 0] raft_group0 - Server 8de951fd-a528-4a82-ac54-592ea269537f is starting group 0 with id 25d2b050-6751-11ed-b534-c3c40c275dd3
WARN 2022-11-18 15:56:20,935 [shard 0] raft_group0 - failed to modify config at peer b7047f7e-03e6-4797-a723-24054201f91d: seastar::rpc::timeout_error (rpc call timed out). Retrying.
INFO 2022-11-18 15:56:21,937 [shard 0] raft_group0 - server 8de951fd-a528-4a82-ac54-592ea269537f found group 0 with group id 25d2b050-6751-11ed-b534-c3c40c275dd3, leader ee0175ea-6159-4d4c-9d7c-95c934f8a408
WARN 2022-11-18 15:56:22,937 [shard 0] raft_group0 - failed to modify config at peer ee0175ea-6159-4d4c-9d7c-95c934f8a408: seastar::rpc::timeout_error (rpc call timed out). Retrying.
INFO 2022-11-18 15:56:23,938 [shard 0] raft_group0 - server 8de951fd-a528-4a82-ac54-592ea269537f found group 0 with group id 25d2b050-6751-11ed-b534-c3c40c275dd3, leader ee0175ea-6159-4d4c-9d7c-95c934f8a408
WARN 2022-11-18 15:56:24,939 [shard 0] raft_group0 - failed to modify config at peer ee0175ea-6159-4d4c-9d7c-95c934f8a408: seastar::rpc::timeout_error (rpc call timed out). Retrying.
```
and so on.
The `add_server` function now takes an optional `ReplaceConfig` struct
(implemented using `NamedTuple`), which specifies the ID of the replaced
server and whether to reuse the IP address.
If we want to reuse the IP address, we don't allocate one using the host
registry.
Since now multiple servers can have the same IP, introduce a
`leased_ips` set to `ScyllaCluster` which is used when `uninstall`ing
the cluster - to make sure we don't `release_host` the same host twice.
Previously some members had to be initialized in `install` because
that's when we first knew the IP address.
Now we know the IP address during construction, which allows us to make
the code a bit shorter and simpler, and establish invariants: some
members (such as `self.config`) are now valid for the entire lifetime of
the server object.
`install()` is reduced to performing only side effects (creating
directories, writing config files), all calculation is done inside the
constructor.
`ScyllaServer`s were constructed without IP addresses. They leased an IP
address from `HostRegistry` and released them in `uninstall`.
This responsibility was now moved into `ScyllaCluster`, which leases an
IP address for a server before constructing it, and passes it to the
constructor. It releases the addresses of its serverswhen uninstalling
itself.
This will allow the cluster to reuse the IP address of an existing
server in that cluster when adding a new server which wants to replace
the existing one. Instead of leasing a new address, it will pass
the existing IP address to the new server's constructor.
The refactor is also nice in that it establishes an invariant for
`ScyllaServer`, simplifying reasoning about the class: now it has
an `ip_addr` field at all times.
`host_registry` was moved from `ScyllaServer` to `ScyllaCluster`.
`ScyllaCluster` constructor takes a function `create_server` which
itself takes 3 parameters now. Soon it will take a 4th. The list of
parameters is repeated at the constructor definition and the call site
of the constructor, with many parameters it begins being tiresome.
Refactor the list of parameters to a `NamedTuple`.
`self.artifacts` was calling `ScyllaServer.stop` and
`ScyllaServer.uninstall`. Now it calls `ScyllaCluster.stop` and
`ScyllaCluster.uninstall`, which underneath stops/uninstalls
servers in this cluster.
We must be a bit more careful now in case installing/starting a
server inside a cluster fails: there are no server cleanup artifacts,
and a server is added to cluster's `running` map only after
`install_and_start` finishes (until that happens,
`ScyllaCluster.stop/uninstall` won't catch this server).
So handle failures explicitly in `install_and_start`.
This commit does not logically change how the tests are running - every
started server belongs to some cluster, so it will be cleaned up
- but it's an important refactor.
It will allow us to move IP address (de)allocation code outside
`ScyllaServer`, into `ScyllaCluster`, which in turn will allow us to
implement node replace operation for the case where we want to reuse
the replaced node's IP.
Also, `ScyllaCluster.uninstall` was unused before this change, now it's
used.
Currently, TTL is listed as one of the experimental features: https://docs.scylladb.com/stable/alternator/compatibility.html#experimental-api-features
This PR moves the feature description from the Experimental Features section to a separate section.
I've also added some links and improved the formatting.
@tzach I've relied on your release notes for RC1.
Refs: https://github.com/scylladb/scylladb/issues/5060Closes#11997
* github.com:scylladb/scylladb:
Update docs/alternator/compatibility.md
doc: update the link to Enabling Experimental Features
doc: remove the note referring to the previous ScyllaDB versions and add the relevant limitation to the paragraph
doc: update the links to the Enabling Experimental Features section
doc: add the link to the Enabling Experimental Features section
doc: move the TTL Alternator feature from the Experimental Features section to the production-ready section
Until now, the Alternator TTL feature was considered "experimental",
and had to be manually enabled on all nodes of the cluster to be usable.
This patch removes this requirement and in essence GAs this feature.
Even after this patch, Alternator TTL is still a "cluster feature",
i.e., for this feature to be usable every node in the cluster needs
to support it. If any of the nodes is old and does not yet support this
feature, the UpdateTimeToLive request will not be accepted, so although
the expiration-scanning threads may exist on the newer nodes, they will
not do anything because none of the tables can be marked as having
expiration enabled.
This patch does not contain documentation fixes - the documentation
still suggests that the Alternator TTL feature is experimental.
The documentation patch will come separately.
Fixes#12037
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12049
The `cleanup_before_exit` method of `ArtifactRegistry` calls `close()`
on artifacts. mypy complains that `Awaitable` has no such method. In
fact, the `artifact` objects that we pass to `ArtifactRegistry`
(obtained by calling `async def` functions) do have a `close()` method,
and they are a particular case of `Awaitable`s, but in general not
all `Awaitable`s have `close()`.
Replace `Awaitable` with one of its subtypes: `Coroutine`. `Coroutine`s
have a `close()` method, and `async def` functions return objects of
this type. mypy no longer complains.
[PR](https://github.com/scylladb/scylladb/pull/9314) fixed a similar issue with regular insert statements
but missed the LWT code path.
It's expected behaviour of
`modification_statement::create_clustering_ranges` to return an
empty range in this case, since `possible_lhs_values` it
uses explicitly returns `empty_value_set` if it evaluates `rhs`
to null, and it has a comment about it (All NULL
comparisons fail; no column values match.) On the other hand,
all components of the primary key are required to be set,
this is checked at the prepare phase, in
`modification_statement::process_where_clause`. So the only
problem was `modification_statement::execute_with_condition`
was not expecting an empty `clustering_range` in case of
a null clustering key.
Also this patch contains a fix for the problem with wrong
column name in Scylla error messages. If `INSERT` or `DELETE`
statement is missing a non-last element of
the primary key, the error message generated contains
an invalid column name.
The problem occurs if the query contains a column with the list type,
otherwise
`statement_restrictions::process_clustering_columns_restrictions`
checks that all the components of the key are specified.
Closes#12047
* github.com:scylladb/scylladb:
cql: refactor, inline modification_statement::validate_primary_key_restrictions
cql: DELETE with null value for IN parameter should be forbidden
cql: add column name to the error message in case of null primary key component
cql: batch statement, inserting a row with a null key column should be forbidden
cql: wrong column name in error messages
modification_statement: fix LWT insert crash if clustering key is null
Release 5.1. introduced a new CQL extension that applies to the CREATE TABLE and ALTER TABLE statements. The ScyllaDB-specific extensions are described on a separate page, so the CREATE TABLE and ALTER TABLE should include links to that page and section.
Note: CQL extensions are described with Markdown, while the Data Definition page is RST. Currently, there's no way to link from an RST page to an MD subsection (using a section heading or anchor), so a URL is used as a temporary solution.
Related: https://github.com/scylladb/scylladb/pull/9810Closes#12070
* github.com:scylladb/scylladb:
doc: move the info about per-partition rate limit for the ALTER TABLE statemet from the paragraph to the list
doc: add the links to the per-partition rate limit extention to the CREATE TABLE and ALTER TABLE sections
Don't call get_datacenter(ep) without checking
first has_endpoint(ep) since the former may abort
on internal error if the endpoint is not listed
in topology.
Refs #11870
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12054
If a DELETE statement contains an IN operator and the
parameter value for it is NULL, this should also trigger
an error. This is in line with how Cassandra
behaves in this case.
Regular INSERT statements with null values for primary key
components are rejected by Scylla since #9286 and #9314.
Batch statements missed a similar check, this patch
fixes it.
Fixes: #12060
If INSERT or DELETE statement is missing a non-last element of
the primary key, the error message generated contains
an invalid column name.
The problem occurs if the query contains a column with the list type,
otherwise
statement_restrictions::process_clustering_columns_restrictions
checks that all the components of the key are specified.
Fixes: #12046
Returns an unordered set of datacenter names
to be used by network_topology_replication_strategy
and for ks_prop_defs.
The set is kept in sync with _dc_endpoints.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12023
Since we moved all IaaS code to scylla-machine-image, we nolonger need
AMI variable on sysconfig file or --ami parameter on setup scripts,
and also never used /etc/scylla/ami_disabled.
So let's drop all of them from Scylla core core.
Related with scylladb/scylla-machine-image#61
Closes#12043
When started the sstable_directory is constructed with a bunch of booleans that control the way its process_sstable_dir method works. It's shorter and simpler to pass these booleans into method directly, all the more so there's another flag that's already passed like this.
Closes#12005
* github.com:scylladb/scylladb:
sstable_directory: Move all RAII booleans onto flags
sstable_directory: Convert sort-sstables argument to flags struct
sstable_directory: Drop default filter
Scylla doesn't support combining restrictions
on token with other restrictions on partition key columns.
Some pieces of code depend on the assumption
that such combinations are allowed.
In case they were allowed in the future
these functions would silently start
returning wrong results, and we would
return invalid rows.
Add a test that will start failing once
this restriction is removed. It will
warn the developer to change the
functions that used to depend
on the assumption.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The PR introduces top level repair tasks representing repair and node operations
performed with repair. The actions performed as a part of these operations are
moved to corresponding tasks' run methods.
Also a small change to repair module is added.
Closes#11869
* github.com:scylladb/scylladb:
repair: define run for data_sync_repair_task_impl
repair: add data_sync_repair_task_impl
tasks: repair: add noexcept to task impl constructor
repair: define run for user_requested_repair_task_impl
repair: add user_requested_repair_task_impl
repair: allow direct access to max_repair_memory_per_range
Originally put braces around the cases because
there were local variables that I didn't want
to be shadowed.
Now there are no variables so the braces
can be removed without any problems.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
When evaluating a binary operation with
operations like EQUAL, LESS_THAN, IN
the logic of the operation is put
in a separate function to keep things clean.
IS_NOT NULL is the only exception,
it has its evaluate implementation
right in the evaluate(binary_operator)
function.
It would be cleaner to have it in
a separate dedicated function,
so it's moved to one.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There is a more general version of limits()
which takes expressions as both the lhs and rhs
arguments.
There is no need for a specialized overload.
This specialized overload takes a tuple_constructor
as lhs, but we call evaluate() on both sides
of a binary operator before checking equality,
so this won't be useful at all.
Having multiple functions increases the risk
that one of them has a bug, while giving
dubious benfit.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Expressions like:
123 = NULL
NULL = 123
NULL = NULL
NULL != 123
should be tolerated, but evaluate to NULL.
The current code assumes that a binary operator
can only evaluate to a boolean - true or false.
Now a binary operator can also evaluate to NULL.
This should happen in cases when one of the
operator's sides is NULL.
A special class is introduced to represent a value
that can be one of three things: true, false or null.
It's better than using std::optional<bool>,
because optional has implicit conversions to bool
that could cause confusion and bugs.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
When processing a query, we keep a pointer to an effective_replication_map.
In a couple places we used the latest topology instead of the one held by the effective_replication_map
that the query uses and that might lead to inconsistencies if, for example, a node is removed from topology after decommission that happens concurrently to the query.
This change gets the topology& from the e_r_m in those cases.
Fixes#12050Closes#12051
* github.com:scylladb/scylladb:
storage_proxy: pass topology& to sort_endpoints_by_proximity
storage_proxy: pass topology& to is_worth_merging_for_range_query
Currently the ctor of said class always allocates as it copies the
provided name string and it creates a new name via format().
We want to avoid this, now that the validator is used on the read path.
So defer creating the formatted name to when we actually want to log
something, which is either when log level is debug or when an error is
found. We don't care about performance in either case, but we do care
about it on the happy path.
Further to the above, provide a constructor for string literal names and
when this is used, don't copy the name string, just save a view to it.
Refs: #11174Closes#12042
Contains fixes requested in the issue (and some tiny extras), together with analysis why they don't affect the users (see commit messages).
Fixes [ #11800](https://github.com/scylladb/scylladb/issues/11800)
Closes#11926
* github.com:scylladb/scylladb:
alternator: add maybe_quote to secondary indexes 'where' condition
test/alternator: correct xfail reason for test_gsi_backfill_empty_string
test/alternator: correct indentation in test_lsi_describe
alternator: fix wrong 'where' condition for GSI range key
There's a bunch of booleans that control the behavior of sstable
directory scanning. Currently they are described as verbose
bool_class<>-es and are put into sstable_directory construction time.
However, these are not used outside of .process_sstable_dir() method and
moving them onto recently added flags struct makes the code much
shorter (29 insertions(+), 121 deletions(-))
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sstable_directory::process_sstable_dir() accepts a boolean to
control its behavior when collecting sstables. Turn this boolean into a
structure of flags. The intention is to extend this flags set in the
future (next patch).
This boolean is true all the time, but one place sets it to true in a
"verbose" manner, like this:
bool sort_sstables_according_to_owner = false;
process_sstable_dir(directory, sort_sstables_according_to_owner).get();
the local variable is not used anymore. Using designated initializers
solves the verbosity in a nicer manner.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's used as default argument for .reshape() method, but callers specify
it explicitly. At the same time the filter is simple enough and is only
used in one place so that the caller can just use explicit lambda.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There is a more general version of equal()
which takes expressions as both the lhs and rhs
arguments.
There is no need for a specialized overload.
This specialized overload takes a tuple_constructor
as lhs, but we call evaluate() on both sides
of a binary operator before checking equality,
so this won't be useful at all.
Having multiple functions increases the risk
that one of them has a bug, while giving
dubious benfit.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
It mustn't use the latest topology that may differ from the
one used by the query as it may be missing nodes
(e.g. after concurrent decommission).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It mustn't use the latest topology that may differ from the
one used by the query as it may be missing nodes
(e.g. after concurrent decommission).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
PR #9314 fixed a similar issue with regular insert statements
but missed the LWT code path.
It's expected behaviour of
modification_statement::create_clustering_ranges to return an
empty range in this case, since possible_lhs_values it
uses explicitly returns empty_value_set if it evaluates rhs
to null, and it has a comment about it (All NULL
comparisons fail; no column values match.) On the other hand,
all components of the primary key are required to be set,
this is checked at the prepare phase, in
modification_statement::process_where_clause. So the only
problem was modification_statement::execute_with_condition
was not expecting an empty clustering_range in case of
a null clustering key.
Fixes: #11954
This bug doesn't affect anything, the reason is descibed in the commit:
'alternator: fix wrong 'where' condition for GSI range key'.
But it's theoretically correct to escape those key names and
the difference can be observed via CQL's describe table. Before
the patch 'where' condition is missing one double quote in variable
name making it mismatched with corresponding column name.
Otherwise I think assert is not executed in a loop. And I am not sure why lsi variable can be bound
to anything. As I tested it was pointing to the last element in lsis...
This bug doesn't manifest in a visible way to the user.
Adding the index to an existing table via GlobalSecondaryIndexUpdates is not supported
so we don't need to consider what could happen for empty values of index range key.
After the index is added the only interesting value user can set is omitting
the value (null or empty are not allowed, see test_gsi_empty_value and
test_gsi_null_value).
In practice no matter of 'where' condition the underlaying materialized
view code is skipping row updates with missing keys as per this comment:
'If one of the key columns is missing, set has_new_row = false
meaning that after the update there will be no view row'.
Thats why the added test passes both before and after the patch.
But it's still usefull to include it to exercise those code paths.
Fixes#11800
This patch includes a translation of several additional small test files
from Cassandra's CQL unit test directory cql3/validation/operations.
All tests included here pass on both Cassandra and Scylla, so they did
not discover any new Scylla bugs, but can be useful in the future as
regression tests.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12045
Take advantage of the facts that both the owned ranges
and the initial non_owned_ranges (derived from the set of sstables)
are deoverlapped and sorted by start token to turn
the calculation of the final non_owned_ranges from
quadratic to linear.
Fixes#11922Closes#11903
* github.com:scylladb/scylladb:
dht: optimize subtract_ranges
compaction: refactor dht::subtract_ranges out of get_ranges_for_invalidation
compaction_manager: needs_cleanup: get first/last tokens from sstable decorated keys
is_satisfied_by has to check if a binary_operator is satisfied
by some values. It used to be impossible to evaluate
a binary_operator, so is_satisfied had code to check
if its satisfied for a limited number of cases
occuring when filtering queries.
Now evaluate(binary_operator) has been implemented
and is_satisfied_by can use it to check if a binary_operator
evaluates to true.
This is cleaner and reduces code duplication.
Additionally cql tests will test the new evalute() implementation.
There is one special case with token().
When is_satisfied_by sees a restriction on token
it assumes that it's satisfied because it's
sure that these token restrictions were used
to generate partition ranges.
I had to leave this special case in because it's impossible
to evaluate(token). Once this is implemented I will remove
the special case because it's risky and prone to cause
bugs.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The code to evaluate binary operators
was copied from is_satisfied_by.
is_satisfied_by wasn't able to evaluate
IS NOT NULL restrictions, so when such restriction
is encountered it throws an exception.
Implement proper handling for IS NOT NULL binary operators.
The switch ensures that all variants of oper_t are handled,
otherwise there would be a compilation error.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
In order to support different storage kinds for sstable files (e.g. -- s3) it's needed to localize all the places that manipulate files on a POSIX filesystem so that custom storage could implement them in its own way. This set moves the deletion log manipulations to the sstable_directory.cc, which already "knows" that it works over a directory.
Closes#12020
* github.com:scylladb/scylladb:
sstables: Delete log file in replay_pending_delete_log()
sstables: Move deletion log manipulations to sstable_directory.cc
sstables: Open-code delete_sstables() call
sstables: Use fs::path in replay_pending_delete_log()
sstables: Indentation fix after previous patch
sstables: Coroutinize replay_pending_delete_log
sstables: Read pending delete log with one line helper
sstables: Dont write pending log with file_writer
evaluate() takes an expression and evaluates it
to a constant value. It wasn't possible to evalute
binary operators before, so it's added.
The code is based on is_satisfied_by,
which is currently used to check
whether a binary operator evaluates
to true or false.
It looks like is_satisfied_by and evalate()
do pretty much the same thing, one could be
implemented using the other.
In the future they might get merged
into a single function.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
like() used to only accept column_value as the lhs
to evaluate. Changed it to accept any generic expression.
This will allow to evaluate a more diverse set of
binary operators.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
contains_key() used to only accept column_value as the lhs
to evaluate. Changed it to accept any generic expression.
This will allow to evaluate a more diverse set of
binary operators.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
contains() used to only accept column_value as the lhs
to evaluate. Changed it to accept any generic expression.
This will allow to evaluate a more diverse set of
binary operators.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Take advantage of the fact that both ranges and
ranges_to_subtract are deoverlapped and sorted by
to reduce the calculation complexity from
quadratic to linear.
Fixes#11922
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The algorithm is generic and can be used elsewhere.
Add a unit test for the function before it gets
optimized in the following patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, the function is inefficient in two ways:
1. unnecessary copy of first/last keys to automatic variables
2. redecorating the partition keys with the schema passed to
needs_cleanup.
We canjust use the tokens from the sstable first/last decorated keys.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The deletion log concept uses the fact that files are on a POSIX
filesystem. Support for another storage type will have to reimplement
this place, so keep the FS-specific code in _directory.cc file.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's no used by any other code, but to be used it requires the caller to
tranform TOC file names by prepending sstable directory to them. Things
get shorter and simpler if merging the helper code into the caller.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's called by a code that has fs::path at hand and internally uses
helpers that need fs::path too, so no need to convert it back and forth.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's a wrapper over output_stream with offset tracking and the tracking
is not needed to generate a log file. As a bonus of switching back we
get a stream.write(sstring) sugar.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fix https://github.com/scylladb/scylladb/issues/11598
This PR adds the troubleshooting article submitted by @syuu1228 in the deprecated _scylla-docs_ repo, with https://github.com/scylladb/scylla-docs/pull/4152.
I copied and reorganized the content and rewritten it a little according to the RST guidelines so that the page renders correctly.
@syuu1228 Could you review this PR to make sure that my changes didn't distort the original meaning?
Closes#11626
* github.com:scylladb/scylladb:
doc: apply the feedback to improve clarity
doc: add the link to the new Troubleshooting section and replace Scylla with ScyllaDB
doc: add the new page to the toctree
doc: add a troubleshooting article about the missing configuration files
There's a logical dependency from `manager_client` to `scylla_cluster`
(`ManagerClient` defined in `manager_client` talks to
`ScyllaClusterManager` defined in `scylla_cluster` over RPC). There is
no such dependency in the other way. Do not introduce it accidentally.
We can import these types from the `internal_types` module.
We had an xfailing test that reproduced a case where Alternator tried
to report an error when the request was too long, but the boto library
didn't see this error and threw a "Broken Pipe" error instead. It turns
out that this wasn't a Scylla bug but rather a bug in urllib3, which
overzealously reported a "Broken Pipe" instead of trying to read the
server's response. It turns out this issue was already fixed in
https://github.com/urllib3/urllib3/pull/1524
and now, on modern installations, the test that used to fail now passes
and reports "XPASS".
So in this patch we remove the "xfail" tag, and skip the test if
running an old version of urllib3.
Fixes#8195Closes#12038
Fragment reordering and fragment dropping bugs have been plaguing us since forever. To fight them we added a validator to the sstable write path to prevent really messed up sstables from being written.
This series adds validation to the mutation compactor. This will cover reads and compaction among others, hopefully ridding us of such bugs on the read path too.
This series fixes some benign looking issues found by unit tests after the validator was added -- although how benign a producer emitting two partition-ends depends entirely on how the consumer reacts to it, so no such bug is actually benign.
Fixes: https://github.com/scylladb/scylladb/issues/11174Closes#11532
* github.com:scylladb/scylladb:
mutation_compactor: add validator
mutation_fragment_stream_validator: add a 'none' validation level
test/boost/mutation_query_test: test_partition_limit: sort input data
querier: consume_page(): use partition_start as the sentinel value
treewide: use ::for_partition_end() instead of ::end_of_partition_tag_t{}
treewide: use ::for_partition_start() instead of ::partition_start_tag_t{}
position_in_partition: add for_partition_{start,end}()
Adds unit tests for the function `expr::prepare_expression`.
Three minor bugs were found by these tests, both fixed in this PR.
1. When preparing a map, the type for tuple constructor was taken from an unprepared tuple, which has `nullptr` as its type.
2. Preparing an empty nonfrozen list or set resulted in `null`, but preparing a map didn't. Fixed this inconsistency.
3. Preparing a `bind_variable` with `nullptr` receiver was allowed. The `bind_variable` ended up with a `nullptr` type, which is incorrect. Changed it to throw an exception,
Closes#11941
* github.com:scylladb/scylladb:
test preparing expr::usertype_constructor
expr_test: test that prepare_expression checks style_type of collection_constructor
expr_test: test preparing expr::collection_constructor for map
prepare_expr: make preparing nonfrozen empty maps return null
prepare_expr: fix a bug in map_prepare_expression
expr_test: test preparing expr::collection_constructor for set
expr_test: test preparing expr::collection_constructor for list
expr_test: test preparing expr::tuple_constructor
expr_test: test preparing expr::untyped_constant
expr_test_utils: add make_bigint_raw/const
expr_test_utils: add make_tinyint_raw/const
expr_test: test preparing expr::bind_variable
cql3: prepare_expr: forbid preparing bind_variable without a receiver
expr_test: test preparing expr::null
expr_test: test preparing expr::cast
expr_test_utils: add make_receiver
expr_test_utils: add make_smallint_raw/const
expr_test: test preparing expr::token
expr_test: test preparing expr::subscript
expr_test: test preparing expr::column_value
expr_test: test preparing expr::unresolved_identifier
expr_test_utils: mock data_dictionary::database
We had a test that used to fail because of issue #8745. But this issue
was alread fixed, and we forgot to remove the "xfail" marker. The test
now passes, so let's remove the xfail marker.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12039
Add new guide for upgrading 5.1 to 5.2.
In this new upgrade doc, include additional steps for enabling
Raft using the `consistent_cluster_management` flag. Note that we don't
have this flag yet but it's planned to replace the experimental flag in
5.2.
In the "Raft in ScyllaDB" document, add sections about:
- enabling Raft in existing clusters in Scylla 5.2,
- verifying that the internal Raft upgrade procedure finishes
successfully,
- recovering from a stuck Raft upgrade procedure or from a majority loss
situation.
Fix some problems in the documentation, e.g. it is not possible to
enable Raft in an existing cluster in 5.0, but the documentation claimed
that it is.
Follow-up items:
- if we decide for a different name for `consistent_cluster_management`,
use that name in the docs instead
- update the warnings in Scylla to link to the Raft doc
- mention Enterprise versions once we know the numbers
- update the appropriate upgrade docs for Enterprise versions
once they exist
Closes#11910
* github.com:scylladb/scylladb:
docs: describe the Raft upgrade and recovery procedures
docs: add upgrade guide 5.1 -> 5.2
We recently (in 7fbad8de87) made sure all admission paths can trigger the eviction of inactive reads. As reader eviction happens in the background, a mechanism was added to make sure only a single eviction fiber was running at any given time. This mechanism however had a preemption point between stopping the fiber and releasing the evict lock. This gave an opportunity for either new waiters or inactive readers to be added, without the fiber acting on it. Since it still held onto the lock, it also prevented from other eviction fibers to start. This could create a situation where the semaphore could admit new reads by evicting inactive ones, but it still has waiters. Since an empty waitlist is also an admission criteria, once one waiter is wrongly added, many more can accumulate.
This series fixes this by ensuring the lock is released in the instant the fiber decides there is no more work to do.
It also fixes the assert failure on recursive eviction and adds a detection to the inactive/waiter contradiction.
Fixes: #11923
Refs: #11770Closes#12026
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: do_wait_admission(): detect admission-waiter anomaly
reader_concurrency_semaphore: evict_readers_in_the_background(): eliminate blind spot
reader_concurrency_semaphore: do_detach_inactive_read(): do a complete detach
This series contains a mixed bag of improvements to `scylla sstable dump-data`. These improvements are mostly aimed at making the json output clearer, getting rid of any ambiguities.
Closes#12030
* github.com:scylladb/scylladb:
tools/scylla-sstable: traverse sstables in argument order
tools/scylla-sstable: dump-data docs: s/clustering_fragments/clustering_elements
tools/scylla-sstable: dump-data/json: use Null instead of "<unknown>"
tools/scylla-sstable: dump-data/json: use more uniform format for collections
tools/scylla-sstable: dump-data/json: make cells easier to parse
Since recently the framework uses a separate set of unique IDs to
identify servers, but the log file and workdir is still named using the
last part of the IP address.
This is confusing: the test logs sometimes don't provide the IP addr
(only the ID), and even if they do, the reader of the test log may not
know that they need to look at the last part of the IP to find the
node's log/workdir.
Also using ID will be necessary if we want to reuse IP addresses (e.g.
during node replace, or simply not to run out of IP addresses during
testing).
So use the ID instead to name the workdir and log file.
Also, when starting a test case, print the used cluster. This will make
it easier to map server IDs to their IP addresses when browsing through
the test logs.
Closes#12018
* github.com:scylladb/scylladb:
test/pylib: manager_client: print used cluster when starting test case
test/pylib: scylla_cluster: use server ID to name workdir and log file, not IP address
BOOST_CHECK_EQUAL is a weaker form of assertion, it reports an error
and will cause the test case to fail but continues. This makes the
test harder to debug because there's no obvious way to catch the
failure in GDB and the test output is also flooded with things which
happen after the failed assertion.
Message-Id: <20221119171855.2240225-1-tgrabiec@scylladb.com>
When filtering with multi column restriction present all other restrictions were ignored.
So a query like:
`SELECT * FROM WHERE pk = 0 AND (ck1, ck2) < (0, 0) AND regular_col = 0 ALLOW FILTERING;`
would ignore the restriction `regular_col = 0`.
This was caused by a bug in the filtering code:
2779a171fc/cql3/selection/selection.cc (L433-L449)
When multi column restrictions were detected, the code checked if they are satisfied and returned immediately.
This is fixed by returning only when these restrictions are not satisfied. When they are satisfied the other restrictions are checked as well to ensure all of them are satisfied.
This code was introduced back in 2019, when fixing #3574.
Perhaps back then it was impossible to mix multi column and regular columns and this approach was correct.
Fixes: #6200Fixes: #12014Closes#12031
* github.com:scylladb/scylladb:
cql-pytest: add a reproducer for #12014, verify that filtering multi column and regular restrictions works
boost/restrictions-test: uncomment part of the test that passes now
cql-pytest: enable test for filtering combined multi column and regular column restrictions
cql3: don't ignore other restrictions when a multi column restriction is present during filtering
The table UUIDs are the same on all shards
so we might as well get them on shard 0
(as we already do) and reuse them on other shards.
It is more efficient and accurate to lookup the table
eventually on the shard using its uuid rather than
its name. If the table was dropped and recreated
using the same name in the background, the new
table will have a new uuid and do the api function
does not apply to it anymore.
A following change will handle the no_such_column_family
cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In issue #12014 a user has encountered an instance of #6200.
When filtering a WHERE clause which contained
both multi-column and regular restrictions,
the regular restrictions were ignored.
Add a test which reproduces the issue
using a reproducer provided by the user.
This problem is tested in another similar test,
but this one reproduces the issue in the exact
way it was found by the user.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
A part of the test was commented out due to #6200.
Now #6200 has been fixed and it can be uncommented.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The test test_multi_column_restrictions_and_filtering was marked as xfail,
because issue #6200 wasn't fixed. Now that filtering
multi column and other restrictions together has been fixed
the test passes.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
When filtering with multi column restriction present all other restrictions were ignored.
So a query like:
`SELECT * FROM WHERE pk = 0 AND (ck1, ck2) < (0, 0) AND regular_col = 0 ALLOW FILTERING;`
would ignore the restriction `regular_col = 0`.
This was caused by a bug in the filtering code:
2779a171fc/cql3/selection/selection.cc (L433-L449)
When multi column restrictions were detected,
the code checked if they are satisfied and returned immediately.
This is fixed by returning only when these restrictions
are not satisfied. When they are satisfied the other
restrictions are checked as well to ensure all
of them are satisfied.
This code was introduced back in 2019, when fixing #3574.
Perhaps back then it was impossible to mix multi column
and regular columns and this approach was correct.
Fixes: #6200Fixes: #12014
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The currently used "<unknown>" marker for invalid values/types is
undistinguishable from a normal value in some cases. Use the much more
distinct and unique json Null instead.
Instead of trying to be clever and switching the output on the type of
collection, use the same format always: a list of objects, where the
object has a key and value attribute, containing to the respective
collection item key and values. This makes processing much easier for
machines (and humans too since the previous system wasn't working well).
There are several slightly different cell types in scylla: regular
cells, collection cells (frozen and non-frozen) and counter cells
(update and shards). In C++ code the type of the cell is always
available for code wishing to make out exactly what kind of cell a cell
is. In the JSON output of the dump-data this is currently really hard to
do as there is not enough information to disambiguate all the different
cell types. We wish to make the JSON output self-sufficient so in this
patch we introduce a "type" field which contains one of:
* regular
* counter-update
* counter-shards
* frozen-collection
* collection
Furthermore, we bring the different types closer by also printing the
counter shards under the 'value' key, not under the 'shards' key as
before. The separate 'shards' is no longer needed to disambiguate.
The documentation and the write operation is also updated to reflect the
changes.
Some tests may take longer than a few seconds to run. We want to
mark such tests in some way, so that we can run them selectively.
This patch proposes to use pytest markers for this. The markers
from the test.py command line are passed to pytest
as is via the -m parameter.
By default, the marker filter is not applied and all tests
will be run without exception. To exclude e.g. slow tests
you can write --markers 'not slow'.
The --markers parameter is currently only supported
by Python tests, other tests ignore it. We intend to
support this parameter for other types of tests in the future.
Another possible improvement is not to run suites for which
all tests have been filtered out by markers. The markers are
currently handled by pytest, which means that the logic in
test.py (e.g., running a scylla test cluster) will be run
for such suites.
Closes#11713
This reverts commit 22f13e7ca3, and reinstates
commit df8e1da8b2 ("Merge 'cql3: select_statement:
coroutinize indexed_table_select_statement::do_execute_base_query()' from
Avi Kivity"). The original commit was reverted due to failures in debug
mode on aarch64, but after commit 224a2877b9
("build: disable -Og in debug mode to avoid coroutine asan breakage"), it
works again.
Closes#12021
We plan to stop storing IP addresses in Raft configuration, and instead
use the information disseminated through gossip to locate Raft peers.
Implement patches that are building up to that:
* improve Raft API of configuration change notifications
* disseminate raft host id in Gossip
* avoid using Raft addresses from Raft configuraiton, and instead
consistently use the translation layer between raft server id <-> IP
address
Closes#11953
* github.com:scylladb/scylladb:
raft: persist the initial raft address map
raft: (upgrade) do not use IP addresses from Raft config
raft: (and gossip) begin gossiping raft server ids
raft: change the API of conf change notifications
The lister accepts sort of a filter -- what kind of entries to list, regular, directories or both. It currently uses unordered_set, but enum_set is shorter and better describes the intent.
Closes#12017
* github.com:scylladb/scylladb:
lister: Make lister::dir_entry_types an enum_set
database: Avoid useless local variable
The semaphore should admit readers as soon as it can. So at any point in
time there should be either no waiters, or the semaphore shouldn't be
able to admit new reads. Otherwise something went wrong. Detect this
when queuing up reads and dump the diagnostics if detected.
Even though tests should ensure this should never happen, recently we've
seen a race between eviction and enqueuing producing such situations.
This is very hard to write tests for, so add built-in detection and
protection instead. Detecting this is very cheap anyway.
Said method has a protection against concurrent (recursive more like)
calls to itself, by setting a flag `_evicting` and returning early if
this flag is set. The evicting loop however has at least one preemption
point between deciding there is nothing more to evict and resetting said
flag. This window provides opporunity for new inactive reads or waiters
to be queued without this loop noticing, while denying any other
concurrent invocations at that time from reacting too.
Eliminate this by using repeat() instead of do_until() and setting
`_evicting = false` the moment the loop's run condition becomes false.
Currently this method detaches the inactive read from the handle and
notifies the permit, calls the notify handler if any and does some stat
bookkeeping. Extend it to do a complete detach: unlink the entry from
the inactive reads list and also cancel the ttl timer.
After this, all that is left to the caller is to destroy the entry.
This will prevent any recursive eviction from causing assertion failure.
Although recursive eviction shouldn't happen, it shouldn't trigger an
assert.
Since commit a980f94 (token_metadata: impl: keep the set of normal token owners as a member), we have a set, _normal_token_owners, which contains all the nodes in the ring.
We can use _normal_token_owners to check if a node is part of the ring directly instead of going through the _toplogy indirectly.
Fixes#11935Closes#11936
* github.com:scylladb/scylladb:
token_metadata: Rename is_member to is_normal_token_owner
token_metadata: Add docs for is_member
token_metadata: Do not use topology info for is_member check
token_metadata: Check node is part of the topology instead of the ring
Since commit a980f94 (token_metadata: impl: keep the set of normal token
owners as a member), we have a set, _normal_token_owners, which contains
all the nodes in the ring.
We can use _normal_token_owners to check if a node is part of the ring
directly instead of going through the _toplogy indirectly.
Fixes#11935
update_normal_tokens is the way to add a new node into the ring. We
should not require a new node to already be in the ring to be able to
add it to the ring. The current code works accidentally because
is_member is checking if a node is in the topology
We should use _topology.has_endpoint to check if a node is part of the
topology explicitly.
In Scylla and Cassandra inserting an empty collection
that is not frozen, is interpreted as inserting a null value.
list_prepare_expression and set_prepare_expression
have an if which handles this behavior, but there
wasn't one in map_prepare_expression.
As a result preparing empty list or set would result in null,
but preparing an empty map wouldn't. This is inconsistent,
it's better to return null in all cases of empty nonfrozen
collections.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
map_prepare_expression takes a collection_constructor
of unprepared items and prepares it.
Elements of a map collection_constructor are tuples (key and value).
map_prepare_expression creates a prepared collection_constructor
by preparing each tuple and adding it to the result.
During this preparation it needs to set the type of the tuple.
There was a bug here - it took the type from unprepared
tuple_constructor and assigned it to the prepared one.
An unprepared tuple_constructor doesn't have a type
so it ended up assigning nullptr.
Instead of that it should create a tuple_type_impl instance
by looking at the types of map key and values,
and use this tuple_type_impl as the type of the prepared tuples.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
prepare_expression treats receiver as an optional argument,
it can be set to nullptr and the preparation should
still succeed when it's possible to infer the type of an expression.
preparing a bind_variable requires the receiver to be present,
because it doesn't contain any information about the type
of the bound value.
Added a check that the receiver is present.
Allowing to prepare a bind_variable without
the receiver present was a bug.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
task_manager::task::impl contains an abort source which can
be used to check whether it is aborted and an abort method
which aborts the task (request_abort on abort_source) and all
its descendants recursively.
When the start method is called after the task was aborted,
then its state is set to failed and the task does not run.
Fixes: #11995Closes#11996
* github.com:scylladb/scylladb:
tasks: do not run tasks that are aborted
tasks: delete unused variable
tasks: add abort_source to task_manager::task::impl
`get_rpc_client` calculates a `topology_ignored` field when creating a
client which says whether the client's endpoint had topology information
when this client was created. This is later used to check if that client
needs to be dropped and replaced with a new client which uses the
correct topology information.
The `topology_ignored` field was incorrectly calculated as `true` for
pending endpoints even though we had topology information for them. This
would lead to unnecessary drops of RPC clients later. Fix this.
Remove the default parameter for `with_pending` from
`topology::has_endpoint` to avoid similar bugs in the future.
Apparently this fixes#11780. The verbs used by decommission operation
use RPC client index 1 (see `do_get_rpc_client_idx` in
message/messaging_service.cc). From local testing with additional
logging I found that by the time this client is created (i.e. the first
verb in this group is used), we already know the topology. The node is
pending at that point - hence the bug would cause us to assume we don't
know the topology, leading us to dropping the RPC client later, possibly
in the middle of a decommission operation.
Fixes: #11780Closes#11942
* github.com:scylladb/scylladb:
message: messaging_service: check for known topology before calling is_same_dc/rack
test: reenable test_topology::test_decommission_node_add_column
test/pylib: util: configurable period in wait_for
message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client
message: messaging_service: topology independent connection settings for GOSSIP verbs
It's interesting that prepare_expression
for column identifiers doesn't require a receiver.
I hope this won't break validation in the future.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add a function which creates a mock instance
of data_dictionary::database.
prepare_expression requires a data_dictionary::database
as an argument, so unit tests for it need something
to pass there. make_data_dictionary_database can
be used to create an instance that is sufficient for tests.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
This type is currently an unordered_set, but only consists of at most
two elements. Making it an enum_set renders it into a size_t variable
and better describes the intention.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's used to run lister::scan_dir() with directory_entry_type::directory
only, but for that is copied around on lambda captures. It's simpler
just to use the value directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Since recently the framework uses a separate set of unique IDs to
identify servers, but the log file and workdir is still named using the
last part of the IP address.
This is confusing: the test logs sometimes don't provide the IP addr
(only the ID), and even if they do, the reader of the test log may not
know that they need to look at the last part of the IP to find the
node's log/workdir.
Also using ID will be necessary if we want to reuse IP addresses (e.g.
during node replace, or simply not to run out of IP addresses during
testing).
As described in #11993 per-shard repair_info instances get the effective_replication_map on their own with no centralized synchronization.
This series ensures that the effective replication maps used by repair (and other associated structures like the token metadata and topology) are all in sync with the one used to initiate the repair operation.
While at at, the series includes other cleanups in this area in repair and view that are not fixes as the calls happen in synchronous functions that do not yield.
Fixes#11993Closes#11994
* github.com:scylladb/scylladb:
repair: pass erm down to get_hosts_participating_in_repair and get_neighbors
repair: pass effective_replication_map down to repair_info
repair: coroutinize sync_data_using_repair
repair: futurize do_repair_start
effective_replication_map: add global_effective_replication_map
shared_token_metadata: get_lock is const
repair: sync_data_using_repair: require to run on shard 0
repair: require all node operations to be called on shard 0
repair: repair_info: keep effective_replication_map
repair: do_repair_start: use keyspace erm to get keyspace local ranges
repair: do_repair_start: use keyspace erm for get_primary_ranges
repair: do_repair_start: use keyspace erm for get_primary_ranges_within_dc
repair: do_repair_start: check_in_shutdown first
repair: get_db().local() where needed
repair: get topology from erm/token_metdata_ptr
view: get_view_natural_endpoint: get topology from erm
Always use raft address map to obtain the IP addresses
of upgrade peers. Right now the map is populated
from Raft configuration, so it's an equivalent transformation,
but in the future raft address map will be populated from other sources:
discovery and gossip, hence the logic of upgrade will change as well.
Do not proceed with the upgrade if an address is
missing from the map, since it means we failed to contact a raft member.
This series moves the topology code from locator/token_metadata.{cc,hh} out to localtor/topology.{cc,hh}
and introduces a shared header file: locator/types.hh contains shared, low level definitions, in anticipation of https://github.com/scylladb/scylladb/pull/11987
While at it, the token_metadata functions are turned into coroutines
and topology copy constructor is deleted. The copy functionality is moved into an async `clone_gently` function that allows yielding while copying the topology.
Closes#12001
* github.com:scylladb/scylladb:
locator: refactor topology out of token_metadata
locator: add types.hh
topology: delete copy constructor
token_metadata: coroutinize clone functions
compaction_manager::task (and thus compaction_data) can be stopped
because of many different reasons. Thus, abort can be requested more
than once on compaction_data abort source causing a crash.
To prevent this before each request_abort() we check whether an abort
was requested before.
Closes#12004
An unordered_set is more efficient and there is no need
to return an ordered set for this purpose.
This change facilitates a follow-up change of adding
topology::get_datacenters(), returning an unordered_set
of datacenter names.
Refs #11987
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12003
Fix a typo introduced by the the recent patch fixing the spelling of
Barrett. The patch introduced a typo in the aarch64 version of the code,
which wasn't found by promotion, as that only builds on X86_64.
Closes#12006
We plan to use gossip data to educate Raft RPC about IP addresses
of raft peers. Add raft server ids to application state, so
that when we get a notification about a gossip peer we can
identify which raft server id this notification is for,
specifically, we can find what IP address stands for this server
id, and, whenever the IP address changes, we can update Raft
address map with the new address.
On the same token, at boot time, we now have to start Gossip
before Raft, since Raft won't be able to send any messages
without gossip data about IP addresses.
Pass a change diff into the notification callback,
rather than add or remove servers one by one, so that
if we need to persist the state, we can do it once per
configuration change, not for every added or removed server.
For now still pass added and removed entries in two separate calls
per a single configuration change. This is done mainly to fulfill the
library contract that it never sends messages to servers
outside the current configuration. The group0 RPC
implementation doesn't need the two calls, since it simply
marks the removed servers as expired: they are not removed immediately
anyway, and messages can still be delivered to them.
However, there may be test/mock implementations of RPC which
could benefit from this contract, so we decided to keep it.
And make sure the token_metadata ring version is same as the
reference one (from the erm on shard 0), when starting the
repair on each shard.
Refs #11993
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Turn it into a coroutine to prepare for the next path
that will co_await make_global_effective_replication_map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Class to hold a coherent view of a keyspace
effective replication map on all shards.
To be used in a following patch to pass the sharded
keyspace e_r_m:s to repair.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that our toolchain is based on Fedora 37, we can rely on its
libdeflate rather than have to carry our own in a submodule.
Frozen toolchain is regenerated. As a side effect clang is updated
from 15.0.0 to 15.0.4.
Closes#12000
And with that do_sync_data_using_repair can be folded into
sync_data_using_repair.
This will simplify using the effective_replication_map
throughout the operation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than calling db.get_keyspace_local_ranges that
looks up the keyspace and its erm again.
We want all the inforamtion derived from the erm to
be based on the same source.
The function is synchronous so this changes doesn't
fix anything, just cleans up the code.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Ensure that the primary ranges are in sync with the
keyspace erm.
The function is synchronous so this change doesn't fix anything,
it just cleans up the code.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Ensure the erm and topology are in sync.
The function is synchronous so this change doesn't fix
anything, just cleans up the code.
Fix mistake in comment while at it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In several places we get the sharded database using get_db()
and then we only use db.local(). Simplify the code by keeping
reference only to the local database upfront.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We want the topology to be synchronized with the respective
effective_replication_map / token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Get the topology for the effective replication map rather
than from the storage_proxy to ensure its synchronized
with the natural endpoints.
Since there's no preemption between the two calls
currently there is no issue, so this is merely a clean up
of the code and not supposed to fix anything.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch adds a reproducer for issue #11954: Attempting an
"IF NOT EXISTS" (LWT) write with a null key crashes Scylla,
instead of producing a simple error message (like happens
without the "IF NOT EXISTS" after #7852 was fixed).
The test passed on Cassandra, but crashes Scylla. Because of this
crash, we can't just mark the test "xfail" and it's temporarily
marked "skip" instead.
Refs #11954.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11982
When `modify_config` or `add_entry` is forwarded to the leader, it may
reach the node at "inappropriate" time and result in an exception. There
are two reasons for it - the leader is changing and, in case of
`modify_config`, other `modify_config` is currently in progress. In both
cases the command is retried, but before this patch there was no delay
before retrying, which could led to a tight loop.
The patch adds a new exception type `transient_error`. When the client
receives it, it is obliged to retry the request after some delay.
Previously leader-side exceptions were converted to `not_a_leader`,
which is strange, especially for `conf_change_in_progress`.
Fixes: #11564Closes#11769
* github.com:scylladb/scylladb:
raft: rafactor: remove duplicate code on retries delays
raft: use wait_for_next_tick in read_barrier
raft: wait for the next tick before retrying
Currently in start() method a task is run even if it was already
aborted.
When start() is called on an aborted task, its state is set to
task_manager::task_state::failed and it doesn't run.
As P. T. Barnoom famously said, "write what you like but spell my name
correctly". Following that, we correct the spelling of Barrett's name
in the source tree.
Closes#11989
Topology is copied only from token_metadata_impl::clone_only_token_map
which copies the token_metadata_impl with yielding to prevent reactor
stalls. This should apply to topology as well, so
add a clone_gently function for cloning the topology
from token_metadata_impl::clone_only_token_map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
`is_same_dc` and `is_same_rack` assume that the peer's topology is
known. If it's unknown, `on_internal_error` will be called inside
topology.
When these functions are used in `get_rpc_client`, they are already
protected by an earlier check for knowing the peer's topology
(the `has_topology()` lambda).
Another use is in `do_start_listen()`, where we create a filter for RPC
module to check if it should accept incoming connections. If cross-dc or
cross-rack encryption is enabled, we will reject connections attempts to
the regular (non-ssl) port from other dcs/rack using `is_same_dc/rack`.
However, it might happen that something (other Scylla node or otherwise)
tries to contact us on the regular port and we don't know that thing's
topology, which would result in `on_internal_error`. But this is not a
fatal error; we simply want to reject that connection. So protect these
calls as well.
Finally, there's `get_preferred_ip` with an unprotected `is_same_dc`
call which, for a given peer, may return a different IP from preferred IP
cache if the endpoint resides in the same DC. If there is not entry in
the preferred IP cache, we return the original (external) IP of the
peer. We can do the same if we don't know the peer's topology. It's
interesting that we didn't see this particular place blowing up. Perhaps
the preferred IP cache is always populated after we know the topology.
Also improve the test to increase the probability of reproducing #11780
by injecting sleeps in appropriate places.
Without the fix for #11780 from the earlier commit, the test reproduces
the issue in roughly half of all runs in dev build on my laptop.
`get_rpc_client` calculates a `topology_ignored` field when creating a
client which says whether the client's endpoint had topology information
when topology was created. This is later used to check if that client
needs to be dropped and replaced with a new client which uses the
correct topology information.
The `topology_ignored` field was incorrectly calculated as `true` for
pending endpoints even though we had topology information for them. This
would lead to unnecessary drops of RPC clients later. Fix this.
Remove the default parameter for `with_pending` from
`topology::has_endpoint` to avoid similar bugs in the future.
Apparently this fixes#11780. The verbs used by decommission operation
use RPC client index 1 (see `do_get_rpc_client_idx` in
message/messaging_service.cc). From local testing with additional
logging I found that by the time this client is created (i.e. the first
verb in this group is used), we already know the topology. The node is
pending at that point - hence the bug would cause us to assume we don't
know the topology, leading us to dropping the RPC client later, possibly
in the middle of a decommission operation.
Fixes: #11780
The gossip verbs are used to learn about topology of other nodes.
If inter-dc/rack encryption is enabled, the knowledge of topology is
necessary to decide whether it's safe to send unencrypted messages to
nodes (i.e., whether the destination lies in the same dc/rack).
The logic in `messaging_service::get_rpc_client`, which decided whether
a connection must be encrypted, was this (given that encryption is
enabled): if the topology of the peer is known, and the peer is in the
same dc/rack, don't encrypt. Otherwise encrypt.
However, it may happen that node A knows node B's topology, but B
doesn't know A's topology. A deduces that B is in the same DC and rack
and tries sending B an unencrypted message. As the code currently
stands, this would cause B to call `on_internal_error`. This is what I
encountered when attempting to fix#11780.
To guarantee that it's always possible to deliver gossiper verbs (even
if one or both sides don't know each other's topology), and to simplify
reasoning about the system in general, choose connection settings that
are independent of the topology - for the connection used by gossiper
verbs (other connections are still topology-dependent and use complex
logic to handle the situation of unknown-and-later-known topology).
This connection only contains 'rare' and 'cheap' verbs, so it's not a
performance problem to always encrypt it (given that encryption is
configured). And this is what already was happening in the past; it was
at some point removed during topology knowledge management refactors. We
just bring this logic back.
Fixes#11992.
Inspired by xemul/scylla@45d48f3d02.
When we write to a materialized view, we need to know some information
defined in the base table such as the columns in its schema. We have
a "view_info" object that tracks each view and its base.
This view_info object has a couple of mutable attributes which are
used to lazily-calculate and cache the SELECT statement needed to
read from the base table. If the base-table schema ever changes -
and the code calls set_base_info() at that point - we need to forget
this cached statement. If we don't (as before this patch), the SELECT
will use the wrong schema and writes will no longer work.
This patch also includes a reproducing test that failed before this
patch, and passes afterwords. The test creates a base table with a
view that has a non-trivial SELECT (it has a filter on one of the
base-regular columns), makes a benign modification to the base table
(just a silly addition of a comment), and then tries to write to the
view - and before this patch it fails.
Fixes#10026Fixes#11542
This PR is V2 of the[ PR created by @psarna.](https://github.com/scylladb/scylladb/pull/11560).
I have:
- copied the content.
- applied the suggestions left by @nyh.
- made minor improvements, such as replacing "Scylla" with "ScyllaDB", fixing punctuation, and fixing the RST syntax.
Fixes https://github.com/scylladb/scylladb/issues/11378Closes#11984
* github.com:scylladb/scylladb:
doc: label user-defined functions as Experimental
doc: restore the note for the Count function (removed by mistatke)
doc: document user defined functions (UDFs)
Despite docs discourage from using INADDR_ANY as listen address, this is not disabled in code. Worse -- some snitch drivers may gossip it around as the INTERNAL_IP state. This set prevents this from happening and also adds a sanity check not to use this value if it somehow sneaks in.
Closes#11846
* github.com:scylladb/scylladb:
messaging_service: Deny putting INADD_ANY as preferred ip
messaging_service: Toss preferred ip cache management
gossiping_property_file_snitch: Dont gossip INADDR_ANY preferred IP
gossiping_property_file_snitch: Make _listen_address optional
When we translate from docker/go arch names to the kernel arch
names, we use an associative array hack using computed variable
names "{$!variable_name}". But it turns out bash has real
associative arrays, introduced with "declare -A". Use the to make
the code a little clearer.
Closes#11985
Fix https://github.com/scylladb/scylla-docs/issues/4144Closes#11226
* github.com:scylladb/scylladb:
Update docs/getting-started/system-requirements.rst
doc: specify the recommended AWS instance types
doc: replace the tables with a generic description of support for Im4gn and Is4gen instances
doc: add support for AWS i4g instances
doc: extend the list of supported CPUs
As indicated in #11816, we'd like to enable deserializing vectors in reverse.
The forward deserialization is achieved by reading from an input_stream. The
input stream internally is a singly linked list with complicated logic. In order to
allow for going through it in reverse, instead when creating the reverse vector
initializer, we scan the stream and store substreams to all the places that are a
starting point for a next element. The iterator itself just deserializes elements
from the remembered substreams, this time in reverse.
Fixes#11816Closes#11956
* github.com:scylladb/scylladb:
test/boost/serialization_test.cc: add test for reverse vector deserializer
serializer_impl.hh: add reverse vector serializer
serializer_impl: remove unneeded generic parameter
As noted in issue #11979, Scylla inconsistently (and unlike Cassandra)
requires "IS NOT NULL" one some but not all materialized-view key
columns. Specifically, Scylla does not require "IS NOT NULL" on the
base's partition key, while Cassandra does.
This patch is a test which demonstrates this inconsistency. It currently
passes on Cassandra and fails on Scylla, so is marked xfail.
Refs #11979
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11980
The get_live_token_owners returns the nodes that are part of the ring
and live.
The get_unreachable_token_owners returns the nodes that are part of the ring
and is not alive.
The token_metadata::get_all_endpoints returns nodes that are part of the
ring.
The patch changes both functions to use the more authoritative source to
get the nodes that are part of the ring and call is_alive to check if
the node is up or down. So that the correctness does not depend on
any derived information.
This patch fixes a truncate issue in storage_proxy::truncate_blocking
where it calls get_live_token_owners and get_unreachable_token_owners to
decide the nodes to talk with for truncate operation. The truncate
failed because incorrect nodes were returned.
Fixes#10296Fixes#11928Closes#11952
This PR is a follow-up to https://github.com/scylladb/scylladb/pull/11918.
With this PR:
- The "ScyllaDB Enterprise" label is added to all the features that are only available in ScyllaDB Enterprise.
- The previous Enterprise-only note is removed (it was included in multiple files as _/rst_include/enterprise-only-note.rst_ - this file is removed as it is no longer used anywhere in the docs).
- "Scylla Enterprise" was removed from `versionadded `because now it's clear that the feature was added for Enterprise.
Closes#11975
* github.com:scylladb/scylladb:
doc: remove the enterprise-only-note.rst file, which was replaced by the ScyllaDB Enterprise label and is not used anymore
doc: add the ScyllaDB Enterprise label to the descriptions of Enterprise-only features
The helper is already widely used, one (last) test case can benefit from using it too
Closes#11978
* github.com:scylladb/scylladb:
test: Indentation fix after previous patch
test: Wse with_sstable_directory() helper
We use Barrett tables (misspelled in the code unfortunately) to fold
crc computations of multiple buffers into a single crc. This is important
because it turns out to be faster to compute crc of three different buffers
in parallel rather than compute the crc of one large buffer, since the crc
instruction has latency 3.
Currently, we have a separate code generation step to compute the
fold tables. The step generates a new C++ source files with the tables.
But modern C++ allows us to do this computation at compile time, avoiding
the code generation step. This simplifies the build.
This series does that. There is some complication in that the code uses
compiler intrinsics for the computation, and these are not constexpr friendly.
So we first introduce constexpr-friendly alternatives and use them.
To prove the transformation is correct, I compared the generated code from
before the series and from just before the last step (where we use constexpr
evaluation but still retain the generated file) and saw no difference in the values.
Note that constexpr is not strictly needed - we could have run the code in the
global variables' initializer. But that would cause a crash if we run on a pre-clmul
machine, and is not as fun.
Closes#11957
* github.com:scylladb/scylladb:
test: crc: add unit tests for constexpr clmul and barrett fold
utils: crc combine table: generate at compile time
utils: barrett: inline functions in header
utils: crc combine table: generate tables at compile time
utils: crc combine table: extract table generation into a constexpr function
utils: crc combine table: extract "pow table" code into constexpr function
utils: crc combine table: store tables std::arrray rather than C array
utils: barrett: make the barrett reduction constexpr friendly
utils: clmul: add 64-bit constexpr clmul
utils: barrett: extract barrett reduction constants
utils: barrett: reorder functions
utils: make clmul() constexpr
Introduce a templated function do_on_leader_with_retries,
use it in add_entries/modify_config/read_barrier. The
function implements the basic logic of retries with aborts
and leader changes handling, adds a delay between
iterations to protect against tight loops.
Replaced the yield on transport_error
with wait_for_next_tick. Added delays for retries, similar
to add_entry/modify_config: we postpone the next
call attempt if we haven't received new information
about the current leader.
When modify_config or add_entry is forwarded
to the leader, it may reach the node at
"inappropriate" time and result in an exception.
There are two reasons for it - the leader is
changing and, in case of modify_config, other
modify_config is currently in progress. In
both cases the command is retried, but before
this patch there was no delay before retrying,
which could led to a tight loop.
The patch adds a new exception type transient_error.
When the client node receives it, it is obliged to retry
the request, possibly after some delay. Previously, leader-side
exceptions were converted to not_a_leader exception,
which is strange, especially for conf_change_in_progress.
We add a delay before retrying in modify_config
and add_entry if the client hasn't received any new
information about the leader since the last attempt.
This can happen if the server
responds with a transient_error with an empty leader
and the current node has not yet learned the new leader.
We neglect an excessive delay if the newly elected leader
is the same as the previous one, this supposed to be a rare.
Fixes: #11564
It's already used everywhere, but one test case wires up the
sstable_directory by hand. Fix it too, but keep in mind, that the caller
fn stops the directory early.
(indentation is deliberately left broken)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently when we want to deserialize mutation in reverse, we unfreeze
it and consume from the end. This new reverse vector deserializer
goes through input stream remembering substreams that contain a given
output range member, and while traversing from the back, deserialize
each substream.
Currently, when replacing a node ip, keeping the old host,
we might end up with the the old endpoint in system.peers
if it is inserted back into the topology by `handle_state_normal`
when on_join is called with the old endpoint.
Then, later on, on_change sees that:
```
if (get_token_metadata().is_member(endpoint)) {
co_await do_update_system_peers_table(endpoint, state, value);
```
As described in #11925.
Fixes#11925Closes#11930
* github.com:scylladb/scylladb:
storage_service, system_keyspace: add debugging around system.peers update
storage_service: handle_state_normal: update topology and notify_joined endpoint only if not removed
Today, compaction_backlog_tracker is managed in each compaction_strategy
implementation. So every compaction strategy is managing its own
tracker and providing a reference to it through get_backlog_tracker().
But this prevents each group from having its own tracker, because
there's only a single compaction_strategy instance per table.
To remove this limitation, compaction_strategy impl will no longer
manage trackers but will instead provide an interface for trackers
to be created, such that each compaction_group will be allowed to
create its own tracker and manage it by itself.
Now table's backlog will be the sum of all compaction_group backlogs.
The normalization factor is applied on the sum, so we don't have
to adjust each individual backlog to any factor.
Closes#11762
* github.com:scylladb/scylladb:
replica: Allow one compaction_backlog_tracker for each compaction_group
compaction: Make compaction_state available for compaction tasks being stopped
compaction: Implement move assignment for compaction_backlog_tracker
compaction: Fix compaction_backlog_tracker move ctor
compaction: Use table_state's backlog tracker in compaction_read_monitor_generator
compaction: kill undefined get_unimplemented_backlog_tracker()
replica: Refactor table::set_compaction_strategy for multiple groups
Fix exception safety when transferring ongoing charges to new backlog tracker
replica: move_sstables_from_staging: Use tracker from group owning the SSTable
replica: Move table::backlog_tracker_adjust_charges() to compaction_group
replica: table::discard_sstables: Use compaction_group's backlog tracker
replica: Disable backlog tracker in compaction_group::stop()
replica: database_sstable_write_monitor: use compaction_group's backlog tracker
replica: Move table::do_add_sstable() to compaction_group
test/sstable_compaction_test: Switch to table_state::get_backlog_tracker()
compaction/table_state: Introduce get_backlog_tracker()
Check that the constexpr variants indeed match the runtime variants.
I verified manually that exactly one computation in each test is
executed at run time (and is compared against a constant).
By now the crc combine tables are generated at compile time,
but still in a separate code generation step. We now eliminate
the code generation step and instead link the global variables
directly into the main executable. The global variables have
been conveniently named exactly as the code generation step
names them, so we don't need to touch any users.
Move the tables into global constinit variables that are
generated at compile time. Note the code that creates
the generated crc32_combine_table.cc is still called; it
transorms compile-time generated tables into a C++ source
that contains the same values, as literals.
If we generate a diff between gen/utils/gz/crc_combine_table.cc
before this series and after this patch, we see the only change
in the file is the type of the variable (which changed to
std::array), proving our constexpr code is correct.
Move the code to a constexpr function, so we can later generate the tables at
compile time. Note that although the function is constexpr, it is still
evaluated at runtime, since the calling function (main()) isn't constexpr
itself.
A "pow table" is used to generate the Barrett fold tables. Extract its
code into a constexpr function so we can later generate the fold tables
at compile time.
C arrays cannot be returned from functions and therefore aren't suitable
for constexpr processing. std::array<> is a regular value and so is
constexpr friendly.
This is used when generating the Barrett reduction tables, and also when
applying the Barrett reduction at runtime, so we need it to be constexpr
friendly.
clmul() is a pure function and so should already be constexpr,
but it uses intrinsics that aren't defined as constexpr and
so the compiler can't really compute it at compile time.
Fix by defining a constexpr variant and dispatching based
on whether we're being constant-evaluated or not.
The implementation is simple, but in any case proof that it
is correct will be provided later on.
Today, compaction_backlog_tracker is managed in each compaction_strategy
implementation. So every compaction strategy is managing its own
tracker and providing a reference to it through get_backlog_tracker().
But this prevents each group from having its own tracker, because
there's only a single compaction_strategy instance per table.
To remove this limitation, compaction_strategy impl will no longer
manage trackers but will instead provide an interface for trackers
to be created, such that each compaction group will be allowed to
have its own tracker, which will be managed by compaction manager.
On compaction strategy change, table will update each group with
the new tracker, which is created using the previously introduced
ompaction_group_sstable_set_updater.
Now table's backlog will be the sum of all compaction_group backlogs.
The normalization factor is applied on the sum, so we don't have
to adjust each individual backlog to any factor.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
compaction_backlog_tracker will be managed by compaction_manager, in the
per table state. As compaction tasks can access the tracker throughout
its lifetime, remove() can only deregister the state once we're done
stopping all tasks which map to that state.
remove() extracted the state upfront, then performed the stop, to
prevent new tasks from being registered and left behind. But we can
avoid the leak of new tasks by only closing the gate, which waits
for all tasks (which are stopped a step earlier) and once closed,
prevents new tasks from being registered.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Luckily it's not used anywhere. Default move ctor was picked but
it won't clear _manager of old object, meaning that its destructor
will incorrectly deregister the tracker from
compaction_backlog_manager.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Refactoring the function for it to accomodate multiple compaction
groups.
To still provide strong exception guarantees, preparation and
execution of changes will be separated.
Once multiple groups are supported, each group will be prepared
first, and the noexcept execution will be done as a last step.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When setting a new strategy, the charges of old tracker is transferred
to the new one.
The problem is that we're not reverting changes if exception is
triggered before the new strategy is successfully set.
To fix this exception safety issue, let's copy the charges instead
of moving them. If exception is triggered, the old tracker is still
the one used and remain intact.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When moving SSTables from staging directory, we'll conditionally add
them to backlog tracker. As each group has its own tracker, a given
sstable will be added to the tracker of the group that owns it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Procedures that call this function happen to be in compaction_group,
so let's move it to group. Simplifies the change where the procedure
retrieves tracker from the group itself.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
As we're moving backlog tracker to compaction group, we need to
stop the tracker there too. We're moving it a step earlier in
table::stop(), before sstables are cleared, but that's okay
because it's still done after the group was deregistered
from compaction manager, meaning no compactions are running.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
All callers of do_add_sstable() live in compaction_group, so it
should be moved into compaction_group too. It also makes easier
for the function to retrieve the backlog tracker from the group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This interface will be helpful for allowing replica::table, unit
tests and sstables::compaction to access the compaction group's tracker
which will be managed by the compaction manager, once we complete
the decoupling work.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
In commit 544ef2caf3 we fixed a bug where
a reveresed clustering-key order caused problems using a secondary index
because of incorrect type comparison. That commit also included a
regression test for this fix.
However, that fix was incomplete, and improved later in commit
c8653d1321. That later fix was labeled
"better safe than sorry", and did not include a test demonstrating
any actual bug, so unsurprisingly we never backported that second
fix to any older branches.
Recently we discovered that missing the second patch does cause real
problems, and this patch includes a test which fails when the first
patch is in, but the second patch isn't (and passes when both patches
are in, and also passes on Cassandra).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11943
The mutation compactor is used on most read-paths we have, so adding a
validator to it gives us a good coverage, in particular it gives us full
coverage of queries and compaction.
The validator validates mutation token (and mutation fragment kind)
monotonicity as that is quite cheap, while it is enough to catch the
most common problems. As we already have a validator on the compaction
path (in the sstable writer), the validator is disabled when the
mutation compactor is instantiated for compaction.
We should probably make this configurable at some point. The addition
of this validator should prevent the worst of the fragment reordering
bugs to affect reads.
Which, as its name suggests, makes the validating filter not validate
anything at all. This validation level can be used effectively to make
it so as if the validator was not there at all.
The test's input data is currently out-of-order, violating a fundamental
invariant of data always being sorted. This doesn't cause any problems
right now, but soon it will. Sort it to avoid it.
Said method calls `compact_mutation_state::start_new_page()` which
requires the kind of the next fragment in the reader. When there is no
fragment (reader is at EOS), we use partition-end. This was a poor
choice: if the reader is at EOS, partition-kind was the last fragment
kind, if the stream were to continue the next fragment would be a
partition-start.
Instead of using assigned IP addresses, use a local integer ID for
managing servers. IP address can be reused by a different server.
While there, get host ID (UUID). This can also be reused with `node
replace` so it's not good enough for tracking.
Closes#11747
* github.com:scylladb/scylladb:
test.py: use internal id to manage servers
test.py: rename hostname to ip_addr
test.py: get host id
test.py: use REST api client in ScyllaCluster
test.py: remove unnecessary reference to web app
test.py: requests without aiohttp ClientSession
In the 5.1 -> 5.2 upgrade doc, include additional steps for enabling
Raft using the `consistent_cluster_management` flag. Note that we don't
have this flag yet but it's planned to replace the experimental flag in
5.2.
In the "Raft in ScyllaDB" document, add sections about:
- enabling Raft in existing clusters in Scylla 5.2,
- verifying that the internal Raft upgrade procedure finishes
successfully,
- recovering from a stuck Raft upgrade procedure or from a majority loss
situation.
Fix some problems in the documentation, e.g. it is not possible to
enable Raft in an existing cluster in 5.0, but the documentation claimed
that it is.
Follow-up items:
- if we decide for a different name for `consistent_cluster_management`,
use that name in the docs instead
- update the warnings in Scylla to link to the Raft doc
- mention Enterprise versions once we know the numbers
- update the appropriate upgrade docs for Enterprise versions
once they exist
It's a copy-paste from the 5.0 -> 5.1 guide with substitutions:
s/5.1/5.2,
s/5.0/5.1
The metric update guide is not written, I left a TODO.
Also I didn't include the guide in
docs/upgrade/upgrade-opensource/index.rst, since 5.2 is not released
yet.
The guide can be accessed by manually following the link:
/upgrade/upgrade-opensource/upgrade-guide-from-5.1-to-5.2/
Instead of using assigned IP addresses, use an internal server id.
Define types to distinguish local server id, host ID (UUID), and IP
address.
This is needed to test servers changing IP address and for node replace
(host UUID).
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The code explicitly manages an IP as string, make it explicit in the
variable name.
Define its type and test for set in the instance instead of using an
empty string as placeholder.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
When initializing a ScyllaServer, try to get the host id instead of only
checking the REST API is up.
Use the existing aiohttp session from ScyllaCluster.
In case of HTTP error check the status was not an internal error (500+).
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Move the REST api client to ScyllaCluster. This will allow the cluster
to query its own servers.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The aiohttp.web.Application only needs to be passed, so don't store a
reference in ScyllaCluster object.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Simplify REST helper by doing requests without a session.
Reusing an aiohttp.ClientSession causes knock-on effects on
`rest_api/test_task_manager` due to handling exceptions outside of an
async with block.
Requests for cluster management and Scylla REST API don't need session,
anyway.
Raise HTTPError with status code, text reason, params, and json.
In ScyllaCluster.install_and_start() instead of adding one more custom
exception, just catch all exceptions as they will be re-raised later.
While there avoid code duplication and improve sanity, type checking,
and lint score.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Currently when we set a single value we need
to call broadcast_to_all_shards to let observers on all
shards get notified of the new value.
However, the latter broadcasts all value to all shards
so it's terribly inefficient.
Instead, add async set_value_on_all_shards functions
to broadcast a value to all shards.
Use those in system_keyspace for db_config_table virtual table
and in task_manager_test to update the task_manager ttl.
Refs #7316Closes#11893
* github.com:scylladb/scylladb:
tests: check ttl on different shards
utils: config_src: add set_value_on_all_shards functions
utils: config_file: add config_source::API
This mini-series introduces dht::tokens_filter and uses it for consuming staging sstable in the view_update_generator.
The tokens_filter uses the token ranges owned by the current node, as retrieved by get_keyspace_local_ranges.
Refs #9559Closes#11932
* github.com:scylladb/scylladb:
db: view_update_generator: always clean up staging sstables
compaction: extract incremental_owned_ranges_checker out to dht
This reverts commit ba6186a47f.
Said commit violates the widely held assumption that sstables
generations can be used as sstable identity. One known problem caused
this is potential OOO partition emitted when reading from sstables
(#11843). We now also have a better fix for #11789 (the bug this commit
was meant to fix): 4aa0b16852. So we can
revert without regressions.
Fixes: #11843Closes#11886
Wrong access to an uninitialized token instead of the actual
generated string caused the parser to crash, this wasn't
detected by the ANTLR3 compiler because all the temporary
variables defined in the ANTLR3 statements are global in the
generated code. This essentialy caused a null dereference.
Tests: 1. The fixed issue scenario from github.
2. Unit tests in release mode.
Fixes#11774
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190612133151.20609-1-eliransin@scylladb.com>
Closes#11777
Currently, when replacing a node ip, keeping the old host,
we might end up with the the old endpoint in system.peers
if it is inserted back into the topology by `handle_state_normal`
when on_join is called with the old endpoint.
Then, later on, on_change sees that:
```
if (get_token_metadata().is_member(endpoint)) {
co_await do_update_system_peers_table(endpoint, state, value);
```
As described in #11925.
Fixes#11925
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently when we set a single value we need
to call broadcast_to_all_shards to let observers on all
shards get notified of the new value.
However, the latter broadcasts all value to all shards
so it's terribly inefficient.
Instead, add async set_value_on_all_shards functions
to broadcast a value to all shards.
Use those in system_keyspace for db_config_table virtual table
and in task_manager_test to update the task_manager ttl.
Refs #7316
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since they are currently not cleaned up by cleanup compaction
filter their tokens, processing only tokens owned by the
current node (based on the keyspace replication strategy).
Refs #9559
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is currently used by cleanup_compaction partition filter.
Factor it out so it can be used to filter staging sstables in
the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a class inherits from multiple virtual base classes, pointers to
instances of this class via one of its base classes, might point to
somewhere into the object, not at its beginning. Therefore, the simple
method employed currently by $downcast_vptr() of casting the provided
pointer to the type extracted from the vtable name fails. Instead when
this situation is detected (detectable by observing that the symbol name
of the partial vtable is not to an offset of +16, but larger),
$downcast_vptr() will iterate over the base classes, adjusting the
pointer with their offsets, hoping to find the true start of the object.
In the one instance I tested this with, this method worked well.
At the very least, the method will now yield a null pointer when it
fails, instead of a badly casted object with corrupt content (which the
developer might or might not attribute to the bad cast).
Closes#11892
This PR adds the "ScyllaDB Enterprise" label to highlight the Enterprise-only features on the following pages:
- Encryption at Rest - the label indicates that the entire page is about an Enterprise-only feature.
- Compaction - the labels indicate the sections that are Enterprise-only.
There are more occurrences across the docs that require a similar update. I'll update them in another PR if this PR is approved.
Closes#11918
* github.com:scylladb/scylladb:
doc: fix the links to resolve the warnings
doc: add the Enterprise label on the Compaction page (to a subheading and on a list of strategies) to replace the info box
doc: add the Enterprise label to the Encryption at Rest page (the entire page) to replace the info box
Prior to off-strategy compaction, streaming / repair would place
staging files into main sstable set, and wait for view building
completion before they could be selected for regular compaction.
The reason for that is that view building relies on table providing
a mutation source without data in staging files. Had regular compaction
mixed staging data with non-staging one, table would have a hard time
providing the required mutation source.
After off-strategy compaction, staging files can be compacted
in parallel to view building. If off-strategy completes first, it
will place the output into the main sstable set. So a parallel view
building (on sstables used for off-strategy) may potentially get a
mutation source containing staging data from the off-strategy output.
That will mislead view builder as it won't be able to detect
changes to data in main directory.
To fix it, we'll do what we did before. Filter out staging files
from compaction, and trigger the operation only after we're done
with view building. We're piggybacking on off-strategy timer for
still allowing the off-strategy to only run at the end of the
node operation, to reduce the amount of compaction rounds on
the data introduced by repair / streaming.
Fixes#11882.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#11919
create-relocatable-package.py collects shared libraries used by
executables for packaging. It also adds libthread-db.so to make
debugging possible. However, the name it uses has changed in glibc,
so packaging fails in Fedora 37.
Switch to the version-agnostic names, libthread-db.so. This happens
to be a symlink, so resolve it.
Closes#11917
--online-discard option defined as string parameter since it doesn't
specify "action=", but has default value in boolean (default=True).
It breaks "provisioning in a similar environment" since the code
supposed boolean value should be "action='store_true'" but it's not.
We should change the type of the option to int, and also specify
"choices=[0, 1]" just like --io-setup does.
Fixes#11700Closes#11831
Whenever a Raft configuration change is performed, `raft::server` calls
`raft_rpc::add_server`/`raft_rpc::remove_server`. Our `raft_rpc`
implementation has a function, `_on_server_update`, passed in the
constructor, which it called in `add_server`/`remove_server`;
that function would update the set of endpoints detected by the
direct failure detector. `_on_server_update` was passed an IP address
and that address was added to / removed from the failure detector set
(there's another translation layer between the IP addresses and internal
failure detector 'endpoint ID's; but we can ignore it for the purposes
of this commit).
Therefore: the failure detector was pinging a certain set of IP
addresses. These IP addresses were updated during Raft configuration
changes.
To implement the `is_alive(raft::server_id)` function (required by
`raft::failure_detector` interface), we would translate the ID using
the Raft address map, which is currently also updated during
configuration changes, to an IP address, and check if that IP address is
alive according to the direct failure detector (which maintained an
`_alive_set` of type `unordered_set<gms::inet_address>`).
This all works well but it assumes that servers can be identified using
IP addresses - it doesn't play well with the fact that servers may
change their IP addresses. The only immutable identifier we have for a
server is `raft::server_id`. In the future, Raft configurations will not
associate IP addresses with Raft servers; instead we will assume that IP
addresses can change at any time, and there will be a different
mechanism that eventually updates the Raft address map with the latest
IP address for each `raft::server_id`.
To prepare us for that future, in this commit we no longer operate in
terms of IP addresses in the failure detector, but in terms of
`raft::server_id`s. Most of the commit is boilerplate, changing
`gms::inet_address` to `raft::server_id` and function/variable names.
The interesting changes are:
- in `is_alive`, we no longer need to translate the `raft::server_id` to
an IP address, because now the stored `_alive_set` already contains
`raft::server_id`s instead of `gms::inet_address`es.
- the `ping` function now takes a `raft::server_id` instead of
`gms::inet_address`. To send the ping message, we need to translate
this to IP address; we do it by the `raft_address_map` pointer
introduced in an earlier commit.
Thus, there is still a point where we have to translate between
`raft::server_id` and `gms::inet_address`; but observe we now do it at
the last possible moment - just before sending the message. If we
have no translation, we consider the `ping` to have failed - it's
equivalent to a network failure where no route to a given address was
found.
Closes#11759
* github.com:scylladb/scylladb:
direct_failure_detector: get rid of complex `endpoint_id` translations
service/raft: ping `raft::server_id`s, not `gms::inet_address`es
service/raft: store `raft_address_map` reference in `direct_fd_pinger`
gms: gossiper: move `direct_fd_pinger` out to a separate service
gms: gossiper: direct_fd_pinger: extract generation number caching to a separate class
This PR introduces the following changes to the documentation landing page:
- The " New to ScyllaDB? Start here!" box is added.
- The "Connect your application to Scylla" box is removed.
- Some wording has been improved.
- "Scylla" has been replaced with "ScyllaDB".
Closes#11896
* github.com:scylladb/scylladb:
Update docs/index.rst
doc: replace Scylla with ScyllaDB on the landing page
doc: improve the wording on the landing page
doc: add the link to the ScyllaDB Basics page to the documentation landing page
There were 4 different pages for upgrading Scylla 5.0 to 5.1 (and the
same is true for other version pairs, but I digress) for different
environments:
- "ScyllaDB Image for EC2, GCP, and Azure"
- Ubuntu
- Debian
- RHEL/CentOS
THe Ubuntu and Debian pages used a common template:
```
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p1.rst
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p2.rst
```
with different variable substitutions.
The "Image" page used a similar template, with some extra content in the
middle:
```
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p1.rst
.. include:: /upgrade/_common/upgrade-image-opensource.rst
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p2.rst
```
The RHEL/CentOS page used a different template:
```
.. include:: /upgrade/_common/upgrade-guide-v4-rpm.rst
```
This was an unmaintainable mess. Most of the content was "the same" for
each of these options. The only content that must actually be different
is the part with package installation instructions (e.g. calls to `yum`
vs `apt-get`). The rest of the content was logically the same - the
differences were mistakes, typos, and updates/fixes to the text that
were made in some of these docs but not others.
In this commit I prepare a single page that covers the upgrade and
rollback procedures for each of these options. The section dependent on
the system was implemented using Sphinx Tabs.
I also fixed and changed some parts:
- In the "Gracefully stop the node" section:
Ubuntu/Debian/Images pages had:
```rst
.. code:: sh
sudo service scylla-server stop
```
RHEL/CentOS pages had:
```rst
.. code:: sh
.. include:: /rst_include/scylla-commands-stop-index.rst
```
the stop-index file contained this:
```rst
.. tabs::
.. group-tab:: Supported OS
.. code-block:: shell
sudo systemctl stop scylla-server
.. group-tab:: Docker
.. code-block:: shell
docker exec -it some-scylla supervisorctl stop scylla
(without stopping *some-scylla* container)
```
So the RHEL/CentOS version had two tabs: one for Scylla installed
directly on the system, one for Scylla running in Docker - which is
interesting, because nothing anywhere else in the upgrade documents
mentions Docker. Furthermore, the RHEL/CentOS version used `systemctl`
while the ubuntu/debian/images version used `service` to stop/start
scylla-server. Both work on modern systems.
The Docker option is completely out of place - the rest of the upgrade
procedure does not mention Docker. So I decided it doesn't make sense to
include it. Docker documentation could be added later if we actually
decide to write upgrade documentation when using Docker... Between
`systemctl` and `service` I went with `service` as it's a bit
higher-level.
- Similar change for "Start the node" section, and corresponding
stop/start sections in the Rollback procedure.
- To reuse text for Ubuntu and Debian, when referencing "ScyllaDB deb
repo" in the Debian/Ubuntu tabs, I provide two separate links: to
Debian and Ubuntu repos.
- the link to rollback procedure in the RPM guide (in 'Download and
install the new release' section) pointed to rollback procedure from
3.0 to 3.1 guide... Fixed to point to the current page's rollback
procedure.
- in the rollback procedure steps summary, the RPM version missed the
"Restore system tables" step.
- in the rollback procedure, the repository links were pointing to the
new versions, while they should point to the old versions.
There are some other pre-existing problems I noticed that need fixing:
- EC2/GCP/Azure option has no corresponding coverage in the rollback
section (Download and install the old release) as it has in the
upgrade section. There is no guide for rolling back 3rd party and OS
packages, only Scylla. I left a TODO in a comment.
- the repository links assume certain Debian and Ubuntu versions (Debian
10 and Ubuntu 20), but there are more available options (e.g. Ubuntu
22). Not sure how to deal with this problem. Maybe a separate section
with links? Or just a generic link without choice of platform/version?
Closes#11891
Flush the memtable before cleaning up the table so not to leave any disowned tokens in the memtable
as they might be resurrected if left in the memtable.
Fixes#1239Closes#11902
* github.com:scylladb/scylladb:
table: perform_cleanup_compaction: flush memtable
table: add perform_cleanup_compaction
api: storage_service: add logging for compaction operations et al
Coroutines and asan don't mix well on aarch64. This was seen in
22f13e7ca3 (" Revert "Merge 'cql3: select_statement: coroutinize
indexed_table_select_statement::do_execute_base_query()' from Avi
Kivity"") where a routine coroutinization was reverted due to failures
on aarch64 debug mode.
In clang 15 this is even worse, the existing code starts failing.
However, if we disable optimization (-O0 rather than -Og), things
begin to work again. In fact we can reinstate the patch reverted
above even with clang 12.
Fix (or rather workaround) the problem by avoiding -Og on aarch64
debug mode. There's the lingering fear that release mode is
miscompiled too, but all the tests pass on clang 15 in release mode
so it appears related to asan.
Closes#11894
We don't explicitly cleanup the memtable, while
it might hold tokens disowned by the current node.
Flush the memtable before performing cleanup compaction
to make sure all tokens in the memtable are cleaned up.
Note that non-owned ranges are invalidate in the cache
in compaction_group::update_main_sstable_list_on_compaction_completion
using desc.ranges_for_cache_invalidation.
Fixes#1239
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move the integration with compaction_manager
from the api layer to the tabel class so
it can also make sure the memtable is cleaned up in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The test runs remove_node command with background ddl workload.
It was written in an attempt to reproduce scylladb#11228 but seems to have
value on its own.
The if_exists parameter has been added to the add_table
and drop_table functions, since the driver could retry
the request sent to a removed node, but that request
might have already been completed.
Function wait_for_host_known waits until the information
about the node reaches the destination node. Since we add
new nodes at each iteration in main, this can take some time.
A number of abort-related options was added
SCYLLA_CMDLINE_OPTIONS as it simplifies
nailing down problems.
Closes#11734
The direct failure detector operates on abstract `endpoint_id`s for
pinging. The `pigner` interface is responsible for translating these IDs
to 'real' addresses.
Earlier we used two types of addresses: IP addresses in 'production'
code (`gms::gossiper::direct_fd_pinger`) and `raft::server_id`s in test
code (in `randomized_nemesis_test`). For each of these use cases we
would maintain mappings between `endpoint_id`s and the address type.
In recent commits we switched the 'production' code to also operate on
Raft server IDs, which are UUIDs underneath.
In this commit we switch `endpoint_id`s from `unsigned` type to
`utils::UUID`. Because each use case operates in Raft server IDs, we can
perform a simple translation: `raft_id.uuid()` to get an `endpoint_id`
from a Raft ID, `raft::server_id{ep_id}` to obtain a Raft ID from
an `endpoint_id`. We no longer have to maintain complex sharded data
structures to store the mappings.
Whenever a Raft configuration change is performed, `raft::server` calls
`raft_rpc::add_server`/`raft_rpc::remove_server`. Our `raft_rpc`
implementation has a function, `_on_server_update`, passed in the
constructor, which it called in `add_server`/`remove_server`;
that function would update the set of endpoints detected by the
direct failure detector. `_on_server_update` was passed an IP address
and that address was added to / removed from the failure detector set
(there's another translation layer between the IP addresses and internal
failure detector 'endpoint ID's; but we can ignore it for the purposes
of this commit).
Therefore: the failure detector was pinging a certain set of IP
addresses. These IP addresses were updated during Raft configuration
changes.
To implement the `is_alive(raft::server_id)` function (required by
`raft::failure_detector` interface), we would translate the ID using
the Raft address map, which is currently also updated during
configuration changes, to an IP address, and check if that IP address is
alive according to the direct failure detector (which maintained an
`_alive_set` of type `unordered_set<gms::inet_address>`).
This all works well but it assumes that servers can be identified using
IP addresses - it doesn't play well with the fact that servers may
change their IP addresses. The only immutable identifier we have for a
server is `raft::server_id`. In the future, Raft configurations will not
associate IP addresses with Raft servers; instead we will assume that IP
addresses can change at any time, and there will be a different
mechanism that eventually updates the Raft address map with the latest
IP address for each `raft::server_id`.
To prepare us for that future, in this commit we no longer operate in
terms of IP addresses in the failure detector, but in terms of
`raft::server_id`s. Most of the commit is boilerplate, changing
`gms::inet_address` to `raft::server_id` and function/variable names.
The interesting changes are:
- in `is_alive`, we no longer need to translate the `raft::server_id` to
an IP address, because now the stored `_alive_set` already contains
`raft::server_id`s instead of `gms::inet_address`es.
- the `ping` function now takes a `raft::server_id` instead of
`gms::inet_address`. To send the ping message, we need to translate
this to IP address; we do it by the `raft_address_map` pointer
introduced in an earlier commit.
Thus, there is still a point where we have to translate between
`raft::server_id` and `gms::inet_address`; but observe we now do it at
the last possible moment - just before sending the message. If we
have no translation, we consider the `ping` to have failed - it's
equivalent to a network failure where no route to a given address was
found.
In later commit `direct_fd_pinger` will operate in terms of
`raft::server_id`s. Decouple it from `gossiper` since we don't want to
entangle `gossiper` with Raft-specific stuff.
`gms::gossiper::direct_fd_pinger` serves multiple purposes: one of them
is to maintain a mapping between `gms::inet_address`es and
`direct_failure_detector::pinger::endpoint_id`s, another is to cache the
last known gossiper's generation number to use it for sending gossip
echo messages. The latter is the only gossiper-specific thing in this
class.
We want to move `direct_fd_pinger` utside `gossiper`. To do that, split the
gossiper-specific thing -- the generation number management -- to a
smaller class, `echo_pinger`.
`echo_pinger` is a top-level class (not a nested one like
`direct_fd_pinger` was) so we can forward-declare it and pass references
to it without including gms/gossiper.hh header.
* seastar f32ed00954...e0dabb361f (12):
> sstring: define formatter
> file: Dont violate API layering
> Add compile_commands.json to gitignore
> Merge 'Add an allocation failure metric' from Travis Downs
> Use const test objects
> Ragel chunk parser: compilation err, unused var
> build: do not expose Valgrind in SeastarTargets.cmake
> defer: mark deferred_* with [[nodiscard]]
> Log selected reactor backend during startup
> http: mark str with [[maybe_unused]]
> Merge 'reactor: open fd without O_NONBLOCK when using io_uring backend' from Kefu Chai
> reactor: add accept and connect to io_uring backend
Closes#11895
Replicating `raft_address_map` entries is needed for the following use
cases:
- the direct failure detector - currently it assumes a static mapping of
`raft::server_id`s to `gms::inet_address`es, which is obtained on Raft
group 0 configuration changes. To handle dynamic mappings we need to
modify the failure detector so it pings `raft::server_id`s and obtains
the `gms::inet_address` before sending the message from
`raft_address_map`. The failure detector is sharded, so we need the
mappings to be available on all shards.
- in the future we'll have multiple Raft groups running on different
shards. To send messages they'll need `raft_address_map`.
Initially I tried to replicate all entries - expiring and non-expiring.
The implementation turned out to be very complex - we need to handle
dropping expired entries and refreshing expiring entries' timestamps
across shards, and doing this correctly while accounting for possible
races is quite problematic.
Eventually I arrived at the conclusion that replicating only
non-expiring entries, and furthermore allowing non-expiring entries to
be added only on shard 0, is good enough for our use cases:
- The direct failure detector is pinging group 0 members only; group
0 members correspond exactly to the non-expiring entries.
- Group 0 configuration changes are handled on shard 0, so non-expiring
entries are added/removed on shard 0.
- When we have multiple Raft groups, we can reuse a single Raft server
ID for all Raft servers running on a single node belonging to
different groups; they are 'namespaced' by the group IDs. Furthermore,
every node has a server that belongs to group 0. Thus for every Raft
server in every group, it has a corresponding server in group 0 with
the same ID, which has a non-expiring entry in `raft_address_map`,
which is replicated to all shards; so every group will be able to
deliver its messages.
With these assumptions the implementation is short and simple.
We can always complicate it in the future if we find that the
assumptions are too strong.
Closes#11791
* github.com:scylladb/scylladb:
test/raft: raft_address_map_test: add replication test
service/raft: raft_address_map: replicate non-expiring entries to other shards
service/raft: raft_address_map: assert when entry is missing in drop_expired_entries
service/raft: turn raft_address_map into a service
We capture `key` by reference, but it is in a another continuation.
Capture it by value, and avoid the default capture specification.
Found by clang 15 + asan + aarch64.
Closes#11884
To fix CVE-2022-24675, we need to a binary compiled in <= golang 1.18.1.
Only released version which compiled <= golang 1.18.1 is node_exporter
1.4.0, so we need to update to it.
See scylladb/scylla-enterprise#2317
Closes#11400
[avi: regenerated frozen toolchain]
Closes#11879
Starting from https://github.com/scylladb/scylla-pkg/pull/3035 we
removed all old tar.gz prefix from uploading to S3 or been used by
downstream jobs.
Hence, there is no point building those tar.gz files anymore
Closes#11865
Fix https://github.com/scylladb/scylla-doc-issues/issues/864
This PR:
- updates the introduction to add information about AArch64 and rewrite the content.
- replaces "Scylla" with "ScyllaDB".
Closes#11778
* github.com:scylladb/scylladb:
Update docs/getting-started/system-requirements.rst
doc: fix the link to the OS Support page
doc: replace Scylla with ScyllaDB
doc: update the info about supported architecture and rewrite the introduction
Replicating `raft_address_map` entries is needed for the following use
cases:
- the direct failure detector - currently it assumes a static mapping of
`raft::server_id`s to `gms::inet_address`es, which is obtained on Raft
group 0 configuration changes. To handle dynamic mappings we need to
modify the failure detector so it pings `raft::server_id`s and obtains
the `gms::inet_address` before sending the message from
`raft_address_map`. The failure detector is sharded, so we need the
mappings to be available on all shards.
- in the future we'll have multiple Raft groups running on different
shards. To send messages they'll need `raft_address_map`.
Initially I tried to replicate all entries - expiring and non-expiring.
The implementation turned out to be very complex - we need to handle
dropping expired entries and refreshing expiring entries' timestamps
across shards, and doing this correctly while accounting for possible
races is quite problematic.
Eventually I arrived at the conclusion that replicating only
non-expiring entries, and furthermore allowing non-expiring entries to
be added only on shard 0, is good enough for our use cases:
- The direct failure detector is pinging group 0 members only; group
0 members correspond exactly to the non-expiring entries.
- Group 0 configuration changes are handled on shard 0, so non-expiring
entries are added/removed on shard 0.
- When we have multiple Raft groups, we can reuse a single Raft server
ID for all Raft servers running on a single node belonging to
different groups; they are 'namespaced' by the group IDs. Furthermore,
every node has a server that belongs to group 0. Thus for every Raft
server in every group, it has a corresponding server in group 0 with
the same ID, which has a non-expiring entry in `raft_address_map`,
which is replicated to all shards; so every group will be able to
deliver its messages.
With these assumptions the implementation is short and simple.
We can always complicate it in the future if we find that the
assumptions are too strong.
Currently, to_string() recursively calls itself for engaged optionals.
Eliminate it. Also, use the std_optional wrapper instead of accessing
std::optional internals directly.
Scylla fiber uses a crude method of scanning inbound and outbound
references to/from other task objects of recognized type. This method
cannot detect user instantiated promise<> objects. Add a note about this
to the printout, so users are beware of this.
We collect already seen tasks in a set to be able to detect perceived
task loops and stop when one is seen. Initialize this set with the
starting task, so if it forms a loop, we won't repeat it in the trace
before cutting the loop.
Currently there is two loops and a separate line printing the starting
task, all duplicating the formatting logic. Define a method for it and
use it in all 3 places instead.
Shard boundaries can be crossed in one direction currently: when looking
for waiters on a task, but not in the other direction (looking for
waited-on tasks). This patch fixes that.
Currently seastar threads end any attempt to follow waited-on-futures.
Seastar threads need special handling because it allocates the wake up
task on its stack. This patch adds this special handling.
scylla_ptr.analyze() switches to the thread the analyzed object lives
on, but forgets to switch back. This was very annoying as any commands
using it (which is a bunch of them) were prone to suddenly and
unexpectedly switching threads.
This patch makes sure that the original thread context is switched back
to after analyzing the pointer.
Rename to scylla tables. Less typing and more up-to-date.
By default it now only lists tables from local shard. Added flag -a
which brings back old behaviour (lists on all shards).
Added -u (only list user tables) and -k (list tables of provided
keyspace only) filtering options.
Even though previous patch makes scylla not gossip this as internal_ip,
an extra sanity check may still be useful. E.g. older versions of scylla
may still do it, or this address can be loaded from system_keyspace.
refs: #11502
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it call cache_preferred_ip() even when the cache is loaded from
system_keyspace and move the connection reset there. This is mainly to
prepare for the next patch, but also makes the code a bit shorter
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Gossiping 0.0.0.0 as preferred IP may break the peer as it will
"interpret" this address as <myself> which is not what peer expects.
However, g.p.f.s. uses --listen-address argument as the internal IP
and it's not prohibited to configure it to be 0.0.0.0
It's better not to gossip the INTERNAL_IP property at all if the listen
address is such.
fixes: #11502
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
leveled_manifest::logger.warn("Turns out that level {} is not disjoint, found {} overlapping SSTables, so compacting everything on behalf of {}.{}",level,overlapping_sstables,schema->ks_name(),schema->cf_name());
// Unfortunately no good limit to limit input size to max_sstables for LCS major
leveled_manifest::logger.warn("Turns out that level {} is not disjoint, found {} overlapping SSTables, so the level will be entirely compacted on behalf of {}.{}",level,overlapping_sstables,schema->ks_name(),schema->cf_name());
help="Build modes to generate ninja files for. The available build modes are:\n{}".format("; ".join(["{} - {}".format(m,cfg['description'])form,cfginmodes.items()])))
throwexceptions::invalid_request_exception(format("Invalid user type literal for {}: field {} is not of type {}",receiver.name,field,field_spec->type->as_cql3_type()));
throwexceptions::invalid_request_exception(format("Invalid user type literal for {}: field {} is not of type {}",*receiver.name,field,field_spec->type->as_cql3_type()));
}
}
}
@@ -123,7 +123,7 @@ usertype_constructor_prepare_expression(const usertype_constructor& u, data_dict
throwexceptions::invalid_request_exception(format("Invalid tuple literal for {}: component {:d} is not of type {}",receiver.name,i,spec->type->as_cql3_type()));
throwexceptions::invalid_request_exception(format("Invalid tuple literal for {}: component {:d} is not of type {}",*receiver.name,i,spec->type->as_cql3_type()));
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.