Compare commits

...

2723 Commits

Author SHA1 Message Date
Anna Stuchlik
2ccd51844c doc: remove wrong image upgrade info (5.2-to-2023.1)
This commit removes the information about the recommended way of upgrading
ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade
procedure is not supported (it was implemented, but then reverted).

Refs https://github.com/scylladb/scylladb/issues/15733

Closes scylladb/scylladb#21876
Fixes https://github.com/scylladb/scylla-enterprise/issues/5041
Fixes https://github.com/scylladb/scylladb/issues/21898

(cherry picked from commit 98860905d8)
2024-12-12 15:28:20 +02:00
Lakshmi Narayanan Sreethar
705ec24977 db/config.cc: increment components_memory_reclaim_threshold config default
Incremented the components_memory_reclaim_threshold config's default
value to 0.2 as the previous value was too strict and caused unnecessary
eviction in otherwise healthy clusters.

Fixes #18607

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 3d7d1fa72a)

Closes #19011
2024-06-04 07:13:28 +03:00
Botond Dénes
e89eb41e70 Merge '[Backport 5.2] : Reload reclaimed bloom filters when memory is available ' from Lakshmi Narayanan Sreethar
PR https://github.com/scylladb/scylladb/pull/17771 introduced a threshold for the total memory used by all bloom filters across SSTables. When the total usage surpasses the threshold, the largest bloom filter will be removed from memory, bringing the total usage back under the threshold. This PR adds support for reloading such reclaimed bloom filters back into memory when memory becomes available (i.e., within the 10% of available memory earmarked for the reclaimable components).

The SSTables manager now maintains a list of all SSTables whose bloom filter was removed from memory and attempts to reload them when an SSTable, whose bloom filter is still in memory, gets deleted. The manager reloads from the smallest to the largest bloom filter to maximize the number of filters being reloaded into memory.

Backported from https://github.com/scylladb/scylladb/pull/18186 to 5.2.

Closes #18666

* github.com:scylladb/scylladb:
  sstable_datafile_test: add testcase to test reclaim during reload
  sstable_datafile_test: add test to verify auto reload of reclaimed components
  sstables_manager: reload previously reclaimed components when memory is available
  sstables_manager: start a fiber to reload components
  sstable_directory_test: fix generation in sstable_directory_test_table_scan_incomplete_sstables
  sstable_datafile_test: add test to verify reclaimed components reload
  sstables: support reloading reclaimed components
  sstables_manager: add new intrusive set to track the reclaimed sstables
  sstable: add link and comparator class to support new instrusive set
  sstable: renamed intrusive list link type
  sstable: track memory reclaimed from components per sstable
  sstable: rename local variable in sstable::total_reclaimable_memory_size
2024-05-30 11:11:39 +03:00
Kefu Chai
45814c7f14 docs: fix typos in upgrade document
s/Montioring/Monitoring/

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit f1f3f009e7)

Closes #18910
2024-05-30 11:10:49 +03:00
Botond Dénes
331e0c4ca7 Merge '[Backport 5.2] mutation_fragment_stream_validating_filter: respect validating_level::none' from ScyllaDB
Even when configured to not do any validation at all, the validator still did some. This small series fixes this, and adds a test to check that validation levels in general are respected, and the validator doesn't validate more than it is asked to.

Fixes: #18662

(cherry picked from commit f6511ca1b0)

(cherry picked from commit e7b07692b6)

(cherry picked from commit 78afb3644c)

 Refs #18667

Closes #18723

* github.com:scylladb/scylladb:
  test/boost/mutation_fragment_test.cc: add test for validator validation levels
  mutation: mutation_fragment_stream_validating_filter: fix validation_level::none
  mutation: mutation_fragment_stream_validating_filter: add raises_error ctor parameter
2024-05-27 08:52:06 +03:00
Alexey Novikov
32be38dae5 make timestamp string format cassandra compatible
when we convert timestamp into string it must look like: '2017-12-27T11:57:42.500Z'
it concerns any conversion except JSON timestamp format
JSON string has space as time separator and must look like: '2017-12-27 11:57:42.500Z'
both formats always contain milliseconds and timezone specification

Fixes #14518
Fixes #7997

Closes #14726
Fixes #16575

(cherry picked from commit ff721ec3e3)

Closes #18852
2024-05-26 16:30:06 +03:00
Botond Dénes
3dacf6a4b1 test/boost/mutation_fragment_test.cc: add test for validator validation levels
To make sure that the validator doesn't validate what the validation
level doesn't include.

(cherry picked from commit 78afb3644c)
2024-05-24 03:36:28 -04:00
Botond Dénes
3d360c7caf mutation: mutation_fragment_stream_validating_filter: fix validation_level::none
Despite its name, this validation level still did some validation. Fix
this, by short-circuiting the catch-all operator(), preventing any
validation when the user asked for none.

(cherry picked from commit e7b07692b6)
2024-05-24 03:34:05 -04:00
Botond Dénes
f7a3091734 mutation: mutation_fragment_stream_validating_filter: add raises_error ctor parameter
When set to false, no exceptions will be raised from the validator on
validation error. Instead, it will just return false from the respective
validator methods. This makes testing simpler, asserting exceptions is
clunky.
When true (default), the previous behaviour will remain: any validation
error will invoke on_internal_error(), resulting in either std::abort()
or an exception.

Backporting notes:
* Added const const mutation_fragment_stream_validating_filter&
  param to on_validation_error()
* Made full_name() public

(cherry picked from commit f6511ca1b0)
2024-05-24 03:33:10 -04:00
Botond Dénes
6f0d32a42f Merge '[Backport 5.2] utils: chunked_vector: fill ctor: make exception safe' from ScyllaDB
Currently, if the fill ctor throws an exception,
the destructor won't be called, as it object is not fully constructed yet.

Call the default ctor first (which doesn't throw)
to make sure the destructor will be called on exception.

Fixes scylladb/scylladb#18635

- [x] Although the fixes is for a rare bug, it has very low risk and so it's worth backporting to all live versions

(cherry picked from commit 64c51cf32c)

(cherry picked from commit 88b3173d03)

(cherry picked from commit 4bbb66f805)

 Refs #18636

Closes #18680

* github.com:scylladb/scylladb:
  chunked_vector_test: add more exception safety tests
  chunked_vector_test: exception_safe_class: count also moved objects
  utils: chunked_vector: fill ctor: make exception safe
2024-05-21 16:30:23 +03:00
Benny Halevy
d947f1e275 chunked_vector_test: add more exception safety tests
For insertion, with and without reservation,
and for fill and copy constructors.

Reproduces https://github.com/scylladb/scylladb/issues/18635

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-21 11:33:42 +03:00
Benny Halevy
d727382cc1 chunked_vector_test: exception_safe_class: count also moved objects
We have to account for moved objects as well
as copied objects so they will be balanced with
the respective `del_live_object` calls called
by the destructor.

However, since chunked_vector requires the
value_type to be nothrow_move_constructible,
just count the additional live object, but
do not modify _countdown or, respectively, throw
an exception, as this should be considered only
for the default and copy constructors.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-21 11:33:42 +03:00
Benny Halevy
15a090f711 utils: chunked_vector: fill ctor: make exception safe
Currently, if the fill ctor throws an exception,
the destructor won't be called, as it object is not
fully constructed yet.

Call the default ctor first (which doesn't throw)
to make sure the destructor will be called on exception.

Fixes scylladb/scylladb#18635

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-21 11:33:42 +03:00
Yaron Kaikov
2fc86cc241 release: prepare for 5.2.19 2024-05-19 16:25:28 +03:00
Lakshmi Narayanan Sreethar
d4c523e9ef sstable_datafile_test: add testcase to test reclaim during reload
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 4d22c4b68b)
2024-05-14 19:20:06 +05:30
Lakshmi Narayanan Sreethar
b505ce4897 sstable_datafile_test: add test to verify auto reload of reclaimed components
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit a080daaa94)
2024-05-14 19:20:06 +05:30
Lakshmi Narayanan Sreethar
80861e3bce sstables_manager: reload previously reclaimed components when memory is available
When an SSTable is dropped, the associated bloom filter gets discarded
from memory, bringing down the total memory consumption of bloom
filters. Any bloom filter that was previously reclaimed from memory due
to the total usage crossing the threshold, can now be reloaded back into
memory if the total usage can still stay below the threshold. Added
support to reload such reclaimed filters back into memory when memory
becomes available.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 0b061194a7)
2024-05-14 19:20:02 +05:30
Lakshmi Narayanan Sreethar
9004c9ee38 sstables_manager: start a fiber to reload components
Start a fiber that gets notified whenever an sstable gets deleted. The
fiber doesn't do anything yet but the following patch will add support
to reload reclaimed components if there is sufficient memory.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit f758d7b114)
2024-05-14 19:19:22 +05:30
Lakshmi Narayanan Sreethar
72494af137 sstable_directory_test: fix generation in sstable_directory_test_table_scan_incomplete_sstables
The testcase uses an sstable whose mutation key and the generation are
owned by different shards. Due to this, when process_sstable_dir is
called, the sstable gets loaded into a different shard than the one that
was intended. This also means that the sstable and the sstable manager
end up in different shards.

The following patch will introduce a condition variable in sstables
manager which will be signalled from the sstables. If the sstable and
the sstable manager are in different shards, the signalling will cause
the testcase to fail in debug mode with this error : "Promise task was
set on shard x but made ready on shard y". So, fix it by supplying
appropriate generation number owned by the same shard which owns the
mutation key as well.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 24064064e9)
2024-05-14 19:19:22 +05:30
Lakshmi Narayanan Sreethar
c15e72695d sstable_datafile_test: add test to verify reclaimed components reload
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 69b2a127b0)
2024-05-14 19:19:18 +05:30
Lakshmi Narayanan Sreethar
83dd78fb9d sstables: support reloading reclaimed components
Added support to reload components from which memory was previously
reclaimed as the total memory of reclaimable components crossed a
threshold. The implementation is kept simple as only the bloom filters
are considered reclaimable for now.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 54bb03cff8)
2024-05-14 19:17:03 +05:30
Lakshmi Narayanan Sreethar
62338d3ad0 compaction: improve partition estimates for garbage collected sstables
When a compaction strategy uses garbage collected sstables to track
expired tombstones, do not use complete partition estimates for them,
instead, use a fraction of it based on the droppable tombstone ratio
estimate.

Fixes #18283

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#18465

(cherry picked from commit d39adf6438)

Closes #18659
2024-05-14 15:42:12 +03:00
Lakshmi Narayanan Sreethar
1bd6584478 sstables_manager: add new intrusive set to track the reclaimed sstables
The new set holds the sstables from where the memory has been reclaimed
and is sorted in ascending order of the total memory reclaimed.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 2340ab63c6)
2024-05-14 01:46:36 +05:30
Lakshmi Narayanan Sreethar
19f3e42583 sstable: add link and comparator class to support new instrusive set
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 140d8871e1)
2024-05-14 01:46:17 +05:30
Lakshmi Narayanan Sreethar
bb9ceae2c3 sstable: renamed intrusive list link type
Renamed the intrusive list link type to differentiate it from the set
link type that will be added in an upcoming patch.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 3ef2f79d14)
2024-05-14 01:45:27 +05:30
Lakshmi Narayanan Sreethar
fa154a8d00 sstable: track memory reclaimed from components per sstable
Added a member variable _total_memory_reclaimed to the sstable class
that tracks the total memory reclaimed from a sstable.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 02d272fdb3)
2024-05-14 01:45:20 +05:30
Lakshmi Narayanan Sreethar
a9101f14f6 sstable: rename local variable in sstable::total_reclaimable_memory_size
Renamed local variable in sstable::total_reclaimable_memory_size in
preparation for the next patch which adds a new member variable
_total_memory_reclaimed to the sstable class.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit a53af1f878)
2024-05-14 01:45:13 +05:30
Kamil Braun
b68c06cc3a direct_failure_detector: increase ping timeout and make it tunable
The direct failure detector design is simplistic. It sends pings
sequentially and times out listeners that reached the threshold (i.e.
didn't hear from a given endpoint for too long) in-between pings.

Given the sequential nature, the previous ping must finish so the next
ping can start. We timeout pings that take too long. The timeout was
hardcoded and set to 300ms. This is too low for wide-area setups --
latencies across the Earth can indeed go up to 300ms. 3 subsequent timed
out pings to a given node were sufficient for the Raft listener to "mark
server as down" (the listener used a threshold of 1s).

Increase the ping timeout to 600ms which should be enough even for
pinging the opposite side of Earth, and make it tunable.

Increase the Raft listener threshold from 1s to 2s. Without the
increased threshold, one timed out ping would be enough to mark the
server as down. Increasing it to 2s requires 3 timed out pings which
makes it more robust in presence of transient network hiccups.

In the future we'll most likely want to decrease the Raft listener
threshold again, if we use Raft for data path -- so leader elections
start quickly after leader failures. (Faster than 2s). To do that we'll
have to improve the design of the direct failure detector.

Ref: scylladb/scylladb#16410
Fixes: scylladb/scylladb#16607

---

I tested the change manually using `tc qdisc ... netem delay`, setting
network delay on local setup to ~300ms with jitter. Without the change,
the result is as observed in scylladb/scylladb#16410: interleaving
```
raft_group_registry - marking Raft server ... as dead for Raft groups
raft_group_registry - marking Raft server ... as alive for Raft groups
```
happening once every few seconds. The "marking as dead" happens whenever
we get 3 subsequent failed pings, which is happens with certain (high)
probability depending on the latency jitter. Then as soon as we get a
successful ping, we mark server back as alive.

With the change, the phenomenon no longer appears.

(cherry picked from commit 8df6d10e88)

Closes #18558
2024-05-08 15:46:59 +02:00
Pavel Emelyanov
1cb959fc84 Update seastar submodule (iotune iodepth underflow fix)
* seastar b9fd21d8...5ab9a7cf (1):
  > iotune: ignore shards with id above max_iodepth

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-06 19:25:41 +03:00
Pavel Emelyanov
5f5acc813a view-builder: Print correct exception in built ste exception handler
Inside .handle_exception() continuation std::current_exception() doesn't
work, there's std::exception ex argument to handler's lambda instead

fixes #18423

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18349

(cherry picked from commit 4ac30e5337)
2024-05-01 10:20:26 +03:00
Anna Stuchlik
f71a687baf doc: run repair after changing RF of system_auth
This commit adds the requirement to run repair after changing
the replication factor of the system_auth keyspace
in the procedure of adding a new node to a cluster.

Refs: https://github.com/scylladb/scylla-enterprise/issues/4129

Closes scylladb/scylladb#18466

(cherry picked from commit d85d37921a)
2024-04-30 19:18:15 +03:00
Asias He
b2858e4028 streaming: Fix use after move in fire_stream_event
The event is used in a loop.

Found by clang-tidy:

```
streaming/stream_result_future.cc:80:49: warning: 'event' used after it was moved [bugprone-use-after-move]
        listener->handle_stream_event(std::move(event));
                                                ^
streaming/stream_result_future.cc:80:39: note: move occurred here
        listener->handle_stream_event(std::move(event));
                                      ^
streaming/stream_result_future.cc:80:49: note: the use happens in a later loop iteration than the move
        listener->handle_stream_event(std::move(event));
                                                ^
```

Fixes #18332

(cherry picked from commit 4fd4e6acf3)

Closes #18430
2024-04-30 15:07:43 +02:00
Lakshmi Narayanan Sreethar
0fc0474ccc sstables: reclaim_memory_from_components: do not update _recognised_components
When reclaiming memory from bloom filters, do not remove them from
_recognised_components, as that leads to the on-disk filter component
being left back on disk when the SSTable is deleted.

Fixes #18398

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#18400

(cherry picked from commit 6af2659b57)

Closes #18437
2024-04-29 10:02:52 +03:00
Kefu Chai
119dbb0d43 thrift: avoid use-after-move in make_non_overlapping_ranges()
in handler.cc, `make_non_overlapping_ranges()` references a moved
instance of `ColumnSlice` when something unexpected happens to
format the error message in an exception, the move constructor of
`ColumnSlice` is default-generated, so the members' move constructors
are used to construct the new instance in the move constructor. this
could lead to undefined behavior when dereferencing the move instance.

in this change, in order to avoid use-after free, let's keep
a copy of the referenced member variables and reference them when
formatting error message in the exception.

this use-after-move issue was introduced in 822a315dfa, which implemented
`get_multi_slice` verb and this piece in the first place. since both 5.2
and 5.4 include this commit, we should backport this change to them.

Refs 822a315dfa
Fixes #18356
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 1ad3744edc)

Closes #18373
2024-04-25 11:36:00 +03:00
Anna Mikhlin
dae9bef75f release: prepare for 5.2.18 2024-04-19 13:30:48 +03:00
Asias He
065f7178ab repair: Improve estimated_partitions to reduce memory usage
Currently, we use the sum of the estimated_partitions from each
participant node as the estimated_partitions for sstable produced by
repair. This way, the estimated_partitions is the biggest possible
number of partitions repair would write.

Since repair will write only the difference between repair participant
nodes, using the biggest possible estimation will overestimate the
partitions written by repair, most of the time.

The problem is that overestimated partitions makes the bloom filter
consume more memory. It is observed that it causes OOM in the field.

This patch changes the estimation to use a fraction of the average
partitions per node instead of sum. It is still not a perfect estimation
but it already improves memory usage significantly.

Fixes #18140

Closes scylladb/scylladb#18141

(cherry picked from commit 642f9a1966)
2024-04-18 16:37:05 +03:00
Botond Dénes
f17e480237 Merge '[Backport 5.2] : Track and limit memory used by bloom filters' from Lakshmi Narayanan Sreethar
Added support to track and limit the memory usage by sstable components. A reclaimable component of an SSTable is one from which memory can be reclaimed. SSTables and their managers now track such reclaimable memory and limit the component memory usage accordingly. A new configuration variable defines the memory reclaim threshold. If the total memory of the reclaimable components exceeds this limit, memory will be reclaimed to keep the usage under the limit. This PR considers only the bloom filters as reclaimable and adds support to track and limit them as required.

The feature can be manually verified by doing the following :

1. run a single-node single-shard 1GB cluster
2. create a table with bloom-filter-false-positive-chance of 0.001 (to intentionally cause large bloom filter)
3. populate with tiny partitions
4. watch the bloom filter metrics get capped at 100MB

The default value of the `components_memory_reclaim_threshold` config variable which controls the reclamation process is `.1`. This can also be reduced further during manual tests to easily hit the threshold and verify the feature.

Fixes https://github.com/scylladb/scylladb/issues/17747

Backported from #17771 to 5.2.

Closes #18247

* github.com:scylladb/scylladb:
  test_bloom_filter.py: disable reclaiming memory from components
  sstable_datafile_test: add tests to verify auto reclamation of components
  test/lib: allow overriding available memory via test_env_config
  sstables_manager: support reclaiming memory from components
  sstables_manager: store available memory size
  sstables_manager: add variable to track component memory usage
  db/config: add a new variable to limit memory used by table components
  sstable_datafile_test: add testcase to verify reclamation from sstables
  sstables: support reclaiming memory from components
2024-04-17 14:34:19 +03:00
Lakshmi Narayanan Sreethar
dd9ab15bb5 test_bloom_filter.py: disable reclaiming memory from components
Disabled reclaiming memory from sstable components in the testcase as it
interferes with the false positive calculation.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit d86505e399)
2024-04-16 15:50:22 +05:30
Lakshmi Narayanan Sreethar
96db5ae5e3 sstable_datafile_test: add tests to verify auto reclamation of components
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit d261f0fbea)
2024-04-16 15:49:58 +05:30
Lakshmi Narayanan Sreethar
beea229deb test/lib: allow overriding available memory via test_env_config
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 169629dd40)
2024-04-16 15:30:39 +05:30
Lakshmi Narayanan Sreethar
89367c4310 sstables_manager: support reclaiming memory from components
Reclaim memory from the SSTable that has the most reclaimable memory if
the total reclaimable memory has crossed the threshold. Only the bloom
filter memory is considered reclaimable for now.

Fixes #17747

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit a36965c474)
2024-04-16 15:30:39 +05:30
Lakshmi Narayanan Sreethar
32de41ecb4 sstables_manager: store available memory size
The available memory size is required to calculate the reclaim memory
threshold, so store that within the sstables manager.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 2ca4b0a7a2)
2024-04-16 15:30:39 +05:30
Lakshmi Narayanan Sreethar
0841c0084c sstables_manager: add variable to track component memory usage
sstables_manager::_total_reclaimable_memory variable tracks the total
memory that is reclaimable from all the SSTables managed by it.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit f05bb4ba36)
2024-04-16 15:30:39 +05:30
Lakshmi Narayanan Sreethar
786c08aa59 db/config: add a new variable to limit memory used by table components
A new configuration variable, components_memory_reclaim_threshold, has
been added to configure the maximum allowed percentage of available
memory for all SSTable components in a shard. If the total memory usage
exceeds this threshold, it will be reclaimed from the components to
bring it back under the limit. Currently, only the memory used by the
bloom filters will be restricted.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit e8026197d2)
2024-04-16 15:30:39 +05:30
Lakshmi Narayanan Sreethar
31251b37dd sstable_datafile_test: add testcase to verify reclamation from sstables
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit e0b6186d16)
2024-04-16 15:30:30 +05:30
Lakshmi Narayanan Sreethar
1b390ceb24 sstables: support reclaiming memory from components
Added support to track total memory from components that are reclaimable
and to reclaim memory from them if and when required. Right now only the
bloom filters are considered as reclaimable components but this can be
extended to any component in the future.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 4f0aee62d1)
2024-04-16 13:03:45 +05:30
Tzach Livyatan
0bfe016beb Update Driver root page
The right term is Amazon DynamoDB not AWS DynamoDB
See https://aws.amazon.com/dynamodb/

Closes scylladb/scylladb#18214

(cherry picked from commit 289793d964)
2024-04-16 09:55:41 +03:00
Botond Dénes
280956f507 Merge '[Backport 5.2] repair: fix memory counting in repair' from Aleksandra Martyniuk
Repair memory limit includes only the size of frozen mutation
fragments in repair row. The size of other members of repair
row may grow uncontrollably and cause out of memory.

Modify what's counted to repair memory limit.

Fixes: https://github.com/scylladb/scylladb/issues/16710.

(cherry picked from commit a4dc6553ab)

(cherry picked from commit 51c09a84cc)

Refs https://github.com/scylladb/scylladb/pull/17785

Closes #18237

* github.com:scylladb/scylladb:
  test: add test for repair_row::size()
  repair: fix memory accounting in repair_row
2024-04-16 07:07:15 +03:00
Aleksandra Martyniuk
97671eb935 test: add test for repair_row::size()
Add test which checs whether repair_row::size() considers external
memory.

(cherry picked from commit 51c09a84cc)
2024-04-09 13:29:33 +02:00
Aleksandra Martyniuk
8144134545 repair: fix memory accounting in repair_row
In repair, only the size of frozen mutation fragments of repair row
is counted to the memory limit. So, huge keys of repair rows may
lead to OOM.

Include other repair_row's members' memory size in repair memory
limit.

(cherry picked from commit a4dc6553ab)
2024-04-06 22:44:51 +00:00
Ferenc Szili
2bb5fe7311 logging: Don't log PK/CK in large partition/row/cell warning
Currently, Scylla logs a warning when it writes a cell, row or partition which are larger than certain configured sizes. These warnings contain the partition key and in case of rows and cells also the cluster key which allow the large row or partition to be identified. However, these keys can contain user-private, sensitive information. The information which identifies the partition/row/cell is also inserted into tables system.large_partitions, system.large_rows and system.large_cells respectivelly.

This change removes the partition and cluster keys from the log messages, but still inserts them into the system tables.

The logged data will look like this:

Large cells:
WARN  2024-04-02 16:49:48,602 [shard 3:  mt] large_data - Writing large cell ks_name/tbl_name: cell_name (SIZE bytes) to sstable.db

Large rows:
WARN  2024-04-02 16:49:48,602 [shard 3:  mt] large_data - Writing large row ks_name/tbl_name: (SIZE bytes) to sstable.db

Large partitions:
WARN  2024-04-02 16:49:48,602 [shard 3:  mt] large_data - Writing large partition ks_name/tbl_name: (SIZE bytes) to sstable.db

Fixes #18041

Closes scylladb/scylladb#18166

(cherry picked from commit f1cc6252fd)
2024-04-05 16:03:08 +03:00
Kefu Chai
4595f51d5c utils/logalloc: do not allocate memory in reclaim_timer::report()
before this change, `reclaim_timer::report()` calls

```c++
fmt::format(", at {}", current_backtrace())
```

which allocates a `std::string` on heap, so it can fail and throw. in
that case, `std::terminate()` is called. but at that moment, the reason
why `reclaim_timer::report()` gets called is that we fail to reclaim
memory for the caller. so we are more likely to run into this issue. anyway,
we should not allocate memory in this path.

in this change, a dedicated printer is created so that we don't format
to a temporary `std::string`, and instead write directly to the buffer
of logger. this avoids the memory allocation.

Fixes #18099
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18100

(cherry picked from commit fcf7ca5675)
2024-04-02 16:38:17 +03:00
Wojciech Mitros
c0c34d2af0 mv: keep semaphore units alive until the end of a remote view update
When a view update has both a local and remote target endpoint,
it extends the lifetime of its memory tracking semaphore units
only until the end of the local update, while the resources are
actually used until the remote update finishes.
This patch changes the semaphore transferring so that in case
of both local and remote endpoints, both view updates share the
units, causing them to be released only after the update that
takes longer finishes.

Fixes #17890

(cherry picked from commit 9789a3dc7c)

Closes #18104
2024-04-02 10:09:01 +02:00
Pavel Emelyanov
c34a503ef3 Update seastar submodule (iotune error path crash fix)
* seastar eb093f8a...b9fd21d8 (1):
  > iotune: Don't close file that wasn't opened

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-03-28 10:53:51 +03:00
Beni Peled
447a3beb47 release: prepare for 5.2.17 2024-03-27 14:35:37 +02:00
Botond Dénes
aca50c46b7 tools/toolchain: update python driver
Backports scylladb/scylladb#17604 and scylladb/scylladb#17956.

Fixes scylladb/scylladb#16709
Fixes scylladb/scylladb#17353

Closes #17661
2024-03-27 08:48:25 +02:00
Wojciech Mitros
44bcaca929 mv: adjust memory tracking of single view updates within a batch
Currently, when dividing memory tracked for a batch of updates
we do not take into account the overhead that we have for processing
every update. This patch adds the overhead for single updates
and joins the memory calculation path for batches and their parts
so that both use the same overhead.

Fixes #17854

(cherry picked from commit efcb718)

Closes #17999
2024-03-26 09:38:17 +02:00
Botond Dénes
2e2bf79092 Merge '[Backport 5.2] tests: utils: error injection: print time duration instead of count' from ScyllaDB
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.

in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision. the tests are updated to print the time duration
instead of count to provide information with a higher resolution.

Fixes #15902

(cherry picked from commit 8a5689e7a7)

(cherry picked from commit 1d33a68dd7)

Closes #17911

* github.com:scylladb/scylladb:
  tests: utils: error injection: print time duration instead of count
  error_injection: do not cast to milliseconds when injecting timeout
2024-03-25 17:41:23 +02:00
Pavel Emelyanov
616199f79c Update seastar submodule (dupliex IO queue activation fix)
* seastar ad0f2d5d...eb093f8a (1):
  > fair_queue: Do not pop unplugged class immediately

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-03-25 12:58:24 +03:00
Wojciech Mitros
5dfb6c9ead mv: adjust the overhead estimation for view updates
In order to avoid running out of memory, we can't
underestimate the memory used when processing a view
update. Particularly, we need to handle the remote
view updates well, because we may create many of them
at the same time in contrast to local updates which
are processed synchronously.

After investigating a coredump generated in a crash
caused by running out of memory due to these remote
view updates, we found that the current estimation
is much lower than what we observed in practice; we
identified overhead of up to 2288 bytes for each
remote view update. The overhead consists of:
- 512 bytes - a write_response_handler
- less than 512 bytes - excessive memory allocation
for the mutation in bytes_ostream
- 448 bytes - the apply_to_remote_endpoints coroutine
started in mutate_MV()
- 192 bytes - a continuation to the coroutine above
- 320 bytes - the coroutine in result_parallel_for_each
started in mutate_begin()
- 112 bytes - a continuation to the coroutine above
- 192 bytes - 5 unspecified allocations of 32, 32, 32,
48 and 48 bytes

This patch changes the previous overhead estimate
of 256 bytes to 2288 bytes, which should take into
account all allocations in the current version of the
code. It's worth noting that changes in the related
pieces of code may result in a different overhead.

The allocations seem to be mostly captures for the
background tasks. Coroutines seem to allocate extra,
however testing shows that replacing a coroutine with
continuations may result in generating a few smaller
futures/continuations with a larger total size.
Besides that, considering that we're waiting for
a response for each remote view update, we need the
relatively large write_response_handler, which also
includes the mutation in case we needed to reuse it.

The change should not majorly affect workloads with many
local updates because we don't keep many of them at
the same time anyway, and an added benefit of correct
memory utilization estimation is avoiding evictions
of other memory that would be otherwise necessary
to handle the excessive memory used by view updates.

Fixes #17364

(cherry picked from commit 5ab3586135)

Closes #17858
2024-03-20 13:52:23 +02:00
Kefu Chai
6209f5d6d4 tests: utils: error injection: print time duration instead of count
instead of casting / comparing the count of duration unit, let's just
compare the durations, so that boost.test is able to print the duration
in a more informative and user friendly way (line wrapped)

test/boost/error_injection_test.cc(167): fatal error:
    in "test_inject_future_disabled":
      critical check wait_time > sleep_msec has failed [23839ns <= 10ms]

Refs #15902
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 1d33a68dd7)
2024-03-20 09:40:16 +00:00
Kefu Chai
ac288684c6 error_injection: do not cast to milliseconds when injecting timeout
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.

in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision.

Fixes #15902

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 8a5689e7a7)
2024-03-20 09:40:16 +00:00
Raphael S. Carvalho
fc1d126f31 replica: Fix major compaction semantics by performing off-strategy first
Major compaction semantics is that all data of a table will be compacted
together, so user can expect e.g. a recently introduced tombstone to be
compacted with the data it shadows.
Today, it can happen that all data in maintenance set won't be included
for major, until they're promoted into main set by off-strategy.
So user might be left wondering why major is not having the expected
effect.
To fix this, let's perform off-strategy first, so data in maintenance
set will be made available by major. A similar approach is done for
data in memtable, so flush is performed before major starts.
The only exception will be data in staging, which cannot be compacted
until view building is done with it, to avoid inconsistency in view
replicas.
The serialization in comapaction manager of reshape jobs guarantee
correctness if there's an ongoing off-strategy on behalf of the
table.

Fixes #11915.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#15792

(cherry picked from commit ea6c281b9f)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #17901
2024-03-20 08:48:17 +02:00
Anna Stuchlik
70edebd8d7 doc: fix the image upgrade page
This commit updates the Upgrade ScyllaDB Image page.

- It removes the incorrect information that updating underlying OS packages is mandatory.
- It adds information about the extended procedure for non-official images.

(cherry picked from commit fc90112b97)

Closes #17885
2024-03-19 16:47:06 +02:00
Petr Gusev
dffc0fb720 repair_meta: get_estimated_partitions fix
The shard_range parameter was unused.

Fixes: #17863

(cherry picked from commit b9f527bfa8)
2024-03-18 14:27:45 +02:00
Kamil Braun
0bb338c521 test: remove test_writes_to_recent_previous_cdc_generations
The test in its original form relies on the
`error_injections_at_startup` feature, which 5.2 doesn't have, so I
adapted the test to enable error injections after bootstrapping nodes in
the backport (9c44bbce67). That is however
incorrect, it's important for the injection to be enabled while the
nodes are booting, otherwise the test will be flaky, as we observed.
Details in scylladb/scylladb#17749.

Remove the test from 5.2 branch.

Fixes scylladb/scylladb#17749

Closes #17750
2024-03-15 10:22:22 +02:00
Tomasz Grabiec
cefa19eb93 Merge 'migration_manager: take group0 lock during raft snapshot taking' from Kamil Braun
This is a backport of 0c376043eb and follow-up fix 57b14580f0 to 5.2.

We haven't identified any specific issues in test or field in 5.2/2023.1 releases, but the bug should be fixed either way, it might bite us in unexpected ways.

Closes #17640

* github.com:scylladb/scylladb:
  migration_manager: only jump to shard 0 in migration_request during group 0 snapshot transfer
  raft_group0_client: assert that hold_read_apply_mutex is called on shard 0
  migration_manager: fix indentation after the previous patch.
  messaging_service: process migration_request rpc on shard 0
  migration_manager: take group0 lock during raft snapshot taking
2024-03-14 23:41:02 +01:00
Nadav Har'El
08077ff3e8 alternator, mv: fix case of two new key columns in GSI
A materialized view in CQL allows AT MOST ONE view key column that
wasn't a key column in the base table. This is because if there were
two or more of those, the "liveness" (timestamp, ttl) of these different
columns can change at every update, and it's not possible to pick what
liveness to use for the view row we create.

We made an exception for this rule for Alternator: DynamoDB's API allows
creating a GSI whose partition key and range key are both regular columns
in the base table, and we must support this. We claim that the fact that
Alternator allows neither TTL (Alternator's "TTL" is a different feature)
nor user-defined timestamps, does allow picking the liveness for the view
row we create. But we did it wrong!

We claimed in a comment - and implemented in the code before this patch -
that in Alternator we can assume that both GSI key columns will have the
*same* liveness, and in particular timestamp. But this is only true if
one modifies both columns together! In fact, in general it is not true:
We can have two non-key attributes 'a' and 'b' which are the GSI's key
columns, and we can modify *only* b, without modifying a, in which case
the timestamp of the view modification should be b's newer timestamp,
not a's older one. The existing code took a's timestamp, assuming it
will be the same as b's, which is incorrect. The result was that if
we repeatedly modify only b, all view updates will receive the same
timestamp (a's old timestamp), and a deletion will always win over
all the modifications. This patch includes a reproducing test written by
a user (@Zak-Kent) that demonstrates how after a view row is deleted
it doesn't get recreated - because all the modifications use the same
timestamp.

The fix is, as suggested above, to use the *higher* of the two
timestamps of both base-regular-column GSI key columns as the timestamp
for the new view rows or view row deletions. The reproducer that
failed before this patch passes with it. As usual, the reproducer
passes on AWS DynamoDB as well, proving that the test is correct and
should really work.

Fixes #17119

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17172

(cherry picked from commit 21e7deafeb)
2024-03-13 15:08:32 +02:00
Kamil Braun
405387c663 test: unflake test_topology_remove_garbage_group0
The test is booting nodes, and then immediately starts shutting down
nodes and removing them from the cluster. The shutting down and
removing may happen before driver manages to connect to all nodes in the
cluster. In particular, the driver didn't yet connect to the last
bootstrapped node. Or it can even happen that the driver has connected,
but the control connection is established to the first node, and the
driver fetched topology from the first node when the first node didn't
yet consider the last node to be normal. So the driver decides to close
connection to the last node like this:
```
22:34:03.159 DEBUG> [control connection] Removing host not found in
   peers metadata: <Host: 127.42.90.14:9042 datacenter1>
```

Eventually, at the end of the test, only the last node remains, all
other nodes have been removed or stopped. But the driver does not have a
connection to that last node.

Fix this problem by ensuring that:
- all nodes see each other as NORMAL,
- the driver has connected to all nodes
at the beginning of the test, before we start shutting down and removing
nodes.

Fixes scylladb/scylladb#16373

(cherry picked from commit a68701ed4f)

Closes #17703
2024-03-12 13:43:21 +01:00
Kamil Braun
b567364af1 migration_manager: only jump to shard 0 in migration_request during group 0 snapshot transfer
Jumping to shard 0 during group 0 snapshot transfer is required because
we take group 0 lock, onyl available on shard 0. But outside of Raft
mode it only pessimizes performance unnecessarily, so don't do it.
2024-03-12 11:19:31 +01:00
Botond Dénes
3897d44893 repair: resolve start-up deadlock
Repairs have to obtain a permit to the reader concurrency semaphore on
each shard they have a presence on. This is prone to deadlocks:

node1                              node2
repair1_master (takes permit)      repair1_follower (waits on permit)
repair2_master (waits for permit)  repair2_follower (takes permit)

In lieu of strong central coordination, we solved this by making permits
evictable: if repair2 can evict repair1's permit so it can obtain one
and make progress. This is not efficient as evicting a permit usually
means discarding already done work, but it prevents the deadlocks.
We recently discovered that there is a window when deadlocks can still
happen. The permit is made evictable when the disk reader is created.
This reader is an evictable one, which effectively makes the permit
evictable. But the permit is obtained when the repair constrol
structrure -- repair meta -- is create. Between creating the repair meta
and reading the first row from disk, the deadlock is still possible. And
we know that what is possible, will happen (and did happen). Fix by
making the permit evictable as soon as the repair meta is created. This
is very clunky and we should have a better API for this (refs #17644),
but for now we go with this simple patch, to make it easy to backport.

Refs: #17644
Fixes: #17591

Closes #17646

(cherry picked from commit c6e108a)

Backport notes:

The fix above does not apply to 5.2, because on 5.2 the reader is
created immediately when the repair-meta is created. So we don't need
the game with a fake inactive read, we can just pause the already
created reader in the repair-reader constructor.

Closes #17730
2024-03-12 08:24:26 +02:00
Michał Chojnowski
54048e5613 sstables: fix a use-after-free in key_view::explode()
key_view::explode() contains a blatant use-after-free:
unless the input is already linearized, it returns a view to a local temporary buffer.

This is rare, because partition keys are usually not large enough to be fragmented.
But for a sufficiently large key, this bug causes a corrupted partition_key down
the line.

Fixes #17625

(cherry picked from commit 7a7b8972e5)

Closes #17725
2024-03-11 16:17:32 +02:00
Lakshmi Narayanan Sreethar
e8736ae431 reader_permit: store schema_ptr instead of raw schema pointer
Store schema_ptr in reader permit instead of storing a const pointer to
schema to ensure that the schema doesn't get changed elsewhere when the
permit is holding on to it. Also update the constructors and all the
relevant callers to pass down schema_ptr instead of a raw pointer.

Fixes #16180

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#16658

(cherry picked from commit 76f0d5e35b)

Closes #17694
2024-03-08 10:56:12 +02:00
Gleb Natapov
42d25f1911 raft_group0_client: assert that hold_read_apply_mutex is called on shard 0
group0 operations a valid on shard 0 only. Assert that.

(cherry picked from commit 9847e272f9)
2024-03-05 16:51:13 +01:00
Gleb Natapov
619f75d1de migration_manager: fix indentation after the previous patch.
(cherry picked from commit 77907b97f1)
2024-03-05 16:50:24 +01:00
Gleb Natapov
6dd31dcade messaging_service: process migration_request rpc on shard 0
Commit 0c376043eb added access to group0
semaphore which can be done on shard0 only. Unlike all other group0 rpcs
(that already always forwarded to shard0) migration_request does not
since it is an rpc that what reused from non raft days. The patch adds
the missing jump to shard0 before executing the rpc.

(cherry picked from commit 4a3c79625f)
2024-03-05 16:49:23 +01:00
Gleb Natapov
dd65bf151b migration_manager: take group0 lock during raft snapshot taking
Group0 state machine access atomicity is guaranteed by a mutex in group0
client. A code that reads or writes the state needs to hold the log. To
transfer schema part of the snapshot we used existing "migration
request" verb which did not follow the rule. Fix the code to take group0
lock before accessing schema in case the verb is called as part of
group0 snapshot transfer.

Fixes scylladb/scylladb#16821

(cherry picked from commit 0c376043eb)

Backport note: introduced missing
`raft_group0_client::hold_read_apply_mutex`
2024-03-05 16:40:00 +01:00
Yaron Kaikov
f4a7804596 release: prepare for 5.2.16 2024-03-03 14:33:44 +02:00
Botond Dénes
ba373f83e4 Merge '[Backport 5.2] repair: streaming: handle no_such_column_family from remote node' from Aleksandra Martyniuk
RPC calls lose information about the type of returned exception.
Thus, if a table is dropped on receiver node, but it still exists
on a sender node and sender node streams the table's data, then
the whole operation fails.

To prevent that, add a method which synchronizes schema and then
checks, if the exception was caused by table drop. If so,
the exception is swallowed.

Use the method in streaming and repair to continue them when
the table is dropped in the meantime.

Fixes: https://github.com/scylladb/scylladb/issues/17028.
Fixes: https://github.com/scylladb/scylladb/issues/15370.
Fixes: https://github.com/scylladb/scylladb/issues/15598.

Closes #17528

* github.com:scylladb/scylladb:
  repair: handle no_such_column_family from remote node gracefully
  test: test drop table on receiver side during streaming
  streaming: fix indentation
  streaming: handle no_such_column_family from remote node gracefully
  repair: add methods to skip dropped table
2024-02-28 16:33:01 +02:00
Kamil Braun
d82c757323 Merge 'misc_services: fix data race from bad usage of get_next_version' from Piotr Dulikowski
The function `gms::version_generator::get_next_version()` can only be called from shard 0 as it uses a global, unsynchronized counter to issue versions. Notably, the function is used as a default argument for the constructor of `gms::versioned_value` which is used from shorthand constructors such as `versioned_value::cache_hitrates`, `versioned_value::schema` etc.

The `cache_hitrate_calculator` service runs a periodic job which updates the `CACHE_HITRATES` application state in the local gossiper state. Each time the job is scheduled, it runs on the next shard (it goes through shards in a round-robin fashion). The job uses the `versioned_value::cache_hitrates` shorthand to create a `versioned_value`, therefore risking a data race if it is not currently executing on shard 0.

The PR fixes the race by moving the call to `versioned_value::cache_hitrates` to shard 0. Additionally, in order to help detect similar issues in the future, a check is introduced to `get_next_version` which aborts the process if the function was called on other shard than 0.

There is a possibility that it is a fix for #17493. Because `get_next_version` uses a simple incrementation to advance the global counter, a data race can occur if two shards call it concurrently and it may result in shard 0 returning the same or smaller value when called two times in a row. The following sequence of events is suspected to occur on node A:

1. Shard 1 calls `get_next_version()`, loads version `v - 1` from the global counter and stores in a register; the thread then is preempted,
2. Shard 0 executes `add_local_application_state()` which internally calls `get_next_version()`, loads `v - 1` then stores `v` and uses version `v` to update the application state,
3. Shard 0 executes `add_local_application_state()` again, increments version to `v + 1` and uses it to update the application state,
4. Gossip message handler runs, exchanging application states with node B. It sends its application state to B. Note that the max version of any of the local application states is `v + 1`,
5. Shard 1 resumes and stores version `v` in the global counter,
6. Shard 0 executes `add_local_application_state()` and updates the application state - again - with version `v + 1`.
7. After that, node B will never learn about the application state introduced in point 6. as gossip exchange only sends endpoint states with version larger than the previous observed max version, which was `v + 1` in point 4.

Note that the above scenario was _not_ reproduced. However, I managed to observe a race condition by:

1. modifying Scylla to run update of `CACHE_HITRATES` much more frequently than usual,
2. putting an assertion in `add_local_application_state` which fails if the version returned by `get_next_version` was not larger than the previous returned value,
3. running a test which performs schema changes in a loop.

The assertion from the second point was triggered. While it's hard to tell how likely it is to occur without making updates of cache hitrates more frequent - not to mention the full theorized scenario - for now this is the best lead that we have, and the data race being fixed here is a real bug anyway.

Refs: #17493

Closes scylladb/scylladb#17499

* github.com:scylladb/scylladb:
  version_generator: check that get_next_version is called on shard 0
  misc_services: fix data race from bad usage of get_next_version

(cherry picked from commit fd32e2ee10)
2024-02-28 14:28:03 +01:00
Aleksandra Martyniuk
78aeb990a6 repair: handle no_such_column_family from remote node gracefully
If no_such_column_family is thrown on remote node, then repair
operation fails as the type of exception cannot be determined.

Use repair::with_table_drop_silenced in repair to continue operation
if a table was dropped.

(cherry picked from commit cf36015591)
2024-02-28 11:46:02 +01:00
Aleksandra Martyniuk
23493bb342 test: test drop table on receiver side during streaming
(cherry picked from commit 2ea5d9b623)
2024-02-28 11:46:02 +01:00
Aleksandra Martyniuk
d19afd7059 streaming: fix indentation
(cherry picked from commit b08f539427)
2024-02-28 11:46:02 +01:00
Aleksandra Martyniuk
4e200aa250 streaming: handle no_such_column_family from remote node gracefully
If no_such_column_family is thrown on remote node, then streaming
operation fails as the type of exception cannot be determined.

Use repair::with_table_drop_silenced in streaming to continue
operation if a table was dropped.

(cherry picked from commit 219e1eda09)
2024-02-28 11:46:02 +01:00
Aleksandra Martyniuk
afca1142cd repair: add methods to skip dropped table
Schema propagation is async so one node can see the table while on
the other node it is already dropped. So, if the nodes stream
the table data, the latter node throws no_such_column_family.
The exception is propagated to the other node, but its type is lost,
so the operation fails on the other node.

Add method which waits until all raft changes are applied and then
checks whether given table exists.

Add the function which uses the above to determine, whether the function
failed because of dropped table (eg. on the remote node so the exact
exception type is unknown). If so, the exception isn't rethrown.

(cherry picked from commit 5202bb9d3c)
2024-02-28 11:45:54 +01:00
Botond Dénes
ce1a422c9c Merge '[Backport 5.2] sstables: close index_reader in has_partition_key' from Aleksandra Martyniuk
If index_reader isn't closed before it is destroyed, then ongoing
sstables reads won't be awaited and assertion will be triggered.

Close index_reader in has_partition_key before destroying it.

Fixes: https://github.com/scylladb/scylladb/issues/17232.

Closes #17532

* github.com:scylladb/scylladb:
  test: add test to check if reader is closed
  sstables: close index_reader in has_partition_key
2024-02-27 16:12:17 +02:00
Aleksandra Martyniuk
296be93714 test: add test to check if reader is closed
Add test to check if reader is closed in sstable::has_partition_key.

(cherry picked from commit 4530be9e5b)
2024-02-26 16:17:12 +01:00
Aleksandra Martyniuk
6feb802d54 sstables: close index_reader in has_partition_key
If index_reader isn't closed before it is destroyed, then ongoing
sstables reads won't be awaited and assertion will be triggered.

Close index_reader in has_partition_key before destroying it.

(cherry picked from commit 5227336a32)
2024-02-26 16:17:12 +01:00
Avi Kivity
9c44bbce67 Merge 'cdc: metadata: allow sending writes to the previous generations' from Patryk Jędrzejczak
Before this PR, writes to the previous CDC generations would
always be rejected. After this PR, they will be accepted if the
write's timestamp is greater than `now - generation_leeway`.

This change was proposed around 3 years ago. The motivation was
to improve user experience. If a client generates timestamps by
itself and its clock is desynchronized with the clock of the node
the client is connected to, there could be a period during
generation switching when writes fail. We didn't consider this
problem critical because the client could simply retry a failed
write with a higher timestamp. Eventually, it would succeed. This
approach is safe because these failed writes cannot have any side
effects. However, it can be inconvenient. Writing to previous
generations was proposed to improve it.

The idea was rejected 3 years ago. Recently, it turned out that
there is a case when the client cannot retry a write with the
increased timestamp. It happens when a table uses CDC and LWT,
which makes timestamps permanent. Once Paxos commits an entry
with a given timestamp, Scylla will keep trying to apply that entry
until it succeeds, with the same timestamp. Applying the entry
involves writing to the CDC log table. If it fails, we get stuck.
It's a major bug with an unknown perfect solution.

Allowing writes to previous generations for `generation_leeway` is
a probabilistic fix that should solve the problem in practice.

Apart from this change, this PR adds tests for it and updates
the documentation.

This PR is sufficient to enable writes to the previous generations
only in the gossiper-based topology. The Raft-based topology
needs some adjustments in loading and cleaning CDC generations.
These changes won't interfere with the changes introduced in this
PR, so they are left for a follow-up.

Fixes scylladb/scylladb#7251
Fixes scylladb/scylladb#15260

Closes scylladb/scylladb#17134

* github.com:scylladb/scylladb:
  docs: using-scylla: cdc: remove info about failing writes to old generations
  docs: dev: cdc: document writing to previous CDC generations
  test: add test_writes_to_previous_cdc_generations
  cdc: generation: allow increasing generation_leeway through error injection
  cdc: metadata: allow sending writes to the previous generations

(cherry picked from commit 9bb4482ad0)

Backport note: replaced `servers_add` with `server_add` loop in tests
replaced `error_injections_at_startup` (not implemented in 5.2) with
`enable_injection` post-boot
2024-02-22 15:05:19 +01:00
Nadav Har'El
6a6115cd86 mv: fix missing view deletions in some cases of range tombstones
For efficiency, if a base-table update generates many view updates that
go the same partition, they are collected as one mutation. If this
mutation grows too big it can lead to memory exhaustion, so since
commit 7d214800d0 we split the output
mutation to mutations no longer than 100 rows (max_rows_for_view_updates)
each.

This patch fixes a bug where this split was done incorrectly when
the update involved range tombstones, a bug which was discovered by
a user in a real use case (#17117).

Range tombstones are read in two parts, a beginning and an end, and the
code could split the processing between these two parts and the result
that some of the range tombstones in update could be missed - and the
view could miss some deletions that happened in the base table.

This patch fixes the code in two places to avoid breaking up the
processing between range tombstones:

1. The counter "_op_count" that decides where to break the output mutation
   should only be incremented when adding rows to this output mutation.
   The existing code strangely incrmented it on every read (!?) which
   resulted in the counter being incremented on every *input* fragment,
   and in particular could reach the limit 100 between two range
   tombstone pieces.

2. Moreover, the length of output was checked in the wrong place...
   The existing code could get to 100 rows, not check at that point,
   read the next input - half a range tombstone - and only *then*
   check that we reached 100 rows and stop. The fix is to calculate
   the number of rows in the right place - exactly when it's needed,
   not before the step.

The first change needs more justification: The old code, that incremented
_op_count on every input fragment and not just output fragments did not
fit the stated goal of its introduction - to avoid large allocations.
In one test it resulted in breaking up the output mutation to chunks of
25 rows instead of the intended 100 rows. But, maybe there was another
goal, to stop the iteration after 100 *input* rows and avoid the possibility
of stalls if there are no output rows? It turns out the answer is no -
we don't need this _op_count increment to avoid stalls: The function
build_some() uses `co_await on_results()` to run one step of processing
one input fragment - and `co_await` always checks for preemption.
I verfied that indeed no stalls happen by using the existing test
test_long_skipped_view_update_delete_with_timestamp. It generates a
very long base update where all the view updates go to the same partition,
but all but the last few updates don't generate any view updates.
I confirmed that the fixed code loops over all these input rows without
increasing _op_count and without generating any view update yet, but it
does NOT stall.

This patch also includes two tests reproducing this bug and confirming
its fixed, and also two additional tests for breaking up long deletions
that I wanted to make sure doesn't fail after this patch (it doesn't).

By the way, this fix would have also fixed issue #12297 - which we
fixed a year ago in a different way. That issue happend when the code
went through 100 input rows without generating *any* output rows,
and incorrectly concluding that there's no view update to send.
With this fix, the code no longer stops generating the view
update just because it saw 100 input rows - it would have waited
until it generated 100 output rows in the view update (or the
input is really done).

Fixes #17117

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17164

(cherry picked from commit 14315fcbc3)
2024-02-22 15:36:58 +02:00
Avi Kivity
e0e46fbc50 Regenerate frozen toolchain
For gnutls 3.8.3.

Since Fedora 37 is end-of-life, pick the package from Fedora 38. libunistring
needs to be updated to satisfy the dependency solver.

Fixes #17285.

Closes scylladb/scylladb#17287

Signed-off-by: Avi Kivity <avi@scylladb.com>

Closes #17411
2024-02-20 12:34:46 +02:00
Wojciech Mitros
27ab3b1744 rust: update dependencies
The currently used version of "rustix" depency had a minor
security vulnerability. This patch updates the corresponding
crate.
The update was performed using "cargo update" on "rustix"
package and version "0.36.17"
relevant package and the corresponding version.

Refs #15772

Closes #17408
2024-02-19 22:12:50 +02:00
Michał Jadwiszczak
0d22471222 schema::describe: print 'synchronous_updates' only if it was specified
While describing materialized view, print `synchronous_updates` option
only if the tag is present in schema's extensions map. Previously if the
key wasn't present, the default (false) value was printed.

Fixes: #14924

Closes #14928

(cherry picked from commit b92d47362f)
2024-02-19 09:10:34 +02:00
Botond Dénes
422a731e85 query: do not kill unpaged queries when they reach the tombstone-limit
The reason we introduced the tombstone-limit
(query_tombstone_page_limit), was to allow paged queries to return
incomplete/empty pages in the face of large tombstone spans. This works
by cutting the page after the tombstone-limit amount of tombstones were
processed. If the read is unpaged, it is killed instead. This was a
mistake. First, it doesn't really make sense, the reason we introduced
the tombstone limit, was to allow paged queries to process large
tombstone-spans without timing out. It does not help unpaged queries.
Furthermore, the tombstone-limit can kill internal queries done on
behalf of user queries, because all our internal queries are unpaged.
This can cause denial of service.

So in this patch we disable the tombstone-limit for unpaged queries
altogether, they are allowed to continue even after having processed the
configured limit of tombstones.

Fixes: #17241

Closes scylladb/scylladb#17242

(cherry picked from commit f068d1a6fa)
2024-02-15 12:50:30 +02:00
Yaron Kaikov
1fa8327504 release: prepare for 5.2.15 2024-02-11 14:17:31 +02:00
Pavel Emelyanov
f3c215aaa1 Update seastar submodule
* seastar 29badd99...ad0f2d5d (1):
  > Merge "Slowdown IO scheduler based on dispatched/completed ratio" into branch-5.2

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-02-09 12:22:58 +03:00
Botond Dénes
94af1df2cf Merge 'Fix mintimeuuid() call that could crash Scylla' from Nadav Har'El
This PR fixes the bug of certain calls to the `mintimeuuid()` CQL function which large negative timestamps could crash Scylla. It turns out we already had protections in place against very positive timestamps, but very negative timestamps could still cause bugs.

The actual fix in this series is just a few lines, but the bigger effort was improving the test coverage in this area. I added tests for the "date" type (the original reproducer for this bug used totimestamp() which takes a date parameter), and also reproducers for this bug directly, without totimestamp() function, and one with that function.

Finally this PR also replaces the assert() which made this molehill-of-a-bug into a mountain, by a throw.

Fixes #17035

Closes scylladb/scylladb#17073

* github.com:scylladb/scylladb:
  utils: replace assert() by on_internal_error()
  utils: add on_internal_error with common logger
  utils: add a timeuuid minimum, like we had maximum
  test/cql-pytest: tests for "date" type

(cherry picked from commit 2a4b991772)
2024-02-07 14:19:32 +02:00
Botond Dénes
9291eafd4a Merge '[Backport 5.2] Raft snapshot fixes' from Kamil Braun
Backports required to fix scylladb/scylladb#16683 in 5.2:
- when creating first group 0 server, create a snapshot with non-empty ID, and start it at index 1 instead of 0 to force snapshot transfer to servers that join group 0
- add an API to trigger Raft snapshot
- use the API when we restart and see that the existing snapshot is at index 0, to trigger a new one --- in order to fix broken deployments that already bootstrapped with index-0 snapshot.

Closes #17087

* github.com:scylladb/scylladb:
  test_raft_snapshot_request: fix flakiness (again)
  test_raft_snapshot_request: fix flakiness
  Merge 'raft_group0: trigger snapshot if existing snapshot index is 0' from Kamil Braun
  Merge 'Add an API to trigger snapshot in Raft servers' from Kamil Braun
  raft: server: add workaround for scylladb/scylladb#12972
  raft: Store snapshot update and truncate log atomically
  service: raft: force initial snapshot transfer in new cluster
  raft_sys_table_storage: give initial snapshot a non zero value
2024-02-07 11:55:20 +02:00
Michał Chojnowski
4546d0789f row_cache: update _prev_snapshot_pos even if apply_to_incomplete() is preempted
Commit e81fc1f095 accidentally broke the control
flow of row_cache::do_update().

Before that commit, the body of the loop was wrapped in a lambda.
Thus, to break out of the loop, `return` was used.

The bad commit removed the lambda, but didn't update the `return` accordingly.
Thus, since the commit, the statement doesn't just break out of the loop as
intended, but also skips the code after the loop, which updates `_prev_snapshot_pos`
to reflect the work done by the loop.

As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted,
`do_update()` fails to update `_prev_snapshot_pos`. It remains in a
stale state, until `do_update()` runs again and either finishes or
is preempted outside of `updater`.

If we read a partition processed by `do_update()` but not covered by
`_prev_snapshot_pos`, we will read stale data (from the previous snapshot),
which will be remembered in the cache as the current data.

This results in outdated data being returned by the replica.
(And perhaps in something worse if range tombstones are involved.
I didn't investigate this possibility in depth).

Note: for queries with CL>1, occurences of this bug are likely to be hidden
by reconciliation, because the reconciled query will only see stale data if
the queried partition is affected by the bug on on *all* queried replicas
at the time of the query.

Fixes #16759

Closes scylladb/scylladb#17138

(cherry picked from commit ed98102c45)
2024-02-04 14:46:57 +02:00
Kamil Braun
4e257c5c74 test_raft_snapshot_request: fix flakiness (again)
At the end of the test, we wait until a restarted node receives a
snapshot from the leader, and then verify that the log has been
truncated.

To check the snapshot, the test used the `system.raft_snapshots` table,
while the log is stored in `system.raft`.

Unfortunately, the two tables are not updated atomically when Raft
persists a snapshot (scylladb/scylladb#9603). We first update
`system.raft_snapshots`, then `system.raft` (see
`raft_sys_table_storage::store_snapshot_descriptor`). So after the wait
finishes, there's no guarantee the log has been truncated yet -- there's
a race between the test's last check and Scylla doing that last delete.

But we can check the snapshot using `system.raft` instead of
`system.raft_snapshots`, as `system.raft` has the latest ID. And since
1640f83fdc, storing that ID and truncating
the log in `system.raft` happens atomically.

Closes scylladb/scylladb#17106

(cherry picked from commit c911bf1a33)
2024-02-02 11:31:19 +01:00
Kamil Braun
08021dc906 test_raft_snapshot_request: fix flakiness
Add workaround for scylladb/python-driver#295.

Also an assert made at the end of the test was false, it is fixed with
appropriate comment added.

(cherry picked from commit 74bf60a8ca)
2024-02-02 11:31:19 +01:00
Botond Dénes
db586145aa Merge 'raft_group0: trigger snapshot if existing snapshot index is 0' from Kamil Braun
The persisted snapshot index may be 0 if the snapshot was created in
older version of Scylla, which means snapshot transfer won't be
triggered to a bootstrapping node. Commands present in the log may not
cover all schema changes --- group 0 might have been created through the
upgrade upgrade procedure, on a cluster with existing schema. So a
deployment with index=0 snapshot is broken and we need to fix it. We can
use the new `raft::server::trigger_snapshot` API for that.

Also add a test.

Fixes scylladb/scylladb#16683

Closes scylladb/scylladb#17072

* github.com:scylladb/scylladb:
  test: add test for fixing a broken group 0 snapshot
  raft_group0: trigger snapshot if existing snapshot index is 0

(cherry picked from commit 181f68f248)

Backport note: test_raft_fix_broken_snapshot had to be removed because
the "error injections enabled at startup" feature does not yet exist in
5.2.
2024-02-01 15:39:14 +01:00
Botond Dénes
ce0ed29ad6 Merge 'Add an API to trigger snapshot in Raft servers' from Kamil Braun
This allows the user of `raft::server` to cause it to create a snapshot
and truncate the Raft log (leaving no trailing entries; in the future we
may extend the API to specify number of trailing entries left if
needed). In a later commit we'll add a REST endpoint to Scylla to
trigger group 0 snapshots.

One use case for this API is to create group 0 snapshots in Scylla
deployments which upgraded to Raft in version 5.2 and started with an
empty Raft log with no snapshot at the beginning. This causes problems,
e.g. when a new node bootstraps to the cluster, it will not receive a
snapshot that would contain both schema and group 0 history, which would
then lead to inconsistent schema state and trigger assertion failures as
observed in scylladb/scylladb#16683.

In 5.4 the logic of initial group 0 setup was changed to start the Raft
log with a snapshot at index 1 (ff386e7a44)
but a problem remains with these existing deployments coming from 5.2,
we need a way to trigger a snapshot in them (other than performing 1000
arbitrary schema changes).

Another potential use case in the future would be to trigger snapshots
based on external memory pressure in tablet Raft groups (for strongly
consistent tables).

The PR adds the API to `raft::server` and a HTTP endpoint that uses it.

In a follow-up PR, we plan to modify group 0 server startup logic to automatically
call this API if it sees that no snapshot is present yet (to automatically
fix the aforementioned 5.2 deployments once they upgrade.)

Closes scylladb/scylladb#16816

* github.com:scylladb/scylladb:
  raft: remove `empty()` from `fsm_output`
  test: add test for manual triggering of Raft snapshots
  api: add HTTP endpoint to trigger Raft snapshots
  raft: server: add `trigger_snapshot` API
  raft: server: track last persisted snapshot descriptor index
  raft: server: framework for handling server requests
  raft: server: inline `poll_fsm_output`
  raft: server: fix indentation
  raft: server: move `io_fiber`'s processing of `batch` to a separate function
  raft: move `poll_output()` from `fsm` to `server`
  raft: move `_sm_events` from `fsm` to `server`
  raft: fsm: remove constructor used only in tests
  raft: fsm: move trace message from `poll_output` to `has_output`
  raft: fsm: extract `has_output()`
  raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor`
  raft: server: pass `*_aborted` to `set_exception` call

(cherry picked from commit d202d32f81)

Backport notes:
- `has_output()` has a smaller condition in the backported version
  (because the condition was smaller in `poll_output()`)
- `process_fsm_output` has a smaller body (because `io_fiber` had a
  smaller body) in the backported version
- the HTTP API is only started if `raft_group_registry` is started
2024-02-01 15:38:51 +01:00
Kamil Braun
cbe8e05ef6 raft: server: add workaround for scylladb/scylladb#12972
When a node joins the cluster, it closes connections after learning
topology information from other nodes, in order to reopen them with
correct encryption, compression etc.

In ScyllaDB 5.2, this mechanism may interrupt an ongoing Raft snapshot
transfer. This was fixed in later versions by putting some order into
the bootstrap process with 50e8ec77c6 but
the fix was not backported due to many prerequisites and complexity.

Raft automatically recovers from interrupted snapshot transfer by
retrying it eventually, and everything works. However an ERROR is
reported due to that one failed snapshot transfer, and dtests dont like
ERRORs -- they report the test case as failed if an ERROR happened in
any node's logs even if the test passed otherwise.

Here we apply a simple workaround to please dtests -- in this particular
scenario, turn the ERROR into a WARN.
2024-02-01 14:29:56 +01:00
Michael Huang
84004ab83c raft: Store snapshot update and truncate log atomically
In case the snapshot update fails, we don't truncate commit log.

Fixes scylladb/scylladb#9603

Closes scylladb/scylladb#15540

(cherry picked from commit 1640f83fdc)
2024-02-01 13:10:05 +01:00
Kamil Braun
753e2d3c57 service: raft: force initial snapshot transfer in new cluster
When we upgrade a cluster to use Raft, or perform manual Raft recovery
procedure (which also creates a fresh group 0 cluster, using the same
algorithm as during upgrade), we start with a non-empty group 0 state
machine; in particular, the schema tables are non-empty.

In this case we need to ensure that nodes which join group 0 receive the
group 0 state. Right now this is not the case. In previous releases,
where group 0 consisted only of schema, and schema pulls were also done
outside Raft, those nodes received schema through this outside
mechanism. In 91f609d065 we disabled
schema pulls outside Raft; we're also extending group 0 with other
things, like topology-specific state.

To solve this, we force snapshot transfers by setting the initial
snapshot index on the first group 0 server to `1` instead of `0`. During
replication, Raft will see that the joining servers are behind,
triggering snapshot transfer and forcing them to pull group 0 state.

It's unnecessary to do this for cluster which bootstraps with Raft
enabled right away but it also doesn't hurt, so we keep the logic simple
and don't introduce branches based on that.

Extend Raft upgrade tests with a node bootstrap step at the end to
prevent regressions (without this patch, the step would hang - node
would never join, waiting for schema).

Fixes: #14066

Closes #14336

(cherry picked from commit ff386e7a44)

Backport note: contrary to the claims above, it turns out that it is
actually necessary to create snapshots in clusters which bootstrap with
Raft, because of tombstones in current schema state expire hence
applying schema mutations from old Raft log entries is not really
idempotent. Snapshot transfer, which transfers group 0 history and
state_ids, prevents old entries from applying schema mutations over
latest schema state.

Ref: scylladb/scylladb#16683
2024-01-31 17:00:10 +01:00
Gleb Natapov
42cf25bcbb raft_sys_table_storage: give initial snapshot a non zero value
We create a snapshot (config only, but still), but do not assign it any
id. Because of that it is not loaded on start. We do want it to be
loaded though since the state of group0 will not be re-created from the
log on restart because the entries will have outdated id and will be
skipped. As a result in memory state machine state will not be restored.
This is not a problem now since schema state it restored outside of raft
code.

Message-Id: <20230316112801.1004602-5-gleb@scylladb.com>
(cherry picked from commit a690070722)
2024-01-31 16:50:42 +01:00
Aleksandra Martyniuk
f85375ff99 api: ignore future in task_manager_json::wait_task
Before returning task status, wait_task waits for it to finish with
done() method and calls get() on a resulting future.

If requested task fails, an exception will be thrown and user will
get internal server error instead of failed task status.

Result of done() method is ignored.

Fixes: #14914.
(cherry picked from commit ae67f5d47e)

Closes #16438
2024-01-30 10:54:33 +02:00
Aleksandra Martyniuk
35a0a459db compaction: ignore future explicitly
discard_result ignores only successful futures. Thus, if
perform_compaction<regular_compaction_task_executor> call fails,
a failure is considered abandoned, causing tests to fail.

Explicitly ignore failed future.

Fixes: #14971.

Closes #15000

(cherry picked from commit 7a28cc60ec)

Closes #16441
2024-01-30 10:53:09 +02:00
Kamil Braun
784695e3ac system_keyspace: use system memory for system.raft table
`system.raft` was using the "user memory pool", i.e. the
`dirty_memory_manager` for this table was set to
`database::_dirty_memory_manager` (instead of
`database::_system_dirty_memory_manager`).

This meant that if a write workload caused memory pressure on the user
memory pool, internal `system.raft` writes would have to wait for
memtables of user tables to get flushed before the write would proceed.

This was observed in SCT longevity tests which ran a heavy workload on
the cluster and concurrently, schema changes (which underneath use the
`system.raft` table). Raft would often get stuck waiting many seconds
for user memtables to get flushed. More details in issue #15622.
Experiments showed that moving Raft to system memory fixed this
particular issue, bringing the waits to reasonable levels.

Currently `system.raft` stores only one group, group 0, which is
internally used for cluster metadata operations (schema and topology
changes) -- so it makes sense to keep use system memory.

In the future we'd like to have other groups, for strongly consistent
tables. These groups should use the user memory pool. It means we won't
be able to use `system.raft` for them -- we'll just have to use a
separate table.

Fixes: scylladb/scylladb#15622

Closes scylladb/scylladb#15972

(cherry picked from commit f094e23d84)
2024-01-25 17:59:49 +01:00
Avi Kivity
351d6d6531 Merge 'Invalidate prepared statements for views when their schema changes.' from Eliran Sinvani
When a base table changes and altered, so does the views that might
refer to the added column (which includes "SELECT *" views and also
views that might need to use this column for rows lifetime (virtual
columns).
However the query processor implementation for views change notification
was an empty function.
Since views are tables, the query processor needs to at least treat them
as such (and maybe in the future, do also some MV specific stuff).
This commit adds a call to `on_update_column_family` from within
`on_update_view`.
The side effect true to this date is that prepared statements for views
which changed due to a base table change will be invalidated.

Fixes https://github.com/scylladb/scylladb/issues/16392

This series also adds a test which fails without this fix and passes when the fix is applied.

Closes scylladb/scylladb#16897

* github.com:scylladb/scylladb:
  Add test for mv prepared statements invalidation on base alter
  query processor: treat view changes at least as table changes

(cherry picked from commit 5810396ba1)
2024-01-23 21:31:47 +02:00
Takuya ASADA
5a05ccc2f8 scylla_raid_setup: faillback to other paths when UUID not avialable
On some environment such as VMware instance, /dev/disk/by-uuid/<UUID> is
not available, scylla_raid_setup will fail while mounting volume.

To avoid failing to mount /dev/disk/by-uuid/<UUID>, fetch all available
paths to mount the disk and fallback to other paths like by-partuuid,
by-id, by-path or just using real device path like /dev/md0.

To get device path, and also to dumping device status when UUID is not
available, this will introduce UdevInfo class which communicate udev
using pyudev.

Related #11359

Closes scylladb/scylladb#13803

(cherry picked from commit 58d94a54a3)

[syuu: renegerate tools/toolchain/image for new python3-pyudev package]

Closes #16938
2024-01-23 16:05:28 +02:00
Botond Dénes
a1603bcb40 readers/multishard: evictable_reader::fast_forward_to(): close reader on exception
When the reader is currently paused, it is resumed, fast-forwarded, then
paused again. The fast forwarding part can throw and this will lead to
destroying the reader without it being closed first.
Add a try-catch surrounding this part in the code. Also mark
`maybe_pause()` and `do_pause()` as noexcept, to make it clear why
that part doesn't need to be in the try-catch.

Fixes: #16606

Closes scylladb/scylladb#16630

(cherry picked from commit 204d3284fa)
2024-01-16 16:57:28 +02:00
Michał Jadwiszczak
29da20b9e0 schema: add scylla specific options to schema description
Add `paxos_grace_seconds`, `tombstone_gc`, `cdc` and `synchronous_updates`
options to schema description.

Fixes: #12389
Fixes: scylladb/scylla-enterprise#2979

Closes #16786
2024-01-16 09:56:08 +02:00
Botond Dénes
7c4ec8cf4b Update tools/java submodule
* tools/java 843096943e...a1eed2f381 (1):
  > Update JNA dependency to 5.14.0

Fixes: https://github.com/scylladb/scylla-tools-java/issues/371
2024-01-15 15:51:32 +02:00
Aleksandra Martyniuk
5def443cf0 tasks: keep task's children in list
If std::vector is resized its iterators and references may
get invalidated. While task_manager::task::impl::_children's
iterators are avoided throughout the code, references to its
elements are being used.

Since children vector does not need random access to its
elements, change its type to std::list<foreign_task_ptr>, which
iterators and references aren't invalidated on element insertion.

Fixes: #16380.

Closes scylladb/scylladb#16381

(cherry picked from commit 9b9ea1193c)

Closes #16777
2024-01-15 15:38:00 +02:00
Anna Mikhlin
c0604a31fa release: prepare for 5.2.14 2024-01-14 16:34:38 +02:00
Pavel Emelyanov
96bb602c62 Update seastar submodule (token bucket duration underflow)
* seastar 43a1ce58...29badd99 (1):
  > shared_token_bucket: Fix duration_for() underflow

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-01-12 15:15:56 +03:00
Botond Dénes
d96440e8b6 Merge '[Backport 5.2] Validate compaction strategy options in prepare' from Aleksandra Martyniuk
Table properties validation is performed on statement execution.
Thus, when one attempts to create a table with invalid options,
an incorrect command gets committed in Raft. But then its
application fails, leading to a raft machine being stopped.

Check table properties when create and alter statements are prepared.

Fixes: https://github.com/scylladb/scylladb/issues/14710.

Closes #16750

* github.com:scylladb/scylladb:
  cql3: statements: delete execute override
  cql3: statements: call check_restricted_table_properties in prepare
  cql3: statements: pass data_dictionary::database to check_restricted_table_properties
2024-01-12 10:56:54 +02:00
Aleksandra Martyniuk
ea41a811d6 cql3: statements: delete execute override
Delete overriden create_table_statement::execute as it only calls its
direct parent's (schema_altering_statement) execute method anyway.

(cherry picked from commit 6c7eb7096e)
2024-01-11 16:43:17 +01:00
Aleksandra Martyniuk
8b77fbc904 cql3: statements: call check_restricted_table_properties in prepare
Table properties validation is performed on statement execution.
Thus, when one attempts to create a table with invalid options,
an incorrect command gets committed in Raft. But then its
application fails, leading to a raft machine being stopped.

Check table properties when create and alter statements are prepared.

The error is no longer returned as an exceptional future, but it
is thrown. Adjust the tests accordingly.

(cherry picked from commit 60fdc44bce)
2024-01-11 16:10:26 +01:00
Aleksandra Martyniuk
3ab3a2cc1b cql3: statements: pass data_dictionary::database to check_restricted_table_properties
Pass data_dictionary::database to check_restricted_table_properties
as an arguemnt instead of query_processor as the method will be called
from a context which does not have access to query processor.

(cherry picked from commit ec98b182c8)
2024-01-11 16:10:26 +01:00
Botond Dénes
7e9107cc97 Update tools/java submodule
* tools/java 79fa02d8a3...843096943e (1):
  > build.xml: update io.airlift to 0.9

Fixes: scylladb/scylla-tools-java#374
2024-01-11 11:03:29 +02:00
Botond Dénes
abb7ae4309 Update ./tools/jmx submodule
* tools/jmx f21550e...50909d6 (1):
  > scylla-apiclient: drop hk2-locator dependency

Fixes: scylladb/scylla-jmx#231
2024-01-10 14:22:14 +02:00
Botond Dénes
2820c63734 Update tools/java submodule
* tools/java d7ec9bf45f...79fa02d8a3 (2):
  > build.xml: update scylla-driver-core to 3.11.5.1
  > treewide: update "guava" package

Fixes: scylla-tools-java#365
Fixes: scylla-tools-java#343

Closes #16693
2024-01-10 08:19:43 +02:00
Nadav Har'El
ac0056f4bc Merge 'Fix partition estimation with TWCS tables during streaming' from Raphael "Raph" Carvalho
TWCS tables require partition estimation adjustment as incoming streaming data can be segregated into the time windows.

Turns out we had two problems in this area that leads to suboptimal bloom filters.

1) With off-strategy enabled, data segregation is postponed, but partition estimation was adjusted as if segregation wasn't postponed. Solved by not adjusting estimation if segregation is postponed.
2) With off-strategy disabled, data segregation is not postponed, but streaming didn't feed any metadata into partition estimation procedure, meaning it had to assume the max windows input data can be segregated into (100). Solved by using schema's default TTL for a precise estimation of window count.

For the future, we want to dynamically size filters (see https://github.com/scylladb/scylladb/issues/2024), especially for TWCS that might have SSTables that are left uncompacted until they're fully expired, meaning that the system won't heal itself in a timely manner through compaction on a SSTable that had partition estimation really wrong.

Fixes https://github.com/scylladb/scylladb/issues/15704.

Closes scylladb/scylladb#15938

* github.com:scylladb/scylladb:
  streaming: Improve partition estimation with TWCS
  streaming: Don't adjust partition estimate if segregation is postponed

(cherry picked from commit 64d1d5cf62)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #16672
2024-01-08 09:06:43 +02:00
Calle Wilund
aaa25e1a78 Commitlog replayer: Range-check skip call
Fixes #15269

If segment being replayed is corrupted/truncated we can attempt skipping
completely bogues byte amounts, which can cause assert (i.e. crash) in
file_data_source_impl. This is not a crash-level error, so ensure we
range check the distance in the reader.

v2: Add to corrupt_size if trying to skip more than available. The amount added is "wrong", but at least will
    ensure we log the fact that things are broken

Closes scylladb/scylladb#15270

(cherry picked from commit 6ffb482bf3)
2024-01-05 09:19:45 +02:00
Beni Peled
c57a0a7a46 release: prepare for 5.2.13 2024-01-03 17:48:59 +02:00
Botond Dénes
740ba3ac2a tools/schema_loader: read_schema_table_mutation(): close the reader
The reader used to read the sstables was not closed. This could
sometimes trigger an abort(), because the reader was destroyed, without
it being closed first.
Why only sometimes? This is due to two factors:
* read_mutation_from_flat_mutation_reader() - the method used to extract
  a mutation from the reader, uses consume(), which does not trigger
  `set_close_is_required()` (#16520). Due to this, the top-level
  combined reader did not complain when destroyed without close.
* The combined reader closes underlying readers who have no more data
  for the current range. If the circumstances are just right, all
  underlying readers are closed, before the combined reader is
  destoyed. Looks like this is what happens for the most time.

This bug was discovered in SCT testing. After fixing #16520, all
invokations of `scylla-sstable`, which use this code would trigger the
abort, without this patch. So no further testing is required.

Fixes: #16519

Closes scylladb/scylladb#16521

(cherry picked from commit da033343b7)
2023-12-31 18:13:10 +02:00
Gleb Natapov
76c3dda640 storage_service: register schema version observer before joining group0 and starting gossiper
The schema version is updated by group0, so if group0 starts before
schema version observer is registered some updates may be missed. Since
the observer is used to update node's gossiper state the gossiper may
contain wrong schema version.

Fix by registering the observer before starting group0 and even before
starting gossiper to avoid a theoretical case that something may pull
schema after start of gossiping and before the observer is registered.

Fixes: #15078

Message-Id: <ZOYZWhEh6Zyb+FaN@scylladb.com>
(cherry picked from commit d1654ccdda)
2023-12-20 11:14:27 +01:00
Kamil Braun
287546923e Merge 'db: hints: add checksum to sync_point encoding' from Patryk Jędrzejczak
Fixes #9405

`sync_point` API provided with incorrect sync point id might allocate
crazy amount of memory and fail with `std::bad_alloc`.

To fix this, we can check if the encoded sync point has been modified
before decoding. We can achieve this by calculating a checksum before
encoding, appending it to the encoded sync point, and compering it with
a checksum calculated in `db::hints::decode` before decoding.

Closes #14534

* github.com:scylladb/scylladb:
  db: hints: add checksum to sync point encoding
  db: hints: add the version_size constant

(cherry picked from commit eb6202ef9c)

The only difference from the original merge commit is the include
path of `xx_hasher.hh`. On branch 5.2, this file is in the root
directory, not `utils`.

Closes #16458
2023-12-19 17:39:50 +02:00
Botond Dénes
c0dab523f9 Update tools/java submodule
* tools/java e2aad6e3a0...d7ec9bf45f (1):
  > Merge "build: take care of old libthrift" from Piotr Grabowski

Fixes: scylladb/scylla-tools-java#352

Closes #16464
2023-12-19 17:37:27 +02:00
Michael Huang
5499f7b5a8 cdc: use chunked_vector for topology_description entries
Lists can grow very big. Let's use a chunked vector to prevent large contiguous
allocations.
Fixes: #15302.

Closes scylladb/scylladb#15428

(cherry picked from commit 62a8a31be7)
2023-12-19 13:43:23 +01:00
Piotr Grabowski
7055ac45d1 test: use more frequent reconnection policy
The default reconnection policy in Python Driver is an exponential
backoff (with jitter) policy, which starts at 1 second reconnection
interval and ramps up to 600 seconds.

This is a problem in tests (refs #15104), especially in tests that restart
or replace nodes. In such a scenario, a node can be unavailable for an
extended period of time and the driver will try to reconnect to it
multiple times, eventually reaching very long reconnection interval
values, exceeding the timeout of a test.

Fix the issue by using a exponential reconnection policy with a maximum
interval of 4 seconds. A smaller value was not chosen, as each retry
clutters the logs with reconnection exception stack trace.

Fixes #15104

Closes #15112

(cherry picked from commit 17e3e367ca)
2023-12-19 13:43:23 +01:00
Gleb Natapov
4ff29d1637 raft: drop assert in server_impl::apply_snapshot for a condition that may happen
server_impl::apply_snapshot() assumes that it cannot receive a snapshots
from the same host until the previous one is handled and usually this is
true since a leader will not send another snapshot until it gets
response to a previous one. But it may happens that snapshot sending
RPC fails after the snapshot was sent, but before reply is received
because of connection disconnect. In this case the leader may send
another snapshot and there is no guaranty that the previous one was
already handled, so the assumption may break.

Drop the assert that verifies the assumption and return an error in this
case instead.

Fixes: #15222

Message-ID: <ZO9JoEiHg+nIdavS@scylladb.com>
(cherry picked from commit 55f047f33f)
2023-12-19 13:43:23 +01:00
Alexey Novikov
6bcf9e6631 When add duration field to UDT check whether this UDT is used in some clustering key
Having values of the duration type is not allowed for clustering
columns, because duration can't be ordered. This is correctly validated
when creating a table but do not validated when we alter the type.

Fixes #12913

Closes scylladb/scylladb#16022

(cherry picked from commit bd73536b33)
2023-12-19 06:58:41 -05:00
Takuya ASADA
74dd8f08e3 dist: fix local-fs.target dependency
systemd man page says:

systemd-fstab-generator(3) automatically adds dependencies of type Before= to
all mount units that refer to local mount points for this target unit.

So "Before=local-fs.taget" is the correct dependency for local mount
points, but we currently specify "After=local-fs.target", it should be
fixed.

Also replaced "WantedBy=multi-user.target" with "WantedBy=local-fs.target",
since .mount are not related with multi-user but depends local
filesystems.

Fixes #8761

Closes scylladb/scylladb#15647

(cherry picked from commit a23278308f)
2023-12-19 13:15:00 +02:00
Botond Dénes
68507ed4d9 Merge '[Backport 5.2] Shard of shard repair task impl' from Aleksandra Martyniuk
Shard id is logged twice in repair (once explicitly, once added by logger).
Redundant occurrence is deleted.

shard_repair_task_impl::id (which contains global repair shard)
is renamed to avoid further confusion.

Fixes: https://github.com/scylladb/scylladb/issues/12955

Closes #16439

* github.com:scylladb/scylladb:
  repair: rename shard_repair_task_impl::id
  repair: delete redundant shard id from logs
2023-12-19 10:28:57 +02:00
Botond Dénes
46a29e9a02 Merge 'alternator: fix isolation of concurrent modifications to tags' from Nadav Har'El
Alternator's implementation of TagResource, UntagResource and UpdateTimeToLive (the latter uses tags to store the TTL configuration) was unsafe for concurrent modifications - some of these modifications may be lost. This short series fixes the bug, and also adds (in the last patch) a test that reproduces the bug and verifies that it's fixed.

The cause of the incorrect isolation was that we separately read the old tags and wrote the modified tags. In this series we introduce a new function, `modify_tags()` which can do both under one lock, so concurrent tag operations are serialized and therefore isolated as expected.

Fixes #6389.

Closes #13150

* github.com:scylladb/scylladb:
  test/alternator: test concurrent TagResource / UntagResource
  db/tags: drop unsafe update_tags() utility function
  alternator: isolate concurrent modification to tags
  db/tags: add safe modify_tags() utility functions
  migration_manager: expose access to storage_proxy

(cherry picked from commit dba1d36aa6)

Closes #16453
2023-12-19 10:19:31 +02:00
Botond Dénes
23fd6939eb Merge '[Backport to 5.2] gossiper: mark_alive: use deferred_action to unmark pending' from Benny Halevy
Backport the following patches to 5.2:
- gossiper: mark_alive: enter background_msg gate (#14791)
- gossiper: mark_alive: use deferred_action to unmark pending (#14839)

Closes #16452

* github.com:scylladb/scylladb:
  gossiper: mark_alive: use deferred_action to unmark pending
  gossiper: mark_alive: enter background_msg gate
2023-12-19 09:06:37 +02:00
Botond Dénes
1cf499cfea Update tools/java submodule
* tools/java 80701efa8d...e2aad6e3a0 (2):
  > build: update logback dependency
  > build: update `netty` dependency

Fixes: https://github.com/scylladb/scylla-tools-java/issues/363
Fixes: https://github.com/scylladb/scylla-tools-java/issues/364

Closes #16444
2023-12-18 18:19:20 +02:00
Nadav Har'El
91e05dc646 cql: fix SELECT toJson() or SELECT JSON of time column
The implementation of "SELECT TOJSON(t)" or "SELECT JSON t" for a column
of type "time" forgot to put the time string in quotes. The result was
invalid JSON. This is patch is a one-liner fixing this bug.

This patch also removes the "xfail" marker from one xfailing test
for this issue which now starts to pass. We also add a second test for
this issue - the existing test was for "SELECT TOJSON(t)", and the second
test shows that "SELECT JSON t" had exactly the same bug - and both are
fixed by the same patch.

We also had a test translated from Cassandra which exposed this bug,
but that test continues to fail because of other bugs, so we just
need to update the xfail string.

The patch also fixes one C++ test, test/boost/json_cql_query_test.cc,
which enshrined the *wrong* behavior - JSON output that isn't even
valid JSON - and had to be fixed. Unlike the Python tests, the C++ test
can't be run against Cassandra, and doesn't even run a JSON parser
on the output, which explains how it came to enshrine wrong output
instead of helping to discover the bug.

Fixes #7988

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#16121

(cherry picked from commit 8d040325ab)
2023-12-18 18:19:20 +02:00
Benny Halevy
a2009c4a8c gossiper: mark_alive: use deferred_action to unmark pending
Make sure _pending_mark_alive_endpoints is unmarked in
any case, including exceptions.

Fixes #14839

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14840

(cherry picked from commit 1e7e2eeaee)
2023-12-18 14:44:22 +02:00
Benny Halevy
999a6bfaae gossiper: mark_alive: enter background_msg gate
The function dispatch a background operation that must be
waited on in stop().

\Fixes scylladb/scylladb#14791

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 868e436901)
2023-12-18 14:42:52 +02:00
Kefu Chai
faef786c88 reloc: strip.sh: always generate symbol list with posix format
we compare the symbols lists of stripped ELF file ($orig.stripped) and
that of the one including debugging symbols ($orig.debug) to get a
an ELF file which includes only the necessary bits as the debuginfo
($orig.minidebug).

but we generate the symbol list of stripped ELF file using the
sysv format, while generate the one from the unstripped one using
posix format. the former is always padded the symbol names with spaces
so that their the length at least the same as the section name after
we split the fields with "|".

that's why the diff includes the stuff we don't expect. and hence,
we have tons of warnings like:

```
objcopy: build/node_exporter/node_exporter.keep_symbols:4910: Ignoring rubbish found on this line
```

when using objcopy to filter the ELF file to keep only the
symbols we are interested in.

so, in this change

* use the same format when dumping the symbols from unstripped ELF
  file
* include the symbols in the text area -- the code, by checking
  "T" and "t" in the dumped symbols. this was achieved by matching
  the lines with "FUNC" before this change.
* include the the symbols in .init data section -- the global
  variables which are initialized at compile time. they could
  be also interesting when debugging an application.

Fixes #15513
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15514

(cherry picked from commit 50c937439b)
2023-12-18 13:58:14 +02:00
Michał Chojnowski
7e9bdef8bb row_cache: when the constructor fails, clear _partitions in the right allocator
If the constructor of row_cache throws, `_partitions` is cleared in the
wrong allocator, possibly causing allocator corruption.

Fix that.

Fixes #15632

Closes scylladb/scylladb#15633

(cherry picked from commit 330d221deb)
2023-12-18 13:55:16 +02:00
Michael Huang
af38b255c8 cql3: Fix invalid JSON parsing for JSON objects with ASCII keys
For JSON objects represented as map<ascii, int>, don't treat ASCII keys
as a nested JSON string. We were doing that prior to the patch, which
led to parsing errors.

Included the error offset where JSON parsing failed for
rjson::parse related functions to help identify parsing errors
better.

Fixes: #7949

Signed-off-by: Michael Huang <michaelhly@gmail.com>

Closes scylladb/scylladb#15499

(cherry picked from commit 75109e9519)
2023-12-18 13:45:57 +02:00
Kefu Chai
c4b699525a sstables: throw at seeing invalid chunk_len
before this change, when running into a zero chunk_len, scylla
crashes with `assert(chunk_size != 0)`. but we can do better than
printing a backtrace like:
```
scylla: sstables/compress.cc:158: void
sstables::compression::segmented_offsets::init(uint32_t): Assertion `chunk_size != 0' failed.
```
so, in this change, a `malformed_sstable_exception` is throw in place
of an `assert()`, which is supposed to verify the programming
invariants, not for identifying corrupted data file.

Fixes #15265
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15264

(cherry picked from commit 1ed894170c)
2023-12-18 13:29:02 +02:00
Nadav Har'El
3a24b8c435 sstables: stop warning when auto-snapshot leaves non-empty directory
When a table is dropped, we delete its sstables, and finally try to delete
the table's top-level directory with the rmdir system call. When the
auto-snapshot feature is enabled (this is still Scylla's default),
the snapshot will remain in that directory so it won't be empty and will
cannot be removed. Today, this results in a long, ugly and scary warning
in the log:

```
WARN  2023-07-06 20:48:04,995 [shard 0] sstable - Could not remove table directory "/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots": std::filesystem::__cxx11::filesystem_error (error system:39, filesystem error: remove failed: Directory not empty [/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots]). Ignored.
```

It is bad to log as a warning something which is completely normal - it
happens every time a table is dropped with the perfectly valid (and even
default) auto-snapshot mode. We should only log a warning if the deletion
failed because of some unexpected reason.

And in fact, this is exactly what the code **tried** to do - it does
not log a warning if the rmdir failed with EEXIST. It even had a comment
saying why it was doing this. But the problem is that in Linux, deleting
a non-empty directory does not return EEXIST, it returns ENOTEMPTY...
Posix actually allows both. So we need to check both, and this is the
only change in this patch.

To confirm this that this patch works, edit test/cql-pytest/run.py and
change auto-snapshot from 0 to 1, run test/alternator/run (for example)
and see many "Directory not empty" warnings as above. With this patch,
none of these warnings appear.

Fixes #13538

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14557

(cherry picked from commit edfb89ef65)
2023-12-18 13:26:40 +02:00
Kefu Chai
9e9a488da3 streaming: cast the progress to a float before formatting it
before this change, we format a `long` using `{:f}`. fmtlib would
throw an exception when actually formatting it.

so, let's make the percentage a float before formatting it.

Fixes #14587
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14588

(cherry picked from commit 1eb76d93b7)
2023-12-18 13:16:58 +02:00
Aleksandra Martyniuk
614d15b9f6 repair: rename shard_repair_task_impl::id
shard_repair_task_impl::id stores global repair id. To avoid confusion
with the task id, the field is renamed to global_repair_id.

(cherry picked from commit d889a599e8)
2023-12-18 12:08:00 +01:00
Aleksandra Martyniuk
fc2799096f repair: delete redundant shard id from logs
In repair shard id is logged twice. Delete repeated occurence.

(cherry picked from commit f7c88edec5)
2023-12-18 12:03:26 +01:00
Petr Gusev
b9178bd853 hints: send_one_hint: extend the scope of file_send_gate holder
The problem was that the holder in with_gate
call was released too early. This happened
before the possible call to on_hint_send_failure
in then_wrapped. As a result, the effects of
on_hint_send_failure (segment_replay_failed flag)
were not visible in send_one_file after
ctx_ptr->file_send_gate.close(), so we could decide
that the segment was sent in full and delete
it even if sending of some hints led to errors.

Fixes #15110

(cherry picked from commit 9fd3df13a2)
2023-12-18 13:03:23 +02:00
Kefu Chai
12aacea997 compound_compat: do not format an sstring with {:d}
before this change, we format a sstring with "{:d}", fmtlib would throw
`fmt::format_error` at runtime when formatting it. this is not expected.

so, in this change, we just print the int8_t using `seastar::format()`
in a single pass. and with the format specifier of `#02x` instead of
adding the "0x" prefix manually.

Fixes #14577
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14578

(cherry picked from commit 27d6ff36df)
2023-12-18 12:47:16 +02:00
Kefu Chai
df30f66bfa tools/scylla-sstable: dump column_desc as an object
before this change, `scylla sstable dump-statistics` prints the
"regular_columns" as a list of strings, like:

```
        "regular_columns": [
          "name",
          "clustering_order",
          "type_name",
          "org.apache.cassandra.db.marshal.UTF8Type",
          "name",
          "column_name_bytes",
          "type_name",
          "org.apache.cassandra.db.marshal.BytesType",
          "name",
          "kind",
          "type_name",
          "org.apache.cassandra.db.marshal.UTF8Type",
          "name",
          "position",
          "type_name",
          "org.apache.cassandra.db.marshal.Int32Type",
          "name",
          "type",
          "type_name",
          "org.apache.cassandra.db.marshal.UTF8Type"
        ]
```

but according
https://opensource.docs.scylladb.com/stable/operating-scylla/admin-tools/scylla-sstable.html#dump-statistics,

> $SERIALIZATION_HEADER_METADATA := {
>     "min_timestamp_base": Uint64,
>     "min_local_deletion_time_base": Uint64,
>     "min_ttl_base": Uint64",
>     "pk_type_name": String,
>     "clustering_key_types_names": [String, ...],
>     "static_columns": [$COLUMN_DESC, ...],
>     "regular_columns": [$COLUMN_DESC, ...],
> }
>
> $COLUMN_DESC := {
>     "name": String,
>     "type_name": String
> }

"regular_columns" is supposed to be a list of "$COLUMN_DESC".
the same applies to "static_columnes". this schema makes sense,
as each column should be considered as a single object which
is composed of two properties. but we dump them like a list.

so, in this change, we guard each visit() call of `json_dumper()`
with `StartObject()` and `EndObject()` pair, so that each column
is printed as an object.

after the change, "regular_columns" are printed like:
```
        "regular_columns": [
          {
            "name": "clustering_order",
            "type_name": "org.apache.cassandra.db.marshal.UTF8Type"
          },
          {
            "name": "column_name_bytes",
            "type_name": "org.apache.cassandra.db.marshal.BytesType"
          },
          {
            "name": "kind",
            "type_name": "org.apache.cassandra.db.marshal.UTF8Type"
          },
          {
            "name": "position",
            "type_name": "org.apache.cassandra.db.marshal.Int32Type"
          },
          {
            "name": "type",
            "type_name": "org.apache.cassandra.db.marshal.UTF8Type"
          }
        ]
```

Fixes #15036
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15037

(cherry picked from commit c82f1d2f57)
2023-12-18 12:26:36 +02:00
Michał Sala
2427bda737 forward_service: introduce shutdown checks
This commit introduces a new boolean flag, `shutdown`, to the
forward_service, along with a corresponding shutdown method. It also
adds checks throughout the forward_service to verify the value of the
shutdown flag before retrying or invoking functions that might use the
messaging service under the hood.

The flag is set before messaging service shutdown, by invoking
forward_service::shutdown in main. By checking the flag before each call
that potentially involves the messaging service, we can ensure that the
messaging service is still operational. If the flag is false, indicating
that the messaging service is still active, we can proceed with the
call. In the event that the messaging service is shutdown during the
call, appropriate exceptions should be thrown somewhere down in called
functions, avoiding potential hangs.

This fix should resolve the issue where forward_service retries could
block the shutdown.

Fixes #12604

Closes #13922

(cherry picked from commit e0855b1de2)
2023-12-18 12:25:25 +02:00
Petr Gusev
27adf340ef storage_proxy: mutation:: make frozen_mutation [[ref]]
We had a redundant copy in receive_mutation_handler
forward_fn callback. This frozen_mutation is
dynamically allocated and can be arbitrary large.

Fixes: #12504
(cherry picked from commit 5adbb6cde2)
2023-12-18 12:20:40 +02:00
Botond Dénes
5c33c9d6a6 Merge 'thrift: return address in listen_addresses() only after server is ready' from Marcin Maliszkiewicz
This is used for readiness API: /storage_service/rpc_server and the fix prevents from returning 'true' prematurely.

Some improvement for readiness was added in a51529dd15 but thrift implementation wasn't fully done.

Fixes https://github.com/scylladb/scylladb/issues/12376

Closes #13319

* github.com:scylladb/scylladb:
  thrift: return address in listen_addresses() only after server is ready
  thrift: simplify do_start_server() with seastar:async

(cherry picked from commit 9a024f72c4)
2023-12-18 12:20:40 +02:00
Kamil Braun
9aaaa66981 Merge 'cql3: fix a few misformatted printouts of column names in error messages' from Nadav Har'El
Fix a few cases where instead of printing column names in error messages, we printed weird stuff like ASCII codes or the address of the name.

Fixes #13657

Closes #13658

* github.com:scylladb/scylladb:
  cql3: fix printing of column_specification::name in some error messages
  cql3: fix printing of column_definition::name in some error messages

(cherry picked from commit a29b8cd02b)
2023-12-18 09:55:37 +02:00
Avi Kivity
b21ec82894 Merge 'Do not yield while traversing the gossiper endpoint state map' from Benny Halevy
This series introduces a new gossiper method: get_endpoints that returns a vector of endpoints (by value) based on the endpoint state map.

get_endpoints is used here by gossiper and storage_service for iterations that may preempt
instead of iterating direction over the endpoint state map (`_endpoint_state_map` in gossiper or via `get_endpoint_states()`) so to prevent use-after-free that may potentially happen if the map is rehashed while the function yields causing invalidation of the loop iterators.

\Fixes #13899

\Closes #13900

* github.com:scylladb/scylladb:
  storage_service: do not preempt while traversing endpoint_state_map
  gossiper: do not preempt while traversing endpoint_state_map

(cherry picked from commit d2d53fc1db)

Closes #16431
2023-12-18 09:35:42 +02:00
Yaron Kaikov
5052890ae8 release: prepare for 5.2.12 2023-12-17 14:28:03 +02:00
Kefu Chai
0da3453f95 db: schema_tables: capture reference to temporary value by value
`clustering_key_columns()` returns a range view, and `front()` returns
the reference to its first element. so we cannot assume the availability
of this reference after the expression is evaluated. to address this
issue, let's capture the returned range by value, and keep the first
element by reference.

this also silences warning from GCC-13:

```
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:30: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
 3654 |     const column_definition& first_view_ck = v->clustering_key_columns().front();
      |                              ^~~~~~~~~~~~~
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:79: note: the temporary was destroyed at the end of the full expression ‘(& v)->view_ptr::operator->()->schema::clustering_key_columns().boost::iterator_range<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> > >::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::random_access_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::bidirectional_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::incrementable_traversal_tag>::front()’
 3654 |     const column_definition& first_view_ck = v->clustering_key_columns().front();
      |                                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
```

Fixes #13720
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13721

(cherry picked from commit 135b4fd434)
2023-12-15 13:55:57 +02:00
Benny Halevy
6d7b2bc02f sstables: compressed_file_data_source_impl: get: throw malformed_sstable_exception on premature eof
Currently, the reader might dereference a null pointer
if the input stream reaches eof prematurely,
and read_exactly returns an empty temporary_buffer.

Detect this condition before dereferencing the buffer
and sstables::malformed_sstable_exception.

Fixes #13599

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #13600

(cherry picked from commit 77b70dbdb7)
2023-12-15 13:54:42 +02:00
Wojciech Mitros
119c8279dd rust: update wasmtime dependency
The previous version of wasmtime had a vulnerability that possibly
allowed causing undefined behavior when calling UDFs.

We're directly updating to wasmtime 8.0.1, because the update only
requires a slight code modification and the Wasm UDF feature is
still experimental. As a result, we'll benefit from a number of
new optimizations.

Fixes #13807

Closes #13804

(cherry picked from commit 6bc16047ba)
2023-12-15 13:54:42 +02:00
Michał Chojnowski
3af6dfe4ac database: fix reads_memory_consumption for system semaphore
The metric shows the opposite of what its name suggests.
It shows available memory rather than consumed memory.
Fix that.

Fixes #13810

Closes #13811

(cherry picked from commit 0813fa1da0)
2023-12-15 13:54:42 +02:00
Eliran Sinvani
0230798db3 use_statement: Covert an exception to a future exception
The use statement execution code can throw if the keyspace is
doesn't exist, this can be a problem for code that will use
execute in a fiber since the exception will break the fiber even
if `then_wrapped` is used.

Fixes #14449

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>

Closes scylladb/scylladb#14394

(cherry picked from commit c5956957f3)
2023-12-15 13:54:42 +02:00
Botond Dénes
64503a7137 Merge 'mutation_query: properly send range tombstones in reverse queries' from Michał Chojnowski
reconcilable_result_builder passes range tombstone changes to _rt_assembler
using table schema, not query schema.
This means that a tombstone with bounds (a; b), where a < b in query schema
but a > b in table schema, will not be emitted from mutation_query.

This is a very serious bug, because it means that such tombstones in reverse
queries are not reconciled with data from other replicas.
If *any* queried replica has a row, but not the range tombstone which deleted
the row, the reconciled result will contain the deleted row.

In particular, range deletes performed while a replica is down will not
later be visible to reverse queries which select this replica, regardless of the
consistency level.

As far as I can see, this doesn't result in any persistent data loss.
Only in that some data might appear resurrected to reverse queries,
until the relevant range tombstone is fully repaired.

This series fixes the bug and adds a minimal reproducer test.

Fixes #10598

Closes scylladb/scylladb#16003

* github.com:scylladb/scylladb:
  mutation_query_test: test that range tombstones are sent in reverse queries
  mutation_query: properly send range tombstones in reverse queries

(cherry picked from commit 65e42e4166)
2023-12-14 12:53:07 +02:00
Yaron Kaikov
b013877629 build_docker.sh: Upgrade package during creation and remove sshd service
When scanning our latest docker image using `trivy` (command: `trivy
image docker.io/scylladb/scylla-nightly:latest`), it shows we have OS
packages which are out of date.

Also removing `openssh-server` and `openssh-client` since we don't use
it for our docker images

Fixes: https://github.com/scylladb/scylladb/issues/16222

Closes scylladb/scylladb#16224

(cherry picked from commit 7ce6962141)

Closes #16360
2023-12-11 10:57:16 +02:00
Botond Dénes
33d2da94ab reader_concurrency_semaphore: execution_loop(): trigger admission check when _ready_list is empty
The execution loop consumes permits from the _ready_list and executes
them. The _ready_list usually contains a single permit. When the
_ready_list is not empty, new permits are queued until it becomes empty.
The execution loops relies on admission checks triggered by the read
releasing resouces, to bring in any queued read into the _ready_list,
while it is executing the current read. But in some cases the current
read might not free any resorces and thus fail to trigger an admission
check and the currently queued permits will sit in the queue until
another source triggers an admission check.
I don't yet know how this situation can occur, if at all, but it is
reproducible with a simple unit test, so it is best to cover this
corner-case in the off-chance it happens in the wild.
Add an explicit admission check to the execution loop, after the
_ready_list is exhausted, to make sure any waiters that can be admitted
with an empty _ready_list are admitted immediately and execution
continues.

Fixes: #13540

Closes #13541

(cherry picked from commit b790f14456)
2023-12-07 16:04:55 +02:00
Paweł Zakrzewski
dac69be4a4 auth: fix error message when consistency level is not met
Propagate `exceptions::unavailable_exception` error message to the
client such as cqlsh.

Fixes #2339

(cherry picked from commit 400aa2e932)
2023-12-07 14:49:47 +02:00
Botond Dénes
763e583cf2 Merge 'row_cache: abort on exteral_updater::execute errors' from Benny Halevy
Currently the cache updaters aren't exception safe
yet they are intended to be.

Instead of allowing exceptions from
`external_updater::execute` escape `row_cache::update`,
abort using `on_fatal_internal_error`.

Future changes should harden all `execute` implementations
to effectively make them `noexcept`, then the pure virtual
definition can be made `noexcept` to cement that.

\Fixes scylladb/scylladb#15576

\Closes scylladb/scylladb#15577

* github.com:scylladb/scylladb:
  row_cache: abort on exteral_updater::execute errors
  row_cache: do_update: simplify _prev_snapshot_pos setup

(cherry picked from commit 4a0f16474f)

Closes scylladb/scylladb#16256
2023-12-07 09:16:42 +02:00
Nadav Har'El
b331b4a4bb Backport fixes for nodetool commands with Alternator GSI in the database
Fixes #16153

* java e716e1bd1d...80701efa8d (1):
  > NodeProbe: allow addressing table name with colon in it

/home/nyh/scylla/tools$ git submodule summary jmx | cat
* jmx bc4f8ea...f21550e (3):
  > ColumnFamilyStore: only quote table names if necessary
  > APIBuilder: allow quoted scope names
  > ColumnFamilyStore: don't fail if there is a table with ":" in its name

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #16296
2023-12-06 10:48:49 +02:00
Anna Stuchlik
d9448a298f doc: fix rollback in the 4.6-to-5.0 upgrade guide
This commit fixes the rollback procedure in
the 4.6-to-5.0 upgrade guide:
- The "Restore system tables" step is removed.
- The "Restore the configuration file" command
  is fixed.
- The "Gracefully shutdown ScyllaDB" command
  is fixed.

In addition, there are the following updates
to be in sync with the tests:

- The "Backup the configuration file" step is
  extended to include a command to backup
  the packages.
- The Rollback procedure is extended to restore
  the backup packages.
- The Reinstallation section is fixed for RHEL.

Refs https://github.com/scylladb/scylladb/issues/11907

This commit must be backported to branch-5.4, branch-5.2, and branch-5.1

Closes scylladb/scylladb#16155

(cherry picked from commit 1e80bdb440)
2023-12-05 15:10:21 +02:00
Anna Stuchlik
a82fd96b6a doc: fix rollback in the 5.0-to-5.1 upgrade guide
This commit fixes the rollback procedure in
the 5.0-to-5.1 upgrade guide:
- The "Restore system tables" step is removed.
- The "Restore the configuration file" command
  is fixed.
- The "Gracefully shutdown ScyllaDB" command
  is fixed.

In addition, there are the following updates
to be in sync with the tests:

- The "Backup the configuration file" step is
  extended to include a command to backup
  the packages.
- The Rollback procedure is extended to restore
  the backup packages.
- The Reinstallation section is fixed for RHEL.

Also, I've the section removed the rollback
section for images, as it's not correct or
relevant.

Refs https://github.com/scylladb/scylladb/issues/11907

This commit must be backported to branch-5.4, branch-5.2, and branch-5.1

Closes scylladb/scylladb#16154

(cherry picked from commit 7ad0b92559)
2023-12-05 15:08:25 +02:00
Anna Stuchlik
ae79fb9ce0 doc: fix rollback in the 5.1-to-5.2 upgrade guide
This commit fixes the rollback procedure in
the 5.1-to-5.2 upgrade guide:
- The "Restore system tables" step is removed.
- The "Restore the configuration file" command
  is fixed.
- The "Gracefully shutdown ScyllaDB" command
  is fixed.

In addition, there are the following updates
to be in sync with the tests:

- The "Backup the configuration file" step is
  extended to include a command to backup
  the packages.
- The Rollback procedure is extended to restore
  the backup packages.
- The Reinstallation section is fixed for RHEL.

Also, I've the section removed the rollback
section for images, as it's not correct or
relevant.

Refs https://github.com/scylladb/scylladb/issues/11907

This commit must be backported to branch-5.4 and branch-5.2.

Closes scylladb/scylladb#16152

(cherry picked from commit 91cddb606f)
2023-12-05 14:58:21 +02:00
Pavel Emelyanov
d83f4b9240 Update seastar submodule
* seastar eda297fc...43a1ce58 (2):
  > io_queue: Add iogroup label to metrics
  > io_queue: Remove ioshard metrics label

refs: scylladb/seastar#1591

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-12-05 10:46:07 +03:00
Raphael S. Carvalho
1b8c078cab test: Fix sporadic failures of database_test
database_test is failing sporadically and the cause was traced back
to commit e3e7c3c7e5.

The commit forces a subset of tests in database_test, to run once
for each of predefined x_log2_compaction_group settings.

That causes two problems:
1) test becomes 240% slower in dev mode.
2) queries on system.auth is timing out, and the reason is a small
table being spread across hundreds of compaction groups in each
shard. so to satisfy a range scan, there will be multiple hops,
making the overhead huge. additionally, the compaction group
aware sstable set is not merged yet. so even point queries will
unnecessarily scan through all the groups.

Fixes #13660.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #13851

(cherry picked from commit a7ceb987f5)
2023-11-30 17:31:07 +02:00
Benny Halevy
1592a84b80 task_manager: module: make_task: enter gate when the task is created
Passing the gate_closed_exception to the task promise in start()
ends up with abandoned exception since no-one is waiting
for it.

Instead, enter the gate when the task is made
so it will fail make_task if the gate is already closed.

Fixes scylladb/scylladb#15211

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit f9a7635390)
2023-11-30 17:16:57 +02:00
Michał Chojnowski
bfeadae1bd position_in_partition: make operator= exception-safe
The copy assignment operator of _ck can throw
after _type and _bound_weight have already been changed.
This leaves position_in_partition in an inconsistent state,
potentially leading to various weird symptoms.

The problem was witnessed by test_exception_safety_of_reads.
Specifically: in cache_flat_mutation_reader::add_to_buffer,
which requires the assignment to _lower_bound to be exception-safe.

The easy fix is to perform the only potentially-throwing step first.

Fixes #15822

Closes scylladb/scylladb#15864

(cherry picked from commit 93ea3d41d8)
2023-11-30 15:01:22 +02:00
Avi Kivity
2c219a65f8 Update seastar submodule (spins on epoll)
* seastar 45f4102428...eda297fcb5 (1):
  > epoll: Avoid spinning on aborted connections

Fixes #12774
Fixes #7753
Fixes #13337
2023-11-30 14:09:22 +02:00
Piotr Grabowski
7054b1ab1e install-dependencies.sh: update node_exporter to 1.7.0
Update node_exporter to 1.7.0.

The previous version (1.6.1) was flagged by security scanners (such as
Trivy) with HIGH-severity CVE-2023-39325. 1.7.0 release fixed that
problem.

[Botond: regenerate frozen toolchain]

Fixes #16085

Closes scylladb/scylladb#16086

Closes scylladb/scylladb#16090

(cherry picked from commit 321459ec51)

[avi: regenerate frozen toolchain]
2023-11-27 18:17:38 +00:00
Anna Mikhlin
a65838ee9c re-spin: 5.2.11 2023-11-26 16:17:58 +02:00
Botond Dénes
68faf18ad9 Update ./tools/jmx and ./tools/java submodules
* tools/jmx 88d9bdc...bc4f8ea (1):
  > Merge "scylla-apiclient: update several Java dependencies" from Piotr Grabowski

* tools/java f8f556d802...e716e1bd1d (1):
  > Merge 'build: update several dependencies' from Piotr Grabowski

Update build dependencies which were flagged by security scanners.

Refs: scylladb/scylla-jmx#220
Refs: scylladb/scylla-tools-java#351

Closes #16150
2023-11-23 15:29:00 +02:00
Beni Peled
44d1b55253 release: prepare for 5.2.11 2023-11-22 14:22:13 +02:00
Tomasz Grabiec
bfd8401477 api, storage_service: Recalculate table digests on relocal_schema api call
Currently, the API call recalculates only per-node schema version. To
workaround issues like #4485 we want to recalculate per-table
digests. One way to do that is to restart the node, but that's slow
and has impact on availability.

Use like this:

  curl -X POST http://127.0.0.1:10000/storage_service/relocal_schema

Fixes #15380

Closes #15381

(cherry picked from commit c27d212f4b)
2023-11-21 01:29:28 +01:00
Botond Dénes
e31f2224f5 migration_manager: also reload schema on enabling digest_insensitive_to_expiry
Currently, when said feature is enabled, we recalcuate the schema
digest. But this feature also influences how table versions are
calculated, so it has to trigger a recalculation of all table versions,
so that we can guarantee correct versions.
Before, this used to happen by happy accident. Another feature --
table_digest_insensitive_to_expiry -- used to take care of this, by
triggering a table version recalulation. However this feature only takes
effect if digest_insensitive_to_expiry is also enabled. This used to be
the case incidently, by the time the reload triggered by
table_digest_insensitive_to_expiry ran, digest_insensitive_to_expiry was
already enabled. But this was not guaranteed whatsoever and as we've
recently seen, any change to the feature list, which changes the order
in which features are enabled, can cause this intricate balance to
break.
This patch makes digest_insensitive_to_expiry also kick off a schema
reload, to eliminate our dependence on (unguaranteed) feature order, and
to guarantee that table schemas have a correct version after all features
are enabled. In fact, all schema feature notification handlers now kick
off a full schema reload, to ensure bugs like this don't creep in, in
the future.

Fixes: #16004

Closes scylladb/scylladb#16013

(cherry picked from commit 22381441b0)
2023-11-21 01:29:28 +01:00
Kamil Braun
4101c8beab schema_tables: remove default value for reload in merge_schema
To avoid bugs like the one fixed in the previous commit.

(cherry picked from commit 4376854473)
2023-11-21 01:29:28 +01:00
Kamil Braun
c994ed2057 schema_tables: pass reload flag when calling merge_schema cross-shard
In 0c86abab4d `merge_schema` obtained a new flag, `reload`.

Unfortunately, the flag was assigned a default value, which I think is
almost always a bad idea, and indeed it was in this case. When
`merge_scehma` is called on shard different than 0, it recursively calls
itself on shard 0. That recursive call forgot to pass the `reload` flag.

Fix this.

(cherry picked from commit 48164e1d09)
2023-11-21 01:29:28 +01:00
Avi Kivity
40eed1f1c5 Merge 'schema_mutations, migration_manager: Ignore empty partitions in per-table digest' from Tomasz Grabiec
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.

Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.

After ae8d2a550d (5.2.0), it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).

This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.

A similar problem was fixed for per-node digest calculation in
c2ba94dc39e4add9db213751295fb17b95e6b962. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.

Fixes #4485.

Manually tested using ccm on cluster upgrade scenarios and node restarts.

Closes #14441

* github.com:scylladb/scylladb:
  test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled
  schema_mutations, migration_manager: Ignore empty partitions in per-table digest
  migration_manager, schema_tables: Implement migration_manager::reload_schema()
  schema_tables: Avoid crashing when table selector has only one kind of tables

(cherry picked from commit cf81eef370)
2023-11-21 01:29:28 +01:00
Gleb Natapov
f233c8a9e4 database: fix do_apply_many() to handle empty array of mutations
Currently the code will assert because cl pointer will be null and it
will be null because there is no mutations to initialize it from.
Message-Id: <20230212144837.2276080-3-gleb@scylladb.com>

(cherry picked from commit 941407b905)

Backport needed by #4485.
2023-11-21 01:29:17 +01:00
Botond Dénes
0f3e31975d api/storage_service: start/stop native transport in the statement sg
Currently, it is started/stopped in the streaming/maintenance sg, which
is what the API itself runs in.
Starting the native transport in the streaming sg, will lead to severely
degraded performance, as the streaming sg has significantly less
CPU/disk shares and reader concurrency semaphore resources.
Furthermore, it will lead to multi-paged reads possibly switching
between scheduling groups mid-way, triggering an internal error.

To fix, use `with_scheduling_group()` for both starting and stopping
native transport. Technically, it is only strictly necessary for
starting, but I added it for stop as well for consistency.

Also apply the same treatment to RPC (Thrift). Although no one uses it,
best to fix it, just to be on the safe side.

I think we need a more systematic approach for solving this once and for
all, like passing the scheduling group to the protocol server and have
it switch to it internally. This allows the server to always run on the
correct scheduling group, not depending on the caller to remember using
it. However, I think this is best done in a follow-up, to keep this
critical patch small and easily backportable.

Fixes: #15485

Closes scylladb/scylladb#16019

(cherry picked from commit dfd7981fa7)
2023-11-20 20:00:56 +02:00
Takuya ASADA
c98b22afce scylla_post_install.sh: detect RHEL correctly
$ID_LIKE = "rhel" works only on RHEL compatible OSes, not for RHEL
itself.
To detect RHEL correctly, we also need to check $ID = "rhel".

Fixes #16040

Closes scylladb/scylladb#16041

(cherry picked from commit 338a9492c9)
2023-11-20 19:36:22 +02:00
Marcin Maliszkiewicz
900754d377 db: view: run local materialized view mutations on a separate smp service group
When base write triggers mv write and it needs to be send to another
shard it used the same service group and we could end up with a
deadlock.

This fix affects also alternator's secondary indexes.

Testing was done using (yet) not committed framework for easy alternator
performance testing: https://github.com/scylladb/scylladb/pull/13121.
I've changed hardcoded max_nonlocal_requests config in scylla from 5000 to 500 and
then ran:

./build/release/scylla perf-alternator-workloads --workdir /tmp/scylla-workdir/ --smp 2 \
--developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write_gsi \
--duration 60 --ring-delay-ms 0 --skip-wait-for-gossip-to-settle 0 --continue-after-error true --concurrency 2000

Without the patch when scylla is overloaded (i.e. number of scheduled futures being close to max_nonlocal_requests) after couple seconds
scylla hangs, cpu usage drops to zero, no progress is made. We can confirm we're hitting this issue by seeing under gdb:

p seastar::get_smp_service_groups_semaphore(2,0)._count
$1 = 0

With the patch I wasn't able to observe the problem, even with 2x
concurrency. I was able to make the process hang with 10x concurrency
but I think it's hitting different limit as there wasn't any depleted
smp service group semaphore and it was happening also on non mv loads.

Fixes https://github.com/scylladb/scylladb/issues/15844

Closes scylladb/scylladb#15845

(cherry picked from commit 020a9c931b)
2023-11-19 18:54:46 +02:00
Botond Dénes
fbb356aa88 repair/repair.cc: do_repair_ranges(): prevent stalls when skipping ranges
We have observed do_repair_ranges() receiving tens of thousands of
ranges to repairs on occasion. do_repair_ranges() repairs all ranges in
parallel, with parallel_for_each(). This is normally fine, as the lambda
inside parallel_for_each() takes a semaphore and this will result in
limited concurrency.
However, in some instances, it is possible that most of these ranges are
skipped. In this case the lambda will become synchronous, only logging a
message. This can cause stalls beacuse there are no opportunities to
yield. Solve this by adding an explicit yield to prevent this.

Fixes: #14330

Closes scylladb/scylladb#15879

(cherry picked from commit 90a8489809)
2023-11-08 21:10:30 +02:00
Michał Jadwiszczak
e8871c02a1 cql3:statements:describe_statement: check pointer to UDF/UDA
While looking for specific UDF/UDA, result of
`functions::functions::find()` needs to be filtered out based on
function's type.

Fixes: #14360
(cherry picked from commit d498451cdf)
2023-11-08 20:16:41 +02:00
Pavel Emelyanov
f76ba217e7 Merge 'api: failure_detector: invoke on shard 0' from Kamil Braun
These APIs may return stale or simply incorrect data on shards
other than 0. Newer versions of Scylla are better at maintaining
cross-shard consistency, but we need a simple fix that can be easily and
without risk be backported to older versions; this is the fix.

Add a simple test to check that the `failure_detector/endpoints`
API returns nonzero generation.

Fixes: scylladb/scylladb#15816

Closes scylladb/scylladb#15970

* github.com:scylladb/scylladb:
  test: rest_api: test that generation is nonzero in `failure_detector/endpoints`
  api: failure_detector: fix indentation
  api: failure_detector: invoke on shard 0

(cherry picked from commit 9443253f3d)
2023-11-07 15:12:12 +01:00
Botond Dénes
17e4d535db test/cql-pytest/nodetool.py: no_autocompaction_context: use the correct API
This `with` context is supposed to disable, then re-enable
autocompaction for the given keyspaces, but it used the wrong API for
it, it used the column_family/autocompaction API, which operates on
column families, not keyspaces. This oversight led to a silent failure
because the code didn't check the result of the request.
Both are fixed in this patch:
* switch to use `storage_service/auto_compaction/{keyspace}` endpoint
* check the result of the API calls and report errors as exceptions

Fixes: #13553

Closes #13568

(cherry picked from commit 66ee73641e)
2023-11-07 13:59:01 +02:00
Aleksandra Martyniuk
75b792e260 repair: release resources of shard_repair_task_impl
Before integration with task manager the state of one shard repair
was kept in repair_info. repair_info object was destroyed immediately
after shard repair was finished.

In an integration process repair_info's fields were moved to
shard_repair_task_impl as the two served the similar purposes.
Though, shard_repair_task_impl isn't immediately destoyed, but is
kept in task manager for task_ttl seconds after it's complete.
Thus, some of repair_info's fields have their lifetime prolonged,
which makes the repair state change delayed.

Release shard_repair_task_impl resources immediately after shard
repair is finished.

Fixes: #15505.
(cherry picked from commit 0474e150a9)

Closes #15875
2023-11-07 09:40:05 +02:00
Tomasz Grabiec
573ef87245 Merge ' tool/scylla-sstable: more flexibility in obtaining the schema' from Botond Dénes
scylla-sstable currently has two ways to obtain the schema:

    * via a `schema.cql` file.
    * load schema definition from memory (only works for system tables).

This meant that for most cases it was necessary to export the schema into a CQL format and write it to a file. This is very flexible. The sstable can be inspected anywhere, it doesn't have to be on the same host where it originates form. Yet in many cases the sstable is inspected on the same host where it originates from. In this cases, the schema is readily available in the schema tables on disk and it is plain annoying to have to export it into a file, just to quickly inspect an sstable file.
This series solves this annoyance by providing a mechanism to load schemas from the on-disk schema tables. Furthermore, an auto-detect mechanism is provided to detect the location of these schema tables based on the path of the sstable, but if that fails, the tool check the usual locations of the scylla data dir, the scylla confguration file and even looks for environment variables that tell the location of these. The old methods are still supported. In fact, if a schema.cql is present in the working directory of the tool, it is preferred over any other method, allowing for an easy force-override.
If the auto-detection magic fails, an error is printed to the console, advising the user to turn on debug level logging to see what went wrong.
A comprehensive test is added which checks all the different schema loading mechanisms. The documentation is also updated to reflect the changes.

This change breaks the backward-compatibility of the command-line API of the tool, as `--system-schema` is now just a flag, the keyspace and table names are supplied separately via the new `--keyspace` and `--table` options. I don't think this will break anybody's workflow as this tools is still lightly used, exactly because of the annoying way the schema has to be provided. Hopefully after this series, this will change.

Example:

```
$ ./build/dev/scylla sstable dump-data /var/lib/scylla/data/ks/tbl2-d55ba230b9a811ed9ae8495671e9e4f8/quarantine/me-1-big-Data.db
{"sstables":{"/var/lib/scylla/data/ks/tbl2-d55ba230b9a811ed9ae8495671e9e4f8/quarantine//me-1-big-Data.db":[{"key":{"token":"-3485513579396041028","raw":"000400000000","value":"0"},"clustering_elements":[{"type":"clustering-row","key":{"raw":"","value":""},"marker":{"timestamp":1677837047297728},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1677837047297728,"value":"0"}}}]}]}}
```

As seen above, subdirectories like qurantine, staging etc are also supported.

Fixes: https://github.com/scylladb/scylladb/issues/10126

Closes #13448

* github.com:scylladb/scylladb:
  test/cql-pytest: test_tools.py: add tests for schema loading
  test/cql-pytest: add no_autocompaction_context
  docs: scylla-sstable.rst: remove accidentally added copy-pasta
  docs: scylla-sstable.rst: remove paragraph with schema limitations
  docs: scylla-sstable.rst: update schema section
  test/cql-pytest: nodetool.py: add flush_keyspace()
  tools/scylla-sstable: reform schema loading mechanism
  tools/schema_loader: add load_schema_from_schema_tables()
  db/schema_tables: expose types schema

(cherry picked from commit 952b455310)

Closes #15386
2023-11-02 17:25:18 +02:00
Beni Peled
454e5a7110 release: prepare for 5.2.10 2023-11-02 15:08:11 +00:00
Avi Kivity
9967c0bda4 Update tools/pythion3 submodule (tar file timestamps)
* tools/python3 cf7030a...6ad2e5a (1):
  > create-relocatable-package.py: fix timestamp of executable files

Fixes #13415.
2023-11-02 12:37:09 +01:00
Botond Dénes
48509c5c00 Merge '[Backport 5.2] properly update storage service after schema changes' from Benny Halevy
This is a backport of https://github.com/scylladb/scylladb/pull/14158 to branch 5.2

Closes #15872

* github.com:scylladb/scylladb:
  migration_notifier: get schema_ptr by value
  migration_manager: propagate listener notification exceptions
  storage_service: keyspace_changed: execute only on shard 0
  database: modify_keyspace_on_all_shards: execute func first on shard 0
  database: modify_keyspace_on_all_shards: call notifiers only after applying func on all shards
  database: add modify_keyspace_on_all_shards
  schema_tables: merge_keyspaces: extract_scylla_specific_keyspace_info for update_keyspace
  database: create_keyspace_on_all_shards
  database: update_keyspace_on_all_shards
  database: drop_keyspace_on_all_shards
2023-10-31 10:27:08 +02:00
Botond Dénes
d606e9bfa2 Merge '[branch-5.2] Enable incremental compaction on off-strategy' from Raphael "Raph" Carvalho
Off-strategy suffers with a 100% space overhead, as it adopted
a sort of all or nothing approach. Meaning all input sstables,
living in maintenance set, are kept alive until they're all
reshaped according to the strategy criteria.

Input sstables in off-strategy are very likely to be mostly disjoint,
so it can greatly benefit from incremental compaction.

The incremental compaction approach is not only good for
decreasing disk usage, but also memory usage (as metadata of
input and output live in memory), and file desc count, which
takes memory away from OS.

Turns out that this approach also greatly simplifies the
off-strategy impl in compaction manager, as it no longer have
to maintain new unused sstables and mark them for
deletion on failure, and also unlink intermediary sstables
used between reshape rounds.

Fixes https://github.com/scylladb/scylladb/issues/14992.

Backport notes: relatively easy to backport, had to include
**replica: Make compaction_group responsible for deleting off-strategy compaction input**
and
**compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0**

Closes #15793

* github.com:scylladb/scylladb:
  test: Verify that off-strategy can do incremental compaction
  compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0
  compaction: Clear pending_replacement list when tombstone GC is disabled
  compaction: Enable incremental compaction on off-strategy
  compaction: Extend reshape type to allow for incremental compaction
  compaction: Move reshape_compaction in the source
  compaction: Enable incremental compaction only if replacer callback is engaged
  replica: Make compaction_group responsible for deleting off-strategy compaction input
2023-10-30 12:00:54 +02:00
Benny Halevy
cd7abb3833 migration_notifier: get schema_ptr by value
To prevent use-after-free as seen in
https://github.com/scylladb/scylladb/issues/15097
where a temp schema_ptr retrieved from a global_schema_ptr
get destroyed when the notification function yielded.

Capturing the schema_ptr on the coroutine frame
is inexpensive since its a shared ptr and it makes sure
that the schema remains valid throughput the coroutine
life time.

\Fixes scylladb/scylladb#15097

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

\Closes #15098

(cherry picked from commit 0f54e24519)
2023-10-29 19:39:17 +02:00
Benny Halevy
8064fface9 migration_manager: propagate listener notification exceptions
1e29b07e40 claimed
to make event notification exception safe,
but swallawing the exceptions isn't safe at all,
as this might leave the node in an inconsistent state
if e.g. storage_service::keyspace_changed fails on any of the
shards.  Propagating the exception here will cause abort,
but it is better than leaving the node up, but in an
inconsistent state.

We keep notifying other listeners even if any of them failed
Based on 1e29b07e40:
```
If one of the listeners throws an exception, we must ensure that other
listeners are still notified.
```

The decision about swallowing exceptions can't be
made in such a generic layer.
Specific notification listeners that may ignore exceptions,
like in transport/evenet_notifier, may decide to swallow their
local exceptions on their own (as done in this patch).

Refs #3389

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 825d617a53)
2023-10-29 19:32:55 +02:00
Benny Halevy
0cf6891c6d storage_service: keyspace_changed: execute only on shard 0
Previously all shards called `update_topology_change_info`
which in turn calls `mutate_token_metadata`, ending up
in quadratic complexity.

Now that the notifications are called after
all database shards are updated, we can apply
the changes on token metadata / effective replication map
only on shard 0 and count on replicate_to_all_cores to
propagate those changes to all other shards.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit a690f0e81f)
2023-10-29 19:27:52 +02:00
Benny Halevy
16a594d564 database: modify_keyspace_on_all_shards: execute func first on shard 0
When creating or altering a keyspace, we create a new
effective_replication_map instance.

It is more efficient to do that first on shard 0
and then on all other shards, otherwise multiple
shards might need to calculate to new e_r_m (and reach
the same result).  When the new e_r_m is "seeded" on
shard 0, other shards will find it there and clone
a local copy of it - which is more efficient.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 13dd92e618)
2023-10-29 19:22:01 +02:00
Benny Halevy
096c312821 database: modify_keyspace_on_all_shards: call notifiers only after applying func on all shards
When creating, updating, or dropping keyspaces,
first execute the database internal function to
modify the database state, and only when all shards
are updated, run the listener notifications,
to make sure they would operate when the database
shards are consistent with each other.

\Fixes #13137

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ba15786059)
2023-10-29 19:21:34 +02:00
Benny Halevy
5c27dacad5 database: add modify_keyspace_on_all_shards
Run all keyspace create/update/drop ops
via `modify_keyspace_on_all_shards` that
will standardize the execution on all shards
in the coming patches.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 3b8c913e61)
2023-10-29 19:16:56 +02:00
Benny Halevy
14113dc23e schema_tables: merge_keyspaces: extract_scylla_specific_keyspace_info for update_keyspace
Similar to create_keyspace_on_all_shards,
`extract_scylla_specific_keyspace_info` and
`create_keyspace_from_schema_partition` can be called
once in the upper layer, passing keyspace_metadata&
down to database::update_keyspace_on_all_shards
which now would only make the per-shard
keyspace_metadata from the reference it gets
from the schema_tables layer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit dc9b0812e9)
2023-10-29 19:14:06 +02:00
Benny Halevy
4d5a99f3b8 database: create_keyspace_on_all_shards
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 3520c786bd)
2023-10-29 19:13:55 +02:00
Benny Halevy
ffe28b3e3f database: update_keyspace_on_all_shards
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 53a6ea8616)
2023-10-29 19:06:45 +02:00
Benny Halevy
1459306603 database: drop_keyspace_on_all_shards
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 9d40305ef6)
2023-10-29 19:06:25 +02:00
Kefu Chai
30a4eb0ea7 sstables: writer: delegate flush() in checksummed_file_data_sink_impl
before this change, `checksummed_file_data_sink_impl` just inherits the
`data_sink_impl::flush()` from its parent class. but as a wrapper around
the underlying `_out` data_sink, this is not only an unusual design
decision in a layered design of an I/O system, but also could be
problematic. to be more specific, the typical user of `data_sink_impl`
is a `data_sink`, whose `flush()` member function is called when
the user of `data_sink` want to ensure that the data sent to the sink
is pushed to the underlying storage / channel.

this in general works, as the typical user of `data_sink` is in turn
`output_stream`, which calls `data_sink.flush()` before closing the
`data_sink` with `data_sink.close()`. and the operating system will
eventually flush the data after application closes the corresponding
fd. to be more specific, almost none of the popular local filesystem
implements the file_operations.op, hence, it's safe even if the
`output_stream` does not flush the underlying data_sink after writing
to it. this is the use case when we write to sstables stored on local
filesystem. but as explained above, if the data_sink is backed by a
network filesystem, a layered filesystem or a storage connected via
a buffered network device, then it is crucial to flush in a timely
manner, otherwise we could risk data lost if the application / machine /
network breaks when the data is considerered persisted but they are
_not_!

but the `data_sink` returned by `client::make_upload_jumbo_sink` is
a little bit different. multipart upload is used under the hood, and
we have to finalize the upload once all the parts are uploaded by
calling `close()`. but if the caller fails / chooses to close the
sink before flushing it, the upload is aborted, and the partially
uploaded parts are deleted.

the default-implemented `checksummed_file_data_sink_impl::flush()`
breaks `upload_jumbo_sink` which is the `_out` data_sink being
wrapped by `checksummed_file_data_sink_impl`. as the `flush()`
calls are shortcircuited by the wrapper, the `close()` call
always aborts the upload. that's why the data and index components
just fail to upload with the S3 backend.

in this change, we just delegate the `flush()` call to the
wrapped class.

Fixes #15079
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #15134

(cherry picked from commit d2d1141188)
2023-10-26 16:48:17 +03:00
Avi Kivity
ea198d884d cql3: grammar: reject intValue with no contents
The grammar mistakenly allows nothing to be parsed as an
intValue (itself accepted in LIMIT and similar clauses).

Easily fixed by removing the empty alternative. A unit test is
added.

Fixes #14705.

Closes #14707

(cherry picked from commit e00811caac)
2023-10-25 19:15:28 +03:00
Raphael S. Carvalho
b8c8794e14 test: Verify that off-strategy can do incremental compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-10-22 17:05:33 -03:00
Benny Halevy
0c2bb5f0b3 compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0
Prevent div-by-zero byt returning const level 1
if max_sstable_size is zero, as configured by
cleanup_incremental_compaction_test, before it's
extended to cover also offstrategy compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b1e164a241)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-10-22 17:05:33 -03:00
Raphael S. Carvalho
61316d8e88 compaction: Clear pending_replacement list when tombstone GC is disabled
pending_replacement list is used by incremental compaction to
communicate to other ongoing compactions about exhausted sstables
that must be replaced in the sstable set they keep for tombstone
GC purposes.

Reshape doesn't enable tombstone GC, so that list will not
be cleared, which prevents incremental compaction from releasing
sstables referenced by that list. It's not a problem until now
where we want reshape to do incremental compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-10-22 17:05:33 -03:00
Raphael S. Carvalho
b8e2739596 compaction: Enable incremental compaction on off-strategy
Off-strategy suffers with a 100% space overhead, as it adopted
a sort of all or nothing approach. Meaning all input sstables,
living in maintenance set, are kept alive until they're all
reshaped according to the strategy criteria.

Input sstables in off-strategy are very likely to mostly disjoint,
so it can greatly benefit from incremental compaction.

The incremental compaction approach is not only good for
decreasing disk usage, but also memory usage (as metadata of
input and output live in memory), and file desc count, which
takes memory away from OS.

Turns out that this approach also greatly simplifies the
off-strategy impl in compaction manager, as it no longer have
to maintain new unused sstables and mark them for
deletion on failure, and also unlink intermediary sstables
used between reshape rounds.

Fixes #14992.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 42050f13a0)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-10-22 17:05:29 -03:00
Raphael S. Carvalho
ba87dfefd1 compaction: Extend reshape type to allow for incremental compaction
That's done by inheriting regular_compaction, which implement
incremental compaction. But reshape still implements its own
methods for creating writer and reader. One reason is that
reshape is not driven by controller, as input sstables to it
live in maintenance set. Another reason is customization
of things like sstable origin, etc.
stop_sstable_writer() is extended because that's used by
regular_compaction to check for possibility of removing
exhausted sstables earlier whenever an output sstable
is sealed.
Also, incremental compaction will be unconditionally
enabled for ICS/LCS during off-strategy.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit db9ce9f35a)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-10-22 15:00:07 -03:00
Raphael S. Carvalho
6b8499f4d8 compaction: Move reshape_compaction in the source
That's in preparation to next change that will make reshape
inherit from regular compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-10-22 15:00:07 -03:00
Raphael S. Carvalho
8c8a80a03d compaction: Enable incremental compaction only if replacer callback is engaged
That's needed for enabling incremental compaction to operate, and
needed for subsequent work that enables incremental compaction
for off-strategy, which in turn uses reshape compaction type.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-10-22 15:00:07 -03:00
Raphael S. Carvalho
fdec5e62d0 replica: Make compaction_group responsible for deleting off-strategy compaction input
Compaction group is responsible for deleting SSTables of "in-strategy"
compactions, i.e. regular, major, cleanup, etc.

Both in-strategy and off-strategy compaction have their completion
handled using the same compaction group interface, which is
compaction_group::table_state::on_compaction_completion(...,
				sstables::offstrategy offstrategy)

So it's important to bring symmetry there, by moving the responsibility
of deleting off-strategy input, from manager to group.

Another important advantage is that off-strategy deletion is now throttled
and gated, allowing for better control, e.g. table waiting for deletion
on shutdown.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #13432

(cherry picked from commit 457c772c9c)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-10-22 15:00:06 -03:00
Raphael S. Carvalho
6798f9676f Resurrect optimization to avoid bloom filter checks during compaction
Commit 8c4b5e4 introduced an optimization which only
calculates max purgeable timestamp when a tombstone satisfy the
grace period.

Commit 'repair: Get rid of the gc_grace_seconds' inverted the order,
probably under the assumption that getting grace period can be
more expensive than calculating max purgeable, as repair-mode GC
will look up into history data in order to calculate gc_before.

This caused a significant regression on tombstone heavy compactions,
where most of tombstones are still newer than grace period.
A compaction which used to take 5s, now takes 35s. 7x slower.

The reason is simple, now calculation of max purgeable happens
for every single tombstone (once for each key), even the ones that
cannot be GC'ed yet. And each calculation has to iterate through
(i.e. check the bloom filter of) every single sstable that doesn't
participate in compaction.

Flame graph makes it very clear that bloom filter is a heavy path
without the optimization:
    45.64%    45.64%  sstable_compact  sstable_compaction_test_g
        [.] utils::filter::bloom_filter::is_present

With its resurrection, the problem is gone.

This scenario can easily happen, e.g. after a deletion burst, and
tombstones becoming only GC'able after they reach upper tiers in
the LSM tree.

Before this patch, a compaction can be estimated to have this # of
filter checks:
(# of keys containing *any* tombstone) * (# of uncompacting sstable
runs[1])

[1] It's # of *runs*, as each key tend to overlap with only one
fragment of each run.

After this patch, the estimation becomes:
(# of keys containing a GC'able tombstone) * (# of uncompacting
runs).

With repair mode for tombstone GC, the assumption, that retrieval
of gc_before is more expensive than calculating max purgeable,
is kept. We can revisit it later. But the default mode, which
is the "timeout" (i.e. gc_grace_seconds) one, we still benefit
from the optimization of deferring the calculation until
needed.

Cherry picked from commit 38b226f997

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Fixes #14091.

Closes #13908

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #15744
2023-10-20 09:34:53 +03:00
Botond Dénes
2642f32c38 Merge '[5.2 backport] doc: remove recommended image upgrade with OS from previous releases' from Anna Stuchlik
This is a backport of PR  https://github.com/scylladb/scylladb/pull/15740.

This commit removes the information about the recommended way of upgrading ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade procedure is not supported (it was implemented, but then reverted).

The scope of this commit:

- Remove the information from the 5.0-to.-5.1 upgrade guide and replace with general info.
- Remove the information from the 4.6-to.-5.1 upgrade guide and replace with general info.
- Remove the information from the 5.x.y-to.-5.x.z upgrade guide and replace with general info.
- Remove the following files as no longer necessary (they were only created to incorporate the (invalid) information about image upgrade into the upgrade guides.
    /upgrade/_common/upgrade-image-opensource.rst
    /upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p1.rst
    /upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p2.rst
    /upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian.rst

Closes #15768

* github.com:scylladb/scylladb:
  doc: remove wrong image upgrade info (5.x.y-to-5.x.y)
  doc: remove wrong image upgrade info (4.6-to-5.0)
  doc: remove wrong image upgrade info (5.0-to-5.1)
2023-10-19 12:30:55 +03:00
Anna Stuchlik
fcbcf1eafd doc: remove wrong image upgrade info (5.x.y-to-5.x.y)
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 5.x.y-to-5.x.y upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).

Refs https://github.com/scylladb/scylladb/issues/15733

In addition, the following files are removed as no longer
necessary (they were only created to incorporate the (invalid)
information about image upgrade into the upgrade guides.

/upgrade/_common/upgrade-image-opensource.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p1.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p2.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian.rst

(cherry picked from commit dd1207cabb)
2023-10-19 08:47:25 +02:00
Anna Stuchlik
3a14fd31d0 doc: remove wrong image upgrade info (4.6-to-5.0)
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 4.6-to-5.0 upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).

Refs https://github.com/scylladb/scylladb/issues/15733

(cherry picked from commit 526d543b95)
2023-10-19 08:41:24 +02:00
Anna Stuchlik
c7b6152a81 doc: remove wrong image upgrade info (5.0-to-5.1)
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 5.0-to-5.1 upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).

Refs https://github.com/scylladb/scylladb/issues/15733

(cherry picked from commit 9852130c5b)
2023-10-19 08:40:27 +02:00
Asias He
ac45d8d092 repair: Use the updated estimated_partitions to create writer
The estimated_partitions is estimated after the repair_meta is created.

Currently, the default estimated_partitions was used to create the
write which is not correct.

To fix, use the updated estimated_partitions.

Reported by Petr Gusev

Closes #14179
Fixes #15748

(cherry picked from commit 4592bbe182)
2023-10-18 13:58:28 +03:00
Anna Stuchlik
d319c2a83f doc: remove recommended image upgrade with OS
This commit removes the information about
the recommended way of upgrading ScyllaDB
images - by updating ScyllaDB and OS packages
in one step.
This upgrade procedure is not supported
(it was implemented, but then reverted).

The scope of this commit:
- Remove the information from the 5.1-to.-5.2
  upgrade guide and replace with general info.
- Remove the information from the Image Upgrade
  page.
- Remove outdated info (about previous releases)
  from the Image Upgrade page.
- Rename "AMI Upgrade" as "Image Upgrade"
  in the page tree.

Refs: https://github.com/scylladb/scylladb/issues/15733
(cherry picked from commit f6767f6d6e)

Closes #15754
2023-10-18 13:57:08 +03:00
Nadav Har'El
cb7e7f15ac Cherry-pick Seastar patch
Backported Seastar commit 4f4e84bb2cec5f11b4742396da7fc40dbb3f162f:

  > sstring: refactor to_sstring() using fmt::format_to()

Refs https://github.com/scylladb/scylladb/issues/15127

Closes #15663
2023-10-09 10:02:21 +03:00
Raphael S. Carvalho
00d431bd20 reader_concurrency_semaphore: Fix stop() in face of evictable reads becoming inactive
Scylla can crash due to a complicated interaction of service level drop,
evictable readers, inactive read registration path.

1) service level drop invoke stop of reader concurrency semaphore, which will
wait for in flight requests

2) turns out it stops first the gate used for closing readers that will
become inactive.

3) proceeds to wait for in-flight reads by closing the reader permit gate.

4) one of evictable reads take the inactive read registration path, and
finds the gate for closing readers closed.

5) flat mutation reader is destroyed, but finds the underlying reader was
not closed gracefully and triggers the abort.

By closing permit gate first, evictable readers becoming inactive will
be able to properly close underlying reader, therefore avoiding the
crash.

Fixes #15534.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#15535

(cherry picked from commit 914cbc11cf)
2023-09-29 09:24:37 +03:00
Botond Dénes
ca8723a6fd Merge 'gossiper: add get_unreachable_members_synchronized and use over api' from Benny Halevy
Modeled after get_live_members_synchronized,
get_unreachable_members_synchronized calls
replicate_live_endpoints_on_change to synchronize
the state of unreachable_members on all shards.

Fixes #12261
Fixes #15088

Also, add rest_api unit test for those apis

Closes #15093

* github.com:scylladb/scylladb:
  test: rest_api: add test_gossiper
  gossiper: add get_unreachable_members_synchronized

(cherry picked from commit 57deeb5d39)

Backport note: `gossiper::lock_endpoint_update_semaphore` helper
function was missing, replaced with
`get_units(g._endpoint_update_semaphore, 1)`
2023-09-27 15:09:32 +02:00
Beni Peled
5709d00439 release: prepare for 5.2.9 2023-09-20 12:34:43 +03:00
Konstantin Osipov
7202634789 raft: do not update raft address map with obsolete gossip data
It is possible that a gossip message from an old node is delivered
out of order during a slow boot and the raft address map overwrites
a new IP address with an obsolete one, from the previous incarnation
of this node. Take into account the node restart counter when updating
the address map.

A test case requires a parameterized error injection, which
we don't support yet. Will be added as a separate commit.

Fixes #14257
Refs #14357

Closes #14329

(cherry picked from commit b9c2b326bc)

Backport note: replaced `gms::generation_type` with `int64_t` because
the branch is missing the refactor which introduced `generation_type`
(7f04d8231d)
2023-09-19 11:12:38 +02:00
Avi Kivity
34e0afb18a Merge "auth: do not grant permissions to creator without actually creating" from Wojciech Mitros
Currently, when creating the table, permissions may be mistakenly
granted to the user even if the table is already existing. This
can happen in two cases:

The query has a IF NOT EXISTS clause - as a result no exception
is thrown after encountering the existing table, and the permission
granting is not prevented.
The query is handled by a non-zero shard - as a result we accept
the query with a bounce_to_shard result_message, again without
preventing the granting of permissions.
These two cases are now avoided by checking the result_message
generated when handling the query - now we only grant permissions
when the query resulted in a schema_change message.

Additionally, a test is added that reproduces both of the mentioned
cases.

CVE-2023-33972

Fixes #15467.

* 'no-grant-on-no-create' of github.com:scylladb/scylladb-ghsa-ww5v-p45p-3vhq:
  auth: do not grant permissions to creator without actually creating
  transport: add is_schema_change() method to result_message

(cherry picked from commit ab6988c52f)
2023-09-19 01:47:27 +03:00
Anna Stuchlik
99e906499d doc: fix internal links
Fixes https://github.com/scylladb/scylladb/issues/14490

This commit fixes mulitple links that were broken
after the documentation is published (but not in
the preview) due to incorrect syntax.
I've fixed the syntax to use the :docs: and :ref:
directive for pages and sections, respectively.

Closes #14664

(cherry picked from commit a93fd2b162)
2023-09-18 09:32:12 +03:00
Anna Stuchlik
b8ff392e8b doc: add info - support for FIPS-compliant systems
This commit adds the information that ScyllaDB Enterprise
supports FIPS-compliant systems in versions
2023.1.1 and later.
The information is excluded from OSS docs with
the "only" directive, because the support was not
added in OSS.

This commit must be backported to branch-5.2 so that
it appears on version 2023.1 in the Enterprise docs.

Closes #15415

(cherry picked from commit fb635dccaa)
2023-09-18 09:17:59 +03:00
Raphael S. Carvalho
a65e5120ab compaction: base compaction throughput on amount of data read
Today, we base compaction throughput on the amount of data written,
but it should be based on the amount of input data compacted
instead, to show the amount of data compaction had to process
during its execution.

A good example is a compaction which expire 99% of data, and
today throughput would be calculated on the 1% written, which
will mislead the reader to think that compaction was terribly
slow.

Fixes #14533.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14615

(cherry picked from commit 3b1829f0d8)
2023-09-14 21:30:22 +03:00
Jan Ciolek
cd9458eeb1 cql.g: make the parser reject INSERT JSON without a JSON value
We allow inserting column values using a JSON value, eg:
```cql
INSERT INTO mytable JSON '{ "\"myKey\"": 0, "value": 0}';
```

When no JSON value is specified, the query should be rejected.

Scylla used to crash in such cases. A recent change fixed the crash
(https://github.com/scylladb/scylladb/pull/14706), it now fails
on unwrapping an uninitialized value, but really it should
be rejected at the parsing stage, so let's fix the grammar so that
it doesn't allow JSON queries without JSON values.

A unit test is added to prevent regressions.

Refs: https://github.com/scylladb/scylladb/pull/14707
Fixes: https://github.com/scylladb/scylladb/issues/14709

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

\Closes #14785

(cherry picked from commit cbc97b41d4)
2023-09-14 21:07:21 +03:00
Nadav Har'El
e917b874f9 test/alternator: fix flaky test test_ttl_expiration_gsi_lsi
The Alternator test test_ttl.py::test_ttl_expiration_gsi_lsi was flaky.
The test incorrectly assumes that when we write an already expired item,
it will be visible for a short time until being deleted by the TTL thread.
But this doesn't need to be true - if the test is slow enough, it may go
look or the item after it was already expired!

So we fix this test by splitting it into two parts - in the first part
we write a non-expiring item, and notice it eventually appears in the
GSI, LSI, and base-table. Then we write the same item again, with an
expiration time - and now it should eventually disappear from the GSI,
LSI and base-table.

This patch also fixes a small bug which prevented this test from running
on DynamoDB.

Fixes #14495

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14496

(cherry picked from commit 599636b307)
2023-09-14 20:43:51 +03:00
Pavel Emelyanov
a27c391cba Update seastar submodule
* seastar 85147cfd...872e0bc6 (3):
  > rpc: Abort server connection streams on stop
  > rpc: Do not register stream to dying parent
  > rpc: Fix client-side stream registration race

refs: #13100

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-06 12:33:30 +03:00
Beni Peled
455ab99b6c release: prepare for 5.2.8 2023-08-31 22:02:06 +03:00
Michał Chojnowski
adcf296bcf reader_concurrency_semaphore: fix a deadlock between stop() and execution_loop()
Permits added to `_ready_list` remain there until
executed by `execution_loop()`.
But `execution_loop()` exits when `_stopped == true`,
even though nothing prevents new permits from being added
to `_ready_list` after `stop()` sets `_stopped = true`.

Thus, if there are reads concurrent with `stop()`,
it's possible for a permit to be added to `_ready_list`
after `execution_loop()` has already quit. Such a permit will
never be destroyed, and `stop()` will forever block on
`_permit_gate.close()`.

A natural solution is to dismiss `execution_loop()` only after
it's certain that `_ready_list` won't receive any new permits.
This is guaranteed by `_permit_gate.close()`. After this call completes,
it is certain that no permits *exist*.

After this patch, `execution_loop()` no longer looks at `_stopped`.
It only exits when `_ready_list_cv` breaks, and this is triggered
by `stop()` right after `_permit_gate.close()`.

Fixes #15198

Closes #15199

(cherry picked from commit 2000a09859)
2023-08-31 08:13:09 +03:00
Calle Wilund
198297a08a generic_server: Handle TLS error codes indicating broken pipe
Fixes  #14625

In broken pipe detection, handle also TLS error codes.

Requires https://github.com/scylladb/seastar/pull/1729

Closes #14626

(cherry picked from commit 890f1f4ad3)
2023-08-29 15:38:21 +03:00
Botond Dénes
9a9b5b691d Update seastar submodule
* seastar 534cb38c...85147cfd (1):
  > tls: Export error_category instance used by tls + some common error codes

Refs: #14625
2023-08-29 15:37:24 +03:00
Alejo Sanchez
610b682cf4 gms, service: replicate live endpoints on shard 0
Call replicate_live_endpoints on shard 0 to copy from 0 to the rest of
the shards. And get the list of live members from shard 0.

Move lock to the callers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13240

(cherry picked from commit da00052ad8)
2023-08-29 12:28:00 +02:00
Kamil Braun
05f4640360 Merge 'api: gossiper: get alive nodes after reaching current shard 0 version' from Alecco
Add an API call to wait for all shards to reach the current shard 0
gossiper version. Throws when timeout is reached.

Closes #12540

* github.com:scylladb/scylladb:
  api: gossiper: fix alive nodes
  gms, service: lock live endpoint copy
  gms, service: live endpoint copy method

(cherry picked from commit b919373cce)
2023-08-29 12:27:52 +02:00
Kefu Chai
8ed58c7dca sstable/writer: log sstable name and pk when capping ldt
when the local_deletion_time is too large and beyond the
epoch time of INT32_MAX, we cap it to INT32_MAX - 1.
this is a signal of bad configuration or a bug in scylla.
so let's add more information in the logging message to
help track back to the source of the problem.

Fixes #15015
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 9c24be05c3)

Closes #15150
2023-08-25 10:13:19 +03:00
Petr Gusev
a83c0a8bbc test_secondary_index_collections: change insert/create index order
Secondary index creation is asynchronous, meaning it
takes time for existing data to be reflected within
the index. However, new data added after the
index is created should appear in it immediately.

The test consisted of two parts. The first created
a series of indexes for one table, added
test data to the table, and then ran a series of checks.
In the second part, several new indexes were added to
the same table, and checks were made to make sure that
already existing data would appear in them. This
last part was flaky.

The patch just moves the index creation statements
from the second part to the first.

Fixes: #14076

Closes #14090

(cherry picked from commit 0415ac3d5f)

Closes #15101
2023-08-24 14:09:08 +03:00
Botond Dénes
df71753498 Merge '[Backport 5.2] distributed_loader: process_sstable_dir: do not verify snapshots' from Benny Halevy
This mini-series backports the fix for #12010 along with low-risk patches it depends on.

Fixes: #12010

Closes #15137

* github.com:scylladb/scylladb:
  distributed_loader: process_sstable_dir: do not verify snapshots
  utils/directories: verify_owner_and_mode: add recursive flag
  utils: Restore indentation after previous patch
  utils: Coroutinize verify_owner_and_mode()
2023-08-23 15:50:29 +03:00
Benny Halevy
6588ecd66f distributed_loader: process_sstable_dir: do not verify snapshots
Skip over verification of owner and mode of the snapshots
sub-directory as this might race with scylla-manager
trying to delete old snapshots concurrently.

Fixes #12010

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 845b6f901b)
2023-08-23 13:19:55 +03:00
Benny Halevy
03640cc15b utils/directories: verify_owner_and_mode: add recursive flag
Allow the caller to verify only the top level directories
so that sub-directories can be verified selectively
(in particular, skip validation of snapshots).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 60862c63dd)
2023-08-23 13:19:36 +03:00
Pavel Emelyanov
6d4d576460 utils: Restore indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 2eb88945ea)
2023-08-23 13:19:36 +03:00
Pavel Emelyanov
96aca473b4 utils: Coroutinize verify_owner_and_mode()
There's a helper verification_error() that prints a warning and returns
excpetional future. The one is converted into void throwing one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 4ebb812df0)
2023-08-23 13:19:30 +03:00
Aleksandra Martyniuk
29e6dc8c1b compaction: do not swallow compaction_stopped_exception for reshape
Loop in shard_reshaping_compaction_task_impl::run relies on whether
sstables::compaction_stopped_exception is thrown from run_custom_job.
The exception is swallowed for each type of compaction
in compaction_manager::perform_task.

Rethrow an exception in perfrom task for reshape compaction.

Fixes: #15058.

(cherry picked from commit e0ce711e4f)

Closes #15122
2023-08-23 12:11:58 +03:00
Vlad Zolotarov
9a414d440d scylla_raid_setup: make --online-discard argument useful
This argument was dead since its introduction and 'discard' was
always configured regardless of its value.
This patch allows actually configuring things using this argument.

Fixes #14963

Closes #14964

(cherry picked from commit e13a2b687d)
2023-08-22 10:40:37 +03:00
Anna Mikhlin
e0ebc95025 release: prepare for 5.2.7 2023-08-21 14:44:56 +03:00
Botond Dénes
b7ab42b61c Merge 'Ignore no such column family in repair' from Aleksandra Martyniuk
While repair requested by user is performed, some tables
may be dropped. When the repair proceeds to these tables,
it should skip them and continue with others.

When no_such_column_family is thrown during user requested
repair, it is logged and swallowed. Then the repair continues with
the remaining tables.

Fixes: #13045

Closes #13068

* github.com:scylladb/scylladb:
  repair: fix indentation
  repair: continue user requested repair if no_such_column_family is thrown
  repair: add find_column_family_if_exists function

(cherry picked from commit 9859bae54f)
2023-08-20 19:49:21 +03:00
Botond Dénes
098baaef48 Merge 'cql: add missing functions for the COUNTER column type' from Nadav Har'El
We have had support for COUNTER columns for quite some time now, but some functionality was left unimplemented - various internal and CQL functions resulted in "unimplemented" messages when used, and the goal of this series is to fix those issues. The primary goal was to add the missing support for CASTing counters to other types in CQL (issue #14501), but we also add the missing CQL  `counterasblob()` and `blobascounter()` functions (issue #14742).

As usual, the series includes extensive functional tests for these features, and one pre-existing test for CAST that used to fail now begins to pass.

Fixes #14501
Fixes #14742

Closes #14745

* github.com:scylladb/scylladb:
  test/cql-pytest: test confirming that casting to counter doesn't work
  cql: support casting of counter to other types
  cql: implement missing counterasblob() and blobascounter() functions
  cql: implement missing type functions for "counters" type

(cherry picked from commit a637ddd09c)
2023-08-13 14:53:48 +03:00
Nadav Har'El
e11561ef65 cql-pytest: translate Cassandra's tests for compact tables
This is a translation of Cassandra's CQL unit test source file
validation/operations/CompactStorageTest.java into our cql-pytest
framework.

This very large test file includes 86 tests for various types of
operations and corner cases of WITH COMPACT STORAGE tables.

All 86 tests pass on Cassandra (except one using a deprecated feature
that needs to be specially enabled). 30 of the tests fail on Scylla
reproducing 7 already-known Scylla issues and 7 previously-unknown issues:

Already known issues:

Refs #3882: Support "ALTER TABLE DROP COMPACT STORAGE"
Refs #4244: Add support for mixing token, multi- and single-column
            restrictions
Refs #5361: LIMIT doesn't work when using GROUP BY
Refs #5362: LIMIT is not doing it right when using GROUP BY
Refs #5363: PER PARTITION LIMIT doesn't work right when using GROUP BY
Refs #7735: CQL parser missing support for Cassandra 3.10's new "+=" syntax
Refs #8627: Cleanly reject updates with indexed values where value > 64k

New issues:

Refs #12471: Range deletions on COMPACT STORAGE is not supported
Refs #12474: DELETE prints misleading error message suggesting
             ALLOW FILTERING would work
Refs #12477: Combination of COUNT with GROUP BY is different from
             Cassandra in case of no matches
Refs #12479: SELECT DISTINCT should refuse GROUP BY with clustering column
Refs #12526: Support filtering on COMPACT tables
Refs #12749: Unsupported empty clustering key in COMPACT table
Refs #12815: Hidden column "value" in compact table isn't completely hidden

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12816

(cherry picked from commit 328cdb2124)
2023-08-13 14:44:19 +03:00
Nadav Har'El
e03c21a83b cql-pytest: translate Cassandra's tests for CAST operations
This is a translation of Cassandra's CQL unit test source file
functions/CastFctsTest.java into our cql-pytest framework.

There are 13 tests, 9 of them currently xfail.

The failures are caused by one recently-discovered issue:

Refs #14501: Cannot Cast Counter To Double

and by three previously unknown or undocumented issues:

Refs #14508: SELECT CAST column names should match Cassandra's
Refs #14518: CAST from timestamp to string not same as Cassandra on zero
             milliseconds
Refs #14522: Support CAST function not only in SELECT

Curiously, the careful translation of this test also caused me to
find a bug in Cassandra https://issues.apache.org/jira/browse/CASSANDRA-18647
which the test in Java missed because it made the same mistake as the
implementation.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14528

(cherry picked from commit f08bc83cb2)
2023-08-13 14:41:36 +03:00
Nadav Har'El
79b5befe65 test/cql-pytest: add tests for data casts and inf in sums
This patch adds tests to reproduce issue #13551. The issue, discovered
by a dtest (cql_cast_test.py), claimed that either cast() or sum(cast())
from varint type broke. So we add two tests in cql-pytest:

1. A new test file, test_cast_data.py, for testing data casts (a
   CAST (...) as ... in a SELECT), starting with testing casts from
   varint to other types.

   The test uncovers a lot of interesting cases (it is heavily
   commented to explain these cases) but nothing there is wrong
   and all tests pass on Scylla.

2. An xfailing test for sum() aggregate of +Inf and -Inf. It turns out
   that this caused #13551. In Cassandra and older Scylla, the sum
   returned a NaN. In Scylla today, it generates a misleading
   error message.

As usual, the tests were run on both Cassandra (4.1.1) and Scylla.

Refs #13551.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 78555ba7f1)
2023-08-13 14:40:08 +03:00
Petr Gusev
aca9e41a44 topology.cc: remove_endpoint: _dc_racks removal fix
The eps reference was reused to manipulate
the racks dictionary. This resulted in
assigning a set of nodes from the racks
dictionary to an element of the _dc_endpoints dictionary.

This is a backport of bcb1d7c to branch-5.2.

Refs: #14184

Closes #14893
2023-08-11 14:29:37 +03:00
Pavel Emelyanov
ff22807ed2 Update seastar submodule
* seastar 29a0e645...534cb38c (1):
  > rpc: Abort connection if send_entry() fails

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-09 11:30:57 +03:00
Botond Dénes
bcb8f6a8dd Merge 'semaphore mismatch: don't throw an error if both semaphores belong to user' from Michał Jadwiszczak
If semaphore mismatch occurs, check whether both semaphores belong
to user. If so, log a warning, log a `querier_cache_scheduling_group_mismatches` stat and drop cached reader instead of throwing an error.

Until now, semaphore mismatch was only checked in multi-partition queries.  The PR pushes the check to `querier_cache` and perform it on all `lookup_*_querier` methods.

The mismatch can happen if user's scheduling group changed during
a query. We don't want to throw an error then, but drop and reset
cached reader.

This patch doesn't solve a problem with mismatched semaphores because of changes in service levels/scheduling groups but only mitigate it.

Refers: https://github.com/scylladb/scylla-enterprise/issues/3182
Refers: https://github.com/scylladb/scylla-enterprise/issues/3050
Closes: #14770

Closes #14736

* github.com:scylladb/scylladb:
  querier_cache: add stats of scheduling group mismatches
  querier_cache: check semaphore mismatch during querier lookup
  querier_cache: add reference to `replica::database::is_user_semaphore()`
  replica:database: add method to determine if semaphore is user one

(cherry picked from commit a8feb7428d)
2023-08-09 10:20:53 +03:00
Kefu Chai
9ce3695a0d compaction_manager: prevent gc-only sstables from being compacted
before this change, there are chances that the temporary sstables
created for collecting the GC-able data create by a certain
compaction can be picked up by another compaction job. this
wastes the CPU cycles, adds write amplification, and causes
inefficiency.

in general, these GC-only SSTables are created with the same run id
as those non-GC SSTables, but when a new sstable exhausts input
sstable(s), we proactively replace the old main set with a new one
so that we can free up the space as soon as possible. so the
GC-only SSTables are added to the new main set along with
the non-GC SSTables, but since the former have good chance to
overlap the latter. these GC-only SSTables are assigned with
different run ids. but we fail to register them to the
`compaction_manager` when replacing the main sstable set.
that's why future compactions pick them up when performing compaction,
when the compaction which created them is not yet completed.

so, in this change,

* to prevent sstables in the transient stage from being picked
  up by regular compactions, a new interface class is introduced
  so that the sstable is always added to registration before
  it is added to sstable set, and removed from registration after
  it is removed from sstable set. the struct helps to consolidate
  the regitration related logic in a single place, and helps to
  make it more obvious that the timespan of an sstable in
  the registration should cover that in the sstable set.
* use a different run_id for the gc sstable run, as it can
  overlap with the output sstable run. the run_id for the
  gc sstable run is created only when the gc sstable writer
  is created. because the gc sstables is not always created
  for all compactions.

please note, all (indirect) callers of
`compaction_task_executor::compact_sstables()` passes a non-empty
`std::function` to this function, so there is no need to check for
empty before calling it. so in this change, the check is dropped.

Fixes #14560
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14725

(cherry picked from commit fdf61d2f7c)

Closes #14827
2023-08-04 09:59:10 +03:00
Patryk Jędrzejczak
4cd5847761 config: add schema_commitlog_segment_size_in_mb variable
In #14668, we have decided to introduce a new scylla.yaml variable
for the schema commitlog segment size. The segment size puts a limit
on the mutation size that can be written at once, and some schema
mutation writes are much larger than average, as shown in #13864.
Therefore, increasing the schema commitlog segment size is sometimes
necessary.

(cherry picked from commit 5b167a4ad7)
2023-08-02 18:05:39 +02:00
Botond Dénes
2b7f1cd906 Update tools/java submodule
* tools/java 83b2168b19...f8f556d802 (1):
  > Use EstimatedHistogram in metricPercentilesAsArray

Fixes: #10089
2023-07-31 12:13:01 +03:00
Nadav Har'El
e34c62c567 Merge 'view_updating_consumer: account empty partitions memory usage' from Botond Dénes
Te view updating consumer uses `_buffer_size` to decide when to flush the accumulated mutations, passing them to the actual view building code. This `_buffer_size` is incremented every time a mutation fragment is consumed. This is not exact, as e.g. range tombstones are represented differently in the mutation object, than in the fragment, but it is good enough. There is one flaw however: `_buffer_size` is not incremented when consuming a partition-start fragment. This is when the mutation object is created in the mutation rebuilder. This is not a big problem when partition have many rows, but if the partitions are tiny, the error in accounting quickly becomes significant. If the partitions are empty, `_buffer_size` is not bumped at all for empty partitions, and any number of these can accumulate in the buffer. We have recently seen this causing stalls and OOM as the buffer got to immense size, only containing empty and tiny partitions.
This PR fixes this by accounting the size of the freshly created `mutation` object in `_buffer_size`, after the partition-start fragment is consumed.

Fixes: #14819

Closes #14821

* github.com:scylladb/scylladb:
  test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations
  db/view/view_updating_consumer: account for the size of mutations
  mutation/mutation_rebuilder*: return const mutation& from consume_new_partition()
  mutation/mutation: add memory_usage()

(cherry picked from commit 056d04954c)
2023-07-31 03:43:44 -04:00
Nadav Har'El
992c50173a Merge 'cql: fix crash on empty clustering range in LWT' from Jan Ciołek
LWT queries with empty clustering range used to cause a crash.
For example in:
```cql
UPDATE tab SET r = 9000 WHERE p = 1  AND c = 2 AND c = 2000 IF r = 3
```
The range of `c` is empty - there are no valid values.

This caused a segfault when accessing the `first` range:
```c++
op.ranges.front()
```

Cassandra rejects such queries at the preparation stage. It doesn't allow two `EQ` restriction on the same clustering column when an IF is involved.
We reject them during runtime, which is a worse solution. The user can prepare a query with `c = ? AND c = ?`, and then run it, but unexpectedly it will throw an `invalid_request_exception` when the two bound variables are different.

We could ban such queries as well, we already ban the usage of `IN` in conditional statements. The problem is that this would be a breaking change.

A better solution would be to allow empty ranges in `LWT` statements. When an empty range is detected we just wouldn't apply the change. This would be a larger change, for now let's just fix the crash.

Fixes: https://github.com/scylladb/scylladb/issues/13129

Closes #14429

* github.com:scylladb/scylladb:
  modification_statement: reject conditional statements with empty clustering key
  statements/cas_request: fix crash on empty clustering range in LWT

(cherry picked from commit 49c8c06b1b)
2023-07-31 09:14:55 +03:00
Beni Peled
58acf071bf release: prepare for 5.2.6 2023-07-30 14:19:28 +03:00
Raphael S. Carvalho
d2369fc546 cached_file: Evict unused pages that aren't linked to LRU yet
It was found that cached_file dtor can hit the following assert
after OOM

cached_file_test: utils/cached_file.hh:379: cached_file::~cached_file(): Assertion _cache.empty()' failed.`

cached_file's dtor iterates through all entries and evict those
that are linked to LRU, under the assumption that all unused
entries were linked to LRU.

That's partially correct. get_page_ptr() may fetch more than 1
page due to read ahead, but it will only call cached_page::share()
on the first page, the one that will be consumed now.

share() is responsible for automatically placing the page into
LRU once refcount drops to zero.

If the read is aborted midway, before cached_file has a chance
to hit the 2nd page (read ahead) in cache, it will remain there
with refcount 0 and unlinked to LRU, in hope that a subsequent
read will bring it out of that state.

Our main user of cached_file is per-sstable index caching.
If the scenario above happens, and the sstable and its associated
cached_file is destroyed, before the 2nd page is hit, cached_file
will not be able to clear all the cache because some of the
pages are unused and not linked.

A page read ahead will be linked into LRU so it doesn't sit in
memory indefinitely. Also allowing for cached_file dtor to
clear all cache if some of those pages brought in advance
aren't fetched later.

A reproducer was added.

Fixes #14814.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14818

(cherry picked from commit 050ce9ef1d)
2023-07-28 13:56:28 +02:00
Kamil Braun
6273c4df35 test: use correct timestamp resolution in test_group0_history_clearing_old_entries
In 10c1f1dc80 I fixed
`make_group0_history_state_id_mutation` to use correct timestamp
resolution (microseconds instead of milliseconds) which was supposed to
fix the flakiness of `test_group0_history_clearing_old_entries`.

Unfortunately, the test is still flaky, although now it's failing at a
later step -- this is because I was sloppy and I didn't adjust this
second part of the test to also use microsecond resolution. The test is
counting the number of entries in the `system.group0_history` table that
are older than a certain timestamp, but it's doing the counting using
millisecond resolution, causing it to give results that are off by one
sometimes.

Fix it by using microseconds everywhere.

Fixes #14653

Closes #14670

(cherry picked from commit 9d4b3c6036)
2023-07-27 15:46:37 +02:00
Raphael S. Carvalho
752984e774 Fix stack-use-after-return in mutation source excluding staging
The new test detected a stack-use-after-return when using table's
as_mutation_source_excluding_staging() for range reads.

This doesn't really affect view updates that generate single
key reads only. So the problem was only stressed in the recently
added test. Otherwise, we'd have seen it when running dtests
(in debug mode) that stress the view update path from staging.

The problem happens because the closure was feeded into
a noncopyable_function that was taken by reference. For range
reads, we defer before subsequent usage of the predicate.
For single key reads, we only defer after finished using
the predicate.

Fix is about using sstable_predicate type, so there won't
be a need to construct a temporary object on stack.

Fixes #14812.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14813

(cherry picked from commit 0ac43ea877)
2023-07-26 14:30:32 +03:00
Raphael S. Carvalho
986491447b table: Optimize creation of reader excluding staging for view building
View building from staging creates a reader from scratch (memtable
+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.

perf shows that the reader creation is very expensive:
+   12.15%    10.75%  reactor-3        scylla             [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+   10.01%     9.99%  reactor-3        scylla             [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    8.95%     8.94%  reactor-3        scylla             [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+    7.29%     7.28%  reactor-3        scylla             [.] dht::ring_position_tri_compare
+    6.28%     6.27%  reactor-3        scylla             [.] dht::tri_compare
+    4.11%     3.52%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    4.09%     4.07%  reactor-3        scylla             [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+    3.46%     0.93%  reactor-3        scylla             [.] sstables::sstable_run::will_introduce_overlapping
+    2.53%     2.53%  reactor-3        libstdc++.so.6     [.] std::_Rb_tree_increment
+    2.45%     2.45%  reactor-3        scylla             [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.14%     2.13%  reactor-3        scylla             [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.07%     2.07%  reactor-3        scylla             [.] logalloc::region_impl::free
+    2.06%     1.91%  reactor-3        scylla             [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()#1}::operator()() const::{lambda()#1}::operator()
+    2.04%     2.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    1.87%     0.00%  reactor-3        [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
+    1.86%     0.00%  reactor-3        [kernel.kallsyms]  [k] do_syscall_64
+    1.39%     1.38%  reactor-3        libc.so.6          [.] __memcmp_avx2_movbe
+    1.37%     0.92%  reactor-3        scylla             [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+    1.34%     1.33%  reactor-3        scylla             [.] logalloc::region_impl::alloc_small
+    1.33%     1.33%  reactor-3        scylla             [.] seastar::memory::small_pool::add_more_objects
+    1.30%     0.35%  reactor-3        scylla             [.] seastar::reactor::do_run
+    1.29%     1.29%  reactor-3        scylla             [.] seastar::memory::allocate
+    1.19%     0.05%  reactor-3        libc.so.6          [.] syscall
+    1.16%     1.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+    1.07%     0.79%  reactor-3        scylla             [.] sstables::partitioned_sstable_set::insert

That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).

The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.

This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.

This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.

With this improvement, view building was measured to be 3x faster.

from
INFO  2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s

to
INFO  2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s

Refs #14089.
Fixes #14244.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 1d8cb32a5d)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14764
2023-07-20 16:46:15 +03:00
Takuya ASADA
a05bb26cd6 scylla_fstrim_setup: start scylla-fstrim.timer on setup
Currently, scylla_fstrim_setup does not start scylla-fstrim.timer and
just enables it, so the timer starts only after rebooted.
This is incorrect behavior, we start start it during the setup.

Also, unmask is unnecessary for enabling the timer.

Fixes #14249

Closes #14252

(cherry picked from commit c70a9cbffe)

Closes #14421
2023-07-18 16:05:09 +03:00
Michał Chojnowski
41aef6dc96 partition_snapshot_reader.hh: fix iterator invalidation in do_refresh_state
do_refresh_state() keeps iterators to rows_entry in a vector.
This vector might be resized during the procedure, triggering
memory reclaim and invalidating the iterators, which can cause
arbitrarily long loops and/or a segmentation fault during make_heap().
To fix this, do_refresh_state has to always be called from the allocating
section.

Additionally, it turns out that the first do_refresh_state is useless,
because reset_state() doesn't set _change_mark. This causes do_refresh_state
to be needlessly repeated during a next_row() or next_range_tombstone() which
happens immediately after it. Therefore this patch moves the _change_mark
assignment from maybe_refresh_state to do_refresh_state, so that the change mark
is properly set even after the first refresh.

Fixes #14696

Closes #14697
2023-07-17 14:20:37 +02:00
Botond Dénes
aa5e904c40 repair: Release permit earlier when the repair_reader is done
Consider

- 10 repair instances take all the 10 _streaming_concurrency_sem

- repair readers are done but the permits are not released since they
  are waiting for view update _registration_sem

- view updates trying to take the _streaming_concurrency_sem to make
  progress of view update so it could release _registration_sem, but it
  could not take _streaming_concurrency_sem since the 10 repair
  instances have taken them

- deadlock happens

Note, when the readers are done, i.e., reaching EOS, the repair reader
replaces the underlying (evictable) reader with an empty reader. The
empty reader is not evictable, so the resources cannot be forcibly
released.

To fix, release the permits manually as soon as the repair readers are
done even if the repair job is waiting for _registration_sem.

Fixes #14676

Closes #14677

(cherry picked from commit 1b577e0414)
2023-07-14 18:18:43 +03:00
Marcin Maliszkiewicz
eff2fe79b1 alternator: close output_stream when exception is thrown during response streaming
When exception occurs and we omit closing output_stream then the whole process is brought down
by an assertion in ~output_stream.

Fixes https://github.com/scylladb/scylladb/issues/14453
Relates https://github.com/scylladb/scylladb/issues/14403

Closes #14454

(cherry picked from commit 6424dd5ec4)
2023-07-13 23:27:46 +03:00
Nadav Har'El
ee8b26167b Merge 'Yield while building large results in Alternator - rjson::print, executor::batch_get_item' from Marcin Maliszkiewicz
Adds preemption points used in Alternator when:
 - sending bigger json response
 - building results for BatchGetItem

I've tested manually by inserting in preemptible sections (e.g. before `os.write`) code similar to:

    auto start  = std::chrono::steady_clock::now();
    do { } while ((std::chrono::steady_clock::now() - start) < 100ms);

and seeing reactor stall times. After the patch they
were not increasing while before they kept building up due to no preemption.

Refs #7926
Fixes #13689

Closes #12351

* github.com:scylladb/scylladb:
  alternator: remove redundant flush call in make_streamed
  utils: yield when streaming json in print()
  alternator: yield during BatchGetItem operation

(cherry picked from commit d2e089777b)
2023-07-13 23:27:38 +03:00
Yaron Kaikov
02bc54d4b6 release: prepare for 5.2.5 2023-07-13 14:23:18 +03:00
Avi Kivity
c9a5c4c876 Merge ' message: match unknown tenants to the default tenant' from Botond Dénes
On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name.

If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen.

This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster.

This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore.

I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before.

Fixes: #13841
Fixes: #12552

Closes #13843

* github.com:scylladb/scylladb:
  message: match unknown tenants to the default tenant
  message: generalize per-tenant connection types

(cherry picked from commit a7c2c9f92b)
2023-07-12 15:31:48 +03:00
Tomasz Grabiec
1a6f4389ae Merge 'atomic_cell: compare value last' from Benny Halevy
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
e225c88a65 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.

To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times.  If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.

\Fixes scylladb/scylladb#14182

Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing #14182

\Closes scylladb/scylladb#14183

* github.com:scylladb/scylladb:
  docs: dml: add update ordering section
  cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
  mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
  atomic_cell: compare_atomic_cell_for_merge: update and add documentation
  compare_atomic_cell_for_merge: compare value last for live cells
  mutation_test: test_cell_ordering: improve debuggability

(cherry picked from commit 87b4606cd6)

Closes #14649
2023-07-12 10:09:56 +03:00
Calle Wilund
1088c3e24a storage_proxy: Make split_stats resilient to being called from different scheduling group
Fixes #11017

When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.

Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.

Closes #14636
2023-07-12 09:24:56 +03:00
Botond Dénes
c9cb8dcfd0 Merge '[backport 5.2] view: fix range tombstone handling on flushes in view_updating_consumer' from Michał Chojnowski
View update routines accept mutation objects.
But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to convert the fragment stream into mutations. This is done by piping the stream to mutation_rebuilder_v2.

To keep memory usage limited, the stream for a single partition might have to be split into multiple partial mutation objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error.

This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next mutation object).

The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic.

Backported from c25201c1a3. `view_build_test.cc` needed some tiny adjustments for the backport.

Closes #14619
Fixes #14503

* github.com:scylladb/scylladb:
  test: view_build_test: add range tombstones to test_view_update_generator_buffering
  test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
  view_updating_consumer: make buffer limit a variable
  view: fix range tombstone handling on flushes in view_updating_consumer
2023-07-11 15:04:23 +03:00
Takuya ASADA
91c1feec51 scylla_raid_setup: wipe filesystem signatures from specified disks
The discussion on the thread says, when we reformat a volume with another
filesystem, kernel and libblkid may skip to populate /dev/disk/by-* since it
detected two filesystem signatures, because mkfs.xxx did not cleared previous
filesystem signature.
To avoid this, we need to run wipefs before running mkfs.

Note that this runs wipefs twice, for target disks and also for RAID device.
wipefs for RAID device is needed since wipefs on disks doesn't clear filesystem signatures on /dev/mdX (we may see previous filesystem signature on /dev/mdX when we construct RAID volume multiple time on same disks).

Also dropped -f option from mkfs.xfs, it will check wipefs is working as we
expected.

Fixes #13737

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes #13738

(cherry picked from commit fdceda20cc)
2023-07-11 15:00:03 +03:00
Piotr Dulikowski
57d0310dcc combined: mergers: remove recursion in operator()()
In mutation_reader_merger and clustering_order_reader_merger, the
operator()() is responsible for producing mutation fragments that will
be merged and pushed to the combined reader's buffer. Sometimes, it
might have to advance existing readers, open new and / or close some
existing ones, which requires calling a helper method and then calling
operator()() recursively.

In some unlucky circumstances, a stack overflow can occur:

- Readers have to be opened incrementally,
- Most or all readers must not produce any fragments and need to report
  end of stream without preemption,
- There has to be enough readers opened within the lifetime of the
  combined reader (~500),
- All of the above needs to happen within a single task quota.

In order to prevent such a situation, the code of both reader merger
classes were modified not to perform recursion at all. Most of the code
of the operator()() was moved to maybe_produce_batch which does not
recur if it is not possible for it to produce a fragment, instead it
returns std::nullopt and operator()() calls this method in a loop via
seastar::repeat_until_value.

A regression test is added.

Fixes: scylladb/scylladb#14415

Closes #14452

(cherry picked from commit ee9bfb583c)

Closes #14605
2023-07-11 11:09:25 +03:00
Michał Chojnowski
78f25f2d36 test: view_build_test: add range tombstones to test_view_update_generator_buffering
This patch adds a full-range tombstone to the compacted mutation.
This raises the coverage of the test. In particular, it reproduces
issue #14503, which should have been caught by this test, but wasn't.
2023-07-11 09:44:00 +02:00
Michał Chojnowski
14fa3ee34e test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
A random mutation test for view_updating_consumer's buffering logic.
Reproduces #14503.
2023-07-11 09:44:00 +02:00
Michał Chojnowski
75933b9906 view_updating_consumer: make buffer limit a variable
The limit doesn't change at runtime, but we this patch makes it variable for
unit testing purposes.
2023-07-11 09:44:00 +02:00
Michał Chojnowski
fc7b02c8e4 view: fix range tombstone handling on flushes in view_updating_consumer
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of
mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to
convert the fragment stream into `mutation`s. This is done by piping
the stream to mutation_rebuilder_v2.

To keep memory usage limited, the stream for a single partition might
have to be split into multiple partial `mutation` objects.
view_update_consumer does that, but in improper way -- when the
split/flush happens inside an active range tombstone, the range
tombstone isn't closed properly. This is illegal, and triggers an
internal error.

This patch fixes the problem by closing the active range tombstone
(and reopening in the same position in the next `mutation` object).

The tombstone is closed just after the last seen clustered position.
This is not necessary for correctness -- for example we could delay
all processing of the range tombstone until we see its end
bound -- but it seems like the most natural semantic.

Fixes #14503
2023-07-11 09:44:00 +02:00
Jan Ciolek
0f4f8638c5 forward_service: fix forgetting case-sensitivity in aggregates
There was a bug that caused aggregates to fail when
used on column-sensitive columns.

For example:
```
SELECT SUM("SomeColumn") FROM ks.table;
```
would fail, with a message saying that there
is no column "somecolumn".

This is because the case-sensitivity got lost on the way.

For non case-sensitive column names we convert them to lowercase,
but for case sensitive names we have to preserve the name
as originally written.

The problem was in `forward_service` - we took a column name
and created a non case-sensitive `column_identifier` out of it.
This converted the name to lowercase, and later such column
couldn't be found.

To fix it, let's make the `column_identifier` case-sensitive.
It will preserve the name, without converting it to lowercase.

Fixes: https://github.com/scylladb/scylladb/issues/14307

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 7fca350075)
2023-07-10 15:22:58 +03:00
Botond Dénes
0ba37fa431 Merge 'doc: fix rollback in the 4.3-to-2021.1, 5.0-to-2022.1, and 5.1-to-2022.2 upgrade guides' from Anna Stuchlik
This PR fixes the Restore System Tables section of the upgrade guides by adding a command to clean upgraded SStables during rollback or adding the entire section to restore system tables (which was missing from the older documents).

This PR fixes is a bug and must be backported to branch-5.3, branch-5.2., and branch-5.1.

Refs: https://github.com/scylladb/scylla-enterprise/issues/3046

- [x]  5.1-to-2022.2 - update command (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x]  5.0-to-2022.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x]  4.3-to-2021.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)

(see https://github.com/scylladb/scylla-enterprise/issues/3046#issuecomment-1604232864)

Closes #14444

* github.com:scylladb/scylladb:
  doc: fix rollback in 4.3-to-2021.1 upgrade guide
  doc: fix rollback in 5.0-to-2022.1 upgrade guide
  doc: fix rollback in 5.1-to-2022.2 upgrade guide

(cherry picked from commit 8a7261fd70)
2023-07-10 15:16:24 +03:00
Raphael S. Carvalho
55edbded47 compaction: avoid excessive reallocation and during input list formatting
with off-strategy, input list size can be close to 1k, which will
lead to unneeded reallocations when formatting the list for
logging.

in the past, we faced stalls in this area, and excessive reallocation
(log2 ~1k = ~10) may have contributed to that.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #13907

(cherry picked from commit 5544d12f18)

Fixes scylladb/scylladb#14071
2023-07-09 23:54:18 +03:00
Marcin Maliszkiewicz
9f79c9f41d docs: link general repairs page to RBNO page
Information was duplicated before and the version on this page was outdated - RBNO is enabled for replace operation already.

Closes #12984

(cherry picked from commit bd7caefccf)
2023-07-07 16:38:32 +02:00
Kamil Braun
6dd09bb4ea storage_proxy: query_partition_key_range_concurrent: don't access empty range
`query_partition_range_concurrent` implements an optimization when
querying a token range that intersects multiple vnodes. Instead of
sending a query for each vnode separately, it sometimes sends a single
query to cover multiple vnodes - if the intersection of replica sets for
those vnodes is large enough to satisfy the CL and good enough in terms
of the heat metric. To check the latter condition, the code would take
the smallest heat metric of the intersected replica set and compare them
to smallest heat metrics of replica sets calculated separately for each
vnode.

Unfortunately, there was an edge case that the code didn't handle: the
intersected replica set might be empty and the code would access an
empty range.

This was catched by an assertion added in
8db1d75c6c by the dtest
`test_query_dc_with_rf_0_does_not_crash_db`.

The fix is simple: check if the intersected set is empty - if so, don't
calculate the heat metrics because we can decide early that the
optimization doesn't apply.

Also change the `assert` to `on_internal_error`.

Fixes #14284

Closes #14300

(cherry picked from commit 732feca115)

Backport note: the original `assert` was never added to branch-5.2, but
the fix is still applicable, so I backported the fix and the
`on_internal_error` check.
2023-07-05 13:14:24 +02:00
Mikołaj Grzebieluch
f431345ab6 raft topology: wait_for_peers_to_enter_synchronize_state doesn't need to resolve all IPs
Another node can stop after it joined the group0 but before it advertised itself
in gossip. `get_inet_addrs` will try to resolve all IPs and
`wait_for_peers_to_enter_synchronize_state` will loop indefinitely.

But `wait_for_peers_to_enter_synchronize_state` can return early if one of
the nodes confirms that the upgrade procedure has finished. For that, it doesn't
need the IPs of all group 0 members - only the IP of some nodes which can do
the confirmation.

This commit restructures the code so that IPs of nodes are resolved inside the
`max_concurrent_for_each` that `wait_for_peers_to_enter_synchronize_state` performs.
Then, even if some IPs won't be resolved, but one of the nodes confirms a
successful upgrade, we can continue.

Fixes #13543

(cherry picked from commit a45e0765e4)
2023-07-05 13:01:57 +02:00
Anna Stuchlik
009601d374 doc: fix rollback in 5.2-to-2023.1 upgrade guide
This commit fixes the Restore System Tables section
in the 5.2-to-2023.1 upgrade guide by adding a command
to clean upgraded SStables during rollback.

This is a bug (an incomplete command) and must be
backported to branch-5.3 and branch-5.2.

Refs: https://github.com/scylladb/scylla-enterprise/issues/3046

Closes #14373

(cherry picked from commit f4ae2c095b)
2023-06-29 12:07:41 +03:00
Botond Dénes
8e63b2f3e3 Merge 'readers: evictable_reader: don't accidentally consume the entire partition' from Kamil Braun
The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the previous buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between.

The code guranteeing this forward progress had a bug: the comparison between the position after the last buffer-fill and the current last fragment position was done in the wrong direction.

So if the condition that we wanted to achieve was already true, we would continue filling the buffer until partition end which may lead to OOMs such as in #13491.

There was already a fix in this area to handle `partition_start` fragments correctly - #13563 - but it missed that the position comparison was done in the wrong order.

Fix the comparison and adjust one of the tests (added in #13563) to detect this case.

After the fix, the evictable reader starts generating some redundant (but expected) range tombstone change fragments since it's now being paused and resumed. For this we need to adjust mutation source tests which were a bit too specific. We modify `flat_mutation_reader_assertions` to squash the redundant `r_t_c`s.

Fixes #13491

Closes #14375

* github.com:scylladb/scylladb:
  readers: evictable_reader: don't accidentally consume the entire partition
  test: flat_mutation_reader_assertions: squash `r_t_c`s with the same position

(cherry picked from commit 586102b42e)
2023-06-29 12:04:35 +03:00
Benny Halevy
483c0b183a repair: use fmt::join to print ks_erms|boost::adaptors::map_keys
This is a minimal fix for #13146 for branch-5.2

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14405
2023-06-27 14:15:28 +03:00
Anna Stuchlik
d5063e6347 doc: add Ubuntu 22 to 2021.1 OS support
Fixes https://github.com/scylladb/scylla-enterprise/issues/3036

This commit adds support for Ubuntu 22.04 to the list
of OSes supported by ScyllaDB Enterprise 2021.1.

This commit fixex a bug and must be backported to
branch-5.3 and branch-5.2.

Closes #14372

(cherry picked from commit 74fc69c825)
2023-06-26 13:43:58 +03:00
Anna Stuchlik
543fa04e4d doc: udpate the OSS docs landing page
Fixes https://github.com/scylladb/scylladb/issues/14333

This commit replaces the documentation landing page with
the Open Source-only documentation landing page.

This change is required as now there is a separate landing
page for the ScyllaDB documentation, so the page is duplicated,
creating bad user experience.

(cherry picked from commit f60f89df17)

Closes #14370
2023-06-23 14:00:33 +02:00
Anna Mikhlin
cebbf6c5df release: prepare for 5.2.4 2023-06-22 16:23:46 +03:00
Avi Kivity
73b8669953 Update seastar submodule (default priority class shares)
* seastar 32ab15cda6...29a0e64513 (1):
  > reactor: change shares for default IO class from 1 to 200

Fixes #13753.

In 5.3: 37e6e65211
2023-06-21 21:23:14 +03:00
Botond Dénes
9efca96cf2 Merge 'Backport 5.2 test.py stability/UX improvemenets' from Kamil Braun
Backport the following improvements for test.py topology tests for CI stability:
- https://github.com/scylladb/scylladb/pull/12652
- https://github.com/scylladb/scylladb/pull/12630
- https://github.com/scylladb/scylladb/pull/12619
- https://github.com/scylladb/scylladb/pull/12686
- picked from https://github.com/scylladb/scylladb/pull/12726: 9ceb6aba81
- picked from https://github.com/scylladb/scylladb/pull/12173: fc60484422
- https://github.com/scylladb/scylladb/pull/12765
- https://github.com/scylladb/scylladb/pull/12804
- https://github.com/scylladb/scylladb/pull/13342
- https://github.com/scylladb/scylladb/pull/13589
- picked from https://github.com/scylladb/scylladb/pull/13135: 7309a1bd6b
- picked from https://github.com/scylladb/scylladb/pull/13134: 21b505e67c, a4411e9ec4, c1d0ee2bce, 8e3392c64f, 794d0e4000, e407956e9f
- https://github.com/scylladb/scylladb/pull/13271
- https://github.com/scylladb/scylladb/pull/13399
- picked from https://github.com/scylladb/scylladb/pull/12699: 3508a4e41e, 08d754e13f, 62a945ccd5, 041ee3ffdd
- https://github.com/scylladb/scylladb/pull/13438 (but skipped the test_mutation_schema_change.py fix since I didn't backport this new test)
- https://github.com/scylladb/scylladb/pull/13427
- https://github.com/scylladb/scylladb/pull/13756
- https://github.com/scylladb/scylladb/pull/13789
- https://github.com/scylladb/scylladb/pull/13933 (but skipped the test_snapshot.py fix since I didn't backport this new test)

Closes #14215

* github.com:scylladb/scylladb:
  test: pylib: fix `read_barrier` implementation
  test: pylib: random_tables: perform read barrier in `verify_schema`
  test: issue a read barrier before checking ring consistency
  Merge 'scylla_cluster.py: fix read_last_line' from Gusev Petr
  test/pylib: ManagerClient helpers to wait for...
  test: pylib: Add a way to create cql connections with particular coordinators
  test/pylib: get gossiper alive endpoints
  test/topology: default replication factor 3
  test/pylib: configurable replication factor
  scylla_cluster.py: optimize node logs reading
  test/pylib: RandomTables.add_column with value column
  scylla_cluster.py: add start flag to server_add
  ServerInfo: drop host_id
  scylla_cluster.py: add config to server_add
  scylla_cluster.py: add expected_error to server_start
  scylla_cluster.py: ScyllaServer.start, refactor error reporting
  scylla_cluster.py: fix ScyllaServer.start, reset cmd if start failed
  test: improve logging in ScyllaCluster
  test: topology smp test with custom cluster
  test/pylib: topology: support clusters of initial size 0
  Merge 'test/pylib: split and refactor topology tests' from Alecco
  Merge 'test/pylib: use larger timeout for decommission/removenode' from Kamil Braun
  test: Increase START_TIMEOUT
  test/pylib: one-shot error injection helper
  test: topology: wait for token ring/group 0 consistency after decommission
  test: topology: verify that group 0 and token ring are consistent
  Merge 'pytest: start after ungraceful stop' from Alecco
  Merge 'test.py: improve test failure handling' from Kamil Braun
2023-06-15 07:19:39 +03:00
Pavel Emelyanov
210e3d1999 Backport 'Merge 'Enlighten messaging_service::shutdown()''
This includes seastar update titled
  'Merge 'Split rpc::server stop into two parts''

* br-5.2-backport-ms-shutdown:
  messaging_service: Shutdown rpc server on shutdown
  messaging_service: Generalize stop_servers()
  messaging_service: Restore indentation after previous patch
  messaging_service: Coroutinize stop()
  messaging_service: Coroutinize stop_servers()
  Update seastar submodule

refs: #14031
2023-06-14 09:14:06 +03:00
Pavel Emelyanov
702d622b38 messaging_service: Shutdown rpc server on shutdown
The RPC server now has a lighter .shutdown() method that just does what
m.s. shutdown() needs, so call it. On stop call regular stop to finalize
the stopping process

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-14 09:04:04 +03:00
Pavel Emelyanov
db44630254 messaging_service: Generalize stop_servers()
Make it do_with_servers() and make it accept method to call and message
to print. This gives the ability to reuse this helper in next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-14 09:03:59 +03:00
Pavel Emelyanov
5d3d64bafe messaging_service: Restore indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-14 09:03:53 +03:00
Pavel Emelyanov
079f5d8eca messaging_service: Coroutinize stop()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-14 09:03:48 +03:00
Pavel Emelyanov
fd7310b104 messaging_service: Coroutinize stop_servers()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-14 09:03:42 +03:00
Pavel Emelyanov
991d00964d Update seastar submodule
* seastar 8c86e6de...32ab15cd (1):
  > rpc: Introduce server::shutdown()

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-06-14 09:02:46 +03:00
Anna Stuchlik
0137ddaec8 doc: remove support for Ubuntu 18
Fixes https://github.com/scylladb/scylladb/issues/14097

This commit removes support for Ubuntu 18 from
platform support for ScyllaDB Enterprise 2023.1.

The update is in sync with the change made for
ScyllaDB 5.2.

This commit must be backported to branch-5.2 and
branch-5.3.

Closes #14118

(cherry picked from commit b7022cd74e)
2023-06-13 12:06:56 +03:00
Raphael S. Carvalho
58f88897c8 compaction: Fix incremental compaction for sstable cleanup
After c7826aa910, sstable runs are cleaned up together.

The procedure which executes cleanup was holding reference to all
input sstables, such that it could later retry the same cleanup
job on failure.

Turns out it was not taking into account that incremental compaction
will exhaust the input set incrementally.

Therefore cleanup is affected by the 100% space overhead.

To fix it, cleanup will now have the input set updated, by removing
the sstables that were already cleaned up. On failure, cleanup
will retry the same job with the remaining sstables that weren't
exhausted by incremental compaction.

New unit test reproduces the failure, and passes with the fix.

Fixes #14035.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14038

(cherry picked from commit 23443e0574)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14193
2023-06-13 09:53:46 +03:00
Kamil Braun
f4115528d6 test: pylib: fix read_barrier implementation
The previous implementation didn't actually do a read barrier, because
the statement failed on an early prepare/validate step which happened
before read barrier was even performed.

Change it to a statement which does not fail and doesn't perform any
schema change but requires a read barrier.

This breaks one test which uses `RandomTables.verify_schema()` when only
one node is alive, but `verify_schema` performs a read barrier. Unbreak
it by skipping the read barrier in this case (it makes sense in this
particular test).

Closes #13933

(cherry picked from commit 64dc76db55)
Backport note: skipped the test_snapshot.py change, as the test doesn't
exist on this branch.
2023-06-12 12:40:22 +02:00
Kamil Braun
9c941aba0b test: pylib: random_tables: perform read barrier in verify_schema
`RandomTables.verify_schema` is often called in topology tests after
performing a schema change. It compares the schema tables fetched from
some node to the expected latest schema stored by the `RandomTables`
object.

However there's no guarantee that the latest schema change has already
propagated to the node which we query. We could have performed the
schema change on a different node and the change may not have been
applied yet on all nodes.

To fix that, pick a specific node and perform a read barrier on it, then
use that node to fetch the schema tables.

Fixes #13788

Closes #13789

(cherry picked from commit 3f3dcf451b)
2023-06-12 12:40:22 +02:00
Konstantin Osipov
094bcac399 test: issue a read barrier before checking ring consistency
Raft replication doesn't guarantee that all replicas see
identical Raft state at all times, it only guarantees the
same order of events on all replicas.

When comparing raft state with gossip state on a node, first
issue a read barrier to ensure the node has the latest raft state.

To issue a read barrier it is sufficient to alter a non-existing
state: in order to validate the DDL the node needs to sync with the
leader and fetch its latest group0 state.

Fixes #13518 (flaky topology test).

Closes #13756

(cherry picked from commit e7c9ca560b)
2023-06-12 12:40:22 +02:00
Kamil Braun
e49a531aaa Merge 'scylla_cluster.py: fix read_last_line' from Gusev Petr
This is a follow-up to #13399, the patch
addresses the issues mentioned there:
* linesep can be split between blocks;
* linesep can be part of UTF-8 sequence;
* avoid excessively long lines, limit to 256 chars;
* the logic of the function made simpler and more maintainable.

Closes #13427

* github.com:scylladb/scylladb:
  pylib_test: add tests for read_last_line
  pytest: add pylib_test directory
  scylla_cluster.py: fix read_last_line
  scylla_cluster.py: move read_last_line to util.py

(cherry picked from commit 70f2b09397)
2023-06-12 12:40:22 +02:00
Alejo Sanchez
bcf99a37cd test/pylib: ManagerClient helpers to wait for...
server to see other servers after start/restart

When starting/restarting a server, provide a way to wait for the server
to see at least n other servers.

Also leave the implementation methods available for manual use and
update previous tests, one to wait for a specific server to be seen, and
one to wait for a specific server to not be seen (down).

Fixes #13147

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13438

(cherry picked from commit 11561a73cb)
Backport note: skipped the test_mutation_schema_change.py fix as the
test doesn't exist on this branch.
2023-06-12 12:40:08 +02:00
Tomasz Grabiec
fe4af95745 test: pylib: Add a way to create cql connections with particular coordinators
Usage:

  await manager.driver_connect(server=servers[0])
  manager.cql.execute(f"...", execution_profile='whitelist')

(cherry picked from commit 041ee3ffdd)
2023-06-12 12:38:15 +02:00
Alejo Sanchez
ac5dff7de0 test/pylib: get gossiper alive endpoints
Helper to get list of gossiper alive endpoints from REST API.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
(cherry picked from commit 62a945ccd5)
2023-06-12 12:38:15 +02:00
Alejo Sanchez
ad99456a9d test/topology: default replication factor 3
For most tests there will be nodes down, increase replication factor to
3 to avoid having problems for partitions belonging to down nodes.

Use replication factor 1 for raft upgrade tests.

(cherry picked from commit 08d754e13f)
2023-06-12 12:38:15 +02:00
Alejo Sanchez
937e890fba test/pylib: configurable replication factor
Make replication factor configurable for the RandomTables helper.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
(cherry picked from commit 3508a4e41e)
2023-06-12 12:38:15 +02:00
Petr Gusev
12eec5bb2b scylla_cluster.py: optimize node logs reading
There are two occasions in scylla_cluster
where we read the node logs, and in both of
them we read the entire file in memory.
This is not efficient and may cause an OOM.

In the first case we need the last line of the
log file, so we seek at the end and move backwards
looking for a new line symbol.

In the second case we look through the
log file to find the expected_error.
The readlines() method returns a Python
list object, which means it reads the entire
file in memory. It's sufficient to just remove
it since iterating over the file instance
already yields lines lazily one by one.

This is a follow-up for #13134.

Closes #13399

(cherry picked from commit 09636b20f3)
2023-06-12 12:38:15 +02:00
Alejo Sanchez
59847389d4 test/pylib: RandomTables.add_column with value column
When adding extra columns in a test, make them value column. Name them
with the "v_" prefix and use the value column number counter.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13271

(cherry picked from commit 81b40c10de)
2023-06-12 12:38:15 +02:00
Petr Gusev
7a8c5db55b scylla_cluster.py: add start flag to server_add
Sometimes when creating a node it's useful
to just install it and not start. For example,
we may want to try to start it later with
expected error.

The ScyllaServer.install method has been made
exception safe, if an exception occurs, it
reverts to the original state. This allows
to not duplicate the try/except logic
in two of its call sites.

(cherry picked from commit e407956e9f)
2023-06-12 12:38:15 +02:00
Petr Gusev
15ea5bf53f ServerInfo: drop host_id
We are going to allow the
ScyllaCluster.add_server function not to
start the server if the caller has requested
that with a special parameter. The host_id
can only be obtained from a running node, so
add_server won't be able to return it in
this case. I've grepped the tests for host_id
and there doesn't seem to be any
reference to it in the code.

(cherry picked from commit 794d0e4000)
2023-06-12 12:38:15 +02:00
Petr Gusev
3ab610753e scylla_cluster.py: add config to server_add
Sometimes when creating a node it's useful
to pass a custom node config.

(cherry picked from commit 8e3392c64f)
2023-06-12 12:38:15 +02:00
Petr Gusev
1959eddf86 scylla_cluster.py: add expected_error to server_start
Sometimes it's useful to check that the node has failed
to start for a particular reason. If server_start can't
find expected_error in the node's log or if the
node has started without errors, it throws an exception.

(cherry picked from commit c1d0ee2bce)
2023-06-12 12:38:15 +02:00
Petr Gusev
43525aec83 scylla_cluster.py: ScyllaServer.start, refactor error reporting
Extract the function that encapsulates all the error
reporting logic. We are going to use it in several
other places to implement expected_error feature.

(cherry picked from commit a4411e9ec4)
2023-06-12 12:38:15 +02:00
Petr Gusev
930c4e65d6 scylla_cluster.py: fix ScyllaServer.start, reset cmd if start failed
The ScyllaServer expects cmd to be None if the
Scylla process is not running. Otherwise, if start failed
and the test called update_config, the latter will
try to send a signal to a non-existent process via cmd.

(cherry picked from commit 21b505e67c)
2023-06-12 12:38:15 +02:00
Konstantin Osipov
d2caaef188 test: improve logging in ScyllaCluster
Print IP addresses and cluster identifiers in more log messages,
it helps debugging.

(cherry picked from commit 7309a1bd6b)
2023-06-12 12:38:15 +02:00
Alejo Sanchez
6474edd691 test: topology smp test with custom cluster
Instead of decommission of initial cluster, use custom cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13589

(cherry picked from commit ce87aedd30)
2023-06-12 12:38:15 +02:00
Alejo Sanchez
b39cdadff3 test/pylib: topology: support clusters of initial size 0
To allow tests with custom clusters, allow configuration of initial
cluster size of 0.

Add a proof-of-concept test to be removed later.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #13342

(cherry picked from commit e3b462507d)
2023-06-12 12:38:15 +02:00
Nadav Har'El
7b60cddae7 Merge 'test/pylib: split and refactor topology tests' from Alecco
Move long running topology tests out of  `test_topology.py` and into their own files, so they can be run in parallel.

While there, merge simple schema tests.

Closes #12804

* github.com:scylladb/scylladb:
  test/topology: rename topology test file
  test/topology: lint and type for topology tests
  test/topology: move topology ip tests to own file
  test/topology: move topology test remove garbaje...
  test/topology: move topology rejoin test to own file
  test/topology: merge topology schema tests and...
  test/topology: isolate topology smp params test
  test/topology: move topology helpers to common file

(cherry picked from commit a24600a662)
2023-06-12 12:38:15 +02:00
Botond Dénes
ea80fe20ad Merge 'test/pylib: use larger timeout for decommission/removenode' from Kamil Braun
Recently we enabled RBNO by default in all topology operations. This
made the operations a bit slower (repair-based topology ops are a bit
slower than classic streaming - they do more work), and in debug mode
with large number of concurrent tests running, they might timeout.

The timeout for bootstrap was already increased before, do the same for
decommission/removenode. The previously used timeout was 300 seconds
(this is the default used by aiohttp library when it makes HTTP
requests), now use the TOPOLOGY_TIMEOUT constant from ScyllaServer which
is 1000 seconds.

Closes #12765

* github.com:scylladb/scylladb:
  test/pylib: use larger timeout for decommission/removenode
  test/pylib: scylla_cluster: rename START_TIMEOUT to TOPOLOGY_TIMEOUT

(cherry picked from commit e55f475db1)
2023-06-12 12:38:15 +02:00
Asias He
f90fe6f312 test: Increase START_TIMEOUT
It is observed that CI machine is slow to run the test. Increase the
timeout of adding servers.

(cherry picked from commit fc60484422)
2023-06-12 12:38:15 +02:00
Alejo Sanchez
6e2c547388 test/pylib: one-shot error injection helper
Existing helper with async context manager only worked for non one-shot
error injections. Fix it and add another helper for one-shot without a
context manager.

Fix tests using the previous helper.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
(cherry picked from commit 9ceb6aba81)
2023-06-12 12:38:05 +02:00
Kamil Braun
91aa2cd8d7 test: topology: wait for token ring/group 0 consistency after decommission
There was a check for immediate consistency after a decommission
operation has finished in one of the tests, but it turns out that also
after decommission it might take some time for token ring to be updated
on other nodes. Replace the check with a wait.

Also do the wait in another test that performs a sequence of
decommissions. We won't attempt to start another decommission until
every node learns that the previously decommissioned node has left.

Closes #12686

(cherry picked from commit 40142a51d0)
2023-06-12 11:58:32 +02:00
Kamil Braun
05c3f7ecef test: topology: verify that group 0 and token ring are consistent
After topology changes like removing a node, verify that the set of
group 0 members and token ring members is the same.

Modify `get_token_ring_host_ids` to only return NORMAL members. The
previous version which used the `/storage_service/host_id` endpoint
might have returned non-NORMAL members as well.

Fixes: #12153

Closes #12619

(cherry picked from commit fa9cf81af2)
2023-06-12 11:58:02 +02:00
Kamil Braun
3aa73e8b5a Merge 'pytest: start after ungraceful stop' from Alecco
If a server is stopped suddenly (i.e. not graceful), schema tables might
be in inconsistent state. Add a test case and enable Scylla
configuration option (force_schema_commit_log) to handle this.

Fixes #12218

Closes #12630

* github.com:scylladb/scylladb:
  pytest: test start after ungraceful stop
  test.py: enable force_schema_commit_log

(cherry picked from commit 5eadea301e)
2023-06-12 11:57:09 +02:00
Nadav Har'El
a0ba3b3350 Merge 'test.py: improve test failure handling' from Kamil Braun
Improve logging by printing the cluster at the end of each test.

Stop performing operations like attempting queries or dropping keyspaces on dirty clusters. Dirty clusters might be completely dead and these operations would only cause more "errors" to happen after a failed test, making it harder to find the real cause of failure.

Mark cluster as dirty when a test that uses it fails - after a failed test, we shouldn't assume that the cluster is in a usable state, so we shouldn't reuse it for another test.

Rely on the `is_dirty` flag in `PythonTest`s and `CQLApprovalTest`s, similarly to what `TopologyTest`s do.

Closes #12652

* github.com:scylladb/scylladb:
  test.py: rely on ScyllaCluster.is_dirty flag for recycling clusters
  test/topology: don't drop random_tables keyspace after a failed test
  test/pylib: mark cluster as dirty after a failed test
  test: pylib, topology: don't perform operations after test on a dirty cluster
  test/pylib: print cluster at the end of test

(cherry picked from commit 2653865b34)
2023-06-12 11:47:54 +02:00
Anna Mikhlin
ea08d409f1 release: prepare for 5.2.3 2023-06-08 22:04:50 +03:00
Avi Kivity
f32971b81f Merge 'multishard_mutation_query: make reader_context::lookup_readers() exception safe' from Botond Dénes
With regards to closing the looked-up querier if an exception is thrown. In particular, this requires closing the querier if a semaphore mismatch is detected. Move the table lookup above the line where the querier is looked up, to avoid having to handle the exception from it. As a consequence of closing the querier on the error path, the lookup lambda has to be made a coroutine. This is sad, but this is executed once per page, so its cost should be insignificant when spread over an
entire page worth of work.

Also add a unit test checking that the mismatch is detected in the first place and that readers are closed.

Fixes: #13784

Closes #13790

* github.com:scylladb/scylladb:
  test/boost/database_test: add unit test for semaphore mismatch on range scans
  partition_slice_builder: add set_specific_ranges()
  multishard_mutation_query: make reader_context::lookup_readers() exception safe
  multishard_mutation_query: lookup_readers(): make inner lambda a coroutine

(cherry picked from commit 1c0e8c25ca)
2023-06-08 04:29:51 -04:00
Michał Chojnowski
8872157422 data_dictionary: fix forgetting of UDTs on ALTER KEYSPACE
Due to a simple programming oversight, one of keyspace_metadata
constructors is using empty user_types_metadata instead of the
passed one. Fix that.

Fixes #14139

Closes #14143

(cherry picked from commit 1a521172ec)
2023-06-06 21:52:47 +03:00
Kamil Braun
b5785ed434 auth: don't use infinite timeout in default_role_row_satisfies query
A long long time ago there was an issue about removing infinite timeouts
from distributed queries: #3603. There was also a fix:
620e950fc8. But apparently some queries
escaped the fix, like the one in `default_role_row_satisfies`.

With the right conditions and timing this query may cause a node to hang
indefinitely on shutdown. A node tries to perform this query after it
starts. If we kill another node which is required to serve this query
right before that moment, the query will hang; when we try to shutdown
the querying node, it will wait for the query to finish (it's a
background task in auth service), which it never does due to infinite
timeout.

Use the same timeout configuration as other queries in this module do.

Fixes #13545.

Closes #14134

(cherry picked from commit f51312e580)
2023-06-06 19:39:29 +03:00
Pavel Emelyanov
70f93767fd Update seastar submodule
* seastar 98504c4b...8c86e6de (1):
  > rpc: Wait for server socket to stop before killing conns

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-30 20:10:44 +03:00
Tzach Livyatan
bb3751334c Remove Ubuntu 18.04 support from 5.2
Ubuntu [18.04 will be soon out of standard support](https://ubuntu.com/blog/18-04-end-of-standard-support), and can be removed from 5.2 supported list
https://github.com/scylladb/scylla-pkg/issues/3346

Closes #13529

(cherry picked from commit e655060429)
2023-05-30 16:25:42 +03:00
Beni Peled
9dd70a58c3 release: prepare for 5.2.2 2023-05-18 14:03:20 +03:00
Anna Stuchlik
0bc6694ac5 doc: fix the links to the Enterprise docs
Fixes https://github.com/scylladb/scylladb/issues/13915

This commit fixes broken links to the Enterprise docs.
They are links to the enterprise branch, which is not
published. The links to the Enterprise docs should include
"stable" instead of the branch name.

This commit must be backported to branch-5.2, because
the broken links are present in the published 5.2 docs.

Closes #13917

(cherry picked from commit 6f4a68175b)
2023-05-18 08:40:02 +03:00
Botond Dénes
486483b379 Merge '[Backport 5.2]: node ops backports' from Benny Halevy
This branch backports to branch-5.2 several fixes related to node operations:
- ba919aa88a (PR #12980; Fixes: #11011, #12969)
- 53636167ca (part of PR #12970; Fixes: #12764, #12956)
- 5856e69462 (part of PR #12970)
- 2b44631ded (PR #13028; Fixes: #12989)
- 6373452b31 (PR #12799; Fixes #12798)

Closes #13531

* github.com:scylladb/scylladb:
  Merge 'Do not mask node operation errors' from Benny Halevy
  Merge 'storage_service: Make node operations safer by detecting asymmetric abort' from Tomasz Grabiec
  storage_service: Wait for normal state handler to finish in replace
  storage_service: Wait for normal state handler to finish in bootstrap
  storage_service: Send heartbeat earlier for node ops
2023-05-17 16:46:49 +03:00
Tzach Livyatan
9afaec5b12 Update Azure recommended instances type from the Lsv2-series to the Lsv3-series
Closes #13835

(cherry picked from commit a73fde6888)
2023-05-17 15:41:47 +03:00
Anna Stuchlik
9c99dc36b5 doc: add OS support for version 2023.1
Fixes https://github.com/scylladb/scylladb/issues/13857

This commit adds the OS support for ScyllaDB Enterprise 2023.1.
The support is the same as for ScyllaDB Open Source 5.2, on which
2023.1 is based.

After this commit is merged, it must be backported to branch-5.2.
In this way, it will be merged to branch-2023.1 and available in
the docs for Enterprise 2023.1

Closes: #13858
(cherry picked from commit 84ed95f86f)
2023-05-16 10:11:21 +03:00
Tomasz Grabiec
548a7f73d3 Merge 'range_tombstone_change_generator: fix an edge case in flush()' from Michał Chojnowski
range_tombstone_change_generator::flush() mishandles the case when two range
tombstones are adjacent and flush(pos, end_of_range=true) is called with pos
equal to the end bound of the lesser-position range tombstone.

In such case, the start change of the greater-position rtc will be accidentally
emitted, and there won't be an end change, which breaks reader assumptions by
ending the stream with an unclosed range tombstone, triggering an assertion.

This is due to a non-strict inequality used in a place where strict inequality
should be used. The modified line was intended to close range tombstones
which end exactly on the flush position, but this is unnecessary because such
range tombstones are handled by the last `if` in the function anyway.
Instead, this line caused range tombstones beginning right after the flush
position to be emitted sometimes.

Fixes https://github.com/scylladb/scylladb/issues/12462

Closes #13894

* github.com:scylladb/scylladb:
  tests: row_cache: Add reproducer for reader producing missing closing range tombstone
  range_tombstone_change_generator: fix an edge case in flush()
2023-05-15 23:29:08 +02:00
Raphael S. Carvalho
5c66875dbe sstables: Fix use-after-move when making reader in reverse mode
static report:
sstables/mx/reader.cc:1705:58: error: invalid invocation of method 'operator*' on object 'schema' while it is in the 'consumed' state [-Werror,-Wconsumed]
            legacy_reverse_slice_to_native_reverse_slice(*schema, slice.get()), pc, std::move(trace_state), fwd, fwd_mr, monitor);

Fixes #13394.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 213eaab246)
2023-05-15 20:27:34 +03:00
Raphael S. Carvalho
26b4d2c3c1 db/view/build_progress_virtual_reader: Fix use-after-move
use-after-free in ctor, which potentially leads to a failure
when locating table from moved schema object.

static report
In file included from db/system_keyspace.cc:51:
./db/view/build_progress_virtual_reader.hh:202:40: warning: invalid invocation of method 'operator->' on object 's' while it is in the 'consumed' state [-Wconsumed]
                _db.find_column_family(s->ks_name(), system_keyspace::v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS),

Fixes #13395.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 1ecba373d6)
2023-05-15 20:26:01 +03:00
Raphael S. Carvalho
874062b72a index/built_indexes_virtual_reader.hh: Fix use-after-move
static report:
./index/built_indexes_virtual_reader.hh:228:40: warning: invalid invocation of method 'operator->' on object 's' while it is in the 'consumed' state [-Wconsumed]
                _db.find_column_family(s->ks_name(), system_keyspace::v3::BUILT_VIEWS),

Fixes #13396.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit f8df3c72d4)
2023-05-15 20:24:35 +03:00
Raphael S. Carvalho
71ec750a59 replica: Fix use-after-move in table::make_streaming_reader
Variant used by
streaming/stream_transfer_task.cc:        , reader(cf.make_streaming_reader(cf.schema(), std::move(permit_), prs))

as full slice is retrieved after schema is moved (clang evaluates
left-to-right), the stream transfer task can be potentially working
on a stale slice for a particular set of partitions.

static report:
In file included from replica/dirty_memory_manager.cc:6:
replica/database.hh:706:83: error: invalid invocation of method 'operator->' on object 'schema' while it is in the 'consumed' state [-Werror,-Wconsumed]
        return make_streaming_reader(std::move(schema), std::move(permit), range, schema->full_slice());

Fixes #13397.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 04932a66d3)
2023-05-15 20:21:48 +03:00
Tomasz Grabiec
7c1bdc6553 tests: row_cache: Add reproducer for reader producing missing closing range tombstone
Adds a reproducer for #12462.

The bug manifests by reader throwing:

  std::logic_error: Stream ends with an active range tombstone: {range_tombstone_change: pos={position: clustered,ckp{},-1}, {tombstone: timestamp=-9223372036854775805, deletion_time=2}}

The reason is that prior to the fix range_tombstone_change_generator::flush()
was used with end_of_range=true to produce the closing range_tombstone_change
and it did not handle correctly the case when there are two adjacent range
tombstones and flush(pos, end_of_range=true) is called such that pos is the
boundary between the two.

Cherry-picked from a717c803c7.
2023-05-15 18:02:40 +02:00
Michał Chojnowski
24d966f806 range_tombstone_change_generator: fix an edge case in flush()
range_tombstone_change_generator::flush() mishandles the case when two range
tombstones are adjacent and flush(pos, end_of_range=true) is called with pos
equal to the end bound of the lesser-position range tombstone.

In such case, the start change of the greater-position rtc will be accidentally
emitted, and there won't be an end change, which breaks reader assumptions by
ending the stream with an unclosed range tombstone, triggering an assertion.

This is due to a non-strict inequality used in a place where strict inequality
should be used. The modified line was intended to close range tombstones
which end exactly on the flush position, but this is unnecessary because such
range tombstones are handled by the last `if` in the function anyway.
Instead, this line caused range tombstones beginning right after the flush
position to be emitted sometimes.

Fixes #12462
2023-05-15 17:48:24 +02:00
Asias He
05a3a1bf55 tombstone_gc: Fix gc_before for immediate mode
The immediate mode is similar to timeout mode with gc_grace_seconds
zero. Thus, the gc_before returned should be the query_time instead of
gc_clock::time_point::max in immediate mode.

Setting gc_before to gc_clock::time_point::max, a row could be dropped
by compaction even if the ttl is not expired yet.

The following procedure reproduces the issue:

- Start 2 nodes

- Insert data

```
CREATE KEYSPACE ks2a WITH REPLICATION = { 'class' : 'SimpleStrategy',
'replication_factor' : 2 };
CREATE TABLE ks2a.tb (pk int, ck int, c0 text, c1 text, c2 text, PRIMARY
KEY(pk, ck)) WITH tombstone_gc = {'mode': 'immediate'};
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (10 ,1, 'x', 'y', 'z')
USING TTL 1000000;
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (20 ,1, 'x', 'y', 'z')
USING TTL 1000000;
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (30 ,1, 'x', 'y', 'z')
USING TTL 1000000;
```

- Run nodetool flush and nodetool compact

- Compaction drops all data

```
~128 total partitions merged to 0.
```

Fixes #13572

Closes #13800

(cherry picked from commit 7fcc403122)
2023-05-15 10:33:29 +03:00
Takuya ASADA
f148a6be1d scylla_kernel_check: suppress verbose iotune messages
Stop printing verbose iotune messages while the check, just print error
message.

Fixes #13373.

Closes #13362

(cherry picked from commit 160c184d0b)
2023-05-14 21:25:57 +03:00
Benny Halevy
5785550e24 view: view_builder: start: demote sleep_aborted log error
This is not really an error, so print it in debug log_level
rather than error log_level.

Fixes #13374

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #13462

(cherry picked from commit cc42f00232)
2023-05-14 21:21:59 +03:00
Avi Kivity
401de17c82 Update seastar submodule (condition_variable tasktrace fix)
* seastar aa46b980ec...98504c4bb6 (1):
  > condition-variable: replace the coroutine wakeup task with a promise

Fixes #13368
2023-05-14 21:12:12 +03:00
Raphael S. Carvalho
94c9553e8a Fix use-after-move when initializing row cache with dummy entry
Courtersy of clang-tidy:
row_cache.cc:1191:28: warning: 'entry' used after it was moved [bugprone-use-after-move]
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{_schema});
^
row_cache.cc:1191:60: note: move occurred here
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{_schema});
^
row_cache.cc:1191:28: note: the use and move are unsequenced, i.e. there is no guarantee about the order in which they are evaluated
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{*_schema});

The use-after-move is UB, as for it to happen, depends on evaluation order.

We haven't hit it yet as clang is left-to-right.

Fixes #13400.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #13401

(cherry picked from commit d2d151ae5b)
2023-05-14 21:02:24 +03:00
Anna Mikhlin
f1c45553bc release: prepare for 5.2.1 2023-05-08 22:15:46 +03:00
Botond Dénes
1a288e0a78 Update seastar submodule
* seastar 1488aaf8...aa46b980 (1):
  > core/on_internal_error: always log error with backtrace

Fixes: #13786
2023-05-08 10:30:10 +03:00
Marcin Maliszkiewicz
a2fed1588e db: view: use deferred_close for closing staging_sstable_reader
When consume_in_thread throws the reader should still be closed.

Related https://github.com/scylladb/scylla-enterprise/issues/2661

Closes #13398
Refs: scylladb/scylla-enterprise#2661
Fixes: #13413

(cherry picked from commit 99f8d7dcbe)
2023-05-08 09:41:07 +03:00
Botond Dénes
f07a06d390 Merge 'service:forward_service: use long type instead of counter in function mocking' from Michał Jadwiszczak
Aggregation query on counter column is failing because forward_service is looking for function with counter as an argument and such function doesn't exist. Instead the long type should be used.

Fixes: #12939

Closes #12963

* github.com:scylladb/scylladb:
  test:boost: counter column parallelized aggregation test
  service:forward_service: use long type when column is counter

(cherry picked from commit 61e67b865a)
2023-05-07 14:27:29 +03:00
Anna Stuchlik
4ec531d807 doc: remove the sequential repair option from docs
Fixes https://github.com/scylladb/scylladb/issues/12132

The sequential repair mode is not supported. This commit
removes the incorrect information from the documentation.

Closes #13544

(cherry picked from commit 3d25edf539)
2023-05-07 14:27:29 +03:00
Asias He
4867683f80 storage_service: Fix removing replace node as pending
Consider

- n1, n2, n3
- n3 is down
- n4 replaces n3 with the same ip address 127.0.0.3
- Inside the storage_service::handle_state_normal callback for 127.0.0.3 on n1/n2

  ```
  auto host_id = _gossiper.get_host_id(endpoint);
  auto existing = tmptr->get_endpoint_for_host_id(host_id);
  ```

  host_id = new host id
  existing = empty

  As a result, del_replacing_endpoint() will not be called.

This means 127.0.0.3 will not be removed as a pending node on n1 and n2 when
replacing is done. This is wrong.

This is a regression since commit 9942c60d93
(storage_service: do not inherit the host_id of a replaced a node), where
replacing node uses a new host id than the node to be replaced.

To fix, call del_replacing_endpoint() when a node becomes NORMAL and existing
is empty.

Before:
n1:
storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3
token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Marked ops done from coordinator=127.0.0.3
storage_service - Node 127.0.0.3 state jump to normal
storage_service - Set host_id=6f9ba4e8-9457-4c76-8e2a-e2be257fe123 to be owned by node=127.0.0.3

After:
n1:
storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3
token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Marked ops done from coordinator=127.0.0.3
storage_service - Node 127.0.0.3 state jump to normal
token_metadata - Removed node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - Set host_id=72219180-e3d1-4752-b644-5c896e4c2fed to be owned by node=127.0.0.3

Tests: https://github.com/scylladb/scylla-dtest/pull/3126

Closes #13677

Fixes: https://github.com/scylladb/scylla-enterprise/issues/2852

(cherry picked from commit a8040306bb)
2023-05-03 14:15:13 +03:00
Botond Dénes
0e42defe06 readers: evictable_reader: skip progress guarantee when next pos is partition start
The evictable reader must ensure that each buffer fill makes forward
progress, i.e. the last fragment in the buffer has a position larger
than the last fragment from the last buffer-fill. Otherwise, the reader
could get stuck in an infinite loop between buffer fills, if the reader
is evicted in-between.
The code guranteeing this forward change has a bug: when the next
expected position is a partition-start (another partition), the code
would loop forever, effectively reading all there is from the underlying
reader.
To avoid this, add a special case to ignore the progress guarantee loop
altogether when the next expected position is a partition start. In this
case, progress is garanteed anyway, because there is exactly one
partition-start fragment in each partition.

Fixes: #13491

Closes #13563

(cherry picked from commit 72003dc35c)
2023-05-02 21:58:41 +03:00
Avi Kivity
f73d017f05 tools: toolchain: regenerate
Fixes #13744
2023-05-02 13:16:59 +03:00
Pavel Emelyanov
3723678b82 scylla-gdb: Parse and eval _all_threads without quotes
I've no idea why the quotes are there at all, it works even without
them. However, with quotes gdb-13 fails to find the _all_threads static
thread-local variable _unless_ it's printed with gdb "p" command
beforehand.

fixes: #13125

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13132

(cherry picked from commit 537510f7d2)
2023-05-02 13:16:59 +03:00
Botond Dénes
ea506f50cc Merge 'Do not mask node operation errors' from Benny Halevy
This series handles errors when aborting node operations and prints them rather letting them leak and be exposed to the user.

Also, cleanup the node_ops logging formats when aborting different node ops
and add more error logging around errors in the "worker" nodes.

Closes #12799

* github.com:scylladb/scylladb:
  storage_service: node_ops_signal_abort: print a warning when signaling abort
  storage_service: s/node_ops_singal_abort/node_ops_signal_abort/
  storage_service: node_ops_abort: add log messages
  storage_service: wire node_ops_ctl for node operations
  storage_service: add node_ops_ctl class to formalize all node_ops flow
  repair: node_ops_cmd_request: add print function
  repair: do_decommission_removenode_with_repair: log ignore_nodes
  repair: replace_with_repair: get ignore_nodes as unordered_set
  gossiper: get_generation_for_nodes: get nodes as unordered_set
  storage_service: don't let node_ops abort failures mask the real error

(cherry picked from commit 6373452b31)
2023-04-30 18:58:28 +03:00
Kamil Braun
42fd3704e4 Merge 'storage_service: Make node operations safer by detecting asymmetric abort' from Tomasz Grabiec
This patch fixes a problem which affects decommission and removenode
which may lead to data consistency problems under conditions which
lead one of the nodes to unliaterally decide to abort the node
operation without the coordinator noticing.

If this happens during streaming, the node operation coordinator would
proceed to make a change in the gossiper, and only later dectect that
one of the nodes aborted during sending of decommission_done or
removenode_done command. That's too late, because the operation will
be finalized by all the nodes once gossip propagates.

It's unsafe to finalize the operation while another node aborted. The
other node reverted to the old topolgy, with which they were running
for some time, without considering the pending replica when handling
requests. As a result, we may end up with consistency issues. Writes
made by those coordinators may not be replicated to CL replicas in the
new topology. Streaming may have missed to replicate those writes
depending on timing.

It's possible that some node aborts but streaming succeeds if the
abort is not due to network problems, or if the network problems are
transient and/or localized and affect only heartbeats.

There is no way to revert after we commit the node operation to the
gossiper, so it's ok to close node_ops sessions before making the
change to the gossiper, and thus detect aborts and prevent later aborts
after the change in the gossiper is made. This is already done during
bootstrap (RBNO enabled) and replacenode. This patch canges removenode
to also take this approach by moving sending of remove_done earlier.

We cannot take this approach with decommission easily, because
decommission_done command includes a wait for the node to leave the
ring, which won't happen before the change to the gossiper is
made. Separating this from decommission_done would require protocol
changes. This patch adds a second-best solution, which is to check if
sessions are still there right before making a change to the gossiper,
leaving decommission_done where it was.

The race can still happen, but the time window is now much smaller.

The PR also lays down infrastructure which enables testing the scenarios. It makes node ops
watchdog periods configurable, and adds error injections.

Fixes #12989
Refs #12969

Closes #13028

* github.com:scylladb/scylladb:
  storage_service: node ops: Extract node_ops_insert() to reduce code duplication
  storage_service: Make node operations safer by detecting asymmetric abort
  storage_service: node ops: Add error injections
  service: node_ops: Make watchdog and heartbeat intervals configurable

(cherry picked from commit 2b44631ded)
2023-04-30 18:58:28 +03:00
Asias He
c9d19b3595 storage_service: Wait for normal state handler to finish in replace
Similar to "storage_service: Wait for normal state handler to finish in
bootstrap", this patch enables the check on the replace procedure.

(cherry picked from commit 5856e69462)
2023-04-30 18:58:28 +03:00
Asias He
9a873bf4b3 storage_service: Wait for normal state handler to finish in bootstrap
In storage_service::handle_state_normal, storage_service::notify_joined
will be called which drops the rpc connections to the node becomes
normal. This causes rpc calls with that node fail with
seastar::rpc::closed_error error.

Consider this:

- n1 in the cluster
- n2 is added to join the cluster
- n2 sees n1 is in normal status
- n2 starts bootstrap process
- notify_joined on n2 closes rpc connection to n1 in the middle of
  bootstrap
- n2 fails to bootstrap

For example, during bootstrap with RBNO, we saw repair failed in a
test that sets ring_delay to zero and does not wait for gossip to
settle.

repair - repair[9cd0dbf8-4bca-48fc-9b1c-d9e80d0313a2]: sync data for
keyspace=system_distributed_everywhere, status=failed:
std::runtime_error ({shard 0: seastar::rpc::closed_error (connection is
closed)})

This patch fixes the race by waiting for the handle_state_normal handler
to finish before the bootstrap process.

Fixes #12764
Fixes #12956

(cherry picked from commit 53636167ca)
2023-04-30 18:58:28 +03:00
Asias He
51a00280a2 storage_service: Send heartbeat earlier for node ops
Node ops has the following procedure:

1   for node in sync_nodes
      send prepare cmd to node

2   for node in sync_nodes
      send heartbeat cmd to node

If any of the prepare cmd in step 1 takes longer than the heartbeat
watchdog timeout, the heartbeat in step 2 will be too late to update the
watchdog, as a result the watchdog will abort the operation.

To prevent slow prepare cmd kills the node operations, we can start the
heartbeat earlier in the procedure.

Fixes #11011
Fixes #12969

Closes #12980

(cherry picked from commit ba919aa88a)
2023-04-30 18:58:28 +03:00
Wojciech Mitros
b0a7c02e09 rust: update dependencies
Cranelift-codegen 0.92.0 and wasmtime 5.0.0 have security issues
potentially allowing malicious UDFs to read some memory outside
the wasm sandbox. This patch updates them to versions 0.92.1
and 5.0.1 respectively, where the issues are fixed.

Fixes #13157

Closes #13171

(cherry picked from commit aad2afd417)
2023-04-27 22:01:44 +03:00
Wojciech Mitros
f18c49dcc6 rust: update dependencies
Wasmtime added some improvements in recent releases - particularly,
two security issues were patched in version 2.0.2. There were no
breaking changes for our use other than the strategy of returning
Traps - all of them are now anyhow::Errors instead, but we can
still downcast to them, and read the corresponding error message.

The cxx, anyhow and futures dependency versions now match the
versions saved in the Cargo.lock.

Closes #12830

(cherry picked from commit 8b756cb73f)

Ref #13157
2023-04-27 22:00:54 +03:00
Anna Stuchlik
35dfec78d1 doc: fixes https://github.com/scylladb/scylladb/issues/12964, removes the information that the CDC options are experimental
Closes #12973

(cherry picked from commit 4dd1659d0b)
2023-04-27 21:06:49 +03:00
Raphael S. Carvalho
dbd8ca4ade replica: Fix undefined behavior in table::generate_and_propagate_view_updates()
Undefined behavior because the evaluation order is undefined.

With GCC, where evaluation is right-to-left, schema will be moved
once it's forwarded to make_flat_mutation_reader_from_mutations_v2().

The consequence is that memory tracking of mutation_fragment_v2
(for tracking only permit used by view update), which uses the schema,
can be incorrect. However, it's more likely that Scylla will crash
when estimating memory usage for row, which access schema column
information using schema::column_at(), which in turn asserts that
the requested column does really exist.

Fixes #13093.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #13092

(cherry picked from commit 3fae46203d)
2023-04-27 19:56:38 +03:00
Anna Stuchlik
1be4afb842 doc: remove incorrect info about BYPASS CACHE
Fixes https://github.com/scylladb/scylladb/issues/13106

This commit removes the information that BYPASS CACHE
is an Enterprise-only feature and replaces that info
with the link to the BYPASS CACHE description.

Closes #13316

(cherry picked from commit 1cfea1f13c)
2023-04-27 19:54:04 +03:00
Kefu Chai
7cc9f5a05f dist/redhat: enforce dependency on %{release} also
* tools/python3 279b6c1...cf7030a (1):
  > dist: redhat: provide only a single version

s/%{version}/%{version}-%{release}/ in `Requires:` sections.

this enforces the runtime dependencies of exactly the same
releases between scylla packages.

Fixes #13222
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 7165551fd7)
2023-04-27 19:27:34 +03:00
Nadav Har'El
bf7fc9709d test/rest_api: fix flaky test for toppartitions
The REST test test_storage_service.py::test_toppartitions_pk_needs_escaping
was flaky. It tests the toppartition request, which unfortunately needs
to choose a sampling duration in advance, and we chose 1 second which we
considered more than enough - and indeed typically even 1ms is enough!
but very rarely (only know of only one occurance, in issue #13223) one
second is not enough.

Instead of increasing this 1 second and making this test even slower,
this patch takes a retry approach: The tests starts with a 0.01 second
duration, and is then retried with increasing durations until it succeeds
or a 5-seconds duration is reached. This retry approach has two benefits:
1. It de-flakes the test (allowing a very slow test to take 5 seconds
instead of 1 seconds which wasn't enough), and 2. At the same time it
makes a successful test much faster (it used to always take a full
second, now it takes 0.07 seconds on a dev build on my laptop).

A *failed* test may, in some cases, take 10 seconds after this patch
(although in some other cases, an error will be caught immediately),
but I consider this acceptable - this test should pass, after all,
and a failure indicates a regression and taking 10 seconds will be
the last of our worries in that case.

Fixes #13223.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #13238

(cherry picked from commit c550e681d7)
2023-04-27 19:16:58 +03:00
Nadav Har'El
00a8c3a433 test/alternator: increase CQL connection timeout
This patch increases the connection timeout in the get_cql_cluster()
function in test/cql-pytest/run.py. This function is used to test
that Scylla came up, and also test/alternator/run uses it to set
up the authentication - which can only be done through CQL.

The Python driver has 2-second and 5-second default timeouts that should
have been more than enough for everybody (TM), but in #13239 we saw
that in one case it apparently wasn't enough. So to be extra safe,
let's increase the default connection-related timeouts to 60 seconds.

Note this change only affects the Scylla *boot* in the test/*/run
scripts, and it does not affect the actual tests - those have different
code to connect to Scylla (see cql_session() in test/cql-pytest/util.py),
and we already increased the timeouts there in #11289.

Fixes #13239

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #13291

(cherry picked from commit 4fdcee8415)
2023-04-27 19:15:39 +03:00
Tomasz Grabiec
c08ed39a33 direct_failure_detector: Avoid throwing exceptions in the success path
sleep_abortable() is aborted on success, which causes sleep_aborted
exception to be thrown. This causes scylla to throw every 100ms for
each pinged node. Throwing may reduce performance if happens often.

Also, it spams the logs if --logger-log-level exception=trace is enabled.

Avoid by swallowing the exception on cancellation.

Fixes #13278.

Closes #13279

(cherry picked from commit 99cb948eac)
2023-04-27 19:14:31 +03:00
Kefu Chai
04424f8956 test: cql-pytest: test_describe: clamp bloom filter's fp rate
before this change, we use `round(random.random(), 5)` for
the value of `bloom_filter_fp_chance` config option. there are
chances that this expression could return a number lower or equal
to 6.71e-05.

but we do have a minimal for this option, which is defined by
`utils::bloom_calculations::probs`. and the minimal false positive
rate is 6.71e-05.

we are observing test failures where the we are using 0 for
the option, and scylla right rejected it with the error message of
```
bloom_filter_fp_chance must be larger than 6.71e-05 and less than or equal to 1.0 (got 0)
```.

so, in this change, to address the test failure, we always use a number
slightly greater or equal to a number slightly greater to the minimum to
ensure that the randomly picked number is in the range of supported
false positive rate.

Fixes #13313
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13314

(cherry picked from commit 33f4012eeb)
2023-04-27 19:12:53 +03:00
Beni Peled
429b696bbc release: prepare for 5.2.0 2023-04-27 16:26:43 +03:00
Beni Peled
a89867d8c2 release: prepare for 5.2.0-rc5 2023-04-25 14:37:54 +03:00
Benny Halevy
6ad94fedf3 utils: clear_gently: do not clear null unique_ptr
Otherwise the null pointer is dereferenced.

Add a unit test reproducing the issue
and testing this fix.

Fixes #13636

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 12877ad026)
2023-04-24 17:51:01 +03:00
Anna Stuchlik
a6188d6abc doc: document tombstone_gc as not experimental
The tombstone_gc was documented as experimental in version 5.0.
It is no longer experimental in version 5.2.
This commit updates the information about the option.

Closes #13469

(cherry picked from commit a68b976c91)
2023-04-24 11:54:06 +03:00
Botond Dénes
50095cc3a5 Merge 'db: system_keyspace: use microsecond resolution for group0_history range tombstone' from Kamil Braun
in `make_group0_history_state_id_mutation`, when adding a new entry to
the group 0 history table, if the parameter `gc_older_than` is engaged,
we create a range tombstone in the mutation which deletes entries older
than the new one by `gc_older_than`. In particular if
`gc_older_than = 0`, we want to delete all older entries.

There was a subtle bug there: we were using millisecond resolution when
generating the tombstone, while the provided state IDs used microsecond
resolution. On a super fast machine it could happen that we managed to
perform two schema changes in a single millisecond; this happened
sometimes in `group0_test.test_group0_history_clearing_old_entries`
on our new CI/promotion machines, causing the test to fail because the
tombstone didn't clear the entry correspodning to the previous schema
change when performing the next schema change (since they happened in
the same millisecond).

Use microsecond resolution to fix that. The consecutive state IDs used
in group 0 mutations are guaranteed to be strictly monotonic at
microsecond resolution (see `generate_group0_state_id` in
service/raft/raft_group0_client.cc).

Fixes #13594

Closes #13604

* github.com:scylladb/scylladb:
  db: system_keyspace: use microsecond resolution for group0_history range tombstone
  utils: UUID_gen: accept decimicroseconds in min_time_UUID

(cherry picked from commit 10c1f1dc80)
2023-04-23 16:03:02 +03:00
Botond Dénes
7b2215d8e0 Merge 'Backport bugfixes regarding UDT, UDF, UDA interactions to branch-5.2' from Wojciech Mitros
This patch backports https://github.com/scylladb/scylladb/pull/12710 to branch-5.2. To resolve the conflicts that it's causing, it also includes
* https://github.com/scylladb/scylladb/pull/12680
* https://github.com/scylladb/scylladb/pull/12681

Closes #13542

* github.com:scylladb/scylladb:
  uda: change the UDF used in a UDA if it's replaced
  functions: add helper same_signature method
  uda: return aggregate functions as shared pointers
  udf: also check reducefunc to confirm that a UDF is not used in a UDA
  udf: fix dropping UDFs that share names with other UDFs used in UDAs
  pytest: add optional argument for new_function argument types
  udt: disallow dropping a user type used in a user function
2023-04-19 01:38:08 -04:00
Botond Dénes
da9f90362d Merge 'Compaction reevaluation bug fixes' from Raphael "Raph" Carvalho
A problem in compaction reevaluation can cause the SSTable set to be left uncompacted for indefinite amount of time, potentially causing space and read amplification to be suboptimal.

Two revaluation problems are being fixed, one after off-strategy compaction ended, and another in compaction manager which intends to periodically reevaluate a need for compaction.

Fixes https://github.com/scylladb/scylladb/issues/13429.
Fixes https://github.com/scylladb/scylladb/issues/13430.

Closes #13431

* github.com:scylladb/scylladb:
  compaction: Make compaction reevaluation actually periodic
  replica: Reevaluate regular compaction on off-strategy completion

(cherry picked from commit 9a02315c6b)
2023-04-19 01:14:33 -04:00
Botond Dénes
c9a17c80f6 mutation/mutation_compactor: consume_partition_end(): reset _stop
The purpose of `_stop` is to remember whether the consumption of the
last partition was interrupted or it was consumed fully. In the former
case, the compactor allows retreiving the compaction state for the given
partition, so that its compaction can be resumed at a later point in
time.
Currently, `_stop` is set to `stop_iteration::yes` whenever the return
value of any of the `consume()` methods is also `stop_iteration::yes`.
Meaning, if the consuming of the partition is interrupted, this is
remembered in `_stop`.
However, a partition whose consumption was interrupted is not always
continued later. Sometimes consumption of a partitions is interrputed
because the partition is not interesting and the downstream consumer
wants to stop it. In these cases the compactor should not return an
engagned optional from `detach_state()`, because there is not state to
detach, the state should be thrown away. This was incorrectly handled so
far and is fixed in this patch, but overwriting `_stop` in
`consume_partition_end()` with whatever the downstream consumer returns.
Meaning if they want to skip the partition, then `_stop` is reset to
`stop_partition::no` and `detach_state()` will return a disengaged
optional as it should in this case.

Fixes: #12629

Closes #13365

(cherry picked from commit bae62f899d)
2023-04-18 02:32:24 -04:00
Wojciech Mitros
7242c42089 uda: change the UDF used in a UDA if it's replaced
Currently, if a UDA uses a UDF that's being replaced,
the UDA will still keep using the old UDF until the
node is restarted.
This patch fixes this behavior by checking all UDAs
when replacing a UDF and updating them if necessary.

Fixes #12709

(cherry picked from commit 02bfac0c66)
2023-04-17 13:14:46 +02:00
Wojciech Mitros
70ff69afab functions: add helper same_signature method
When deciding whether two functions have the same
signature, we have to check if they have the same name
and parameter types. Additionally, if they're represented
by pointers, we need to check if any of them is a nullptr.
This logic is used multiple times, so it's extracted to
a separate function.
To use this function, the `used_by_user_aggregate` method
takes now a function instead of name and types list - we
can do it because we always use it with an existing user
function (that we're trying to drop).
The method will also be useful when we'll be not dropping,
but replacing a user function.

(cherry picked from commit 58987215dc)
2023-04-17 13:14:40 +02:00
Wojciech Mitros
5fd4bb853b uda: return aggregate functions as shared pointers
We will want to reuse the functions that we get from an aggregate
without making a deep copy, and it's only possible if we get
pointers from the aggregate instead of actual values.

(cherry picked from commit 20069372e7)
2023-04-17 13:14:24 +02:00
Wojciech Mitros
313649e86d udf: also check reducefunc to confirm that a UDF is not used in a UDA
When dropping a UDF we're checking if it's not begin used in any UDAs
and fail otherwise. However, we're only checking its state function
and final function, and it may also be used as its reduce function.
This patch adds the missing checks and a test for them.

(cherry picked from commit ef1dac813b)
2023-04-17 13:14:16 +02:00
Wojciech Mitros
14d8cec130 udf: fix dropping UDFs that share names with other UDFs used in UDAs
Currently, when dropping a function, we only check if there exist
an aggregate that uses a function with the same name as its state
function or final function. This may cause the drop to fail even
when it's just another UDF with the same name that's used in the
aggregate, even when the actual dropped function is not used there.
This patch fixes this by checking whether not only the name of the
UDA's sfunc and finalfunc, but also their argument types.

(cherry picked from commit 49077dd144)
2023-04-17 13:14:09 +02:00
Wojciech Mitros
203cbb79a1 pytest: add optional argument for new_function argument types
When multiple functions with the same name but different argument types
are created, the default drop statement for these functions will fail
because it does not include the argument types.
With this change, this problem can be worked around by specifying
argument types when creating the function, as this will cause the drop
statement to include them.

(cherry picked from commit 8791b0faf5)
2023-04-17 13:13:59 +02:00
Wojciech Mitros
51f19d1b8c udt: disallow dropping a user type used in a user function
Currently, nothing prevents us from dropping a user type
used in a user function, even though doing so may make us
unable to use the function correctly.
This patch prevents this behavior by checking all function
argument and return types when executing a drop type statement
and preventing it from completing if the type is referenced
by any of them.

(cherry picked from commit 86c61828e6)
2023-04-17 13:13:35 +02:00
Anna Stuchlik
83735ae77f doc: update the metrics between 5.2 and 2023.1
Related: https://github.com/scylladb/scylla-enterprise/issues/2794

This commit adds the information about the metric changes
in version 2023.1 compared to version 5.2.

This commit is part of the 5.2-to-2023.1 upgrade guide and
must be backported to branch-5.2.

Closes #13506

(cherry picked from commit 989a75b2f7)
2023-04-17 11:29:43 +02:00
Avi Kivity
9d384e3af2 Merge 'Backport "reader_concurrency_semaphore: don't evict inactive readers needlessly" to branch-5.2' from Botond Dénes
The patch doesn't apply cleanly, so a targeted backport PR was necessary.
I also needed to cherry-pick two patches from https://github.com/scylladb/scylladb/pull/13255 that the backported patch depends on. Decided against backporting the entire https://github.com/scylladb/scylladb/pull/13255 as it is quite an intrusive change.

Fixes: https://github.com/scylladb/scylladb/issues/11803

Closes #13515

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: don't evict inactive readers needlessly
  reader_concurrency_semaphore: add stats to record reason for queueing permits
  reader_concurrency_semaphore: can_admit_read(): also return reason for rejection
2023-04-17 12:25:21 +03:00
Nadav Har'El
0da0c94f49 cql: USING TTL 0 means unlimited, not default TTL
Our documentation states that writing an item with "USING TTL 0" means it
should never expire. This should be true even if the table has a default
TTL. But Scylla mistakenly handled "USING TTL 0" exactly like having no
USING TTL at all (i.e., it took the default TTL, instead of unlimited).
We had two xfailing tests demonstrating that Scylla's behavior in this
is different from Cassandra. Scylla's behavior in this case was also
undocumented.

By the way, Cassandra used to have the same bug (CASSANDRA-11207) but
it was fixed already in 2016 (Cassandra 3.6).

So in this patch we fix Scylla's "USING TTL 0" behavior to match the
documentation and Cassandra's behavior since 2016. One xfailing test
starts to pass and the second test passes this bug and fails on a
different one. This patch also adds a third test for "USING TTL ?"
with UNSET_VALUE - it behaves, on both Scylla and Cassandra, like a
missing "USING TTL".

The origin of this bug was that after parsing the statement, we saved
the USING TTL in an integer, and used 0 for the case of no USING TTL
given. This meant that we couldn't tell if we have USING TTL 0 or
no USING TTL at all. This patch uses an std::optional so we can tell
the case of a missing USING TTL from the case of USING TTL 0.

Fixes #6447

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #13079

(cherry picked from commit a4a318f394)
2023-04-17 10:41:08 +03:00
Nadav Har'El
1a9f51b767 cql: fix empty aggregation, and add more tests
This patch fixes #12475, where an aggregation (e.g., COUNT(*), MIN(v))
of absolutely no partitions (e.g., "WHERE p = null" or "WHERE p in ()")
resulted in an internal error instead of the "zero" result that each
aggregator expects (e.g., 0 for COUNT, null for MIN).

The problem is that normally our aggregator forwarder picks the nodes
which hold the relevant partition(s), forwards the request to each of
them, and then combines these results. When there are no partitions,
the query is sent to no node, and we end up with an empty result set
instead of the "zero" results. So in this patch we recognize this
case and build those "zero" results (as mentioned above, these aren't
always 0 and depend on the aggregation function!).

The patch also adds two tests reproducing this issue in a fairly general
way (e.g., several aggregators, different aggregation functions) and
confirming the patch fixes the bug.

The test also includes two additional tests for COUNT aggregation, which
uncovered an incompatibility with Cassandra which is still not fixed -
so these tests are marked "xfail":

Refs #12477: Combining COUNT with GROUP by results with empty results
             in Cassandra, and one result with empty count in Scylla.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12715

(cherry picked from commit 3ba011c2be)
2023-04-17 10:41:08 +03:00
Raphael S. Carvalho
dba0e604a7 table: Fix disk-space related metrics
total disk space used metric is incorrectly telling the amount of
disk space ever used, which is wrong. It should tell the size of
all sstables being used + the ones waiting to be deleted.
live disk space used, by this defition, shouldn't account the
ones waiting to be deleted.
and live sstable count, shouldn't account sstables waiting to
be deleted.

Fix all that.

Fixes #12717.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 529a1239a9)
2023-04-16 22:14:01 +03:00
Michał Chojnowski
4ea67940cb locator: token_metadata: get rid of a quadratic behaviour in get_address_ranges()
Some callees of update_pending_ranges use the variant of get_address_ranges()
which builds a hashmap of all <endpoint, owned range> pairs. For
everywhere_topology, the size of this map is quadratic in the number of
endpoints, making it big enough to cause contiguous allocations of tens of MiB
for clusters of realistic size, potentially causing trouble for the
allocator (as seen e.g. in #12724). This deserves a correction.

This patch removes the quadratic variant of get_address_ranges() and replaces
its uses with its linear counterpart.

Refs #10337
Refs #10817
Refs #10836
Refs #10837
Fixes #12724

(cherry picked from commit 9e57b21e0c)
2023-04-16 21:59:14 +03:00
Jan Ciolek
a8c49c44e5 cql/query_options: add a check for missing bind marker name
There was a missing check in validation of named
bind markers.

Let's say that a user prepares a query like:
```cql
INSERT INTO ks.tab (pk, ck, v) VALUES (:pk, :ck, :v)
```
Then they execute the query, but specify only
values for `:pk` and `:ck`.

We should detect that a value for :v is missing
and throw an invalid_request_exception.

Until now there was no such check, in case of a missing variable
invalid `query_options` were created and Scylla crashed.

Sadly it's impossible to create a regression test
using `cql-pytest` or `boost`.

`cql-pytest` uses the python driver, which silently
ignores mising named bind variables, deciding
that the user meant to send an UNSET_VALUE for them.
When given values like `{'pk': 1, 'ck': 2}`, it will automaticaly
extend them to `{'pk': 1, 'ck': 2, 'v': UNSET_VALUE}`.

In `boost` I tried to use `cql_test_env`,
but it only has methods which take valid `query_options`
as a parameter. I could create a separate unit tests
for the creation and validation of `query_options`
but it won't be a true end-to-end test like `cql-pytest`.

The bug was found using the rust driver,
the reproducer is available in the issue description.

Fixes: #12727

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

Closes #12730

(cherry picked from commit 2a5ed115ca)
2023-04-16 21:57:28 +03:00
Nadav Har'El
12a29edf90 test/alternator: fix flaky test for partition-tombstone scan
The test test_scan.py::test_scan_long_partition_tombstone_string
checks that a full-table Scan operation ends a page in the middle of
a very long string of partition tombstones, and does NOT scan the
entire table in one page (if we did that, getting a single page could
take an unbounded amount of time).

The test is currently flaky, having failed in CI runs three times in
the past two months.

The reason for the flakiness is that we don't know exactly how long
we need to make the sequence of partition tombstones in the test before
we can be absolutely sure a single page will not read this entire sequence.
For single-partition scans we have the "query_tombstone_page_limit"
configuration parameter, which tells us exactly how long we need to
make the sequence of row tombstones. But for a full-table scan of
partition tombstones, the situation is more complicated - because the
scan is done in parallel on several vnodes in parallel and each of
them needs to read query_tombstone_page_limit before it stops.

In my experiments, using query_tombstone_limit * 4 consecutive tombstones
was always enough - I ran this test hundreds of times and it didn't fail
once. But since it did fail on Jenkins very rarely (3 times in the last
two months), maybe the multiplier 4 isn't enough. So this patch doubles
it to 8. Hopefully this would be enough for anyone (TM).

This makes this test even bigger and slower than it was. To make it
faster, I changed this test's write isolation mode from the default
always_use_lwt to forbid_rmw (not use LWT). This leaves the test's
total run time to be similar to what it was before this patch - around
0.5 seconds in dev build mode on my laptop.

Fixes #12817

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12819

(cherry picked from commit 14cdd034ee)
2023-04-14 11:54:45 +03:00
Botond Dénes
3e10c3fc89 reader_concurrency_semaphore: don't evict inactive readers needlessly
Inactive readers should only be evicted to free up resources for waiting
readers. Evicting them when waiters are not admitted for any other
reason than resources is wasteful and leads to extra load later on when
these evicted readers have to be recreated end requeued.
This patch changes the logic on both the registering path and the
admission path to not evict inactive readers unless there are readers
actually waiting on resources.
A unit-test is also added, reproducing the overly-agressive eviction and
checking that it doesn't happen anymore.

Fixes: #11803

Closes #13286

(cherry picked from commit bd57471e54)
2023-04-14 10:37:30 +03:00
Botond Dénes
f11deb5074 reader_concurrency_semaphore: add stats to record reason for queueing permits
When diagnosing problems, knowing why permits were queued is very
valuable. Record the reason in a new stats, one for each reason a permit
can be queued.

(cherry picked from commit 7b701ac52e)
2023-04-14 10:37:30 +03:00
Botond Dénes
1baf9dddd7 reader_concurrency_semaphore: can_admit_read(): also return reason for rejection
So caller can bump the appropriate counters or log the reason why the
the request cannot be admitted.

(cherry picked from commit bb00405818)
2023-04-14 09:30:02 +03:00
Kamil Braun
9717ff5057 docs: cleaning up after failed membership change
After a failed topology operation, like bootstrap / decommission /
removenode, the cluster might contain a garbage entry in either token
ring or group 0. This entry can be cleaned-up by executing removenode on
any other node, pointing to the node that failed to bootstrap or leave
the cluster.

Document this procedure, including a method of finding the host ID of a
garbage entry.

Add references in other documents.

Fixes: #13122

Closes #13186

(cherry picked from commit c2a2996c2b)
2023-04-13 10:35:02 +02:00
Anna Stuchlik
b293b1446f doc: remove Enterprise upgrade guides from OSS doc
This commit removes the Enterprise upgrade guides from
the Open Source documentation. The Enterprise upgrade guides
should only be available in the Enterprise documentation,
with the source files stored in scylla-enterprise.git.

In addition, this commit:
- adds the links to the Enterprise user guides in the Enterprise
documentation at https://enterprise.docs.scylladb.com/
- adds the redirections for the removed pages to avoid
breaking any links.

This commit must be reverted in scylla-enterprise.git.

(cherry picked from commit 61bc05ae49)

Closes #13473
2023-04-11 14:26:35 +03:00
Yaron Kaikov
e6f7ac17f6 doc: update supported os for 2022.1
ubuntu22.04 is already supported on both `5.0` and `2022.1`

updating the table

Closes #13340

(cherry picked from commit c80ab78741)
2023-04-05 13:56:07 +03:00
Anna Stuchlik
36619fc7d9 doc: add upgrade guide from 5.2 to 2023.1
Related: https://github.com/scylladb/scylla-enterprise/issues/2770

This commit adds the upgrade guide from ScyllaDB Open Source 5.2
to ScyllaDB Enterprise 2023.1.
This commit does not cover metric updates (the metrics file has no
content, which needs to be added in another PR).

As this is an upgrade guide, this commit must be merged to master and
backported to branch-5.2 and branch-2023.1 in scylla-enterprise.git.

Closes #13294

(cherry picked from commit 595325c11b)
2023-04-05 06:43:01 +03:00
Anna Stuchlik
750414c196 doc: update Raft doc for versions 5.2 and 2023.1
Fixes https://github.com/scylladb/scylladb/issues/13345
Fixes https://github.com/scylladb/scylladb/issues/13421

This commit updates the Raft documentation page to be up to date in versions 5.2 and 2023.1.

- Irrelevant information about previous releases is removed.
- Some information is clarified.
- Mentions of version 5.2 are either removed (if possible) or version 2023.1 is added.

Closes #13426

(cherry picked from commit 447ce58da5)
2023-04-05 06:42:28 +03:00
Botond Dénes
128050e984 Merge 'commitlog: Fix updating of total_size_on_disk on segment alloc when o_dsync is off' from Calle Wilund
Fixes #12810

We did not update total_size_on_disk in commitlog totals when use o_dsync was off.
This means we essentially ran with no registered footprint, also causing broken comparisons in delete_segments.

Closes #12950

* github.com:scylladb/scylladb:
  commitlog: Fix updating of total_size_on_disk on segment alloc when o_dsync is off
  commitlog: change type of stored size

(cherry picked from commit e70be47276)
2023-04-03 08:57:43 +03:00
Yaron Kaikov
d70751fee3 release: prepare for 5.2.0-rc4 2023-04-02 16:40:56 +03:00
Tzach Livyatan
1fba43c317 docs: minor improvments to the Raft Handling Failures and recovery procedure sections
Closes #13292

(cherry picked from commit 46e6c639d9)
2023-03-31 11:22:20 +02:00
Botond Dénes
e380c24c69 Merge 'Improve database shutdown verbosity' from Pavel Emelyanov
The `database::stop` method is sometimes hanging and it's always hard to spot where exactly it sleeps. Few more logging messages would make this much simpler.

refs: #13100
refs: #10941

Closes #13141

* github.com:scylladb/scylladb:
  database: Increase verbosity of database::stop() method
  large_data_handler: Increase verbosity on shutdown
  large_data_handler: Coroutinize .stop() method

(cherry picked from commit e22b27a107)
2023-03-30 17:01:24 +03:00
Avi Kivity
76a76a95f4 Update tools/java submodule (hdrhistogram with Java 11)
* tools/java 1c4e1e7a7d...83b2168b19 (1):
  > Fix cassandra-stress -log hdrfile=... with java 11

Fixes #13287
2023-03-29 14:10:27 +03:00
Anna Stuchlik
f6837afec7 doc: update the Ubuntu version used in the image
Starting from 5.2 and 2023.1 our images are based on Ubuntu:22.04.
See https://github.com/scylladb/scylladb/issues/13138#issuecomment-1467737084

This commit adds that information to the docs.
It should be merged and backported to branch-5.2.

Closes #13301

(cherry picked from commit 9e27f6b4b7)
2023-03-27 14:08:57 +03:00
Botond Dénes
6350c8836d Revert "repair: Reduce repair reader eviction with diff shard count"
This reverts commit c6087cf3a0.

Said commit can cause a deadlock when 2 or more repairs compete for
locks on 2 or more nodes. Consider the following scenario:

Node n1 and n2 in the cluster, 1 shard per node, rf = 2, each shard has
1 available unit for the reader lock

    n1 starts repair r1
    r1-n1 (instance of r1 on node1) takes the reader lock on node1
    n2 starts repair r2
    r2-n2 (instance of r2 on node2) takes the reader lock on node2
    r1-n2 will fail to take the reader lock on node2
    r2-n1 will fail to take the reader lock on node1

As a result, r1 and r2 could not make progress and deadlock happens.

The complexity comes from the fact that a repair job needs lock on more
than one node. It is not guaranteed that all the participant nodes could
take the lock in one short.

There is no simple solution to this so we have to revert this locking
mechanism and look for another way to prevent reader trashing when
repairing nodes with mismatching shard count.

Fixes: #12693

Closes #13266

(cherry picked from commit 7699904c54)
2023-03-24 09:44:16 +02:00
Avi Kivity
5457948437 Update seastar submodule (rpc cancellation during negotiation)
* seastar 8889cbc198...1488aaf842 (1):
  > Merge 'Keep outgoing queue all cancellable while negotiating (again)' from Pavel Emelyanov

Fixes #11507.
2023-03-23 17:15:00 +02:00
Avi Kivity
da41001b5c .gitmodules: point seastar submodule at scylla-seastar.git
This allows is to backport seastar commits.

Ref #11507.
2023-03-23 17:11:43 +02:00
Anna Stuchlik
dd61e8634c doc: related https://github.com/scylladb/scylladb/issues/12754; add the missing information about reporting latencies to the upgrade guide 5.1 to 5.2
Closes #12935

(cherry picked from commit 26bb36cdf5)
2023-03-22 10:38:28 +02:00
Anna Stuchlik
b642b4c30e doc: fix the service name in upgrade guides
Fixes https://github.com/scylladb/scylladb/issues/13207

This commit fixes the service and package names in
the upgrade guides 5.0-to-2022.1 and 5.1-to-2022.2.
Service name: scylla-server
Package name: scylla-enterprise

Previous PRs to fix the same issue in other
upgrade guides:
https://github.com/scylladb/scylladb/pull/12679
https://github.com/scylladb/scylladb/pull/12698

This commit must be backported to branch-5.1 and branch 5.2.

Closes #13225

(cherry picked from commit 922f6ba3dd)
2023-03-22 10:37:12 +02:00
Botond Dénes
c013336121 db/view/view_update_check: check_needs_view_update_path(): filter out non-member hosts
We currently don't clean up the system_distributed.view_build_status
table after removed nodes. This can cause false-positive check for
whether view update generation is needed for streaming.
The proper fix is to clean up this table, but that will be more
involved, it even when done, it might not be immediate. So until then
and to be on the safe side, filter out entries belonging to unknown
hosts from said table.

Fixes: #11905
Refs: #11836

Closes #11860

(cherry picked from commit 84a69b6adb)
2023-03-22 09:03:50 +02:00
Kamil Braun
b6b35ce061 service: storage_proxy: sequence CDC preimage select with Paxos learn
`paxos_response_handler::learn_decision` was calling
`cdc_service::augment_mutation_call` concurrently with
`storage_proxy::mutate_internal`. `augment_mutation_call` was selecting
rows from the base table in order to create the preimage, while
`mutate_internal` was writing rows to the table. It was therefore
possible for the preimage to observe the update that it accompanied,
which doesn't make any sense, because the preimage is supposed to show
the state before the update.

Fix this by performing the operations sequentially. We can still perform
the CDC mutation write concurrently with the base mutation write.

`cdc_with_lwt_test` was sometimes failing in debug mode due to this bug
and was marked flaky. Unmark it.

Fixes #12098

(cherry picked from commit 1ef113691a)
2023-03-21 20:23:19 +02:00
Petr Gusev
069e38f02d transport server: fix unexpected server errors handling
If request processing ended with an error, it is worth
sending the error to the client through
make_error/write_response. Previously in this case we
just wrote a message to the log and didn't handle the
client connection in any way. As a result, the only
thing the client got in this case was timeout error.

A new test_batch_with_error is added. It is quite
difficult to reproduce error condition in a test,
so we use error injection instead. Passing injection_key
in the body of the request ensures that the exception
will be thrown only for this test request and
will not affect other requests that
the driver may send in the background.

Closes: scylladb#12104
(cherry picked from commit a4cf509c3d)
2023-03-21 20:23:09 +02:00
Anna Mikhlin
61a8003ad1 release: prepare for 5.2.0-rc3 2023-03-20 10:10:27 +02:00
Botond Dénes
8a17066961 Merge 'doc: Updates the recommended OS to be Ubuntu 22.04' from Anna Stuchlik
Fixes https://github.com/scylladb/scylladb/issues/13138
Fixes https://github.com/scylladb/scylladb/issues/13153

This PR:

- Fixes outdated information about the recommended OS. Since version 5.2, the recommended OS should be Ubuntu 22.04 because that OS is used for building the ScyllaDB image.
- Adds the OS support information for version 5.2.

This PR (both commits) needs to be backported to branch-5.2.

Closes #13188

* github.com:scylladb/scylladb:
  doc: Add OS support for version 5.2
  doc: Updates the recommended OS to be Ubuntu 22.04

(cherry picked from commit f4b5679804)
2023-03-17 10:30:06 +02:00
Pavel Emelyanov
487ba9f3e1 Merge '[backport] reader_concurrency_semaphore:: clear_inactive_reads(): defer evicting to evict()' from Botond Dénes
This PR backports 2f4a793457 to branch-5.2. Said patch depends on some other patches that are not part of any release yet.
This PR should apply to 5.1 and 5.0 too.

Closes #13162

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore:: clear_inactive_reads(): defer evicting to evict()
  reader_permit: expose operator<<(reader_permit::state)
  reader_permit: add get_state() accessor
2023-03-16 18:41:08 +03:00
Botond Dénes
bd4f9e3615 Merge 'readers/nonforwarding: don't emit partition_end on next_partition,fast_forward_to' from Gusev Petr
The series fixes the `make_nonforwardable` reader, it shouldn't emit `partition_end` for previous partition after `next_partition()` and `fast_forward_to()`

Fixes: #12249

Closes #12978

* github.com:scylladb/scylladb:
  flat_mutation_reader_test: cleanup, seastar::async -> SEASTAR_THREAD_TEST_CASE
  make_nonforwardable: test through run_mutation_source_tests
  make_nonforwardable: next_partition and fast_forward_to when single_partition is true
  make_forwardable: fix next_partition
  flat_mutation_reader_v2: drop forward_buffer_to
  nonforwardable reader: fix indentation
  nonforwardable reader: refactor, extract reset_partition
  nonforwardable reader: add more tests
  nonforwardable reader: no partition_end after fast_forward_to()
  nonforwardable reader: no partition_end after next_partition()
  nonforwardable reader: no partition_end for empty reader
  row_cache: pass partition_start though nonforwardable reader

(cherry picked from commit 46efdfa1a1)
2023-03-16 10:42:03 +02:00
Botond Dénes
c68deb2461 reader_concurrency_semaphore:: clear_inactive_reads(): defer evicting to evict()
Instead of open-coding the same, in an incomplete way.
clear_inactive_reads() does incomplete eviction in severeal ways:
* it doesn't decrement _stats.inactive_reads
* it doesn't set the permit to evicted state
* it doesn't cancel the ttl timer (if any)
* it doesn't call the eviction notifier on the permit (if there is one)

The list goes on. We already have an evict() method that all this
correctly, use that instead of the current badly open-coded alternative.

This patch also enhances the existing test for clear_inactive_reads()
and adds a new one specifically for `stop()` being called while having
inactive reads.

Fixes: #13048

Closes #13049

(cherry picked from commit 2f4a793457)
2023-03-14 09:50:16 +02:00
Botond Dénes
dd96d3017a reader_permit: expose operator<<(reader_permit::state)
(cherry picked from commit ec1c615029)
2023-03-14 09:50:16 +02:00
Botond Dénes
6ca80ee118 reader_permit: add get_state() accessor
(cherry picked from commit 397266f420)
2023-03-14 09:40:11 +02:00
Jan Ciolek
eee8f750cc cql3: preserve binary_operator.order in search_and_replace
There was a bug in `expr::search_and_replace`.
It doesn't preserve the `order` field of binary_operator.

`order` field is used to mark relations created
using the SCYLLA_CLUSTERING_BOUND.
It is a CQL feature used for internal queries inside Scylla.
It means that we should handle the restriction as a raw
clustering bound, not as an expression in the CQL language.

Losing the SCYLLA_CLUSTERING_BOUND marker could cause issues,
the database could end up selecting the wrong clustering ranges.

Fixes: #13055

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

Closes #13056

(cherry picked from commit aa604bd935)
2023-03-09 12:52:39 +02:00
Botond Dénes
8d5206e6c6 sstables/sstable: validate_checksums(): force-check EOF
EOF is only guarateed to be set if one tried to read past the end of the
file. So when checking for EOF, also try to read some more. This
should force the EOF flag into a correct value. We can then check that
the read yielded 0 bytes.
This should ensure that `validate_checksums()` will not falsely declare
the validation to have failed.

Fixes: #11190

Closes #12696

(cherry picked from commit 693c22595a)
2023-03-09 12:30:44 +02:00
Anna Stuchlik
cfa40402f4 doc: Update the documentation landing page
This commit makes the following changes to the docs landing page:

- Adds the ScyllaDB enterprise docs as one of three tiles.

- Modifies the three tiles to reflect the three flavors of ScyllaDB.

- Moves the "New to ScyllaDB? Start here!" under the page title.

- Renames "Our Products" to "Other Products" to list the products other
  than ScyllaDB itself. In addtition, the boxes are enlarged from to
  large-4 to look better.

The major purpose of this commit is to expose the ScyllaDB
documentation.

docs: fix the link
(cherry picked from commit 27bb8c2302)

Closes #13086
2023-03-06 14:18:15 +02:00
Botond Dénes
2d170e51cf Merge 'doc: specify the versions where Alternator TTL is no longer experimental' from Anna Stuchlik
This PR adds a note to the Alternator TTL section to specify in which Open Source and Enterprise versions the feature was promoted from experimental to non-experimental.

The challenge here is that OSS and Enterprise are (still) **documented together**, but they're **not in sync** in promoting the TTL feature: it's still experimental in 5.1 (released) but no longer experimental in 2022.2 (to be released soon).

We can take one of the following approaches:
a) Merge this PR with master and ask the 2022.2 users to refer to master.
b) Merge this PR with master and then backport to branch-5.1. If we choose this approach, it is necessary to backport https://github.com/scylladb/scylladb/pull/11997 beforehand to avoid conflicts.

I'd opt for a) because it makes more sense from the OSS perspective and helps us avoid mess and backporting.

Closes #12295

* github.com:scylladb/scylladb:
  doc: fix the version in the comment on removing the note
  doc: specify the versions where Alternator TTL is no longer experimental

(cherry picked from commit d5dee43be7)
2023-03-02 12:09:16 +02:00
Anna Stuchlik
860e79e4b1 doc: fixes https://github.com/scylladb/scylladb/issues/12954, adds the minimal version from which the 2021.1-to-2022.1 upgrade is supported for Ubuntu, Debian, and image
Closes #12974

(cherry picked from commit 91b611209f)
2023-02-28 13:02:05 +02:00
Anna Mikhlin
908a82bea0 release: prepare for 5.2.0-rc2 2023-02-28 10:13:06 +02:00
Gleb Natapov
39158f55d0 lwt: do not destroy capture in upgrade_if_needed lambda since the lambda is used more then once
If on the first call the capture is destroyed the second call may crash.

Fixes: #12958

Message-Id: <Y/sks73Sb35F+PsC@scylladb.com>
(cherry picked from commit 1ce7ad1ee6)
2023-02-27 14:19:37 +02:00
Raphael S. Carvalho
22c1685b3d sstables: Temporarily disable loading of first and last position metadata
It's known that reading large cells in reverse cause large allocations.
Source: https://github.com/scylladb/scylladb/issues/11642

The loading is preliminary work for splitting large partitions into
fragments composing a run and then be able to later read such a run
in an efficiency way using the position metadata.

The splitting is not turned on yet, anywhere. Therefore, we can
temporarily disable the loading, as a way to avoid regressions in
stable versions. Large allocations can cause stalls due to foreground
memory eviction kicking in.
The default values for position metadata say that first and last
position include all clustering rows, but they aren't used anywhere
other than by sstable_run to determine if a run is disjoint at
clustering level, but given that no splitting is done yet, it
does not really matter.

Unit tests relying on position metadata were adjusted to enable
the loading, such that they can still pass.

Fixes #11642.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #12979

(cherry picked from commit d73ffe7220)
2023-02-27 08:58:34 +02:00
Botond Dénes
9ba6fc73f1 mutation_compactor: only pass consumed range-tombstone-change to validator
Currently all consumed range tombstone changes are unconditionally
forwarded to the validator. Even if they are shadowed by a higher level
tombstone and/or purgable. This can result in a situation where a range
tombstone change was seen by the validator but not passed to the
consumer. The validator expects the range tombstone change to be closed
by end-of-partition but the end fragment won't come as the tombstone was
dropped, resulting in a false-positive validation failure.
Fix by only passing tombstones to the validator, that are actually
passed to the consumer too.

Fixes: #12575

Closes #12578

(cherry picked from commit e2c9cdb576)
2023-02-23 22:52:47 +02:00
Botond Dénes
f2e2c0127a types: unserialize_value for multiprecision_int,bool: don't read uninitialized memory
Check the first fragment before dereferencing it, the fragment might be
empty, in which case move to the next one.
Found by running range scan tests with random schema and random data.

Fixes: #12821
Fixes: #12823
Fixes: #12708

Closes #12824

(cherry picked from commit ef548e654d)
2023-02-23 22:38:03 +02:00
Gleb Natapov
363ea87f51 raft: abort applier fiber when a state machine aborts
After 5badf20c7a applier fiber does not
stop after it gets abort error from a state machine which may trigger an
assertion because previous batch is not applied. Fix it.

Fixes #12863

(cherry picked from commit 9bdef9158e)
2023-02-23 14:12:12 +02:00
Kefu Chai
c49fd6f176 tools/schema_loader: do not return ref to a local variable
we should never return a reference to local variable.
so in this change, a reference to a static variable is returned
instead. this should address following warning from Clang 17:

```
/home/kefu/dev/scylladb/tools/schema_loader.cc:146:16: error: returning reference to local temporary object [-Werror,-Wreturn-stack-address]
        return {};
               ^~
```

Fixes #12875
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #12876

(cherry picked from commit 6eab8720c4)
2023-02-22 22:02:43 +02:00
Takuya ASADA
3114589a30 scylla_coredump_setup: fix coredump timeout settings
We currently configure only TimeoutStartSec, but probably it's not
enough to prevent coredump timeout, since TimeoutStartSec is maximum
waiting time for service startup, and there is another directive to
specify maximum service running time (RuntimeMaxSec).

To fix the problem, we should specify RunTimeMaxSec and TimeoutSec (it
configures both TimeoutStartSec and TimeoutStopSec).

Fixes #5430

Closes #12757

(cherry picked from commit bf27fdeaa2)
2023-02-19 21:13:36 +02:00
Anna Stuchlik
34f68a4c0f doc: related https://github.com/scylladb/scylladb/issues/12658, fix the service name in the upgrade guide from 2022.1 to 2022.2
Closes #12698

(cherry picked from commit 826f67a298)
2023-02-17 12:17:48 +02:00
Botond Dénes
b336e11f59 Merge 'doc: fix the service name from "scylla-enterprise-server" "to "scylla-server"' from Anna Stuchlik
Related https://github.com/scylladb/scylladb/issues/12658.

This issue fixes the bug in the upgrade guides for the released versions.

Closes #12679

* github.com:scylladb/scylladb:
  doc: fix the service name in the upgrade guide for patch releases versions 2022
  doc: fix the service name in the upgrade guide from 2021.1 to 2022.1

(cherry picked from commit 325246ab2a)
2023-02-17 12:16:52 +02:00
Anna Stuchlik
9ef73d7e36 doc: fixes https://github.com/scylladb/scylladb/issues/12754, document the metric update in 5.2
Closes #12891

(cherry picked from commit bcca706ff5)
2023-02-17 12:16:13 +02:00
Botond Dénes
8700a72b4c Merge 'Backport compaction-backlog-tracker fixes to branch-5.2' from Raphael "Raph" Carvalho
Both patches are important to fix inefficiencies when updating the backlog tracker, which can manifest as a reactor stall, on a special event like schema change.

No conflicts when backporting.

Regression since 1d9f53c881, which is present in branch 5.1 onwards.

Closes #12851

* github.com:scylladb/scylladb:
  compaction: Fix inefficiency when updating LCS backlog tracker
  table: Fix quadratic behavior when inserting sstables into tracker on schema change
2023-02-15 07:22:25 +02:00
Raphael S. Carvalho
886dd3e1d2 compaction: Fix inefficiency when updating LCS backlog tracker
LCS backlog tracker uses STCS tracker for L0. Turns out LCS tracker
is calling STCS tracker's replace_sstables() with empty arguments
even when higher levels (> 0) *only* had sstables replaced.
This unnecessary call to STCS tracker will cause it to recompute
the L0 backlog, yielding the same value as before.

As LCS has a fragment size of 0.16G on higher levels, we may be
updating the tracker multiple times during incremental compaction,
which operates on SSTables on higher levels.

Inefficiency is fixed by only updating the STCS tracker if any
L0 sstable is being added or removed from the table.

This may be fixing a quadratic behavior during boot or refresh,
as new sstables are loaded one by one.
Higher levels have a substantial higher number of sstables,
therefore updating STCS tracker only when level 0 changes, reduces
significantly the number of times L0 backlog is recomputed.

Refs #12499.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #12676

(cherry picked from commit 1b2140e416)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-02-14 12:14:27 -03:00
Raphael S. Carvalho
f565f3de06 table: Fix quadratic behavior when inserting sstables into tracker on schema change
Each time backlog tracker is informed about a new or old sstable, it
will recompute the static part of backlog which complexity is
proportional to the total number of sstables.
On schema change, we're calling backlog_tracker::replace_sstables()
for each existing sstable, therefore it produces O(N ^ 2) complexity.

Fixes #12499.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #12593

(cherry picked from commit 87ee547120)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-02-14 12:14:21 -03:00
Anna Stuchlik
76ff6d981c doc: related https://github.com/scylladb/scylladb/issues/12754, add the requirement to upgrade Monitoring to version 4.3
Closes #12784

(cherry picked from commit c7778dd30b)
2023-02-10 10:28:35 +02:00
Botond Dénes
f924f59055 Merge 'Backport test.py improvements to 5.2' from Kamil Braun
Backport the following improvements for test.py efficiency and user experience:
- https://github.com/scylladb/scylladb/pull/12542
- https://github.com/scylladb/scylladb/pull/12560
- https://github.com/scylladb/scylladb/pull/12564
- https://github.com/scylladb/scylladb/pull/12563
- https://github.com/scylladb/scylladb/pull/12588
- https://github.com/scylladb/scylladb/pull/12613
- https://github.com/scylladb/scylladb/pull/12569
- https://github.com/scylladb/scylladb/pull/12612
- https://github.com/scylladb/scylladb/pull/12549
- https://github.com/scylladb/scylladb/pull/12678

Fixes #12617

Closes #12770

* github.com:scylladb/scylladb:
  test/pylib: put UNIX-domain socket in /tmp
  Merge 'test/pylib: scylla_cluster: ensure there's space in the cluster pool when running a sequence of tests' from Kamil Braun
  Merge 'test.py: manual cluster pool handling for Python suite' from Alecco
  Merge 'test.py: handle broken clusters for Python suite' from Alecco
  test/pylib: scylla_cluster: don't leak server if stopping it fails
  Merge 'test/pylib: scylla_cluster: improve server startup check' from Kamil Braun
  test/pylib: scylla_cluster: return error details from test framework endpoints
  test/pylib: scylla_cluster: release cluster IPs when stopping ScyllaClusterManager
  test/pylib: scylla_cluster: mark cluster as dirty if it fails to boot
  test: disable commitlog O_DSYNC, preallocation
2023-02-08 15:09:09 +02:00
Nadav Har'El
d5cef05810 test/pylib: put UNIX-domain socket in /tmp
The "cluster manager" used by the topology test suite uses a UNIX-domain
socket to communicate between the cluster manager and the individual tests.
The socket is currently located in the test directory but there is a
problem: In Linux the length of the path used as a UNIX-domain socket
address is limited to just a little over 100 bytes. In Jenkins run, the
test directory names are very long, and we sometimes go over this length
limit and the result is that test.py fails creating this socket.

In this patch we simply put the socket in /tmp instead of the test
directory. We only need to do this change in one place - the cluster
manager, as it already passes the socket path to the individual tests
(using the "--manager-api" option).

Tested by cloning Scylla in a very long directory name.
A test like ./test.py --mode=dev test_concurrent_schema fails before
this patch, and passes with it.

Fixes #12622

Closes #12678

(cherry picked from commit 681a066923)
2023-02-07 17:12:14 +01:00
Nadav Har'El
e0f4e99e9b Merge 'test/pylib: scylla_cluster: ensure there's space in the cluster pool when running a sequence of tests' from Kamil Braun
`ScyllaClusterManager` is used to run a sequence of test cases from
a single test file. Between two consecutive tests, if the previous test
left the cluster 'dirty', meaning the cluster cannot be reused, it would
free up space in the pool (using `steal`), stop the cluster, then get a
new cluster from the pool.

Between the `steal` and the `get`, a concurrent test run (with its own
instance of `ScyllaClusterManager` would start, because there was free
space in the pool.

This resulted in undesirable behavior when we ran tests with
`--repeat X` for a large `X`: we would start with e.g. 4 concurrent
runs of a test file, because the pool size was 4. As soon as one of the
runs freed up space in the pool, we would start another concurrent run.
Soon we'd end up with 8 concurrent runs. Then 16 concurrent runs. And so
on. We would have a large number of concurrent runs, even though the
original 4 runs didn't finish yet. All of these concurrent runs would
compete waiting on the pool, and waiting for space in the pool would
take longer and longer (the duration is linear w.r.t number of
concurrent competing runs). Tests would then time out because they would
have to wait too long.

Fix that by using the new `replace_dirty` function introduced to the
pool. This function frees up space by returning a dirty cluster and then
immediately takes it away to be used for a new cluster. Thanks to this,
we will only have at most as many concurrent runs as the pool size. For
example with --repeat 8 and pool size 4, we would run 4 concurrent runs
and start the 5th run only when one of the original 4 runs finishes,
then the 6th run when a second run finishes and so on.

The fix is preceded by a refactor that replaces `steal` with `put(is_dirty=True)`
and a `destroy` function passed to the pool (now the pool is responsible
for stopping the cluster and releasing its IPs).

Fixes #11757

Closes #12549

* github.com:scylladb/scylladb:
  test/pylib: scylla_cluster: ensure there's space in the cluster pool when running a sequence of tests
  test/pylib: pool: introduce `replace_dirty`
  test/pylib: pool: replace `steal` with `put(is_dirty=True)`

(cherry picked from commit 132af20057)
2023-02-07 17:08:17 +01:00
Kamil Braun
6795715011 Merge 'test.py: manual cluster pool handling for Python suite' from Alecco
From reviews of https://github.com/scylladb/scylladb/pull/12569, avoid
using `async with` and access the `Pool` of clusters with
`get()`/`put()`.

Closes #12612

* github.com:scylladb/scylladb:
  test.py: manual cluster handling for PythonSuite
  test.py: stop cluster if PythonSuite fails to start
  test.py: minor fix for failed PythonSuite test

(cherry picked from commit 5bc7f0732e)
2023-02-07 17:07:43 +01:00
Nadav Har'El
aa9e91c376 Merge 'test.py: handle broken clusters for Python suite' from Alecco
If the after test check fails (is_after_test_ok is False), discard the cluster and raise exception so context manager (pool) does not recycle it.

Ignore exception re-raised by the context manager.

Fixes #12360

Closes #12569

* github.com:scylladb/scylladb:
  test.py: handle broken clusters for Python suite
  test.py: Pool discard method

(cherry picked from commit 54f174a1f4)
2023-02-07 17:07:36 +01:00
Kamil Braun
ddfb9ebab2 test/pylib: scylla_cluster: don't leak server if stopping it fails
`ScyllaCluster.server_stop` had this piece of code:
```
        server = self.running.pop(server_id)
        if gracefully:
            await server.stop_gracefully()
        else:
            await server.stop()
        self.stopped[server_id] = server
```

We observed `stop_gracefully()` failing due to a server hanging during
shutdown. We then ended up in a state where neither `self.running` nor
`self.stopped` had this server. Later, when releasing the cluster and
its IPs, we would release that server's IP - but the server might have
still been running (all servers in `self.running` are killed before
releasing IPs, but this one wasn't in `self.running`).

Fix this by popping the server from `self.running` only after
`stop_gracefully`/`stop` finishes.

Make an analogous fix in `server_start`: put `server` into
`self.running` *before* we actually start it. If the start fails, the
server will be considered "running" even though it isn't necessarily,
but that is OK - if it isn't running, then trying to stop it later will
simply do nothing; if it is actually running, we will kill it (which we
should do) when clearing after the cluster; and we don't leak it.

Closes #12613

(cherry picked from commit a0ff33e777)
2023-02-07 17:05:20 +01:00
Nadav Har'El
d58a3e4d16 Merge 'test/pylib: scylla_cluster: improve server startup check' from Kamil Braun
Don't use a range scan, which is very inefficient, to perform a query for checking CQL availability.

Improve logging when waiting for server startup times out. Provide details about the failure: whether we managed to obtain the Host ID of the server and whether we managed to establish a CQL connection.

Closes #12588

* github.com:scylladb/scylladb:
  test/pylib: scylla_cluster: better logging for timeout on server startup
  test/pylib: scylla_cluster: use less expensive query to check for CQL availability

(cherry picked from commit ccc2c6b5dd)
2023-02-07 17:05:02 +01:00
Kamil Braun
2ebac52d2d test/pylib: scylla_cluster: return error details from test framework endpoints
If an endpoint handler throws an exception, the details of the exception
are not returned to the client. Normally this is desirable so that
information is not leaked, but in this test framework we do want to
return the details to the client so it can log a useful error message.

Do it by wrapping every handler into a catch clause that returns
the exception message.

Also modify a bit how HTTPErrors are rendered so it's easier to discern
the actual body of the error from other details (such as the params used
to make the request etc.)

Before:
```
E test.pylib.rest_client.HTTPError: HTTP error 500: 500 Internal Server Error
E
E Server got itself in trouble, params None, json None, uri http+unix://api/cluster/before-test/test_stuff
```

After:
```
E test.pylib.rest_client.HTTPError: HTTP error 500, uri: http+unix://api/cluster/before-test/test_stuff, params: None, json: None, body:
E Failed to start server at host 127.155.129.1.
E Check the log files:
E /home/kbraun/dev/scylladb/testlog/test.py.dev.log
E /home/kbraun/dev/scylladb/testlog/dev/scylla-1.log
```

Closes #12563

(cherry picked from commit 2f84e820fd)
2023-02-07 17:04:37 +01:00
Kamil Braun
b536614913 test/pylib: scylla_cluster: release cluster IPs when stopping ScyllaClusterManager
When we obtained a new cluster for a test case after the previous test
case left a dirty cluster, we would release the old cluster's used IP
addresses (`_before_test` function). However, we would not release the
last cluster's IP after the last test case. We would run out of IPs with
sufficiently many test files or `--repeat` runs. Fix this.

Also reorder the operations a bit: stop the cluster (and release its
IPs) before freeing up space in the cluster pool (i.e. call
`self.cluster.stop()` before `self.clusters.steal()`). This reduces
concurrency a bit - fewer Scyllas running at the same time, which is
good (the pool size gives a limit on the desired max number of
concurrently running clusters). Killing a cluster is quick so it won't
make a significant difference for the next guy waiting on the pool.

Closes #12564

(cherry picked from commit 3ed3966f13)
2023-02-07 17:04:19 +01:00
Kamil Braun
85df0fd2b1 test/pylib: scylla_cluster: mark cluster as dirty if it fails to boot
If a cluster fails to boot, it saves the exception in
`self.start_exception` variable; the exception will be rethrown when
a test tries to start using this cluster. As explained in `before_test`:
```
    def before_test(self, name) -> None:
        """Check that  the cluster is ready for a test. If
        there was a start error, throw it here - the server is
        running when it's added to the pool, which can't be attributed
        to any specific test, throwing it here would stop a specific
        test."""
```
It's arguable whether we should blame some random test for a failure
that it didn't cause, but nevertheless, there's a problem here: the
`start_exception` will be rethrown and the test will fail, but then the
cluster will be simply returned to the pool and the next test will
attempt to use it... and so on.

Prevent this by marking the cluster as dirty the first time we rethrow
the exception.

Closes #12560

(cherry picked from commit 147dd73996)
2023-02-07 17:03:56 +01:00
Avi Kivity
cdf9fe7023 test: disable commitlog O_DSYNC, preallocation
Commitlog O_DSYNC is intended to make Raft and schema writes durable
in the face of power loss. To make O_DSYNC performant, we preallocate
the commitlog segments, so that the commitlog writes only change file
data and not file metadata (which would require the filesystem to commit
its own log).

However, in tests, this causes each ScyllaDB instance to write 384MB
of commitlog segments. This overloads the disks and slows everything
down.

Fix this by disabling O_DSYNC (and therefore preallocation) during
the tests. They can't survive power loss, and run with
--unsafe-bypass-fsync anyway.

Closes #12542

(cherry picked from commit 9029b8dead)
2023-02-07 17:02:59 +01:00
Beni Peled
8ff4717fd0 release: prepare for 5.2.0-rc1 2023-02-06 22:13:53 +02:00
Kamil Braun
291b1f6e7f service/raft: raft_group0: prevent double abort
There was a small chance that we called `timeout_src.request_abort()`
twice in the `with_timeout` function, first by timeout and then by
shutdown. `abort_source` fails on an assertion in this case. Fix this.

Fixes: #12512

Closes #12514

(cherry picked from commit 54170749b8)
2023-02-05 18:31:50 +02:00
Kefu Chai
b2699743cc db: system_keyspace: take the reserved_memory into account
before this change, we returns the total memory managed by Seastar
in the "total" field in system.memory. but this value only reflect
the total memory managed by Seastar's allocator. if
`reserve_additional_memory` is set when starting app_template,
Seastar's memory subsystem just reserves a chunk of memory of this
specified size for system, and takes the remaining memory. since
f05d612da8, we set this value to 50MB for wasmtime runtime. hence
the test of `TestRuntimeInfoTable.test_default_content` in dtest
fails. the test expects the size passed via the option of
`--memory` to be identical to the value reported by system.memory's
"total" field.

after this change, the "total" field takes the reserved memory
for wasm udf into account. the "total" field should reflect the total
size of memory used by Scylla, no matter how we use a certain portion
of the allocated memory.

Fixes #12522
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #12573

(cherry picked from commit 4a0134a097)
2023-02-05 18:30:05 +02:00
Botond Dénes
50ae73a4bd types: is_tuple(): handle reverse types
Currently reverse types match the default case (false), even though they
might be wrapping a tuple type. One user-visible effect of this is that
a schema, which has a reversed<frozen<UDT>> clustering key component,
will have this component incorrectly represented in the schema cql dump:
the UDT will loose the frozen attribute. When attempting to recreate
this schema based on the dump, it will fail as the only frozen UDTs are
allowed in primary key components.

Fixes: #12576

Closes #12579

(cherry picked from commit ebc100f74f)
2023-02-05 18:20:21 +02:00
Calle Wilund
c3dd4a2b87 alterator::streams: Sort tables in list_streams to ensure no duplicates
Fixes #12601 (maybe?)

Sort the set of tables on ID. This should ensure we never
generate duplicates in a paged listing here. Can obviously miss things if they
are added between paged calls and end up with a "smaller" UUID/ARN, but that
is to be expected.

(cherry picked from commit da8adb4d26)
2023-02-05 17:44:00 +02:00
Benny Halevy
0f9fe61d91 view: row_lock: lock_ck: find or construct row_lock under partition lock
Since we're potentially searching the row_lock in parallel to acquiring
the read_lock on the partition, we're racing with row_locker::unlock
that may erase the _row_locks entry for the same clustering key, since
there is no lock to protect it up until the partition lock has been
acquired and the lock_partition future is resolved.

This change moves the code to search for or allocate the row lock
_after_ the partition lock has been acquired to make sure we're
synchronously starting the read/write lock function on it, without
yielding, to prevent this use-after-free.

This adds an allocation for copying the clustering key in advance
even if a row_lock entry already exists, that wasn't needed before.
It only us slows down (a bit) when there is contention and the lock
already existed when we want to go locking. In the fast path there
is no contention and then the code already had to create the lock
and copy the key. In any case, the penalty of copying the key once
is tiny compared to the rest of the work that view updates are doing.

This is required on top of 5007ded2c1 as
seen in https://github.com/scylladb/scylladb/issues/12632
which is closely related to #12168 but demonstrates a different race
causing use-after-free.

Fixes #12632

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 4b5e324ecb)
2023-02-05 17:22:31 +02:00
Anna Stuchlik
59d30ff241 docs: fixes https://github.com/scylladb/scylladb/issues/12654, update the links to the Download Center
Closes #12655

(cherry picked from commit 64cc4c8515)
2023-02-05 17:19:56 +02:00
Anna Stuchlik
fb82dff89e doc: fixes https://github.com/scylladb/scylladb/issues/12672, fix the redirects to the Cloud docs
Closes #12673

(cherry picked from commit 2be131da83)
2023-02-05 17:17:35 +02:00
Kefu Chai
b588b19620 cql3/selection: construct string_view using char* not size
before this change, we construct a sstring from a comma statement,
which evaluates to the return value of `name.size()`, but what we
expect is `sstring(const char*, size_t)`.

in this change

* instead of passing the size of the string_view,
  both its address and size are used
* `std::string_view` is constructed instead of sstring, for better
  performance, as we don't need to perform a deep copy

the issue is reported by GCC-13:

```
In file included from cql3/selection/selectable.cc:11:
cql3/selection/field_selector.hh:83:60: error: ignoring return value of function declared with 'nodiscard' attribute [-Werror,-Wunused-result]
        auto sname = sstring(reinterpret_cast<const char*>(name.begin(), name.size()));
                                                           ^~~~~~~~~~
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #12666

(cherry picked from commit 186ceea009)

Fixes #12739.
2023-02-05 13:50:48 +02:00
Michał Chojnowski
608ef92a71 commitlog: fix total_size_on_disk accounting after segment file removal
Currently, segment file removal first calls `f.remove_file()` and
does `total_size_on_disk -= f.known_size()` later.
However, `remove_file()` resets `known_size` to 0, so in effect
the freed space in not accounted for.

`total_size_on_disk` is not just a metric. It is also responsible
for deciding whether a segment should be recycled -- it is recycled
only if `total_size_on_disk - known_size < max_disk_size`.
Therefore this bug has dire performance consequences:
if `total_size_on_disk - known_size` ever exceeds `max_disk_size`,
the recycling of commitlog segments will stop permanently, because
`total_size_on_disk - known_size` will never go back below
`max_disk_size` due to the accounting bug. All new segments from this
point will be allocated from scratch.

The bug was uncovered by a QA performance test. It isn't easy to trigger --
it took the test 7 hours of constant high load to step into it.
However, the fact that the effect is permanent, and degrades the
performance of the cluster silently, makes the bug potentially quite severe.

The bug can be easily spotted with Prometheus as infinitely rising
`commitlog_total_size_on_disk` on the affected shards.

Fixes #12645

Closes #12646

(cherry picked from commit fa7e904cd6)
2023-02-01 21:54:37 +02:00
Kamil Braun
d2732b2663 Merge 'Enable Raft by default in new clusters' from Kamil Braun
New clusters that use a fresh conf/scylla.yaml will have `consistent_cluster_management: true`, which will enable Raft, unless the user explicitly turns it off before booting the cluster.

People using existing yaml files will continue without Raft, unless consistent_cluster_management is explicitly requested during/after upgrade.

Also update the docs: cluster creation and node addition procedures.

Fixes #12572.

Closes #12585

* github.com:scylladb/scylladb:
  docs: mention `consistent_cluster_management` for creating cluster and adding node procedures
  conf: enable `consistent_cluster_management` by default

(cherry picked from commit 5c886e59de)
2023-01-26 12:21:55 +01:00
Anna Mikhlin
34ab98e1be release: prepare for 5.2.0-rc0 2023-01-18 14:54:36 +02:00
Tomasz Grabiec
563998b69a Merge 'raft: improve group 0 reconfiguration failure handling' from Kamil Braun
Make it so that failures in `removenode`/`decommission` don't lead to reduced availability, and any leftovers in group 0 can be removed by `removenode`:
- In `removenode`, make the node a non-voter before removing it from the token ring. This removes the possibility of having a group 0 voting member which doesn't correspond to a token ring member. We can still be left with a non-voter, but that's doesn't reduce the availability of group 0.
- As above but for `decommission`.
- Make it possible to remove group 0 members that don't correspond to token ring members from group 0 using `removenode`.
- Add an API to query the current group 0 configuration.

Fixes #11723.

Closes #12502

* github.com:scylladb/scylladb:
  test: test_topology: test for removing garbage group 0 members
  test/pylib: move some utility functions to util.py
  db: system_keyspace: add a virtual table with raft configuration
  db: system_keyspace: improve system.raft_snapshot_config schema
  service: storage_service: better error handling in `decommission`
  service: storage_service: fix indentation in removenode
  service: storage_service: make `removenode` work for group 0 members which are not token ring members
  service/raft: raft_group0: perform read_barrier in wait_for_raft
  service: storage_service: make leaving node a non-voter before removing it from group 0 in decommission/removenode
  test: test_raft_upgrade: remove test_raft_upgrade_with_node_remove
  service/raft: raft_group0: link to Raft docs where appropriate
  service/raft: raft_group0: more logging
  service/raft: raft_group0: separate function for checking and waiting for Raft
2023-01-17 21:23:15 +01:00
Kamil Braun
d134c458e5 test/pylib: increase timeout when waiting for cluster before test
Increase the timeout from default 5 minutes to 10 minutes.
Sent as a workaround for #12546 to unblock next promotions.

Closes #12547
2023-01-17 21:03:09 +02:00
Kamil Braun
4f1c317bdc test: test_raft_upgrade: stop servers gracefully in test_recovery_after_majority_loss
This test is frequently failing due to a timeout when we try to restart
one of the nodes. The shutdown procedure apparently hangs when we try to
stop the `hints_manager` service, e.g.:
```
INFO  2023-01-13 03:18:02,946 [shard 0] hints_manager - Asked to stop
INFO  2023-01-13 03:18:02,946 [shard 0] hints_manager - Stopped
INFO  2023-01-13 03:18:02,946 [shard 0] hints_manager - Asked to stop
INFO  2023-01-13 03:18:02,946 [shard 1] hints_manager - Asked to stop
INFO  2023-01-13 03:18:02,946 [shard 1] hints_manager - Stopped
INFO  2023-01-13 03:18:02,946 [shard 1] hints_manager - Asked to stop
INFO  2023-01-13 03:18:02,946 [shard 1] hints_manager - Stopped
INFO  2023-01-13 03:22:56,997 [shard 0] hints_manager - Stopped
```
observe the 5 minute delay at the end.

There is a known issue about `hints_manager` stop hanging: #8079.

Now, for some reason, this is the only test case that is hitting this
issue. We don't completely understand why. There is one significant
difference between this test case and others: this is the only test case
which kills 2 (out of 3) servers in the cluster and then tries to
gracefully shutdown the last server. There's a hypothesis that the last
server gets stuck trying to send hints to the killed servers. We weren't
able to prove/falsify it yet. But if it's true, then this patch will:
- unblock next promotions,
- give us some important information when we see that the issue stops
  appearing.
In the patch we shutdown all servers gracefully instead of killing them,
like we do in the other test cases.

Closes #12548
2023-01-17 20:51:09 +02:00
Pavel Emelyanov
4f415413d2 raft: Fix non-existing state_machine::apply_entry in docs
The docs mention that method, but it doesn't exist. Instead, the
state_machine interface defines plain .apply() one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #12541
2023-01-17 12:53:05 +01:00
Kamil Braun
5545547d07 test: test_topology: test for removing garbage group 0 members
Verify that `removenode` can remove group 0 members which are not token
ring members.
2023-01-17 12:28:00 +01:00
Kamil Braun
c959ec455a test/pylib: move some utility functions to util.py
They were used in test_raft_upgrade, but we want to use them in other
test files too.
2023-01-17 12:28:00 +01:00
Kamil Braun
a483915c62 db: system_keyspace: add a virtual table with raft configuration
Add a new virtual table `system.raft_state` that shows the currently
operating Raft configuration for each present group. The schema is the
same as `system.raft_snapshot_config` (the latter shows the config from
the last snapshot). In the future we plan to add more columns to this
table, showing more information (like the current leader and term),
hence the generic name.

Adding the table requires some plumbing of
`sharded<raft_group_registry>&` through function parameters to make it
accessible from `register_virtual_tables`, but it's mostly
straightforward.

Also added some APIs to `raft_group_registry` to list all groups and
find a given group (returning `nullptr` if one isn't found, not throwing
an exception).
2023-01-17 12:28:00 +01:00
Kamil Braun
2bfe85ce9b db: system_keyspace: improve system.raft_snapshot_config schema
Remove the `ip_addr` column which was not used. IP addresses are not
part of Raft configuration now and they can change dynamically.

Swap the `server_id` and `disposition` columns in the clustering key, so
when querying the configuration, we first obtain all servers with the
current disposition and then all servers with the previous disposition
(note that a server may appear both in current and previous).
2023-01-17 12:28:00 +01:00
Kamil Braun
c3ed82e5fb service: storage_service: better error handling in decommission
Improve the error handling in `decommission` in case `leave_group0`
fails, informing the user what they should do (i.e. call `removenode` to
get rid of the group 0 member), and allowing decommission to finish; it
does not make sense to let the node continue to run after it leaves the
token ring. (And I'm guessing it's also not safe. Or maybe impossible.)
2023-01-17 12:28:00 +01:00
Kamil Braun
beb0eee007 service: storage_service: fix indentation in removenode 2023-01-17 12:28:00 +01:00
Kamil Braun
aba33dd352 service: storage_service: make removenode work for group 0 members which are not token ring members
Due to failures we might end up in a situation where we have a group 0
member which is not a token ring member: a decommission/removenode
which failed after leaving/removing a node from the token ring but
before leaving / removing a node from group 0.

There was no way to get rid of such a group 0 member. A node that left
the token ring must not be allowed to run further (or it can cause data
loss, data resurrection and maybe other fun stuff), so we can't run
decommission a second time (even if we tried, it would just say that
"we're not a member of the token ring" and abort). And `removenode`
would also not work, because it proceeds only if the node requested to
be removed is a member of the token ring.

We modify `removenode` so it can run in this situation and remove the
group 0 member. The parts of `removenode` related to token ring
modification are now conditioned on whether the node was a member of the
token ring. The final `remove_from_group0` step is in its own branch. Some
minor refactors were necessary. Some log messages were also modified so
it's easier to understand which messages correspond the "token movement"
part of the procedure.

The `make_nonvoter` step happens only if token ring removal happens,
otherwise we can skip directly to `remove_from_group0`.

We also move `remove_from_group0` outside the "try...catch",
fixing #11723. The "node ops" part of the procedure is related strictly
to token ring movement, so it makes sense for `remove_from_group0` to
happen outside.

Indentation is broken in this commit for easier reviewability, fixed in
the following commit.

Fixes: #11723
2023-01-17 12:28:00 +01:00
Kamil Braun
ec2cd29e42 service/raft: raft_group0: perform read_barrier in wait_for_raft
Right now wait_for_raft is called before performing group 0
configuration changes. We want to also call it before checking for
membership, for that it's desirable to have the most recent information,
hence call read_barrier. In the existing use cases it's not strictly
necessary, but it doesn't hurt.
2023-01-17 12:28:00 +01:00
Kamil Braun
db734cd74f service: storage_service: make leaving node a non-voter before removing it from group 0 in decommission/removenode
removenode currently works roughly like this:
1. stream/repair data so it ends up on new replica sets (calculated
   without the node we want to remove)
2. remove the node from the token ring
3. remove the node from group 0 configuration.

If the procedure fails before after step 2 but before step 3 finishes,
we're in trouble: the cluster is left with an additional voting group 0
member, which reduces group 0's availability, and there is no way to
remove this member because `removenode` no longer considers it to be
part of the cluster (it consults the token ring to decide).

Improve this failure scenario by including a new step at the beginning:
make the node a non-voter in group 0 configuration. Then, even if we
fail after removing the node from the token ring but before removing it
from group 0, we'll only be left with a non-voter which doesn't reduce
availability.

We make a similar change for `decommission`: between `unbootstrap()` (which
streams data) and `leave_ring()` (which removes our tokens from the
ring), become a non-voter. The difference here is that we don't become a
non-voter at the beginning, but only after streaming/repair. In
`removenode` it's desirable to make the node a non-voter as soon as
possible because it's already dead. In decommission it may be desirable
for us to remain a voter if we fail during streaming because we're still
alive and functional in that case.

In a later commit we'll also make it possible to retry `removenode` to
remove a node that is only a group 0 member and not a token ring member.
2023-01-17 12:28:00 +01:00
Kamil Braun
1eee349a17 test: test_raft_upgrade: remove test_raft_upgrade_with_node_remove
The test would create a scenario where one node was down while the others
started the Raft upgrade procedure. The procedure would get stuck, but
it was possible to `removenode` the downed node using one of the alive
nodes, which would unblock the Raft upgrade procedure.

This worked because:
1. the upgrade procedure starts by ensuring that all peers can be
   contacted,
2. `removenode` starts by removing the node from the token ring.

After removing the node from the token ring, the upgrade procedure
becomes able to contact all peers (the peers set no longer contains the
down node). At the end, after removing the node from the token ring,
`removenode` would actually get stuck for a while, waiting for the
upgrade procedure to finish before removing the peer from group 0.
After the upgrade procedure finished, `removenode` would also finish.
(so: first the upgrade procedure waited for removenode, then removenode
waited for the upgrade procedure).

We want to modify the `removenode` procedure and include a new step
before removing the node from the token ring: making the node a
non-voter. The purpose is to improve the possible failure scenarios.
Previously, if the `removenode` procedure failed after removing the node
from the token ring but before removing it from group 0, the cluster
would contain a 'garbage' group 0 member which is a voter - reducing
group 0's availability. If the node is made a non-voter first, then this
failure will not be as big of a problem, because the leftover group 0
member will be a non-voter.

However, to correctly perform group 0 operations including making
someone a nonvoter, we must first wait for the Raft upgrade procedure to
finish (or at least wait until everyone joins group 0). Therefore by
including this 'make the node a non-voter' step at the beginning of
`removenode`, we make it impossible to remove a token ring member in the
middle of the upgrade procedure, on which the test case relied. The test
case would get stuck waiting for the `removenode` operation to finish,
which would never finish because it would wait for the upgrade procedure
to finish, which would not finish because of the dead peer.

We remove the test case; it was "lucky" to pass in the first place. We
have a dedicated mechanism for handling dead peers during Raft upgrade
procedure: the manual Raft group 0 RECOVERY procedure. There are other
test cases in this file which are using that procedure.
2023-01-17 12:28:00 +01:00
Kamil Braun
4f0801406e service/raft: raft_group0: link to Raft docs where appropriate
Resolve some TODOs.
2023-01-17 12:28:00 +01:00
Kamil Braun
2befbaa341 service/raft: raft_group0: more logging
Make the logs in leave_group0 consistent with logs in
remove_from_group0.
2023-01-17 12:28:00 +01:00
Kamil Braun
77dc1c4c70 service/raft: raft_group0: separate function for checking and waiting for Raft
leave_group0 and remove_from_group0 functions both start with the
following steps:
- if Raft is disabled or in RECOVERY mode, print a simple log message
  and abort
- if Raft cluster feature flag is not yet enabled, print a complex log
  message and abort
- wait for Raft upgrade procedure to finish
- then perform the actual group 0 reconfiguration.

Refactor these preparation steps to a separate function,
`wait_for_raft`. This reduces code duplication; the function will also
be used in more operations later (becoming a nonvoter or turning another
server into a nonvoter).

We also change the API so that the preparation function is called from
outside by the caller before they call the reconfiguration function.
This is because in later commits, some of the call sites (mainly
`removenode`) will want to check explicitly whether Raft is enabled and
wait for Raft's availabilty, then perform a sequence of steps related
to group 0 configuration depending on the result.

Also add a private function `raft_upgrade_complete()` which we use to
assert that Raft is ready to be used.
2023-01-17 12:27:58 +01:00
Wojciech Mitros
5f45b32bfa forward_service: prevent heap use-after-free of forward_aggregates
Currently, we create `forward_aggregates` inside a function that
returns the result of a future lambda that captures these aggregates
by reference. As a result, the aggregates may be destructed before
the lambda finishes, resulting in a heap use-after-free.

To prolong the lifetime of these aggregates, we cannot use a move
capture, because the lambda is wrapped in a with_thread_if_needed()
call on these aggregates. Instead, we fix this by wrapping the
entire return statement in a do_with().

Fixes #12528

Closes #12533
2023-01-17 13:25:57 +02:00
Gleb Natapov' via ScyllaDB development
15ebd59071 lwt: upgrade stored mutations to the latest schema during prepare
Currently they are upgraded during learn on a replica. The are two
problems with this.  First the column mapping may not exist on a replica
if it missed this particular schema (because it was down for instance)
and the mapping history is not part of the schema. In this case "Failed
to look up column mapping for schema version" will be thrown. Second lwt
request coordinator may not have the schema for the mutation as well
(because it was freed from the registry already) and when a replica
tries to retrieve the schema from the coordinator the retrieval will fail
causing the whole request to fail with "Schema version XXXX not found"

Both of those problems can be fixed by upgrading stored mutations
during prepare on a node it is stored at. To upgrade the mutation its
column mapping is needed and it is guarantied that it will be present
at the node the mutation is stored at since it is pre-request to store
it that the corresponded schema is available. After that the mutation
is processed using latest schema that will be available on all nodes.

Fixes #10770

Message-Id: <Y7/ifraPJghCWTsq@scylladb.com>
2023-01-17 11:14:46 +01:00
Raphael S. Carvalho
f2f839b9cc compaction: LCS: don't reshape all levels if only a single breaks disjointness
LCS reshape is compacting all levels if a single one breaks
disjointness. That's unnecessary work because rewriting that single
level is enough to restore disjointness. If multiple levels break
disjointness, they'll each be reshaped in its own iteration, so
reducing operation time for each step and disk space requirement,
as input files can be released incrementally.
Incremental compaction is not applied to reshape yet, so we need to
avoid "major compaction", to avoid the space overhead.
But space overhead is not the only problem, the inefficiency, when
deciding what to reshape when overlapping is detected, motivated
this patch.

Fixes #12495.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #12496
2023-01-17 09:55:15 +02:00
Michał Chojnowski
9e17564c70 types: add some missing explicit instantiations
Some functions defined by a template in types.cc are used in other
translation units (via `cql3/untyped_result_set.hh`), but aren't
explicitly instantiated. Therefore their linking can fail, depending
on inlining decisions. (I experienced this when playing with compiler
options).
Fix that.

Closes #12539
2023-01-17 10:46:01 +02:00
Nadav Har'El
5bf94ae220 cql: allow disabling of USING TIMESTAMP sanity checking
As requested by issue #5619, commit 2150c0f7a2
added a sanity check for USING TIMESTAMP - the number specified in the
timestamp must not be more than 3 days into the future (when viewed as
a number of microseconds since the epoch).

This sanity checking helps avoid some annoying client-side bugs and
mis-configurations, but some users genuinely want to use arbitrary
or futuristic-looking timestamps and are hindered by this sanity check
(which Cassandra doesn't have, by the way).

So in this patch we add a new configuration option, restrict_future_timestamp
If set to "true", futuristic timestamps (more than 3 days into the future)
are forbidden. The "true" setting is the default (as has been the case
sinced #5619). Setting this option to "false" will allow using any 64-bit
integer as a timestamp, like is allowed Cassanda (and was allowed in
Scylla prior to #5619.

The error message in the case where a futuristic timestamp is rejected
now mentions the configuration paramter that can be used to disable this
check (this, and the option's name "restrict_*", is similar to other
so-called "safe mode" options).

This patch also includes a test, which works in Scylla and Cassandra,
with either setting of restrict_future_timestamp, checking the right
thing in all these cases (the futuristic timestamp can either be written
and read, or can't be written). I used this test to manually verify that
the new option works, defaults to "true", and when set to "false" Scylla
behaves like Cassandra.

Fixes #12527

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12537
2023-01-16 23:18:56 +02:00
Kefu Chai
114f30016a main: use std::shift_left() to consume tool name
for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #12536
2023-01-16 21:01:34 +02:00
Nadav Har'El
feef3f9dda test/cql-pytest: test more than one restriction on same clustering column
Cassandra refuses a request with more than one relation to the same
clustering column, for example

    DELETE FROM tbl WHERE p = ? and c = ? AND c > ?

complains that

    c cannot be restricted by more than one relation if it includes an Equal

But it produces different error messages for different operators and
even order.

Currently, Scylla doesn't consider such requests an error. Whether or
not we should be compatible with Cassandra here is discussed in
issue #12472. But as long as we do accept these queries, we should be
sure we do the right thing: "WHERE c = 1 AND c > 2" should match
nothing, "WHERE c = 1 AND c > 0" should match the matches of c = 1,
and so on. This patch adds a test for verify that these requests indeed
yield correct results. The test is scylla_only because, as explained
above, Cassandra doesn't support these requests at all.

Refs #12472

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12498
2023-01-16 20:41:16 +02:00
Kefu Chai
86b451d45c SCYLLA-VERSION-GEN: remove unnecessary bashism
remove unnecessary bashism, so that this script can be interpreted
by a POSIX shell.

/bin/sh is specified in the shebang line. on debian derivatives,
/bin/sh is dash, which is POSIX compliant. but this script is
written in the bash dialect.

before this change, we could run into following build failure
when building the tree on Debian:

[7/904] ./SCYLLA-VERSION-GEN
./SCYLLA-VERSION-GEN: 37: [[: not found

after this change, the build is able to proceed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #12530
2023-01-16 20:34:01 +02:00
Avi Kivity
0b418fa7cf cql3, transport, tests: remove "unset" from value type system
The CQL binary protocol introduced "unset" values in version 4
of the protocol. Unset values can be bound to variables, which
cause certain CQL fragments to be skipped. For example, the
fragment `SET a = :var` will not change the value of `a` if `:var`
is bound to an unset value.

Unsets, however, are very limited in where they can appear. They
can only appear at the top-level of an expression, and any computation
done with them is invalid. For example, `SET list_column = [3, :var]`
is invalid if `:var` is bound to unset.

This causes the code to be littered with checks for unset, and there
are plenty of tests dedicated to catching unsets. However, a simpler
way is possible - prevent the infiltration of unsets at the point of
entry (when evaluating a bind variable expression), and introduce
guards to check for the few cases where unsets are allowed.

This is what this long patch does. It performs the following:

(general)

1. unset is removed from the possible values of cql3::raw_value and
   cql3::raw_value_view.

(external->cql3)

2. query_options is fortified with a vector of booleans,
   unset_bind_variable_vector, where each boolean corresponds to a bind
   variable index and is true when it is unset.
3. To avoid churn, two compatiblity structs are introduced:
   cql3::raw_value{,_view}_vector_with_unset, which can be constructed
   from a std::vector<raw_value{,_view/}>, which is what most callers
   have. They can also be constructed with explicit unset vectors, for
   the few cases they are needed.

(cql3->variables)

4. query_options::get_value_at() now throws if the requested bind variable
   is unset. This replaces all the throwing checks in expression evaluation
   and statement execution, which are removed.
5. A new query_options::is_unset() is added for the users that can tolerate
   unset; though it is not used directly.
6. A new cql3::unset_operation_guard class guards against unsets. It accepts
   an expression, and can be queried whether an unset is present. Two
   conditions are checked: the expression must be a singleton bind
   variable, and at runtime it must be bound to an unset value.
7. The modification_statement operations are split into two, via two
   new subclasses of cql3::operation. cql3::operation_no_unset_support
   ignores unsets completely. cql3::operation_skip_if_unset checks if
   an operand is unset (luckily all operations have at most one operand that
   tolerates unset) and applies unset_operation_guard to it.
8. The various sites that accept expressions or operations are modified
   to check for should_skip_operation(). This are the loops around
   operations in update_statement and delete_statement, and the checks
   for unset in attributes (LIMIT and PER PARTITION LIMIT)

(tests)

9. Many unset tests are removed. It's now impossible to enter an
   unset value into the expression evaluation machinery (there's
   just no unset value), so it's impossible to test for it.
10. Other unset tests now have to be invoked via bind variables,
   since there's no way to create an unset cql3::expr::constant.
11. Many tests have their exception message match strings relaxed.
   Since unsets are now checked very early, we don't know the context
   where they happen. It would be possible to reintroduce it (by adding
   a format string parameter to cql3::unset_operation_guard), but it
   seems not to be worth the effort. Usage of unsets is rare, and it is
   explicit (at least with the Python driver, an unset cannot be
   introduced by ommission).

I tried as an alternative to wrap cql3::raw_value{,_view} (that doesn't
recognize unsets) with cql3::maybe_unset_value (that does), but that
caused huge amounts of churn, so I abandoned that in favor of the
current approach.

Closes #12517
2023-01-16 21:10:56 +02:00
Kamil Braun
7510144fba Merge 'Add replace-node-first-boot option' from Benny Halevy
Allow replacing a node given its Host ID rather than its ip address.

This series adds a replace_node_first_boot option to db/config
and makes use of it in storage_service.

The new option takes priority over the legacy replace_address* options.
When the latter are used, a deprecation warning is printed.

Documentation updated respectively.

And a cql unit_test is added.

Ref #12277

Closes #12316

* github.com:scylladb/scylladb:
  docs: document the new replace_node_first_boot option
  dist/docker: support --replace-node-first-boot
  db: config: describe replace_address* options as deprecated
  test: test_topology: test replace using host_id
  test: pylib: ServerInfo: add host_id
  storage_service: get rid of get_replace_address
  storage_service: is_replacing: rely directly on config options
  storage_service: pass replacement_info to run_replace_ops
  storage_service: pass replacement_info to booststrap
  storage_service: join_token_ring: reuse replacement_info.address
  storage_service: replacement_info: add replace address
  init: do not allow cfg.replace_node_first_boot of seed node
  db: config: add replace_node_first_boot option
2023-01-16 15:08:31 +01:00
Michał Sala
bbbe12af43 forward_service: fix timeout support in parallel aggregates
`forward_request` verb carried information about timeouts using
`lowres_clock::time_point` (that came from local steady clock
`seastar::lowres_clock`). The time point was produced on one node and
later compared against other node `lowres_clock`. That behavior
was wrong (`lowres_clock::time_point`s produced with different
`lowres_clock`s cannot be compared) and could lead to delayed or
premature timeout.

To fix this issue, `lowres_clock::time_point` was replaced with
`lowres_system_clock::time_point` in `forward_request` verb.
Representation to which both time point types serialize is the same
(64-bit integer denoting the count of elapsed nanoseconds), so it was
possible to do an in-place switch of those types using logic suggested
by @avikivity:
    - using steady_clock is just broken, so we aren't taking anything
        from users by breaking it further
    - once all nodes are upgraded, it magically starts to work

Closes #12529
2023-01-16 12:08:13 +02:00
Botond Dénes
3d9ab1d9eb Merge 'Get recursive tasks' statuses with task manager api call' from Aleksandra Martyniuk
The PR adds an api call allowing to get the statuses of a given
task and all its descendants.

The parent-child tree is traversed in BFS order and the list of
statuses is returned to user.

Closes #12317

* github.com:scylladb/scylladb:
  test: add test checking recursive task status
  api: get task statuses recursively
  api: change retrieve_status signature
2023-01-16 11:44:50 +02:00
Tzach Livyatan
073f0f00c6 Add Scylla Summit 2023 in the top banner
Closes #12519
2023-01-16 08:05:20 +02:00
Avi Kivity
5a07641b95 Update python3 submodule (license file fix)
* tools/python3 548e860...279b6c1 (1):
  > create-relocatable-package: s/pyhton3-libs/python3-libs/
2023-01-15 17:59:27 +02:00
Benny Halevy
de3142e540 docs: document the new replace_node_first_boot option
And mention that replacing a node using the legacy
replace_addr* options is deprecated.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:41:44 +02:00
Benny Halevy
d4f1563369 dist/docker: support --replace-node-first-boot
And mention that replace_address_first_boot is deprecated

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:36:09 +02:00
Benny Halevy
1577aa8098 db: config: describe replace_address* options as deprecated
The replace_address options are still supported
But mention in their description that they are now deprecated
and the user should use replace_node_first_boot instead.

While at it fix a typo in ignore_dead_nodes_for_replace

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:36:09 +02:00
Benny Halevy
90faeedb77 test: test_topology: test replace using host_id
Add test cases exercising the --replace-node-first-boot option
by replacing nodes using their host_id rather
than ip address.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:36:09 +02:00
Benny Halevy
7d0d9e28f1 test: pylib: ServerInfo: add host_id
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:36:07 +02:00
Benny Halevy
db2b76beb5 storage_service: get rid of get_replace_address
It is unused now.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:34:29 +02:00
Benny Halevy
17f70e4619 storage_service: is_replacing: rely directly on config options
Rather than on get_replace_address, before we remove the latter.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:34:29 +02:00
Benny Halevy
7282d58d11 storage_service: pass replacement_info to run_replace_ops
So it won't need to call get_replace_address.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:34:09 +02:00
Benny Halevy
08598e4f64 storage_service: pass replacement_info to booststrap
So it won't need to call get_replace_address.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:30:48 +02:00
Benny Halevy
b863f7a75f storage_service: join_token_ring: reuse replacement_info.address
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:30:48 +02:00
Benny Halevy
add2f209b8 storage_service: replacement_info: add replace address
Populate replacement_info.address in prepare_replacement_info
as a first step towards getting rid of get_replace_address().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:30:48 +02:00
Benny Halevy
75c8a5addc init: do not allow cfg.replace_node_first_boot of seed node
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:30:48 +02:00
Benny Halevy
32e79185d4 db: config: add replace_node_first_boot option
For replacing a node given its (now unique) Host ID.

The existing options for replace_address*
will be deprecated in the following patches
and eventually we will stop supporting them.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:30:48 +02:00
Tomasz Grabiec
abc43f97c9 Merge 'Simplify some Raft tables' from Kamil Braun
Rename `system.raft_config` to `system.raft_snapshot_config` to make it clearer
what the table stores.

Remove the `my_server_id` partition key column from
`system.raft_snapshot_config` and a corresponding column from
`system.raft_snapshots` which would store the Raft server ID of the local node.
It's unnecessary, all servers running on a given node in different groups will
use the same ID - the Raft ID of the node which is equal to its Host ID. There
will be no multiple servers running in a single Raft group on the same node.

Closes #12513

* github.com:scylladb/scylladb:
  db: system_keyspace: remove (my_)server_id column from RAFT_SNAPSHOTS and RAFT_SNAPSHOT_CONFIG
  db: system_keyspace: rename 'raft_config' to 'raft_snapshot_config'
2023-01-13 00:23:21 +01:00
Botond Dénes
4e41e7531c docs/dev/debugging.md: recommend open-coredump.sh for opening coredumps
Leave the guide for manual opening in though, the script might not work
in all cases.
Also update the version example, we changed how development versions
look like.

Closes #12511
2023-01-12 19:30:59 +02:00
Botond Dénes
ab8171ffd5 open-coredump.sh: handle dev versions
Like: 5.2.0~dev, which really means master. Don't try to checkout
branch-5.2 in this case, it doesn't exist yet, checkout master instead.

Closes #12510
2023-01-12 19:28:58 +02:00
Kamil Braun
be390285b6 db: system_keyspace: remove (my_)server_id column from RAFT_SNAPSHOTS and RAFT_SNAPSHOT_CONFIG
A single node will run a single Raft server in any given Raft group,
so this column is not necessary.
2023-01-12 16:48:50 +01:00
Kamil Braun
bed555d1e5 db: system_keyspace: rename 'raft_config' to 'raft_snapshot_config'
Make it clear that the table stores the snapshot configuration, which is
not necessarily the currently operating configuration (the last one
appended to the log).

In the future we plan to have a separate virtual table for showing the
currently operating configuration, perhaps we will call it
`system.raft_config`.
2023-01-12 16:21:26 +01:00
Botond Dénes
f87e3993ef Merge 'configure.py: a bunch of clean-up changes' from Michał Chojnowski
The planned integration of cross-module optimizations in scylladb/scylladb-enterprise requires several changes to `configure.py`. To minimize the divergence between the `configure.py`s of both repositories, this series upstreams some of these changes to scylladb/scylladb.

The changes mostly remove dead code and fix some traps for the unaware.

Closes #12431

* github.com:scylladb/scylladb:
  configure.py: prevent deduplication of seastar compile options
  configure.py: rename clang_inline_threshold()
  configure.py: rework the seastar_cflags variable
  configure.py: hoist the pkg_config() call for seastar-testing.pc
  configure.py: unify the libs variable for tests and non-tests
  configure.py: fix indentation
  configure.py: remove a stale code path for .a artifacts
2023-01-12 16:40:02 +02:00
Wojciech Mitros
082bfea187 rust: use depfile and Cargo.lock to avoid building rust when unnecessary
Currently, we call cargo build every time we build scylla, even
when no rust files have been changed.
This is avoided by adding a depfile to the ninja rule for the rust
library.
The rust file is generated by default during cargo build,
but it uses the full paths of all depenencies that it includes,
and we use relative paths. This is fixed by specifying
CARGO_BUILD_DEP_INFO_BASEDIR='.', which makes it so the current
path is subtracted from all generated paths.
Instead of using 'always' when specifying when to run the cargo
build, a dependency on Cargo.lock is added additionally to the
depfile. As a result, the rust files are recompiled not only
when the source files included in the depfile are modified,
but also when some rust dependency is updated.
Cargo may put an old cached file as a result of the build even
when the Cargo.lock was recently updated. Because of that, the
the build result may be older than the Cargo.lock file even
if the build was just performed. This may cause ninja to rebuilt
the file every following time. To avoid this, we 'touch' the
build result, so that its last modification time is up to date.
Because the dependency on Cargo.lock was added, the new command
for the build does not modify it. Instead, the developer must
update it when modifying the dependencies - the docs are updated
to reflect that.

Closes #12489

Fixes #12508
2023-01-12 14:44:11 +02:00
Kefu Chai
77baea2add docs/architecture: fix typo of SyllaDB
s/SyllaDB/ScyllaDB/

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #12505
2023-01-12 12:25:53 +02:00
Michał Chojnowski
1ff4abef4a configure.py: prevent deduplication of seastar compile options
In its infinite wisdom, CMake deduplicates the options passed
to `target_compile_options`, making it impossible to pass options which require
duplication, such as -mllvm.
Passing e.g.
`-mllvm;-pgso=false;-mllvm;-inline-threshold=2500` invokes the compiler
`-mllvm -pgso=false -inline-threshold=2500`, breaking the options.

As a workaround, CMake added the `SHELL:` syntax, which makes it possible to
pass the list of options not as a CMake list, but as a shell-quoted string.
Let's use it, so we can pass multiple -mllvm options.
2023-01-12 11:24:10 +01:00
Michał Chojnowski
85facefe45 configure.py: rename clang_inline_threshold()
There's a global variable (the CLI argument) with the same name.
Rename one of the two to avoid accidental mixups.
2023-01-12 11:24:10 +01:00
Michał Chojnowski
d9de78f6d3 configure.py: rework the seastar_cflags variable
The name of this variable is misleading. What it really does is pass flags to
static libraries compiled by us, not just to seastar.
We will need this capability to implement cross-artifact optimizations in our
build.
We will also need to pass linker flags, and we will need to vary those flags
depending on the build mode.

This patch splits the seastar_cflags variable into per-mode lib_cflags and
lib_ldflags variables. It shouldn't change the resulting build.ninja for now,
but will be needed by later planned patches.
2023-01-12 11:24:10 +01:00
Michał Chojnowski
ee462a9d3c configure.py: hoist the pkg_config() call for seastar-testing.pc
Put the pkg_config() for seastar-testing.pc in the same area as the call
for seastar.pc, outside of the loop.
This is a cosmetic change aimed at making following commits cleaner.
2023-01-12 11:24:10 +01:00
Michał Chojnowski
c9aeeeae11 configure.py: unify the libs variable for tests and non-tests
This is a cosmetic change aimed at make following commits in the same area
cleaner.
2023-01-12 11:24:09 +01:00
Michał Chojnowski
10ac881ef1 configure.py: fix indentation
Fix indentation after the preceeding commit.
2023-01-12 11:23:32 +01:00
Michał Chojnowski
be419adaf8 configure.py: remove a stale code path for .a artifacts
Scylla haven't had `.a` artifacts for a long time (since the Urchin days,
I believe), and the piece of code responsible for them is stale and untested.
Remove it.
2023-01-12 11:22:49 +01:00
Botond Dénes
8a86f8d4ef gdbinit: add ignore clause for SIG35
Another real-time even often raised in scylla, making debugging a live
process annoying.

Closes #12507
2023-01-12 12:13:04 +02:00
Avi Kivity
7a8a442c1e transport: drop some dead code around v1 and v2 protocols
In 424dbf43f ("transport: drop cql protocol versions 1 and 2"),
we dropped support for protocols 1 and 2, but some code remains
that checks for those versions. It is now dead code, so remove it.

Closes #12497
2023-01-12 12:52:19 +02:00
Avi Kivity
4de2524a42 build: update toolchain for scylla-driver package
Pull updated scylla-driver package, fixing an IP change related
bug [1].

[1] https://github.com/scylladb/python-driver/issues/198

Closes #12501
2023-01-11 22:16:35 +02:00
Nadav Har'El
7192283172 Merge 'doc: add the upgrade guide for ScyllaDB 5.1 to ScyllaDB Enterprise 2022.2' from Anna Stuchlik
Fix https://github.com/scylladb/scylladb/issues/12315

This PR adds the upgrade guide from ScyllaDB 5.1 to ScyllaDB Enterprise 2022.2.
Instead of adding separate guides per platform, I've merged the information to create one platform-agnostic guide, similar to what we did for [OSS->OSS](https://docs.scylladb.com/stable/upgrade/upgrade-opensource/upgrade-guide-from-5.0-to-5.1/) and [Enterprise->Enterprise ](https://github.com/scylladb/scylladb/pull/12339)guides.

Closes #12450

* github.com:scylladb/scylladb:
  doc: add the new upgrade guide to the toctree and fix its name
  docs: add the upgrade guide from ScyllaDB 5.1 to ScyllaDB Enterprise 2022.2
2023-01-11 21:01:34 +02:00
Avi Kivity
cb2cb8a606 utils: small_vector: mark throw_out_of_range() const
It can be called from the const version of small_vector::at.

Closes #12493
2023-01-11 20:58:53 +02:00
Nadav Har'El
04d6402780 docs: cql-extensions.md: explain our NULL handling
Our handling of NULLs in expressions is different from Cassandra's,
and more uniform. For example, the filter "WHERE x = NULL" is an
error in Cassandra, but supported in Scylla. Let's explain how and why.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12494
2023-01-11 20:56:50 +02:00
Wojciech Mitros
95031074a5 configure: fix the order of rust header generation
Currently, no rule enforces that the cxx.h rust header
is generated before compiling the .cc files generated
from rust. This patch adds this dependency.

Closes #12492
2023-01-11 16:55:53 +02:00
Botond Dénes
210738c9ce Merge 'test.py: improve logging' from Kamil Braun
Make it easy to see which clusters are operated on by which tests in which build modes and so on.
Add some additional logs.

These improvements would have saved me a lot of debugging time if I had them last week and we would have https://github.com/scylladb/scylladb/pull/12482 much faster.

Closes #12483

* github.com:scylladb/scylladb:
  test.py: harmonize topology logs with test.py format
  test/pylib: additional logging during cluster setup
  test/pylib: prefix cluster/manager logs with the current test name
  test/pylib: pool: pass *args and **kwargs to the build function from get()
  test.py: include mode in ScyllaClusterManager logs
2023-01-11 16:32:56 +02:00
Aleksandra Martyniuk
fcb3f76e78 test: add test checking recursive task status
Rest api test checking whether task manager api returns recursive tasks'
statuses properly in BFS order.
2023-01-11 12:34:17 +01:00
Aleksandra Martyniuk
6b79c92cb7 api: get task statuses recursively
Sometimes to debug some task manager module, we may want to inspect
the whole tree of descendants of some task.

To make it easier, an api call getting a list of statuses of the requested
task and all its descendants in BFS order is added.
2023-01-11 12:34:06 +01:00
Konstantin Osipov
f3440240ee test.py: harmonize topology logs with test.py format
We need millisecond resolution in the log to be able to
correlate test log with test.py log and scylla logs. Harmonize
the log format for tests which actively manage scylla servers.
2023-01-11 10:09:42 +01:00
Kamil Braun
79712185d5 test/pylib: additional logging during cluster setup
This would have saved me a lot of debugging time.
2023-01-11 10:09:42 +01:00
Kamil Braun
4f7e5ee963 test/pylib: prefix cluster/manager logs with the current test name
The log file produced by test.py combines logs coming from multiple
concurrent test runs. Each test has its own log file as well, but this
"global" log file is useful when debugging problems with topology tests,
since many events related to managing clusters are stored there.

Make the logs easier to read by including information about the test case
that's currently performing operations such as adding new servers to
clusters and so on. This includes the mode, test run name and the name
of the test case.

We do this by using custom `Logger` objects (instead of calling
`logging.info` etc. which uses the root logger) with `LoggerAdapter`s
that include the prefixes. A bit of boilerplate 'plumbing' through
function parameters is required but it's mostly straightforward.

This doesn't apply to all events, e.g. boost test cases which don't
setup a "real" Scylla cluster. These events don't have additional
prefixes.

Example:
```

17:41:43.531 INFO> [dev/topology.test_topology.1] Cluster ScyllaCluster(name: 7a414ffc-903c-11ed-bafb-f4d108a9e4a3, running: ScyllaServer(1, 127.40.246.1, 29c4ec73-8912-45ca-ae19-8bfda701a6b5), ScyllaServer(4, 127.40.246.4, 75ae2afe-ff9b-4760-9e19-cd0ed8d052e7), ScyllaServer(7, 127.40.246.7, 67a27df4-be63-4b4c-a70c-aeac0506304f), stopped: ) adding server...
17:41:43.531 INFO> [dev/topology.test_topology.1] installing Scylla server in /home/kbraun/dev/scylladb/testlog/dev/scylla-10...
17:41:43.603 INFO> [dev/topology.test_topology.1] starting server at host 127.40.246.10 in scylla-10...
17:41:43.614 INFO> [dev/topology.test_topology.2] Cluster ScyllaCluster(name: 7a497fce-903c-11ed-bafb-f4d108a9e4a3, running: ScyllaServer(2, 127.40.246.2, f59d3b1d-efbb-4657-b6d5-3fa9e9ef786e), ScyllaServer(5, 127.40.246.5, 9da16633-ce53-4d32-8687-e6b4d27e71eb), ScyllaServer(9, 127.40.246.9, e60c69cd-212d-413b-8678-dfd476d7faf5), stopped: ) adding server...
17:41:43.614 INFO> [dev/topology.test_topology.2] installing Scylla server in /home/kbraun/dev/scylladb/testlog/dev/scylla-11...
17:41:43.670 INFO> [dev/topology.test_topology.2] starting server at host 127.40.246.11 in scylla-11...
```
2023-01-11 10:09:39 +01:00
Avi Kivity
de0c31b3b6 cql3: query_options: simplify batch query_options constructor
The batch constructor uses an unnecessarily complicated template,
where in fact it only vector<vector<raw_value | raw_value_view>>.

Simplify the constructor to allow exactly that. Delete some confusing
comments around it.

Closes #12488
2023-01-11 07:54:54 +02:00
Kamil Braun
2bda0f9830 test/pylib: pool: pass *args and **kwargs to the build function from get()
This will be used to specify a custom logger when building new clusters
before starting tests, allowing to easily pinpoint which tests are
waiting for clusters to be built and what's happening to these
particular clusters.
2023-01-10 17:41:54 +01:00
Kamil Braun
ff2c030bf9 test.py: include mode in ScyllaClusterManager logs
The logs often mention the test run and the current test case in a given
run, such as `test_topology.1` and
`test_topology.1::test_add_server_add_column`. However, if we run
test.py in multiple modes, the different modes might be running the same
test case and the logs become confusing. To disambiguate, prefix the
test run/case names with the mode name.

Example:
```
Leasing Scylla cluster ScyllaCluster(name: 7a414ffc-903c-11ed-bafb-f4d108a9e4a3, running: ScyllaServer(1, 127.40.246.1, 29c4ec73-8912-45ca-ae19-8bfda701a6b5), ScyllaServer(4, 127.40.246.4, 75ae2afe-ff9b-4
760-9e19-cd0ed8d052e7), ScyllaServer(7, 127.40.246.7, 67a27df4-be63-4b4c-a70c-aeac0506304f), stopped: ) for test dev/topology.test_topology.1::test_add_server_add_column
```
2023-01-10 17:41:54 +01:00
Wojciech Mitros
e558c7d988 functions: initialize aggregates on scylla start
Currently, UDAs can't be reused if Scylla has been
restarted since they have been created. This is
caused by the missing initialization of saved
UDAs that should have inserted them to the
cql3::functions::functions::_declared map, that
should store all (user-)created functions and
aggregates.

This patch adds the missing implementation in a way
that's analogous to the method of inserting UDF to
the _declared map.

Fixes #11309
2023-01-10 17:44:18 +02:00
Wojciech Mitros
d1b809754c database: wrap lambda coroutines used as arguments in coroutine::lambda
Using lambda coroutines as arguments can lead to a use-after-free.
Currently, the way these lambdas were used in do_parse_schema_tables
did not lead to such a problem, but it's better to be safe and wrap
them in coroutine::lambda(), so that they can't lead to this problem
as long as we ensure that the lambda finishes in the
do_parse_schema_tables() statement (for example using co_await).

Closes #12487
2023-01-10 17:24:52 +02:00
Nadav Har'El
0edb090c67 test/cql-pytest: add simple tests for SELECT DISTINCT
This patch adds a few simple functional test for the SELECT DISTINCT
feature, and how it interacts with other features especiall GROUP BY.

2 of the 5 new tests are marked xfail, and reproduce one old and one
newly-discovered issue:

Refs #5361: LIMIT doesn't work when using GROUP BY (the test here uses
            LIMIT and GROUP BY together with SELECT DISTINCT, so the
            LIMIT isn't honored).

Refs #12479: SELECT DISTINCT doesn't refuse GROUP BY with clustering
             column.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12480
2023-01-10 13:29:26 +02:00
Michał Radwański
dcab289656 boost/mvcc_test: use failure_injecting_allocation_strategy where it is meant to
In test_apply_is_atomic, a basic form of exception testing is used.
There is failure_injecting_allocation_strategy, which however is not
used for any allocation, since for some reason,
`with_allocator(r.allocator()` is used instead of
`with_allocator(alloc`. Fix that.

Closes #12354
2023-01-10 12:01:36 +01:00
Tomasz Grabiec
ebcd736343 cache: Fix undefined behavior when populating with non-full keys
Regression introduced in 23e4c8315.

view_and_holder position_in_partiton::after_key() triggers undefined
behavior when the key was not full because the holder is moved, which invalidates the view.

Fixes #12367

Closes #12447
2023-01-10 12:51:54 +02:00
Jan Ciolek
8d7e35caef cql3: expr: remove reference to temporary in get_rhs_receiver
The function underlying_type() returns an data_type by value,
but the code assigned it to a reference.

At first I was sure this is an error
(assigning temporary value to a reference), but it turns out
that this is most likely correct due to C++ lifetime
extension rules.

I think it's better to avoid such unituitive tricks.
Assigning to value makes it clearer that the code
is correct and there are no dangling references.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

Closes #12485
2023-01-10 09:42:49 +02:00
Raphael "Raph" Carvalho
407c7fdaf2 docs: Fix command to create a symbolic link to relocatable pkg dir
Closes #12481
2023-01-10 07:09:14 +02:00
Kamil Braun
822410c49b test/pylib: scylla_cluster: release IPs when cluster is no longer needed
With sufficiently many test cases we would eventually run out of IP
addresses, because IPs (which are leased from a global host registry)
would only be released at the end of an entire test suite.

In fact we already hit this during next promotions, causing much pain
indeed.

Release IPs when a cluster, after being marked dirty, is stopped and
thrown away.

Closes #12482
2023-01-10 06:59:41 +02:00
Avi Kivity
e71e1dc964 Merge 'tools/scylla-sstable: add lua scripting support' from Botond Dénes
Introduce a new "script" operation, which loads a script from the specified path, then feeds the mutation fragment stream to it. The script can then extract, process and present information from the sstable as it wishes.
For now only Lua scripts are supported for the simple reason that Lua is easy to write bindings for, it is simple and lightweight and more importantly we already have Lua included in the Scylla binary as it is used as the implementation language for UDF/UDA. We might consider WASM support in the future, but for now we don't have any language support in WASM available.

Example:
```lua
function new_stats(key)
    return {
        partition_key = key,
        total = 0,
        partition = 0,
        static_row = 0,
        clustering_row = 0,
        range_tombstone_change = 0,
    };
end

total_stats = new_stats(nil);

function inc_stat(stats, field)
    stats[field] = stats[field] + 1;
    stats.total = stats.total + 1;
    total_stats[field] = total_stats[field] + 1;
    total_stats.total = total_stats.total + 1;
end

function on_new_sstable(sst)
    max_partition_stats = new_stats(nil);
    if sst then
        current_sst_filename = sst.filename;
    else
        current_sst_filename = nil;
    end
end

function consume_partition_start(ps)
    current_partition_stats = new_stats(ps.key);
    inc_stat(current_partition_stats, "partition");
end

function consume_static_row(sr)
    inc_stat(current_partition_stats, "static_row");
end

function consume_clustering_row(cr)
    inc_stat(current_partition_stats, "clustering_row");
end

function consume_range_tombstone_change(crt)
    inc_stat(current_partition_stats, "range_tombstone_change");
end

function consume_partition_end()
    if current_partition_stats.total > max_partition_stats.total then
        max_partition_stats = current_partition_stats;
    end
end

function on_end_of_sstable()
    if current_sst_filename then
        print(string.format("Stats for sstable %s:", current_sst_filename));
    else
        print("Stats for stream:");
    end
    print(string.format("\t%d fragments in %d partitions - %d static rows, %d clustering rows and %d range tombstone changes",
        total_stats.total,
        total_stats.partition,
        total_stats.static_row,
        total_stats.clustering_row,
        total_stats.range_tombstone_change));
    print(string.format("\tPartition with max number of fragments (%d): %s - %d static rows, %d clustering rows and %d range tombstone changes",
        max_partition_stats.total,
        max_partition_stats.partition_key,
        max_partition_stats.static_row,
        max_partition_stats.clustering_row,
        max_partition_stats.range_tombstone_change));
end
```
Running this script wilt yield the following:
```
$ scylla sstable script --script-file fragment-stats.lua --system-schema system_schema.columns /var/lib/scylla/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/me-1-big-Data.db
Stats for sstable /var/lib/scylla/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f//me-1-big-Data.db:
        397 fragments in 7 partitions - 0 static rows, 362 clustering rows and 28 range tombstone changes
        Partition with max number of fragments (180): system - 0 static rows, 179 clustering rows and 0 range tombstone changes
```

Fixes: https://github.com/scylladb/scylladb/issues/9679

Closes #11649

* github.com:scylladb/scylladb:
  tools/scylla-sstable: consume_reader(): improve pause heuristincs
  test/cql-pytest/test_tools.py: add test for scylla-sstable script
  tools: add scylla-sstable-scripts directory
  tools/scylla-sstable: remove custom operation
  tools/scylla-sstable: add script operation
  tools/sstable: introduce the Lua sstable consumer
  dht/i_partitioner.hh: ring_position_ext: add weight() accessor
  lang/lua: export Scylla <-> lua type conversion methods
  lang/lua: use correct lib name for string lib
  lang/lua: fix type in aligned_used_data (meant to be user_data)
  lang/lua: use lua_State* in Scylla type <-> Lua type conversions
  tools/sstable_consumer: more consistent method naming
  tools/scylla-sstable: extract sstable_consumer interface into own header
  tools/json_writer: add accessor to underlying writer
  tools/scylla-sstable: fix indentation
  tools/scylla-sstable: export mutation_fragment_json_writer declaration
  tools/scylla-sstable: mutation_fragment_json_writer un-implement sstable_consumer
  tools/scylla-sstable: extract json writing logic from json_dumper
  tools/scylla-sstable: extract json_writer into its own header
  tools/scylla-sstable: use json_writer::DataKey() to write all keys
  tools/scylla-types: fix use-after-free on main lambda captures
2023-01-09 20:54:42 +02:00
Raphael S. Carvalho
05ffb024bb replica: Kill table::calculate_shard_from_sstable_generation()
Inferring shard from generation is long gone. We still use it in
some scripts, but that's no longer needed in Scylla, when loading
the SSTables, and it also conflicts with ongoing work of UUID-based
generations.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #12476
2023-01-09 20:17:57 +02:00
Takuya ASADA
548c9e36a1 main: add tcp_timestamps sanity check
Check net.ipv4.tcp_timestamps, show warning message when it's not set to 1.

Fixes #12144

Closes #12199
2023-01-09 19:08:21 +02:00
Nadav Har'El
d6e6820f33 Merge 'Drop support for cql binary protocols versions 1 and 2' from Avi Kivity
The CQL binary protocol version 3 was introduced in 2014. All Scylla
version support it, and Cassandra versions 2.1 and newer.

Versions 1 and 2 have 16-bit collection sizes, while protocol 3 and newer
use 32-bit collection sizes.

Unfortunately, we implemented support for multiple serialization formats
very intrusively, by pushing the format everywhere. This avoids the need
to re-serialize (sometimes) but is quite obnoxious. It's also likely to be
broken, since it's almost untested and it's too easy to write
cql_serialization_format::internal() instead of propagating the client
specified value.

Since protocols 1 and 2 are obsolete for 9 years, just drop them. It's
easy to verify that they are no longer in use on a running system by
examining the `system.clients` table before upgrade.

Fixes #10607

Closes #12432

* github.com:scylladb/scylladb:
  treewide: drop cql_serialization_format
  cql: modification_statement: drop protocol check for LWT
  transport: drop cql protocol versions 1 and 2
2023-01-09 18:52:41 +02:00
Botond Dénes
bd42da6e69 tools/scylla-sstable: consume_reader(): improve pause heuristincs
The consume loop had some heuristics in place to determine whether after
pausing, the consumer wishes to skip just the partition or the remaining
content of the sstable. This heuristics was flawed so replace it with a
non-heuristic method: track the last consumed fragment and look at this
to determine what should be done.
2023-01-09 09:46:57 -05:00
Botond Dénes
1d222220e0 test/cql-pytest/test_tools.py: add test for scylla-sstable script
To test the script operation, we use some of the example scripts from
the example directory. Namely, dump.lua and slice.lua. These two scripts
together have a very good coverage of the entire script API. Testing
their functionality therefore also provides a good coverage of the lua
bindings. A further advantage is that since both scripts dump output in
identical format to that of the data-dump operation, it is trivial to do
a comparison against this already tested operation.
A targeted test is written for the sstable skip functionality of the
consumer API.
2023-01-09 09:46:57 -05:00
Botond Dénes
ace42202df tools: add scylla-sstable-scripts directory
To be the home of example scripts for scylla-sstable. For now only a
README.md is added describing the directory's purpose and with links to
useful resources.
One example script is added in this patch, more will come later.
2023-01-09 09:46:57 -05:00
Botond Dénes
7b40463f29 tools/scylla-sstable: remove custom operation
We now have a script operation, the custom operation (poor man's script
operation) has no reason to exist anymore.
2023-01-09 09:46:57 -05:00
Botond Dénes
e5071fdeab tools/scylla-sstable: add script operation
Loads the script from the specified path, then feeds the mutation
fragment stream to it. For now only Lua scripts are supported for the
simple reason that Lua is easy to write bindings for, it is simple and
lightweight and more importantly we already have Lua included in the
Scylla binary as it is used as the implementation language for UDF/UDA.
We might consider WASM support in the future, but for now we don't have
any language support in WASM available.
2023-01-09 09:46:57 -05:00
Botond Dénes
9dd5107919 tools/sstable: introduce the Lua sstable consumer
The Lua sstable consumer loads a script from the specified path then
feeds the mutation fragment stream to the script via the
sstable_consumer methods, each method of which the script is allowed to
define, effectively overloading the virtual method in Lua.
This allows for very wide and flexible customization opportunities for
what to extract from sstables and how to process and present them,
without the need to recompile the scylla-sstable tool.
2023-01-09 09:46:57 -05:00
Botond Dénes
50b155e706 dht/i_partitioner.hh: ring_position_ext: add weight() accessor 2023-01-09 09:46:57 -05:00
Botond Dénes
8699fe5001 lang/lua: export Scylla <-> lua type conversion methods
Currently hidden in lang/lua.cc, declare these in a header so others can
use it.
2023-01-09 09:46:57 -05:00
Botond Dénes
e9a52837cf lang/lua: use correct lib name for string lib
AFAIK the mistake had no real consequence, but still it is nicer to have
it correct.
2023-01-09 09:46:57 -05:00
Botond Dénes
76663d7774 lang/lua: fix type in aligned_used_data (meant to be user_data) 2023-01-09 09:46:57 -05:00
Botond Dénes
943fc3b6f3 lang/lua: use lua_State* in Scylla type <-> Lua type conversions
Instead of the lua_slice_state which is local to this file. We want to
reuse the Scylla type <-> Lua type conversion functions but for that
they have to use the more generic lua_State*. No functionality or
convenience is lost with the switch, the code didn't make use of the
other fields bundled in lua_slice_state.
2023-01-09 09:46:57 -05:00
Botond Dénes
8045751867 tools/sstable_consumer: more consistent method naming
Use `consume_` consistently across the entire interface, instead of having
some methods with `on_` and others with `consume_` prefixes.
2023-01-09 09:46:57 -05:00
Botond Dénes
8e117501ac tools/scylla-sstable: extract sstable_consumer interface into own header
So it can be used in code outside scylla-sstable.cc. This source file is
quite large already, and as we have yet another large chunk of code to
add, we want to add it in a separate file.
2023-01-09 09:46:57 -05:00
Botond Dénes
9b1c486051 tools/json_writer: add accessor to underlying writer 2023-01-09 09:46:57 -05:00
Botond Dénes
cfb5afbe9b tools/scylla-sstable: fix indentation
Left broken by previous patches.
2023-01-09 09:46:57 -05:00
Botond Dénes
d42b0bb5d5 tools/scylla-sstable: export mutation_fragment_json_writer declaration
To json_writer.hh. Method definition are left in scylla-sstable.cc.
Indentation is left broken, will be fixed by the next patch.
2023-01-09 09:46:57 -05:00
Botond Dénes
517135e155 tools/scylla-sstable: mutation_fragment_json_writer un-implement sstable_consumer
There is no point in the former implementing said interface. For one it
is a futurized interface, which is not needed for something writing to
the stdout. Rename the methods to follow the naming convention of rjson
writers more closely.
2023-01-09 09:46:57 -05:00
Botond Dénes
0ee1c6ca57 tools/scylla-sstable: extract json writing logic from json_dumper
We want to split this class into two parts: one with the actual logic
converting mutation fragments to json, and a wrapper over this one,
which implements the sstable_consumer interface.
As a first step we extract the class as is (no changes) and just forward
all-calls from now empty wrapper to it.
2023-01-09 09:46:57 -05:00
Botond Dénes
55ef0ed421 tools/scylla-sstable: extract json_writer into its own header
Other source files will want to use it soon.
2023-01-09 09:46:57 -05:00
Botond Dénes
8623818a8d tools/scylla-sstable: use json_writer::DataKey() to write all keys
This method was renamed from its previous name of PartitionKey. Since in
json partition keys and clustering keys look alike, with the only
difference being that the former may also have a token, it makes to have
a single method to write them (with an optional token parameter). This
was the case at some point, json_dumper::write_key() taking this role.
However at a later point, json_writer::PartitionKey() was introduced and
now the code uses both. Standardize on the latter and give it a more
generic name.
2023-01-09 09:46:57 -05:00
Botond Dénes
602fca0a12 tools/scylla-types: fix use-after-free on main lambda captures
The main lambda of scylla-types, the one passed to app_template::run()
was recently made a coroytine. app_template::run() however doesn't keep
this lambda alive and hence after the first suspention point, accessing
the lambda's captures triggers use-after-free.
The simple fix is to convert the coroutine into continuation chain.
2023-01-09 09:46:57 -05:00
Tomasz Grabiec
f97268d8f2 row_cache: Fix violation of the "oldest version are evicted first" when evicting last dummy
Consider the following MVCC state of a partition:

   v2: ==== <7> [entry2] ==== <9> ===== <last dummy>
   v1: ================================ <last dummy> [entry1]

Where === means a continuous range and --- means a discontinuous range.

After two LRU items are evicted (entry1 and entry2), we will end up with:

   v2: ---------------------- <9> ===== <last dummy>
   v1: ================================ <last dummy> [entry1]

This will cause readers to incorrectly think there are no rows before
entry <9>, because the range is continuous in v1, and continuity of a
snapshot is a union of continuous intervals in all versions. The
cursor will see the interval before <9> as continuous and the reader
will produce no rows.

This is only temporary, because current MVCC merging rules are such
that the flag on the latest entry wins, so we'll end up with this once
v1 is no longer needed:

   v2: ---------------------- <9> ===== <last dummy>

...and the reader will go to sstables to fetch the evicted rows before
entry <9>, as expected.

The bug is in rows_entry::on_evicted(), which treats the last dummy
entry in a special way, and doesn't evict it, and doesn't clear the
continuity by omission.

The situation is not easy to trigger because it requires certain
eviction pattern concurrent with multiple reads of the same partition
in different versions, so across memtable flushes.

Closes #12452
2023-01-09 16:10:52 +02:00
Avi Kivity
1bb1855757 Merge 'replica/database: fix read related metrics' from Botond Dénes
Sstable read related metrics are broken for a long time now. First, the introduction of inactive reads (https://github.com/scylladb/scylladb/issues/1865) diluted this metric, as it now also contained inactive reads (contrary to the metric's name). Then, after moving the semaphore in front of the cache (3d816b7c1) this metric became completely broken as this metric now contains all kinds of reads: disk, in-memory and inactive ones too.
This series aims to remedy this:
* `scylla_database_active_reads` is fixed to only include active reads.
* `scylla_database_active_reads_memory_consumption` is renamed to `scylla_database_reads_memory_consumption` and its description is brought up-to-date.
* `scylla_database_disk_reads` is added to track current reads that are gone to disk.
* `scylla_database_sstables_read` is added to track the number of sstables read currently.

Fixes: https://github.com/scylladb/scylladb/issues/10065

Closes #12437

* github.com:scylladb/scylladb:
  replica/database: add disk_reads and sstables_read metrics
  sstables: wire in the reader_permit's sstable read count tracking
  reader_concurrency_semaphore: add disk_reads and sstables_read stats
  replica/database: fix active_reads_memory_consumption_metric
  replica/database: fix active_reads metric
2023-01-09 12:18:49 +02:00
Pavel Emelyanov
e20738cd7d azure_snitch: Handle empty zone returned from IMDS
Azure metadata API may return empty zone sometimes. If that happens
shard-0 gets empty string as its rack, but propagates UNKNOWN_RACK to
other shards.

Empty zones response should be handled regardless.

refs: #12185

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #12274
2023-01-09 11:57:45 +02:00
Nadav Har'El
2d845b6244 test/cql-pytest: a test for more than one equality in WHERE
Cassandra refuses a request with more than one equality relation to the
same column, for example

    DELETE FROM tbl WHERE partitionKey = ? AND partitionKey = ?

It complains that

    partitionkey cannot be restricted by more than one relation if it
    includes an Equal

Currently, Scylla doesn't consider such requests an error. Whether or
not we should be compatible with Cassandra here is discussed in
issue #12472. But as long as we do accept this query, we should be
sure we do the right thing: "WHERE p = 1 AND p = 2" should match
nothing (not the first, or last, value being tested..), and "WHERE p = 1
AND p = 1" should match the matches of p = 1. This patch adds a test
for verify that these requests indeed yield correct results. The
test is scylla_only because, as explained above, Cassandra doesn't
support this feature at all.

Refs #12472

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12473
2023-01-09 11:56:39 +02:00
Anna Stuchlik
b61515c871 doc: replace Scylla with ScyllaDB on the menu tree and major links; related: https://github.com/scylladb/scylla-docs/issues/3962
Closes #12456
2023-01-09 08:39:50 +02:00
Avi Kivity
42575340ba Update seastar submodule
* seastar ca586cfb8d...8889cbc198 (14):
  > http: request_parser: fix grammar ambiguity in field_content
Fixes #12468
  > sstring: use fold expression to simply copy_str_to()
  > sstring: use fold expression to simply str_len()
  > metrics: capture by move in make_function()
  > metrics: replace homebrew is_callable<> with is_invocable_v<>
  > reactor: use std::move() to avoid copy.
  > reactor: remove redundant semicolon.
  > reactor: use mutable to make std::move() work.
  > build: install liburing explicitly on ArchLinux.
  > reactor: use a for loop for submitting ios
  > metrics: add spaces around '='
  > parallel utils: align concept with implementation
  > reactor: s/resize(0)/clear()/
  > reactor: fix a typo in comment

Closes #12469
2023-01-08 18:56:00 +02:00
Alejo Sanchez
d632e1aa7a test/pytest: add missing import, remove unused import
Add missed import time and remove unused name import.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #12446
2023-01-08 17:38:46 +02:00
Avi Kivity
5ffe4fee6d Merge 'Remove legacy half reverse' from Michał Radwański
This commit removes consume_in_reverse::legacy_half_reverse, an option
once used to indicate that the given key ranges are sorted descending,
based on the clustering key of the start of the range, and that the
range tombstones inside partition would be sorted (descending, as all
the mutation fragments would) according to their end (but range
tombstone would still be stored according to their start bound).

As it turns out, mutation::consume, when called with legacy_half_reverse
option produces invalid fragment stream, one where all the row
tombstone changes come after all the clustering rows. This was not an
issue, since when constructing results from the query, Scylla would not
pass the tombstones to the client, but instead compact data beforehand.

In this commit, the consume_in_reverse::legacy_half_reverse is removed,
along with all the uses.

As for the swap out in mutation_partition.cc in query_mutation and
to_data_query_result:

The downstream was not prepared to deal with legacy_half_reverse.
mutation::consume contains

```
     if (reverse == consume_in_reverse::yes) {
         while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::yes>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) {
             co_await yield();
        }
     } else {
         while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::no>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) {
             co_await yield();
         }
     }
```

So why did it work at all? to_data_query_result deals with a single slice.
The used consumer (compact_for_query_v2) compacts-away the range tombstone
changes, and thus the only difference between the consume_in_reverse::no
and consume_in_reverse::yes was that one was ordered increasing wrt. ckeys
and the second one was ordered decreasing. This property is maintained if
we swap out for the consume_in_reverse::yes format.

Refs: #12353

Closes #12453

* github.com:scylladb/scylladb:
  mutation{,_consumer,_partition}: remove consume_in_reverse::legacy_half_reverse
  mutation_partition_view: treat query::partition_slice::option::reversed in to_data_query_result as consume_in_reverse::yes
  mutation: move consume_in_reverse def to mutation_consumer.hh
2023-01-08 15:42:00 +02:00
Botond Dénes
c4688563e3 sstables: track decompressed buffers
Convert decompressed temporary buffers into tracked buffers just before
returning them to the upper layer. This ensures these buffers are known
to the reader concurrency semaphore and it has an accurate view of the
actual memory consumption of reads.

Fixes: #12448

Closes #12454
2023-01-08 15:34:28 +02:00
Kamil Braun
b77df84543 test: test_topology: make test_nodes_with_different_smp less hacky
The test would use a trick to start a separate Scylla cluster from the
one provided originally by the test framework. This is not supported by
the test framework and may cause unexpected problems.

Change the test to perform regular node operations. Instead of starting
a fresh cluster of 3 nodes, we join the first of these nodes to the
original framework-provided cluster, then decommission the original
nodes, then bootstrap the other 2 fresh nodes.

Also add some logging to the test.

Refs: #12438, #12442

Closes #12457
2023-01-08 15:33:17 +02:00
Avi Kivity
02c9968e73 Merge 'Add WASM UDF implementation in Rust' from Wojciech Mitros
This series adds the implementation and usage of rust wasmtime bindings.

The WASM UDFs introduced by this patch are interruptable and use memory allocated using the seastar allocator.

This series includes #11102 (the first two commits) because #11102 required disabling wasm UDFs completely. This patch disables them in the middle of the series, and enables them again at the end.
After this patch, `libwasmtime.a` can be removed from the toolchain.
This patch also removes the workaround for #https://github.com/scylladb/scylladb/issues/9387 but it hasn't been tested with ARM yet - if the ARM test causes issues I'll revert this part of the change.

Closes #11351

* github.com:scylladb/scylladb:
  build: remove references to unused c bindings of wasmtime
  test: assert that WASM allocations can fail without crashing
  wasm: limit memory allocated using mmap
  wasm: add configuration options for instance cache and udf execution
  test: check that wasmtime functions yield
  wasm: use the new rust bindings of wasmtime
  rust: add Wasmtime bindings
  rust: add build profiles more aligned with ninja modes
  rust: adjust build according to cxxbridge's recommendations
  tools: toolchain: dbuild: prepare for sharing cargo cache
2023-01-08 15:31:09 +02:00
Nadav Har'El
f5cda3cfc3 test/cql-pytest: add more tests for "timestamp" column type
In issue #3668, a discussion spanning several years theorized that several
things are wrong with the "timestamp" type. This patch begins by adding
several tests that demonstrate that Scylla is in fact behaving correctly,
and mostly identically to Cassandra except one esoteric error handling
case.

However, after eliminating the red herrings, we are left for the real
issue that prompted opening #3668, which is a duplicate of issues #2693
and #2694, and this patch also adds a reproducer for that. The issue is
that Cassandra 4 added support for arithmetic expressions on values,
and timestamps can be added durations, for example:

        '2011-02-03 04:05:12.345+0000' - 1d

is a valid timestamp - and we don't currently support this syntax.
So the new test - which passes on Cassandra 4 and fails on Scylla
(or Cassandra 3) is marked xfail.

Refs #2693
Refs #2694

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12436
2023-01-08 15:00:49 +02:00
Michał Chojnowski
08b3a9c786 configure: don't reduce parsers' optimization level to 1 in release
The line modified in this patch was supposed to increase the
optimization levels of parsers in debug mode to 1, because they
were too slow otherwise. But as a side effect, it also reduced the
optimization level in release mode to 1. This is not a problem
for the CQL frontend, because statement preparation is not
performance-sensitive, but it is a serious performance problem
for Alternator, where it lies in the hot path.

Fix this by only applying the -O1 to debug modes.

Fixes #12463

Closes #12460
2023-01-06 18:04:36 +02:00
Wojciech Mitros
903c4874d0 build: remove references to unused c bindings of wasmtime
Before the changes intorducing the new wasmtime bindings we relied
on an downloaded static library libwasmtime.a. Now that the bindings
are introduced, we do not rely on it anymore, so all references to
it can be removed.
2023-01-06 14:07:29 +01:00
Wojciech Mitros
996a942e05 test: assert that WASM allocations can fail without crashing
The main source of big allocations in the WASM UDF implementation
is the WASM Linear Memory. We do not want Scylla to crash even if
a memory allocation for the WASM Memory fails, so we assert that
an exception is thrown instead.

The wasmtime runtime does not actually fail on an allocation failure
(assuming the memory allocator does not abort and returns nullptr
instead - which our seastar allocator does). What happens then
depends on the failed allocation handling of the code that was
compiled to WASM. If the original code threw an exception or aborted,
the resulting WASM code will trap. To make sure that we can handle
the trap, we need to allow wasmtime to handle SIGILL signals, because
that what is used to carry information about WASM traps.

The new test uses a special WASM Memory allocator that fails after
n allocations, and the allocations include both memory growth
instructions in WASM, as well as growing memory manually using the
wasmtime API.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2023-01-06 14:07:29 +01:00
Wojciech Mitros
f05d612da8 wasm: limit memory allocated using mmap
The wasmtime runtime allocates memory for the executable code of
the WASM programs using mmap and not the seastar allocator. As
a result, the memory that Scylla actually uses becomes not only
the memory preallocated for the seastar allocator but the sum of
that and the memory allocated for executable codes by the WASM
runtime.
To keep limiting the memory used by Scylla, we measure how much
memory do the WASM programs use and if they use too much, compiled
WASM UDFs (modules) that are currently not in use are evicted to
make room.
To evict a module it is required to evict all instances of this
module (the underlying implementation of modules and instances uses
shared pointers to the executable code). For this reason, we add
reference counts to modules. Each instance using a module is a
reference. When an instance is destroyed, a reference is removed.
If all references to a module are removed, the executable code
for this module is deallocated.
The eviction of a module is actually acheved by eviction of all
its references. When we want to free memory for a new module we
repeatedly evict instances from the wasm_instance_cache using its
LRU strategy until some module loses all its instances. This
process may not succeed if the instances currently in use (so not
in the cache) use too much memory - in this case the query also
fails. Otherwise the new module is added to the tracking system.
This strategy may evict some instances unnecessarily, but evicting
modules should not happen frequently, and any more efficient
solution requires an even bigger intervention into the code.
2023-01-06 14:07:29 +01:00
Wojciech Mitros
b8d28a95bf wasm: add configuration options for instance cache and udf execution
Different users may require different limits for their UDFs. This
patch allows them to configure the size of their cache of wasm,
the maximum size of indivitual instances stored in the cache, the
time after which the instances are evicted, the fuel that all wasm
UDFs are allowed to consume before yielding (for the control of
latency), the fuel that wasm UDFs are allowed to consume in total
(to allow performing longer computations in the UDF without
detecting an infinite loop) and the hard limit of the size of UDFs
that are executed (to avoid large allocations)
2023-01-06 14:07:27 +01:00
Wojciech Mitros
3214f5c2db test: check that wasmtime functions yield
The new implementation for WASM UDFs allows executing the UDFs
in pieces. This commit adds a test asserting that the UDF is in fact
divided and that each of the execution segments takes no longer than
1ms.
2023-01-06 14:05:53 +01:00
Wojciech Mitros
3146807192 wasm: use the new rust bindings of wasmtime
This patch replaces all dependencies on the wasmtime
C++ bindings with our new ones.
The wasmtime.hh and wasm_engine.hh files are deleted.
The libwasmtime.a library is no longer required by
configure.py. The SCYLLA_ENABLE_WASMTIME macro is
removed and wasm udfs are now compiled by default
on all architectures.
In terms of implementation, most of code using
wasmtime was moved to the Rust source files. The
remaining code uses names from the new bindings
(which are mostly unchanged). Most of wasmtime objects
are now stored as a rust::Box<>, to make it compatible
with rust lifetime requirements.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2023-01-06 14:05:53 +01:00
Wojciech Mitros
50b24cf036 rust: add Wasmtime bindings
The C++ bindings provided by wasmtime are lacking a crucial
capability: asynchronous execution of the wasm functions.
This forces us to stop the execution of the function after
a short time to prevent increasing the latency. Fortunately,
this feature is implemented in the native language
of Wasmtime - Rust. Support for Rust was recently added to
scylla, so we can implement the async bindings ourselves,
which is done in this patch.

The bindings expose all the objects necessary for creating
and calling wasm functions. The majority of code implemented
in Rust is a translation of code that was previously present
in C++.

Types exported from Rust are currently required to be defined
by the  same crate that contains the bridge using them, so
wasmtime types can't be exported directly. Instead, for each
class that was supposed to be exported, a wrapper type is
created, where its first member is the wasmtime class. Note
that the members are not visible from C++ anyway, the
difference only applies to Rust code.

Aside from wasmtime types and methods, two additional types
are exported with some associated methods.
- The first one is ValVec, which is a wrapper for a rust Vec
of wasmtime Vals. The underlying vector is required by
wasmtime methods for calling wasm functions. By having it
exported we avoid multiple conversions from a Val wrapper
to a wasmtime Val, as would be required if we exported a
rust Vec of Val wrappers (the rust Vec itself does not
require wrappers if the type it contains is already wrapped)
- The second one is Fut. This class represents an computation
tha may or may not be ready. We're currently using it
to control the execution of wasm functions from C++. This
class exposes one method: resume(), which returns a bool
that signals whether the computation is finished or not.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2023-01-06 14:05:53 +01:00
Wojciech Mitros
33c97de25c rust: add build profiles more aligned with ninja modes
A cargo profile is created for each of build modes: dev, debug,
sanitize, realease and coverage. The names of cargo profiles are
prefixed by "rust-" because cargo does not allow separate "dev"
and "debug" profiles.

The main difference between profiles are their optimization levels,
they correlate to the levels used in configure.py. The debug info
is stripped only in the dev mode, and only this mode uses
"incremental" compilation to speed it up.
2023-01-06 14:05:53 +01:00
Wojciech Mitros
4d7858e66d rust: adjust build according to cxxbridge's recommendations
Currently, the rust build system in Scylla creates a separate
static library for each incuded rust package. This could cause
duplicate symbol issues when linking against multiple libraries
compiled from rust.

This issue is fixed in this patch by creating a single static library
to link against, which combines all rust packages implemented in
Scylla.

The Cargo.lock for the combined build is now tracked, so that all
users of the same scylla version also use the same versions of
imported rust modules.

Additionally, the rust package implementation and usage
docs are modified to be compatible with the build changes.

This patch also adds a new header file 'rust/cxx.hh' that contains
definitions of additional rust types available in c++.
2023-01-06 14:05:53 +01:00
Avi Kivity
eeaa475de9 tools: toolchain: dbuild: prepare for sharing cargo cache
Rust's cargo caches downloaded sources in ~/.cargo. However dbuild
won't provide access to this directory since it's outside the source
directory.

Prepare for sharing the cargo cache between the host and the dbuild
environment by:
 - Creating the cache if it doesn't already exist. This is likely if
   the user only builds in a dbuild environment.
 - Propagating the cache directory as a mounted volume.
 - Respecting the CARGO_HOME override.
2023-01-06 14:05:53 +01:00
Avi Kivity
6868dcf30b tools: toolchain: drop s390x from prepare script architecture list
It's been a long while since we built ScyllaDB for s390x, and in
fact the last time I checked it was broken on the ragel parser
generator generating bad source files for the HTTP parser. So just
drop it from the list.

I kept s390x in the architecture mapping table since it's still valid.

Closes #12455
2023-01-06 09:08:01 +02:00
Michał Radwański
1fbf433966 mutation{,_consumer,_partition}: remove consume_in_reverse::legacy_half_reverse
This commit removes consume_in_reverse::legacy_half_reverse, an option
once used to indicate that the given key ranges are sorted descending,
based on the clustering key of the start of the range, and that the
range tombstones inside partition would be sorted (descending, as all
the mutation fragments would) according to their end (but range
tombstone would still be stored according to their start bound).

As it turns out, mutation::consume, when called with legacy_half_reverse
option produces invalid fragment stream, one where all the row
tombstone changes come after all the clustering rows. This was not an
issue, since when constructing results from the query, Scylla would not
pass the tombstones to the client, but instead compact data beforehand.

In this commit, the consume_in_reverse::legacy_half_reverse is removed,
along with all the uses.

As for the swap out in mutation_partition.cc in query_mutation and
to_data_query_result:

The downstream was not prepared to deal with legacy_half_reverse.
mutation::consume contains

```
     if (reverse == consume_in_reverse::yes) {
         while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::yes>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) {
             co_await yield();
        }
     } else {
         while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::no>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) {
             co_await yield();
         }
     }
```

So why did it work at all? to_data_query_result deals with a single slice.
The used consumer (compact_for_query_v2) compacts-away the range tombstone
changes, and thus the only difference between the consume_in_reverse::no
and consume_in_reverse::yes was that one was ordered increasing wrt. ckeys
and the second one was ordered decreasing. This property is maintained if
we swap out for the consume_in_reverse::yes format.
2023-01-05 18:48:55 +01:00
Botond Dénes
2612f98a6c Merge 'Abort repair tasks' from Aleksandra Martyniuk
Aborting of repair operation is fully managed by task manager.
Repair tasks are aborted:
- on shutdown; top level repair tasks subscribe to global abort source. On shutdown all tasks are aborted recursively
- through node operations (applies to data_sync_repair_task_impls and their descendants only); data_sync_repair_task_impl subscribes to node_ops_info abort source
- with task manager api (top level tasks are abortable)
- with storage_service api and on failure; these cases were modified to be aborted the same way as the ones from above are.

Closes #12085

* github.com:scylladb/scylladb:
  repair: make top level repair tasks abortable
  repair: unify a way of aborting repair operations
  repair: delete sharded abort source from node_ops_info
  repair: delete unused node_ops_info from data_sync_repair_task_impl
  repair: delete redundant abort subscription from shard_repair_task_impl
  repair: add abort subscription to data sync task
  tasks: abort tasks on system shutdown
2023-01-05 15:21:35 +01:00
Avi Kivity
cc6010b512 Merge 'Make restore_replica_count abortable' from Benny Halevy
Similar to the way we allow aborting streaming-based
removenode, subscribe to storage_service::_abort_source
to request abort locally and pass a shared_ptr<abort_source>
to `node_ops_info`, used to abort removenode_with_repair
on shutdown.

Fixes #12429

Closes #12430

* github.com:scylladb/scylladb:
  storage_service: restore_replica_count: demote status_checker related logging to debug level
  storage_service: restore_replica_count: allow aborting removenode_with_repair
  storage_service: coroutinize restore_replica_count
  storage_service: restore_replica_count: undefer stop_status_checker
  storage_service: restore_replica_count: handle exceptions from stream_async and send_replication_notification
  storage_service: restore_replica_count: coroutinize status_checker
2023-01-05 15:21:35 +01:00
Kamil Braun
09da661eeb Merge 'raft: replace experimental raft option with dedicated flag' from Gleb Natapov
Unlike other experimental feature we want to raft to be opt in even
after it leaves experimental mode. For that we need to have a separate
option to enable it. The patch adds the binary option "consistent-cluster-management"
for that.

* 'consistent-cluster-management-flag' of github.com:scylladb/scylla-dev:
  raft: replace experimental raft option with dedicated flag
  main: move supervisor notification about group registry start where it actually starts
2023-01-05 15:21:35 +01:00
Anna Stuchlik
44e6f18d1b doc: add the new upgrade guide to the toctree and fix its name 2023-01-05 14:13:33 +01:00
Anna Stuchlik
0ad2e3e63a docs: add the upgrade guide from ScyllaDB 5.1 to ScyllaDB Enterprise 2022.2 2023-01-05 13:30:10 +01:00
Aleksandra Martyniuk
dcb91457da api: change retrieve_status signature
Sometimes we may need task status to be nothrow move constructible.
httpd::task_manager_json::task_status does not satisfy this requirement.

retrieve_status returns future<full_task_status> instead of future<task_status>
to provide an intermediate struct with better properties. An argument
is passed by reference to prevent the necessity to copy foreign_ptr.
2023-01-05 13:28:51 +01:00
Kamil Braun
df72536fc5 Merge 'docs: add the upgrade guide for Enterprise from 2022.1 to 2022.2' from Anna Stuchlik
Fixes https://github.com/scylladb/scylladb/issues/12314

This PR adds the upgrade guide for ScyllaDB Enterprise - from version
2022.1 to 2022.2.  Using this opportunity, I've replaced "Scylla" with
"ScyllaDB" in the upgrade-enterprise index file.

In previous releases, we added several upgrade guides - one per platform
(and version). In this PR, I've merged the information for different
platforms to create one generic upgrade guide. It is similar to what
@kbr- added for the Open Source upgrade guide from 5.0 to 5.1. See
https://docs.scylladb.com/stable/upgrade/upgrade-opensource/upgrade-guide-from-5.0-to-5.1/.

Closes #12339

* github.com:scylladb/scylladb:
  docs: add the info about minor release
  docs: add the new upgade guide 2022.1 to 2022.2 to the index and the toctree
  docs: add the index file for the new upgrage guide from 2022.1 to 2022.2
  docs: add the metrics update file to the upgrade guide 2022.1 to 2022.2
  docs: add the upgrade guide for ScyllaDB Enterprise from 2022.1 to 2022.2
2023-01-04 18:07:00 +01:00
Benny Halevy
086546f575 storage_service: restore_replica_count: demote status_checker related logging to debug level
the status_checker is not the main line of business
of restore_replica_count, starting and stopping it
do nt seem to deserve info level logging, which
might have been useful in the past to debug issues
surrounding that.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-04 19:05:04 +02:00
Benny Halevy
3879ee1db8 storage_service: restore_replica_count: allow aborting removenode_with_repair
Similar to the way we allow aborting streaming-based
removenode, subscribe to storage_service::_abort_source
to request abort locally and pass a shared_ptr<abort_source>
to `node_ops_info`, used to abort removenode_with_repair
on shutdown.

Fixes #12429

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-04 19:05:04 +02:00
Benny Halevy
afece5bdc4 storage_service: coroutinize restore_replica_count
and unwrap the async thread started for streaming.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-04 19:05:04 +02:00
Benny Halevy
d1eadc39c1 storage_service: restore_replica_count: undefer stop_status_checker
Now that all exceptions in the rest of the function
are swallowed, just execute the stop_status_checker
deferred action serially before returning, on the
wau to coroutinizing restore_replica_count (since
we can't co_await status_checker inside the deferred
action).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-04 19:05:04 +02:00
Benny Halevy
788ecb738d storage_service: restore_replica_count: handle exceptions from stream_async and send_replication_notification
On the way to coroutinizing restore_replica_count,
extract awaiting stream_async and send_replication_notification
into a try/catch blocks so we can later undefer stop_status_checker.

The exception is still returned as an exceptional future
which is logged by the caller as warning.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-04 19:02:42 +02:00
Benny Halevy
b54d121dfd storage_service: restore_replica_count: coroutinize status_checker
There is no need to start a thread for the status_checker
and can be implemented using a background coroutine.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-04 19:02:20 +02:00
Botond Dénes
1d273a98b9 readers/multishard: shard_reader::close() silence read-ahead timeouts
Timouts are benign, especially on a read-ahead that turned out to be not
needed at all. They just introduce noise in the logs, so silence them.

Fixes: #12435

Closes #12441
2023-01-04 16:10:09 +02:00
Kamil Braun
4268b1bbc2 Merge 'raft: raft_group0, register RPC verbs on all shards' from Gusev Petr
raft_group0 used to register RPC verbs only on shard 0. This worked on
clusters with the same --smp setting on all nodes, since RPCs in this
case are processed on the same shard as the calling code, and
raft_group0 methods only run on shard 0.

A new test test_nodes_with_different_smp was added to identify the
problem. Since --smp can only be specified via the command line, a
corresponding parameter was added to the ManagerClient.server_add
method.  It allows to override the default parameters set by the
SCYLLA_CMDLINE_OPTIONS variable by changing, adding or deleting
individual items.

Fixes: #12252

Closes #12374

* github.com:scylladb/scylladb:
  raft: raft_group0, register RPC verbs on all shards
  raft: raft_append_entries, copy entries to the target shard
  test.py, allow to specify the node's command line in test
2023-01-04 11:11:21 +01:00
Marcin Maliszkiewicz
61a9816bad utils/rjson: enable inlining in rapidjson library
Due to lack of NDEBUG macro inlining was disabled. It's
important for parsing and printing performance.

Testing with perf_simple_query shows that it reduced around
7000 insns/op, thus increasing median tps by 4.2% for the alternator frontend.

Because inlined functions are called for every character
in json this scales with request/response size. When
default write size is increased by around 7x (from ~180 to ~ 1255
bytes) then the median tps increased by 12%.

Running:
./build/release/test/perf/perf_simple_query_g --smp 1 \
                                --alternator forbid --default-log-level error \
                                --random-seed=1235000092 --duration=60 --write

Results before the patch:

median 46011.50 tps (197.1 allocs/op,  12.1 tasks/op,  170989 insns/op,        0 errors)
median absolute deviation: 296.05
maximum: 46548.07
minimum: 42955.49

Results after the patch:

median 47974.79 tps (197.1 allocs/op,  12.1 tasks/op,  163723 insns/op,        0 errors)
median absolute deviation: 303.06
maximum: 48517.53
minimum: 44083.74

The change affects both json parsing and printing.

Closes #12440
2023-01-04 10:27:35 +02:00
Michał Jadwiszczak
83bb77b8bb test/boost/cql_query_test: enable parallelized_aggregation
Run tests for parallelized aggregation with
`enable_parallelized_aggregation` set always to true, so the tests work
even if the default value of the option is false.

Closes #12409
2023-01-04 10:11:25 +02:00
Anna Stuchlik
c4d779e447 doc: Fix https://github.com/scylladb/scylla-doc-issues/issues/854 - update the procedure to update topology strategy when nodes are on different racks
Closes #12439
2023-01-04 09:50:10 +02:00
Avi Kivity
2739ac66ed treewide: drop cql_serialization_format
Now that we don't accept cql protocol version 1 or 2, we can
drop cql_serialization format everywhere, except when in the IDL
(since it's part of the inter-node protocol).

A few functions had duplicate versions, one with and one without
a cql_serialization_format parameter. They are deduplicated.

Care is taken that `partition_slice`, which communicates
the cql_serialization_format across nodes, still presents
a valid cql_serialization_format to other nodes when
transmitting itself and rejects protocol 1 and 2 serialization\
format when receiving. The IDL is unchanged.

One test checking the 16-bit serialization format is removed.
2023-01-03 19:54:13 +02:00
Avi Kivity
654b96660a cql: modification_statement: drop protocol check for LWT
CQL protocol 1 did not support LWT, but since we don't support it
any more, we can drop the check and the supporting get_protocol_version()
helper.
2023-01-03 19:51:57 +02:00
Avi Kivity
424dbf43f3 transport: drop cql protocol versions 1 and 2
Version 3 was introduced in 2014 (Cassandra 2.1) and was supported
in the very first version of Scylla (2a7da21481 "CQL binary protocol").

Cassandra 3.0 (2015) dropped protocols 1 and 2 as well.
It's safe enough to drop it now, 9 years after introduction of v3
and 7 years after Cassandra stopped supporting it.

Dropping it allows dropping cql_serialization_format, which causes
quite a lot of pain, and is probably broken. This will be dropped in the
following patch.
2023-01-03 19:47:49 +02:00
Avi Kivity
f600ad5c1b Update seastar submodule
* seastar 3db15b5681...ca586cfb8d (28):
  > reactor: trim returned buffer to received number of bytes
  > util/process: include used header
  > build: drop unused target_include_directories()
  > build: use BUILD_IN_SOURCE instead chdir <SOURCE_DIR>
  > build: specify CMake policy CMP0135 to new
  > tests: only destroy allocated pending connections
  > build: silence the output when generating private keys
  > tests, httpd: Limit loopback connection factory sharding
  > lw_shared_ptr: Add nullptr_t comparing operators
  > noncopyable_function: Add concept for (Func func) constructor
  > reactor: add process::terminate() and process::kill()
  > Merge 'tests, include: include headers without ".." in path' from Kefu Chai
  > build: customize toolset for building Boost
  > build: use different toolset base on specified compiler
  > allocator: add an option to reserve additional memory for the OS
  > Merge 'build: pass cflags and ldflags to cooking.sh' from Kefu Chai
  > build: build static library of cryptopp
  > gate: add gate holders debugging
  > build: detect debug build of yaml-cpp also
  > build: do not use pkg_search_module(IMPORTED_TARGET) for finding yaml-cpp
  > build: bump yaml-cpp to 0.7.0 in cooking_recipe
  > build: bump cryptopp to 8.7.0 in cooking_recipe
  > build: bump boost to 1.81.0 in cooking_recipe
  > build: bump fmtlib to 9.1.0 in cooking_recipe
  > shared_ptr: add overloads for fmt::ptr()
  > chunked_fifo: const_iterator: use the base class ctor
  > build: s/URING_LIBARIES/URING_LIBRARIES/
  > build: export the full path of uring with URING_LIBRARIES

Closes #12434
2023-01-03 17:58:31 +02:00
Alejo Sanchez
889acf710c test/python: increase CQL connection timeout for...
test_ssl

In very slow debug builds the default driver timeouts are too low and
tests might fail. Bump up the values to a more reasonable time.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #12408
2023-01-03 17:10:46 +02:00
Nadav Har'El
1c96d2134f docs,alternator: link to issue about missing ACL feature
The alternator compatibility.md document mentions the missing ACL
(access control) feature, but unlike other missing features we
forgot to link to the open issue about this missing feature.
So let's add that link.

Refs #5047.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12399
2023-01-03 16:50:33 +02:00
Kamil Braun
fc57626afa Merge 'docs: remove auto_bootstrap option from the documentation' from Anna Stuchlik
Fixes https://github.com/scylladb/scylladb/issues/12318

This PR removes all occurrences of the `auto_bootstrap` option in the docs.
In most cases, I've simply removed the option name and its definition, but sometimes additional changes were necessary:
- In node-joined-without-any-data.rst, I removed the `auto_bootstrap `option as one of the causes of the problem.
- In rebuild-node.rst, I removed the first step in the procedure (enabling the `auto_bootstrap `option).
- In admin. rst, I removed the section about manual bootstrapping - it's based on setting `auto_bootstrap` to false, which is not possible now.

Closes #12419

* github.com:scylladb/scylladb:
  docs: remove the auto_bootstrap option from the admin procedures - involves removing the Manual Bootstraping section
  docs: remove the auto_bootstrap option from the procedure to replace a dead node
  docs: remove the auto_bootstrap option from the Troubleshooting article about a node joining with no data
  docs: remove the auto_bootstrap option from the procedure to rebuild a node after losing the data volume
  docs: remove the auto_bootstrap option from the procedures to create a cluster or add a DC
2023-01-03 15:44:00 +01:00
Botond Dénes
e4d5b2a373 replica/database: add disk_reads and sstables_read metrics
Tracking the current number of reads gone to disk and the current number
of sstables read by all such reads respectively.
2023-01-03 09:37:29 -05:00
Botond Dénes
2acfa950d7 sstables: wire in the reader_permit's sstable read count tracking
Hook in the relevant methods when creating and destroying sstable
readers.
2023-01-03 09:37:29 -05:00
Botond Dénes
2c0de50969 reader_concurrency_semaphore: add disk_reads and sstables_read stats
And the infrastructure to reader_permit to update them. The
infrastructure is not wired in yet.
These metrics will be used to count the number of reads gone to disk and
the number of sstables read currently respectively.
2023-01-03 09:37:29 -05:00
Botond Dénes
dcd2deb5af replica/database: fix active_reads_memory_consumption_metric
Rename to reads_memory_consumption and drop the "active" from the
description as well. This metric tracks the memory consumption of all
reads: active or inactive. We don't even currently have a way to track
the memory consumption of only active reads.
Drop the part of the description which explains the interaction with
other metrics: this part is outdated and the new interactions are much
more complicated, no way to explain in a metric description.
Also ask the semaphore to calculate the memory amount, instead of doing
it in the metric itself.
2023-01-03 09:25:47 -05:00
Petr Gusev
8417840647 raft: raft_group0, register RPC verbs on all shards
raft_group0 used to register RPC verbs only on shard 0.
This worked on clusters with the same --smp setting on
all nodes, since RPCs in this case are (usually)
processed on the same shard as the calling code,
and raft_group0 methods only run on shard 0.

A new test test_nodes_with_different_smp was added
to identify the problem.

Fixes: #12252
2023-01-03 17:04:07 +03:00
Anna Stuchlik
00ef20c3df docs: remove the auto_bootstrap option from the admin procedures - involves removing the Manual Bootstraping section 2023-01-03 14:48:01 +01:00
Anna Stuchlik
b7d62b2fc7 docs: remove the auto_bootstrap option from the procedure to replace a dead node 2023-01-03 14:47:55 +01:00
Anna Stuchlik
bc62e61df1 docs: remove the auto_bootstrap option from the Troubleshooting article about a node joining with no data 2023-01-03 14:46:38 +01:00
Anna Stuchlik
1602f27cd7 docs: remove the auto_bootstrap option from the procedure to rebuild a node after losing the data volume 2023-01-03 14:45:08 +01:00
Botond Dénes
929481ea9c replica/database: fix active_reads metric
This metric has been broken for a long time, since inactive reads were
introduced. As calculated currently, it includes all permits that passed
admission, including inactive reads. On the other hand, it excludes
permits created bypassing admission.
Fix by using the newly introduced (in this patch)
reader_concurrency_semaphore::active_reads() as the basis of this
metric: this now includes all permits (reads) that are currently active,
excluding waiters and inactive reads.
2023-01-03 08:12:25 -05:00
Petr Gusev
7725e03a09 raft: raft_append_entries, copy entries to the target shard
If append_entries RPC was received on a non-zero shard, we may
need to pass it to a zero (or, potentially, some other) shard.
The problem is that raft::append_request contains entries in the form
of raft::log_entry_ptr == lw_shared_ptr<log_entry>, which doesn't
support cross-shard reference counting. In debug mode it contains
a special ref-counting facility debug_shared_ptr_counter_type,
which resorts to on_internal_error if it detects such a case.

To solve this, we just copy log entries to the target shard if it
isn't equal to the current one. In most cases, if --smp setting
is the same on all nodes, RPC will be handled on zero shard,
so there will be no overhead.
2023-01-03 15:25:00 +03:00
Petr Gusev
1c23390f12 test.py, allow to specify the node's command line in test
An optional parameter cmdline has been added to
the ManagerClient.server_add method.
It allows you to override the default parameters
set by the SCYLLA_CMDLINE_OPTIONS variable
by changing, adding or deleting individual
items. To change or add a parameter just specify
its name and value one after the other.
To remove parameter use the special keyword
__remove__ as a value. To set a parameter
without a value (such as --overprovisioned)
use the special keyword __missing__ as the value.
2023-01-03 15:24:54 +03:00
Nadav Har'El
eb85f136c8 cql-pytest: document how to write new cql-pytest tests
Add to test/cql-pytest/README.md an explanation of the philosophy
of the cql-pytest test suite, and some guideliness on how to write
good tests in that framework.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12400
2023-01-03 12:13:22 +02:00
Anna Stuchlik
994bc33147 docs: fix the command on the Manager-Monitoring Integration troubleshooting page
Closes #12375
2023-01-03 11:41:16 +02:00
Anna Stuchlik
9d17d812c0 docs: Fix https://github.com/scylladb/scylla-doc-issues/issues/870, update the nodetool rebuild command
Closes #12416
2023-01-03 11:40:40 +02:00
Gleb Natapov
1688163233 raft: replace experimental raft option with dedicated flag
Unlike other experimental feature we want to raft to be optional even
after it leaves experimental mode. For that we need to have a separate
option to enable it. The patch adds the binary option "consistent-cluster-management"
for that.
2023-01-03 11:15:11 +02:00
Gleb Natapov
29060cc235 main: move supervisor notification about group registry start where it actually starts
99fe580068 moved raft_group_registry::start call a bit later, but
forget to move supervisor notification call. Do it now.
2023-01-03 11:09:30 +02:00
Botond Dénes
2ef71e9c70 Merge 'Improve verbosity of task manager api' from Aleksandra Martyniuk
The PR introduces changes to task manager api:
- extends tasks' list returned with get_tasks with task type,
   keyspace, table, entity, and sequence number
- extends status returned with get_task_status and wait_task
   with a list of children's ids

Closes #12338

* github.com:scylladb/scylladb:
  api: extend status in task manager api
  api: extend get_tasks in task manager api
2023-01-03 10:39:41 +02:00
Botond Dénes
82101b786d Merge 'docs: document scylla-api-client' from Anna Stuchlik
Fixes https://github.com/scylladb/scylladb/issues/11999.

This PR adds a description of scylla-api-cli.

Closes #12392

* github.com:scylladb/scylladb:
  docs: fix the description of the system log POST example
  docs: uptate the curl tool name
  docs: describe how to use the scylla-api-client tool
  docs: fix the scylla-api-client tool name
  docs: document scylla-api-cli
2023-01-03 10:30:04 +02:00
Benny Halevy
63c2cdafe8 sstables: index_reader: close(index_bound&) reset current_list
When closing _lower_bound and *_upper_bound
in the final close() call, they are currently left with
an engaged current_list member.

If the index_reader uses a _local_index_cache,
it is evicted with evict_gently which will, rightfully,
see the respective pages as referenced, and they won't be
evicted gently (only later when the index_reader is destroyed).

Reset index_bound.current_list on close(index_bound&)
to free up the reference.

Ref #12271

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12370
2023-01-02 16:42:33 +01:00
Avi Kivity
767b7be8be Merge 'Get rid of handle_state_replacing' from Benny Halevy
Since [repair: Always use run_replace_ops](2ec1f719de), nodes no longer publish HIBERNATE state so we don't need to support handling it.

Replace is now always done using node operations (using repair or streaming).
so nodes are never expected to change status to HIBERNATE.

Therefore storage_service:handle_state_replacing is not needed anymore.

This series gets rid of it and updates documentation related to STATUS:HIBERNATE respectively.

Fixes #12330

Closes #12349

* github.com:scylladb/scylladb:
  docs: replace-dead-node: get rid of hibernate status
  storage_service: get rid of handle_state_replacing
2023-01-02 13:35:29 +02:00
Gleb Natapov
28952d32ff storage_service: move leave_ring outside of unbootstrap()
We want to reuse the later without the call.

Message-Id: <20221228144944.3299711-17-gleb@scylladb.com>
2023-01-02 12:03:29 +02:00
Gleb Natapov
229cef136d raft: add trace logging to raft::server::start
Allows to see initial state of the server during start.

Message-Id: <20221228144944.3299711-15-gleb@scylladb.com>
2023-01-02 11:57:53 +02:00
Gleb Natapov
96453ff75f service: raft: improve group0_state_machine::apply logging
Trace how many entries are applied as well.

Message-Id: <20221228144944.3299711-14-gleb@scylladb.com>
2023-01-02 11:57:16 +02:00
Gleb Natapov
dbd5b97201 storage_service: improve logging in update_pending_ranges() function
We pass the reason for the change. Log it as well.

Message-Id: <20221228144944.3299711-11-gleb@scylladb.com>
2023-01-02 11:54:03 +02:00
Gleb Natapov
04ab673359 messaging: check that a node knows its own topology before accessing it
We already check is remote's node topology is missing before creating a
connection, but local node topology can be missing too when we will use
raft to manage it. Raft needs to be able to create connections before
topology is knows.

Message-Id: <20221228144944.3299711-7-gleb@scylladb.com>
2023-01-02 11:53:14 +02:00
Gleb Natapov
6f104982e1 topology: use std::erase_if on std::map instead of ad-hoc loop
There is std::erase_if since c++20. We can use it here.

Message-Id: <20221228144944.3299711-6-gleb@scylladb.com>
2023-01-02 11:45:52 +02:00
Gleb Natapov
84eb5924ac system_keyspace: remove redundant include
storage_proxy.hh is included twice

Message-Id: <20221228144944.3299711-4-gleb@scylladb.com>
2023-01-02 11:39:22 +02:00
Gleb Natapov
5182543df2 raft: fix typo in read_barrier logging
The log logs applied index not append one.

Message-Id: <20221228144944.3299711-3-gleb@scylladb.com>
2023-01-02 11:38:47 +02:00
Gleb Natapov
5a96751534 storage_service: remove start_leaving since it is no longer used
Message-Id: <20221228144944.3299711-2-gleb@scylladb.com>
2023-01-02 11:37:48 +02:00
Raphael S. Carvalho
b4e4bbd64a database_test: Reduce x_log2_compaction_group values to avoid timeout
database_test in timing out because it's having to run the tests calling
do_with_cql_env_and_compaction_groups 3x, one for each compaction group
setting. reduce it to 2 settings instead of 3 if running in debug mode.

Refs #12396.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #12421
2023-01-01 13:56:18 +02:00
Raphael S. Carvalho' via ScyllaDB development
a7c4a129cb sstables: Bump row_reads metrics for mx version
Metric was always 0 despite a row was processed by mx reader.

Fixes #12406.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20221227220202.295790-1-raphaelsc@scylladb.com>
2022-12-30 18:38:30 +01:00
Anna Stuchlik
601aeb924a docs: remove the auto_bootstrap option from the procedures to create a cluster or add a DC 2022-12-30 13:10:06 +01:00
Avi Kivity
8635d24424 build: drop abseil submodule, replace with distribution abseil
This lets us carry fewer things and rely on the distribution
for maintenance.

The frozen toolchain is updated. Incidental updates include clang 15.0.6,
and pytest that doesn't need workarounds.

Closes #12397
2022-12-28 19:02:23 +02:00
Avi Kivity
eced91b575 Revert "view: coroutinize maybe_mark_view_as_built"
This reverts commit ac2e2f8883. It causes
a regression ("std::bad_variant_access in load_view_build_progress").

Commit 2978052113 (a reindent) is also reverted as part of
the process.

Fixes #12395
2022-12-28 15:36:05 +02:00
Nadav Har'El
200bc82913 test/cql-pytest: exit immediately if Scylla is down
In commit acfa180766 we added to
test/cql-pytest a mechanism to detect when Scylla crashes in the middle
of a test function - in which case we report the culprit test and exit
immediately to avoid having a hundred more tests report that they failed
as well just because Scylla was down.

However, if Scylla was *never* up - e.g., if the user ran "pytest" without
ever running Scylla -  we still report hundreds of tests as having failed,
which is confusing and not helpful.

So with this patch, if a connection cannot be made to Scylla at all,
the test exits immediately, explaining what went wrong, not blaming
any specific test:

    $ pytest
    ...
    ! _pytest.outcomes.Exit: Cannot connect to Scylla at --host=localhost --port=9042 !
    ============================ no tests ran in 0.55s =============================

Beyond being a helpful reminder for a developer who runs "pytest" without
having started Scylla first (or using test/cql-pytest/run or test.py to
start Scylla easily), this patch is also important when running tests
through test.py if it reuses an instance of Scylla that crashed during an
earlier pytest file's run.

This patch does not fix test.py - it can still try to run pytest with
a dead Scylla server without checking. But at least with this patch
pytest will notice this problem immediately and won't report hundreds of
test functions having failed. The only report the user will see will be
the last test which crashed Scylla, which will make it easier to find
this failure without being hidden between hundreds of spurious failures.

Fixes #12360

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12401
2022-12-28 13:04:28 +02:00
Anna Stuchlik
d0db1a27c3 docs: fix the description of the system log POST example 2022-12-28 11:25:54 +01:00
Anna Stuchlik
b7ec99b10b docs: uptate the curl tool name 2022-12-28 10:33:07 +01:00
Asias He
b9e5e340aa streaming: Enable offstrategy for all classic streaming based node ops
This patch enables offstrategy compaction for all classic streaming
based node ops. We can use this method because tables are streamed one
after another. As long as there is still streamed data for a given
table, we update the automatic trigger timer. When all the streaming has
finished, the trigger timer will timeout and fire the offstrategy
compaction for the given table.

I checked with this patch, rebuild is 3X faster. There was no compaction
in the middle of the streaming. The streamed sstables are compacted
together after streaming is done.

Time Before:
INFO  2022-11-25 10:06:08,213 [shard 0] range_streamer - Rebuild
succeeded, took 67 seconds, nr_ranges_remaining=0

Time After:
INFO  2022-11-25 09:42:50,943 [shard 0] range_streamer - Rebuild
succeeded, took 23 seconds, nr_ranges_remaining=0

Compaciton Before:
88 sstables were written -> 88 sstables were added into main set

Compaction After:
88 sstables written ->  after offstretegy 2 sstables were added into main seet

Closes #11848
2022-12-28 11:12:02 +02:00
Michał Chojnowski
5e79d6b30b tasks: task_manager: move invoke_on_task<> to .hh
invoke_on_task is used in translation units where its definition is not
visible, yet it has no explicit instantiations. If the compiler always
decides to inline the definition, not to instantiate it implicitly,
linking invoke_on_task will fail. (It happened to me when I turned up
inline-threshold). Fix that.

Closes #12387
2022-12-28 10:55:43 +02:00
Alejo Sanchez
d408b711e3 test/python: increase CQL connection timeouts
In very slow debug builds the default driver timeouts are too low and
tests might fail. Bump up the values to more reasonable time.

These timeout values are the same as used in topology tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #12405
2022-12-28 10:06:33 +02:00
Anna Stuchlik
39ade2f5a5 docs: describe how to use the scylla-api-client tool 2022-12-27 14:46:16 +01:00
Anna Stuchlik
2789501023 docs: fix the scylla-api-client tool name 2022-12-27 14:28:27 +01:00
Alejo Sanchez
1bfe234133 test/pylib: API get/set logger level of Scylla server
Provide helpers to get and set logger level for Scylla servers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #12394
2022-12-25 13:58:43 +02:00
Anna Stuchlik
ea7e23bf92 docs: fix the option name from compaction to compression on the Data Definition page
Fixes the option name in the "Other table options" table on the Data Definition page.

Fixes #12334

Closes #12382
2022-12-25 11:24:56 +02:00
Botond Dénes
b0d95948e1 mutation_compactor: reset stop flag on page start
When the mutation compactor has all the rows it needs for a page, it
saves the decision to stop in a member flag: _stop.
For single partition queries, the mutation compactor is kept alive
across pages and so it has a method, start_new_page() to reset its state
for the next page. This method didn't clear the _stop flag. This meant
that the value set at the end of the previous could cause the new page
and subsequently the entire query to be stopped prematurely.
This can happen if the new page starts with a row that is covered by a
higher level tombstone and is completely empty after compaction.
Reset the _stop flag in start_new_page() to prevent this.

This commit also adds a unit test which reproduces the bug.

Fixes: #12361

Closes #12384
2022-12-24 13:52:45 +02:00
Takuya ASADA
642d035067 docker: prevent hostname -i failure when server address is specified
On some docker instance configuration, hostname resolution does not
work, so our script will fail on startup because we use hostname -i to
construct cqlshrc.
To prevent the error, we can use --rpc-address or --listen-address
for the address since it should be same.

Fixes #12011

Closes #12115
2022-12-24 13:52:16 +02:00
Asias He
d819d98e78 storage_service: Ignore dropped table for repair_updater
In case a table is dropped, we should ignore it in the repair_updater,
since we can not update off strategy trigger for a dropped table.

Refs #12373

Closes #12388
2022-12-24 13:48:25 +02:00
Raphael S. Carvalho
67ebd70e6e compaction_manager: Fix reactor stalls during periodic submissions
Every 1 hour, compaction manager will submit all registered table_state
for a regular compaction attempt, all without yielding.

This can potentially cause a reactor stall if there are 1000s of table
states, as compaction strategy heuristics will run on behalf of each,
and processing all buckets and picking the best one is not cheap.
This problem can be magnified with compaction groups, as each group
is represented by a table state.

This might appear in dashboard as periodic stalls, every 1h, misleading
the investigator into believing that the problem is caused by a
chronological job.

This is fixed by piggybacking on compaction reevaluation loop which
can yield between each submission attempt if needed.

Fixes #12390.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #12391
2022-12-24 13:43:16 +02:00
Anna Stuchlik
74fd776751 docs: document scylla-api-cli 2022-12-23 11:27:37 +01:00
Benny Halevy
8797958dfc schema: operator<<: print also tombstone_gc_options
They are currently missing from the printout
when the a table is created, but they are determinal
to understanding the mode with which tombstones are to
be garbage-collected in the table.  gcGraceSeconds alone
is no longer enough since the introduction of
tombstone_gc_option in a8ad385ecd.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12381
2022-12-22 16:40:18 +02:00
Anna Stuchlik
7e8977bf2d docs: add the info about minor release 2022-12-22 10:26:33 +01:00
Nadav Har'El
ef2e5675ed materialized views, test: add tests for CLUSTERING ORDER BY
In issue #10767, concerned were raised that the CLUSTERING ORDER BY
clause is handled incorrectly in a CREATE MATERIALIZED VIEW definition.

The tests in this patch try to explore the different ways in which
CLUSTERING ORDER BY can be used in CREATE MATERIALIZED VIEW and allows
us to compare Scylla's behaivor to Cassandra, and to common sense.

The tests discover that the CLUSTERING ORDER BY feature in materialized
views generally works as expected, but there are *three* differences
between Scylla and Cassandra in this feature. We consider two differences
to be bugs (and hence the test is marked xfail) and one a Scylla extension:

1. When a base table has a reverse-order clustering column and this
   clustering column is used in the materialized view, in Cassandra
   the view's clustering order inherits the reversed order. In Scylla,
   the view's clustering order reverts to the default order.
   Arguably, both behaviors can be justified, but usually when in doubt
   we should implement Cassandra's behavior - not pick a different
   behavior, even if the different behavior is also reasonable. So
   this test (test_mv_inherit_clustering_order()) is marked "xfail",
   and a new issue was created about this difference: #12308.

   If we want to fix this behavior to match Cassandra's we should also
   consider backward compatibility - what happens if we change this
   behavior in Scylla now, after we had the opposite behavior in
   previous releases? We may choose to enshrine Scylla's Cassandra-
   incompatible behavior here - and document this difference.

2. The CLUSTERING ORDER BY should, as its name suggests, only list
   clustering columns. In Scylla, specifying other things, like regular
   columns, partition-key columns, or non-existent columns, is silently
   ignored, whereas it should result in an Invalid Request error (as it
   does in Cassandra). So test_mv_override_clustering_order_error()
   is marked "xfail".
   This is the difference already discovered in #10767.

3. When a materialized view has several clustering columns, Cassandra
   requires that a CLUSTERING ORDER BY clause, if present, must specify
   the order of all of *all* clustering columns. Scylla, in contrast,
   allows the user to override the order of only *some* of these columns -
   and the rest get the default order. I consider this to be a
   legitimate Scylla extension, and not a compatibility bug, so marked
   the test with "scylla_only", and no issue was opened about it.

Refs #10767
Refs #12308

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12307
2022-12-22 09:48:16 +02:00
Nadav Har'El
6d2e146aa6 test/cql-pytest.py: add scylla_inject_error() utility
This patch adds a scylla_inject_error(), a context manager which tests
can use to temporarily enable some error injection while some test
code is running. It can be used to write tests that artificially
inject certain errors instead of trying to reach the elaborate (and
often requiring precise timing or high amounts of data) situation where
they occur naturally.

The error-injection API is Scylla-specific (it uses the Scylla REST API)
and does not work on "release"-mode builds (all other modes are supported),
so when Cassandra or release-mode build are being tested, the test which
uses scylla_inject_error() gets skipped.

Example usage:

```python
    from rest_api import scylla_inject_error
    with scylla_inject_error(cql, "injection_name", one_shot=True):
        # do something here
        ...
```

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12264
2022-12-22 09:39:10 +02:00
Nadav Har'El
01f0644b22 Merge 'scylla-gdb.py: introduce scylla get-config-value' from Botond Dénes
Retrieves the configuration item with the given name and prints its
value as well as its metadata.
Example:

    (gdb) scylla get-config-value compaction_static_shares
    value: 100, type: "float", source: SettingsFile, status: Used, live: MustRestart

Closes #12362

* github.com:scylladb/scylladb:
  scylla-gdb.py: add scylla get-config-value gdb command
  scylla-gdb.py: extract $downcast_vptr logic to standalone method
  test: scylla-gdb/run: improve diagnostics for failed tests
2022-12-21 18:38:23 +02:00
Aleksandra Martyniuk
599fce16cf repair: make top level repair tasks abortable 2022-12-21 11:52:58 +01:00
Aleksandra Martyniuk
e77de463e4 repair: unify a way of aborting repair operations 2022-12-21 11:52:53 +01:00
Aleksandra Martyniuk
f56e886127 repair: delete sharded abort source from node_ops_info
Sharded abort source in node_ops_info is no longer needed since
its functionality is provided by task manager's tasks structure.
2022-12-21 11:37:03 +01:00
Aleksandra Martyniuk
18efe0a4e8 repair: delete unused node_ops_info from data_sync_repair_task_impl 2022-12-21 11:28:30 +01:00
Aleksandra Martyniuk
ee13a5dde8 api: extend status in task manager api
Status of tasks returned with get_task_status and wait_task is extended
with the list of ids of child tasks.
2022-12-21 10:54:56 +01:00
Aleksandra Martyniuk
697af4ccf2 api: extend get_tasks in task manager api
Each task stats in a list returned from tm::get_task api call
is extended with info about: task type, keyspace, table, entity,
and sequence number.
2022-12-21 10:54:50 +01:00
Michał Chojnowski
19049150ef configure.py: remove --static, --pie, --so
These options have been nonsense since 2017.
--pie and --so are ignored, --static disables (sic!) static linking of
libraries.
Remove them.

Closes #12366
2022-12-21 11:01:56 +02:00
Botond Dénes
29d49e829e scylla-gdb.py: add scylla get-config-value gdb command
Retrieves the configuration item with the given name and prints its
value as well as its metadata.
Example:
    (gdb) scylla get-config-value compaction_static_shares
    value: 100, type: "float", source: SettingsFile, status: Used, live: MustRestart
2022-12-21 03:05:56 -05:00
Botond Dénes
0cdb89868a scylla-gdb.py: extract $downcast_vptr logic to standalone method
So it can be reused by regular python code.
2022-12-21 03:05:56 -05:00
Botond Dénes
24022c19a6 test: scylla-gdb/run: improve diagnostics for failed tests
By instructing gdb to print the full python stack in case of errors.
2022-12-21 03:05:56 -05:00
Michał Chojnowski
d9269abf5b sstables: index_reader: always evict the local cache gently
Due to an oversight, the local index cache isn't evicted gently
when _upper_bound existed. This is a source of reactor stalls.
Fix that.

Fixes #12271

Closes #12364
2022-12-20 18:23:27 +02:00
Michał Radwański
e7fbcd6c9d mutation_partition_view: treat query::partition_slice::option::reversed in to_data_query_result as consume_in_reverse::yes
The consume_in_reverse::legacy_half_reverse format is soon to be phased
out. This commit starts treating frozen_mutations from replicas for
reversed queries so that they are consumed with consume_in_reverse::yes.
2022-12-20 17:05:02 +01:00
Benny Halevy
1adb2bff18 mutation: move consume_in_reverse def to mutation_consumer.hh
To be used also by frozen_mutation consumer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-20 16:23:10 +01:00
Avi Kivity
bb731b4f52 Merge 'docs: move documentation of tools online' from Botond Dénes
Currently the scylla tools (`scylla-types` and `scylla-sstable`) have documentation in two places: high level documentation can be found at `docs/operating-scylla/admin-tools/scylla-{types,sstable}.rst`, while low level, more detailed documentation is embedded in the tool itself. This is especially pronounced for `scylla-sstable`, which only has a short description of its operations online, all details being found only in the command-line help.
We want to move away from this model, such that all documentation can be found online, with the command-line help being reserved to documenting how the various switches and flags work, on top of a short description of the operation and a link to the detailed online docs.

Closes #12284

* github.com:scylladb/scylladb:
  tool/scylla-sstable: move documentation online
  docs: scylla-sstable.rst: add sstable content section
  docs: scylla-{sstable,types}.rst: drop Syntax section
2022-12-20 17:04:47 +02:00
Avi Kivity
3fce43124a Merge 'Static compaction groups' from Raphael "Raph" Carvalho
Allows static configuration of number of compaction groups per table per shard.

To bootstrap the project, config option x_log2_compaction_groups was added which controls both number of groups and partitioning within a shard.

With a value of 0 (default), it means 1 compaction group, therefore all tokens go there.
With a value of 3, it means 8 compaction groups, and 3 most-significant-bits of tokens being used to decide which group owns the token.
And so on.

It's still missing:
- integration with repair / streaming
- integration with reshard / reshape.

perf/perf_simple_query --smp 1 --memory 1G

BEFORE
-----
median 61358.55 tps ( 71.1 allocs/op,  12.2 tasks/op,   56375 insns/op,        0 errors)
median 61322.80 tps ( 71.1 allocs/op,  12.2 tasks/op,   56391 insns/op,        0 errors)
median 61058.58 tps ( 71.1 allocs/op,  12.2 tasks/op,   56386 insns/op,        0 errors)
median 61040.94 tps ( 71.1 allocs/op,  12.2 tasks/op,   56381 insns/op,        0 errors)
median 61118.40 tps ( 71.1 allocs/op,  12.2 tasks/op,   56379 insns/op,        0 errors)

AFTER
-----
median 61656.12 tps ( 71.1 allocs/op,  12.2 tasks/op,   56486 insns/op,        0 errors)
median 61483.29 tps ( 71.1 allocs/op,  12.2 tasks/op,   56495 insns/op,        0 errors)
median 61638.05 tps ( 71.1 allocs/op,  12.2 tasks/op,   56494 insns/op,        0 errors)
median 61726.09 tps ( 71.1 allocs/op,  12.2 tasks/op,   56509 insns/op,        0 errors)
median 61537.55 tps ( 71.1 allocs/op,  12.2 tasks/op,   56491 insns/op,        0 errors)

Closes #12139

* github.com:scylladb/scylladb:
  test: mutation_test: Test multiple compaction groups
  test: database_test: Test multiple compaction groups
  test: database_test: Adapt it to compaction groups
  db: Add config for setting static number of compaction groups
  replica: Introduce static compaction groups
  test: sstable_test: Stop referencing single compaction group
  api: compaction_manager: Stop a compaction type for all groups
  api: Estimate pending tasks on all compaction groups
  api: storage_service: Run maintenance compactions on all compaction groups
  replica: table: Adapt assertion to compaction groups
  replica: database: stop and disable compaction on behalf of all groups
  replica: Introduce table::parallel_foreach_table_state()
  replica: disable auto compaction on behalf of all groups
  replica: table: Rework compaction triggers for compaction groups
  replica: Adapt table::get_sstables_including_compacted_undeleted() to compaction groups
  replica: Adapt table::rebuild_statistics() to compaction groups
  replica: table: Perform major compaction on behalf of all groups
  replica: table: Perform off-strategy compaction on behalf of all groups
  replica: table: Perform cleanup compaction on behalf of all groups
  replica: Extend table::discard_sstables() to operate on all compaction groups
  replica: table: Create compound sstable set for all groups
  replica: table: Set compaction strategy on behalf of all groups
  replica: table: Return min memtable timestamp across all groups
  replica: Adapt table::stop() to compaction groups
  replica: Adapt table::clear() to compaction groups
  replica: Adapt table::can_flush() to compaction groups
  replica: Adapt table::flush() to compaction groups
  replica: Introduce parallel_foreach_compaction_group()
  replica: Adapt table::set_schema() to compaction groups
  replica: Add memtables from all compaction groups for reads
  replica: Add memtable_count() method to compaction_group
  replica: table: Reserve reader list capacity through a callback
  replica: Extract addition of memtables to reader list into a new function
  replica: Adapt table::occupancy() to compaction groups
  replica: Adapt table::active_memtable() to compaction groups
  replica: Introduce table::compaction_groups()
  replica: Preparation for multiple compaction groups
  scylla-gdb: Fix backward compatibility of scylla_memtables command
2022-12-20 17:04:47 +02:00
Avi Kivity
623be22d25 Merge 'sstables: allow bypassing min max position metadata loading' from Botond Dénes
Said mechanism broke tools and tests to some extent: the read it executes on sstable load time means that if the sstable is broken enough to fail this read, it will fail to load, preventing diagnostic tools to load it and examine it and preventing tests from producing broken sstables for testing purposes.

Closes #12359

* github.com:scylladb/scylladb:
  sstables: allow bypassing first/last position metadata loading
  sstables: sstable::{load,open_data}(): fix indentation
  sstables: coroutinize sstable::open_data()
  sstables: sstable::open_data(): use clear_gently() to clear token ranges
  sstables: coroutinize sstable::load()
2022-12-20 17:04:47 +02:00
Aleksandra Martyniuk
60e298fda1 repair: change utils::UUID to node_ops_id
Type of the id of node operations is changed from utils::UUID
to node_ops_id. This way the id of node operations would be easily
distinguished from the ids of other entities.

Closes #11673
2022-12-20 17:04:47 +02:00
Avi Kivity
88a1fbd72f Update seastar submodule
* seastar 3a5db04197...3db15b5681 (27):
  > build: get the full path of c-ares
  > build: unbreak pkgconfig output
  > http: Add 206 Partial Content response code
  > http: Carry integer content_length on reply
  > tls_test: drop duplicated includes
  > tls_test: remove duplicated test case
  > reactor: define __NR_pidfd_open if not defined
  > sockets: Wait on socket peer closing the connection
  > tcp: Close connection when getting RST from server
  > Merge 'Enhance rpc tester with delays, timeouts and verbosity' from Pavel Emelyanov
  > Merge 'build: use pkg_search_module(.. IMPORTED_TARGET ..) ' from Kefu Chai
  > build: define GnuTLS_{LIBRARIES, INCLUDE_DIRS} only if GnuTLS is found
  > build: use pkg_search_module(.. IMPORTED_TARGET ..)
  > addr2line: extend asan regex
  > abort_source: move-assign operator: call base class unlink
  > coroutine: correct syntax error in doxygen comment
  > demo: Extend http connection demo with https
  > test: temporarily disable warning for tests triggering warnings
  > tests/unit/coroutine: Include <ranges>
  > sstring: Document why sstring exists at all
  > test: log error when read/write to pipe fails
  > test: use executables in /bin
  > tests: spawn_test: use BOOST_CHECK_EQUAL() for checking equality of temporary_buffer
  > docker: bump up to clang {14,15} and gcc {11,12}
  > shared_ptr: ignore false alarm from GCC-12
  > build: check for fix of CWG2631
  > circleci: use versioned container image

Closes #12355
2022-12-20 17:04:47 +02:00
Botond Dénes
3c8949d34c sstables: allow bypassing first/last position metadata loading
When loading an sstable. Tests and tools might want to do this to be
able to load a damaged sstable to do tests/diagnostics on it.
2022-12-20 01:45:38 -05:00
Botond Dénes
bba956c13c sstables: sstable::{load,open_data}(): fix indentation 2022-12-20 01:45:38 -05:00
Botond Dénes
c85ff7945d sstables: coroutinize sstable::open_data()
Used once when sstable is opened on startup, not performance sensitive.
2022-12-20 01:45:38 -05:00
Botond Dénes
15966a0b1b sstables: sstable::open_data(): use clear_gently() to clear token ranges
Instead of an open-coded loop. It also makes the code easier to
coroutinize (next patch).
2022-12-20 01:45:22 -05:00
Nadav Har'El
08c8e0d282 test/alternator: enable tests for long strings of consecutive tombstones
In the past we had issue #7933 where very long strings of consecutive
tombstones caused Alternator's paging to take an unbounded amount of
time and/or memory for a single page. This issue was fixed (by commit
e9cbc9ee85) but the two tests we had
reproducing that issue were left with the "xfail" mark.
They were also marked "veryslow" - each taking about 100 seconds - so
they didn't run by default so nobody noticed they started to pass.

In this patch I make these tests much faster (taking less than a second
together), confirm that they pass - and remove the "xfail" mark and
improve their descriptions.

The trick to making these tests faster is to not create a million
tombstones like we used to: We now know that after string of just 10,000
tombstones ('query_tombstone_page_limit') the page should end, so
we can check specifically this number. The story is more complicated for
partition tombstones, but there too it should be a multiple of
query_tombstone_page_limit. To make the tests even faster, we change
run.py to lower the query_tombstone_page_limit from the default 10,000
to 1000. The tests work correctly even without this change, but they are
ten times faster with it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12350
2022-12-20 07:08:36 +02:00
Botond Dénes
94f3fb341f Merge 'Fix nix devenv' from Michael Livshin
* Update Nixpkgs base

* Clarify some comments

* Get rid of custom-packaged cxxbridge (it's now present in Nixpkgs as
  cxx-rs)

* Add missing libraries (libdeflate, libxcrypt)

* Fix expected hash of the gdb patch

* Fix a couple of small build problems

Fixes #12259

Closes #12346

* github.com:scylladb/scylladb:
  build: fix Nix devenv
  cql3: mark several private fields as maybe_unused
  configure.py: link with more abseil libs
2022-12-20 07:01:06 +02:00
Michael Livshin
7c383c6249 build: fix Nix devenv
* Update Nixpkgs base

* Clarify some comments

* Get rid of custom-packaged cxxbridge (it's now present in Nixpkgs as
  cxx-rs)

* Add missing libraries (libdeflate, libxcrypt)

* Fix expected hash of the gdb patch

* Bump Python driver to 3.25.20-scylla

Fixes #12259
2022-12-19 20:53:07 +02:00
Michael Livshin
4407828766 cql3: mark several private fields as maybe_unused
Because they are indeed unused -- they are initialized, passed down
through some layers, but not actually used.  No idea why only Clang 12
in debug mode in Nix devenv complains about it, though.
2022-12-19 20:53:07 +02:00
Michael Livshin
c0c8afb79e configure.py: link with more abseil libs
Specifically libabsl_strings{,_internal}.a.

This fixes failure to link tests in the Nix devenv; since presumably
all is good in other setups, it must be something weird having to do
with inlining?

The extra linked libraries shouldn't hurt in any case.
2022-12-19 20:53:07 +02:00
Raphael S. Carvalho
e7380bea65 test: mutation_test: Test multiple compaction groups
Extends mutation_test to run the tests with more than one
compaction group, in addition to a single one (default).

Piggyback on existing tests. Avoids duplication.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 12:36:07 -03:00
Raphael S. Carvalho
e3e7c3c7e5 test: database_test: Test multiple compaction groups
Extends database_test to run the tests with more than one
compaction group, in addition to a single one (default).

Piggyback on existing tests. Avoids duplication.

Caught a bug when snapshotting, in implementation of
table::can_flush(), showing its usefulness.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 12:36:07 -03:00
Raphael S. Carvalho
e103e41c76 test: database_test: Adapt it to compaction groups
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 12:36:05 -03:00
Aleksandra Martyniuk
be529cc209 repair: delete redundant abort subscription from shard_repair_task_impl
data_sync_repair_task_impl subscribes to corresponding node_ops_info
abort source and then, when requested, all its descedants are
aborted recursively. Thus, shard_repair_task_impl does not need
to subscribe to the node_ops_info abort source, since the parent
task will take care of aborting once it is requested.

abort_subscription and connected attributes are deleted from
the shard_repair_task_impl.
2022-12-19 16:07:28 +01:00
Aleksandra Martyniuk
e48ca62390 repair: add abort subscription to data sync task
When node operation is aborted, same should happen with
the corresponding task manager's repair task.

Subscribe data_sync_repair_task_impl abort() to node_ops_info
abort_source.
2022-12-19 15:57:35 +01:00
Aleksandra Martyniuk
2b35d7df1b tasks: abort tasks on system shutdown
When system shutdowns, all task manager's top level tasks are aborted.
Responsibility for aborting child tasks is on their parents.
2022-12-19 15:57:35 +01:00
Botond Dénes
827cd0d37b sstables: coroutinize sstable::load()
It nicely simplified by it. No regression expected, this method is
supposedly only used by tests and tools.
2022-12-19 09:33:52 -05:00
Raphael S. Carvalho
d9ab59043e db: Add config for setting static number of compaction groups
This new option allows user to control the number of compaction groups
per table per shard. It's 0 by default which implies a single compaction
group, as is today.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:16:24 -03:00
Raphael S. Carvalho
9cf4dc7b62 replica: Introduce static compaction groups
This is the initial support for multiple groups.

_x_log2_compaction_groups controls the number of compaction groups
and the partitioning strategy within a single table.

The value in _x_log2_compaction_groups refers to log base 2 of the
actual number of groups.

0 means 1 compaction group.
1 means 2 groups and 2 most significant bits of token being
used to pick the target group.

The group partitioner should be later abstracted for making tablet
integration easier in the future.

_x_log2_compaction_groups is still a constant but a config option
will come next.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:16:23 -03:00
Raphael S. Carvalho
c807e61715 test: sstable_test: Stop referencing single compaction group
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:16:20 -03:00
Raphael S. Carvalho
254c38c4d2 api: compaction_manager: Stop a compaction type for all groups
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:16:19 -03:00
Raphael S. Carvalho
4e836cb96c api: Estimate pending tasks on all compaction groups
Estimates # of compaction jobs to be performed on a table.
Adaptation is done by adding estimation from all groups.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:16:17 -03:00
Raphael S. Carvalho
640436e72a api: storage_service: Run maintenance compactions on all compaction groups
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:16:15 -03:00
Raphael S. Carvalho
e0c5cbee8d replica: table: Adapt assertion to compaction groups
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:16:13 -03:00
Raphael S. Carvalho
d35cf88f09 replica: database: stop and disable compaction on behalf of all groups
With compaction group model, truncate_table_on_all_shards() needs
to stop and disable compaction for all groups.
replica::table::as_table_state() will be removed once no user
remains, as each table may map to multiple groups.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:16:12 -03:00
Raphael S. Carvalho
50b02ee0bd replica: Introduce table::parallel_foreach_table_state()
This will replace table::as_table_state(). The latter will be
killed once its usage drops to zero.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:16:10 -03:00
Raphael S. Carvalho
fd69bd433e replica: disable auto compaction on behalf of all groups
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:16:08 -03:00
Raphael S. Carvalho
6fefbe5706 replica: table: Rework compaction triggers for compaction groups
Allow table-wide compaction trigger, as well as fine-grained trigger
like after flushing a memtable on behalf of a single group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:16:07 -03:00
Raphael S. Carvalho
6a6adea3ab replica: Adapt table::get_sstables_including_compacted_undeleted() to compaction groups
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:16:05 -03:00
Raphael S. Carvalho
5919836da8 replica: Adapt table::rebuild_statistics() to compaction groups
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:16:04 -03:00
Raphael S. Carvalho
70b727db31 replica: table: Perform major compaction on behalf of all groups
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:16:01 -03:00
Raphael S. Carvalho
e3ccdb17a0 replica: table: Perform off-strategy compaction on behalf of all groups
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:16:00 -03:00
Raphael S. Carvalho
6efc9fd1f6 replica: table: Perform cleanup compaction on behalf of all groups
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:15:58 -03:00
Raphael S. Carvalho
36e11eb2a5 replica: Extend table::discard_sstables() to operate on all compaction groups
discard_sstables() runs on context of truncate, which is a table-wide
operation today, and will remain so with multiple static groups.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:15:55 -03:00
Raphael S. Carvalho
24c3687c3f replica: table: Create compound sstable set for all groups
Avoids extra compound set for single-compaction-group table.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:15:52 -03:00
Raphael S. Carvalho
eb620da981 replica: table: Set compaction strategy on behalf of all groups
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:15:50 -03:00
Raphael S. Carvalho
7a0e4f900f replica: table: Return min memtable timestamp across all groups
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:15:49 -03:00
Raphael S. Carvalho
ceaa8a1ef1 replica: Adapt table::stop() to compaction groups
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:15:47 -03:00
Raphael S. Carvalho
facf923440 replica: Adapt table::clear() to compaction groups
clear() clears memtable content and cache.

Cache is shared by groups, therefore adaptation happens by only
clearing memtables of all groups.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:15:45 -03:00
Raphael S. Carvalho
a9c902cd5e replica: Adapt table::can_flush() to compaction groups
can_flush() is used externally to determine if a table has an active
memtable that can be flushed. Therefore, adaptation happens by
returning true if any of the groups can be flushed. A subsequent
flush request will flush memtable of all groups that are ready
for it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:15:44 -03:00
Raphael S. Carvalho
ea42090d47 replica: Adapt table::flush() to compaction groups
Adaptation of flush() happens by trigger flush on memtable of all
groups.
table::seal_active_memtable() will bail out if memtable is empty, so
it's not a problem to call flush on a group which memtable is empty.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:15:42 -03:00
Raphael S. Carvalho
7274c83098 replica: Introduce parallel_foreach_compaction_group()
This variant will be useful when iterating through groups
and performing async actions on each. It guarantees that all
groups are alive by the time they're reached in the loop.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:15:40 -03:00
Raphael S. Carvalho
89ab9d7227 replica: Adapt table::set_schema() to compaction groups
set_schema() is used by the database to apply schema changes to
table components which include memtables.
Adaptation happens by setting schema to memtable(s) of all groups.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:15:38 -03:00
Raphael S. Carvalho
0022322ae3 replica: Add memtables from all compaction groups for reads
Let's add memtables of all compaction groups. Point queries are
optimized by picking a single group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:15:36 -03:00
Raphael S. Carvalho
e044001176 replica: Add memtable_count() method to compaction_group
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:15:34 -03:00
Raphael S. Carvalho
f2ea79f26c replica: table: Reserve reader list capacity through a callback
add_memtables_to_reader_list() will be adapted to compaction groups.
For point queries, it will add memtables of a single group.
With the callback, add_memtables_to_reader_list() can tell its
caller the exact amount of memtable readers to be added, so it
can reserve precisely the readers capacity.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:15:33 -03:00
Raphael S. Carvalho
e841508685 replica: Extract addition of memtables to reader list into a new function
Will make it easier for adding memtables of all compaction groups.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:15:19 -03:00
Raphael S. Carvalho
530956b2de replica: Adapt table::occupancy() to compaction groups
table::occupancy() provides accumulated occupancy stats from
memtables.
Adaptation happens by accumulating stats from memtables of
all groups.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:15:17 -03:00
Raphael S. Carvalho
ef8f542d75 replica: Adapt table::active_memtable() to compaction groups
active_memtable() was fine to a single group, but with multiple groups,
there will be one active memtable per group. Let's change the
interface to reflect that.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:15:14 -03:00
Raphael S. Carvalho
429c5aa2f9 replica: Introduce table::compaction_groups()
Useful for iterating through all groups. This is intermediary
implementation which requires allocation as only one group
is supported today.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:15:12 -03:00
Raphael S. Carvalho
514008f136 replica: Preparation for multiple compaction groups
Adjusts scylla_memtables gdb command to multiple groups,
while keeping backward compatibility.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:15:10 -03:00
Raphael S. Carvalho
52b94b6dd7 scylla-gdb: Fix backward compatibility of scylla_memtables command
Fix it while refactoring the code for arrival of multiple compaction
groups.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:15:07 -03:00
Anna Stuchlik
bbfb9556fc doc: mark the in-memory tables feature as deprecated
Closes #12286
2022-12-19 15:39:31 +02:00
Avi Kivity
c70a9b0166 test: make test xml filenames more unique
ea99750de7 ("test: give tests less-unique identifiers") made
the disambiguating ids only be unambiguous within a single test
case. This made all tests named "run" have the name name "run.1".

Fix that by adding the suite name everywhere: in test paths, and
in junit test case names.

Fixes #12310.

Closes #12313
2022-12-19 15:03:51 +02:00
Botond Dénes
3e6ddf21bc Merge 'storage_service: unbootstrap: avoid unnecessary copy of ranges_to_stream' from Benny Halevy
`ranges_to_stream` is a map of ` std::unordered_multimap<dht::token_range, inet_address>` per keyspace.
On large clusters with a large number of keyspace, copying it may cause reactor stalls as seen in #12332

This series eliminates this copy by using std::move and also
turns `stream_ranges` into a coroutine, adding maybe_yield calls to avoid further stalls down the road.

Fixes #12332

Closes #12343

* github.com:scylladb/scylladb:
  storage_service: stream_ranges: unshare streamer
  storage_service: stream_ranges: maybe_yield
  storage_service: coroutinize stream_ranges
  storage_service: unbootstrap: move ranges_to_stream_by_keyspace to stream_ranges
2022-12-19 12:53:16 +02:00
Benny Halevy
e8aa1182b2 docs: replace-dead-node: get rid of hibernate status
With replace using node operations, the HIBERNATE
gossip status is not used anymore.

This change updates documentation to reflect that.
During replace, the replacing nodes shows in gossipinfo
in STATUS:NORMAL.

Also, the replaced node shows as DN in `nodetool status`
while being replaced, so remove paragraph showing it's
not listed in `nodetool status`.

Plus. tidy up the text alignment.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-19 12:19:10 +02:00
Benny Halevy
c9993f020d storage_service: get rid of handle_state_replacing
Since 2ec1f719de nodes no longer
publish HIBERNATE state so we don't need to support handling it.

Replace is now always done using node operations (using
repair or streaming).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-19 12:19:08 +02:00
Benny Halevy
60de7d28db storage_service: stream_ranges: unshare streamer
Now that stream_ranges is a coroutine
streamer can be an automatic variable on the
coroutine stack frame.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-19 07:42:07 +02:00
Benny Halevy
9badcd56ca storage_service: stream_ranges: maybe_yield
Prevent stalls with a large number of keyspaces
and token ranges.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-19 07:42:07 +02:00
Benny Halevy
2cf75319b0 storage_service: coroutinize stream_ranges
Before adding maybe_yield calls.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-19 07:42:01 +02:00
Benny Halevy
82486bb5d2 storage_service: unbootstrap: move ranges_to_stream_by_keyspace to stream_ranges
Avoid a potentially large memory copy causing
a reactor stall with a large number of keyspaces.

Fixes #12332

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-19 07:39:48 +02:00
Avi Kivity
7c7eb81a66 Merge 'Encapsulate filesystem access by sstable into filesystem_storage subsclass' from Pavel Emelyanov
This is to define the API sstable needs from underlying storage. When implementing object-storage backend it will need to implement those. The API looks like

        future<> snapshot(const sstable& sst, sstring dir, absolute_path abs) const;
        future<> quarantine(const sstable& sst, delayed_commit_changes* delay);
        future<> move(const sstable& sst, sstring new_dir, generation_type generation, delayed_commit_changes* delay);
        void open(sstable& sst, const io_priority_class& pc); // runs in async context
        future<> wipe(const sstable& sst) noexcept;

        future<file> open_component(const sstable& sst, component_type type, open_flags flags, file_open_options options, bool check_integrity);

It doesn't have "list" or alike, because it's not a method of an individual sstable, but rather the one from sstables_manager. It will come as separate PR.

Closes #12217

* github.com:scylladb/scylladb:
  sstable, storage: Mark dir/temp_dir private
  sstable: Remove get_dir() (well, almost)
  sstable: Add quarantine() method to storage
  sstable: Use absolute/relative path marking for snapshot()
  sstable: Remove temp_... stuff from sstable
  sstable: Move open_component() on storage
  sstable: Mark rename_new_sstable_component_file() const
  sstable: Print filename(type) on open-component error
  sstable: Reorganize new_sstable_component_file()
  sstable: Mark filename() private
  sstable: Introduce index_filename()
  tests: Disclosure private filename() calls
  sstable: Move wipe_storage() on storage
  sstable: Remove temp dir in wipe_storage()
  sstable: Move unlink parts into wipe_storage
  sstable: Remove get_temp_dir()
  sstable: Move write_toc() to storage
  sstable: Shuffle open_sstable()
  sstable: Move touch_temp_dir() to storage
  sstable: Move move() to storage
  sstable: Move create_links() to storage
  sstable: Move seal_sstable() to storage
  sstable: Tossing internals of seal_sstable()
  sstable: Move remove_temp_dir() to storage
  sstable: Move create_links_common() to storage
  sstable: Move check_create_links_replay() to storage
  sstable: Remove one of create_links() overloads
  sstable: Remove create_links_and_mark_for_removal()
  sstable: Indentation fix after prevuous patch
  sstable: Coroutinize create_links_common()
  sstable: Rename create_links_common()'s "dir" argument
  sstable: Make mark_for_removal bool_class
  sstable, table: Add sstable::snapshot() and use in table::take_snapshot
  sstable: Move _dir and _temp_dir on filesystem_storage
  sstable: Use sync_directory() method
  test, sstable: Use component_basename in test
  sstables: Move read_{digest|checksum} on sstable
2022-12-18 17:29:35 +02:00
Anna Stuchlik
6a8eb33284 docs: add the new upgade guide 2022.1 to 2022.2 to the index and the toctree 2022-12-16 17:13:50 +01:00
Anna Stuchlik
36f4ef2446 docs: add the index file for the new upgrage guide from 2022.1 to 2022.2 2022-12-16 17:11:25 +01:00
Anna Stuchlik
8d8983e029 docs: add the metrics update file to the upgrade guide 2022.1 to 2022.2 2022-12-16 17:09:21 +01:00
Anna Stuchlik
252c2139c2 docs: add the upgrade guide for ScyllaDB Enterprise from 2022.1 to 2022.2 2022-12-16 17:07:00 +01:00
Michał Chojnowski
b52bd9ef6a db: commitlog: remove unused max_active_writes()
Dead and misleading code.

Closes #12327
2022-12-16 10:23:03 +02:00
Nadav Har'El
327539b15d Merge 'test.py: fix cql failure handling' from Alecco
Fix a bug in failure handling and log level.

Closes #12336

* github.com:scylladb/scylladb:
  test.py: convert param to str
  test.py: fix error level for CQL tests
2022-12-16 09:29:21 +02:00
Botond Dénes
cc03becf82 Merge 'tasks: get task's type with method' from Aleksandra Martyniuk
Type of operation is related to a specific implementation
of a task. Then, it should rather be access with a virtual
method in tasks::task_manager::task::impl than be
its attribute.

Closes #12326

* github.com:scylladb/scylladb:
  api: delete unused type parameter from task_manager_test api
  tasks: repair: api: remove type attribute from task_manager::task::status
  tasks: add type() method to task_manager::task::impl
  repair: add reason attribute to repair_task
2022-12-16 09:20:26 +02:00
Aleksandra Martyniuk
f81ad2d66a repair: make shard tasks internal
Shard tasks should not be visible to users by default, thus they are
made internal.

Closes #12325
2022-12-16 09:05:30 +02:00
Aleksandra Martyniuk
bae887da3b tasks: add virtual destructor to task_manager::module
When an object of a class inheriting from task_manager::module
is destroyed, destructor of the derived class should be called.

Closes #12324
2022-12-16 08:59:26 +02:00
Raphael S. Carvalho
e6fb3b3a75 compaction: Delete atomically off-strategy input sstables
After commit a57724e711, off-strategy no longer races with view
building, therefore deletion code can be simplified and piggyback
on mechanism for deleting all sstables atomically, meaning a crash
midway won't result in some of the files coming back to life,
which leads to unnecessary work on restart.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #12245
2022-12-16 08:15:49 +02:00
Alejo Sanchez
9b65448d38 test.py: convert param to str
The format_unidiff() function takes str, not pathlib PosixPath, so
convert it to str.

This prevented diff output of unexpected result to be shown in the log
file.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-12-15 20:46:35 +01:00
Alejo Sanchez
5142d80bb1 test.py: fix error level for CQL tests
If the test fails, use error log level.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-12-15 20:45:44 +01:00
Botond Dénes
64903ba7d5 test/cql-pytest: use pytest site-packages workaround
Recently, the pytest script shipped by Fedora started invoking python
with the `-s` flag, which disables python considering user site
packages. This caused problems for our tests which install the cassandra
driver in the user site packages. This was worked around in e5e7780f32
by providing our own pytest interposer launcher script which does not
pass the above mentioned flag to python. Said patch fixed test.py but
not the run.py in cql-pytest. So if the cql-pytest suite is ran via
test.py it works fine, but if it is invoked via the run script, it fails
because it cannot find the cassandra driver. This patch patches run.py
to use our own pytest launcher script, so the suite can be run via the
run script as well.
Since run.py is shared with the alternator pytest suite, this patch also
fixes said test suite too.

Closes #12253
2022-12-15 16:05:31 +02:00
Benny Halevy
639e247734 test: cql-pytest: test_describe: test_table_options_quoting: USE test_keyspace
Without that, I often (but not always) get the following error:
```
__________________________ test_table_options_quoting __________________________

cql = <cassandra.cluster.Session object at 0x7f1aafb10650>
test_keyspace = 'cql_test_1671103335055'

    def test_table_options_quoting(cql, test_keyspace):
        type_name = f"some_udt; DROP KEYSPACE {test_keyspace}"
        column_name = "col''umn -- @quoting test!!"
        comment = "table''s comment test!\"; DESC TABLES --quoting test"
        comment_plain = "table's comment test!\"; DESC TABLES --quoting test" #without doubling "'" inside comment

>       cql.execute(f"CREATE TYPE \"{type_name}\" (a int)")

test/cql-pytest/test_describe.py:623:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cassandra/cluster.py:2699: in cassandra.cluster.Session.execute
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="No keyspace has been specified. USE a keyspace, or explicitly specify keyspace.tablename"
```

CQL driver in use ise the scylla driver version 3.25.10.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12329
2022-12-15 14:35:33 +02:00
Aleksandra Martyniuk
f0b2b00a15 api: delete unused type parameter from task_manager_test api 2022-12-15 10:50:30 +01:00
Aleksandra Martyniuk
5bc09daa7a tasks: repair: api: remove type attribute from task_manager::task::status 2022-12-15 10:49:09 +01:00
Aleksandra Martyniuk
8d5377932d tasks: add type() method to task_manager::task::impl 2022-12-15 10:41:58 +01:00
Aleksandra Martyniuk
329176c7bc repair: add reason attribute to repair_task
As a preparation to creating a type() method in task_manager::task::impl
a streaming::stream_reason is kept in repair_task.
2022-12-15 10:38:38 +01:00
Botond Dénes
9713a5c314 tool/scylla-sstable: move documentation online
The inline-help of operations will only contain a short summary of the
operation and the link to the online documentation.
The move is not a straightforward copy-paste. First and foremost because
we move from simple markdown to RST. Informal references are also
replaced with proper RST links. Some small edits were also done on the
texts.
The intent is the following:
* the inline help serves as a quick reference for what the operation
  does and what flags it has;
* the online documentation serves as the full reference manual,
  explaining all details;
2022-12-15 04:10:21 -05:00
Botond Dénes
3cf7afdf95 docs: scylla-sstable.rst: add sstable content section
Provides a link to the architecture/sstable page for more details on the
sstable format itself. It also describes the mutation-fragment stream,
the parts of it that is relevant to the sstable operations.
The purpose of this section is to provide a target for links that want to
point to a common explanation on the topic. In particular, we will soon
move the detailed documentation of the scylla-sstable operations into
this file and we want to have a common explanation of the mutation
fragment stream that these operations can point to.
2022-12-15 04:10:21 -05:00
Botond Dénes
641fb4c8bb docs: scylla-{sstable,types}.rst: drop Syntax section
In both files, the section hierarchy is as follows:

    Usage
        Syntax
            Sections with actual content

This scheme uses up 3 levels of hierarchy, leaving not much room to
expand the sections with actual content with subsections of their own.
Remove the Syntax level altogether, directly embedding the sections with
content under the Usage section.
2022-12-15 04:03:00 -05:00
Botond Dénes
8f8284783a Merge 'Fix handling of non-full clustering keys in the read path' from Tomasz Grabiec
This PR fixes several bugs related to handling of non-full
clustering keys.

One is in trim_clustering_row_ranges_to(), which is broken for non-full keys in reverse
mode. It will trim the range to position_in_partition_view::after_key(full_key) instead of
position_in_partition_view::before_key(key), hence it will include the
key in the resulting range rather than exclude it.

Fixes #12180

after_key() was creating a position which is after all keys prefixed
by a non-full key, rather than a position which is right after that
key.

This will issue will be caught by cql_query_test::test_compact_storage
in debug mode when mutation_partition_v2 merging starts inserting
sentinels at position after_key() on preemption.

It probably already causes problems for such keys as after_key() is used
in various parts in the read path.

Refs #1446

Closes #12234

* github.com:scylladb/scylladb:
  position_in_partition: Make after_key() work with non-full keys
  position_in_partition: Introduce before_key(position_in_partition_view)
  db: Fix trim_clustering_row_ranges_to() for non-full keys and reverse order
  types: Fix comparison of frozen sets with empty values
2022-12-15 10:47:12 +02:00
Pavel Emelyanov
6d10a3448b sstable, storage: Mark dir/temp_dir private
Now all storage access via sstable happens with the help of storage
class API so its internals can be finally made private.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00
Pavel Emelyanov
6296ca3438 sstable: Remove get_dir() (well, almost)
The sstable::get_dir() is now gone, no callers know that sstable lives
in any path on a filesystem. There are only few callers left.

One is several places in code that need sstable datafile, toc and index
paths to print them in logs. The other one is sstable_directory that is
to be patched separately.

For both there's a storage.prefix() method that prepends component name
with where the sstable is "really" located.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00
Pavel Emelyanov
7402787d16 sstable: Add quarantine() method to storage
Moving sstable to quarantine has some specific -- if the sstable is in
staging/ directory it's anyway moved into root/quarantine dir, not into
the quarantine subdir of its current location.

Encapsulate this feture in storage class method.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00
Pavel Emelyanov
f507271578 sstable: Use absolute/relative path marking for snapshot()
The snapshotting code uses full paths to files to manipulate snapshotted
sstables. Until this code is patched to use some proper snapshotting API
from sstable/ module, it will continue doing so.

Nowever, to remove the get_dir() method from sstable() the
seal_sstable() needs to put relative "backup" directory to
storage::snapshot() method. This patch adds a temporary bool_class for
this distinguishing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00
Pavel Emelyanov
a46d378bee sstable: Remove temp_... stuff from sstable
There's a bunch of helpers around XFS-specific temp-dir sitting in
publie sstable part. Drop it altogether, no code needs it for real.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00
Pavel Emelyanov
adba24d8ae sstable: Move open_component() on storage
Obtaining a class file object to read/write sstable from/to is now
storage-specific.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00
Pavel Emelyanov
4c22831d23 sstable: Mark rename_new_sstable_component_file() const
It's in fact such. Next patch will need it const to call this method
via const sstable reference.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00
Pavel Emelyanov
6bf3e3a921 sstable: Print filename(type) on open-component error
The file path is going to disappear soon, so print the filename() on
error. For now it's the same, but the meaning of the filename()
returning string is changing to become "random label for the log
reader".

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00
Pavel Emelyanov
dc72bce6d7 sstable: Reorganize new_sstable_component_file()
The helper consists of three stages:

1. open a file (probably in a temp dir)
2. decorate it with extentions and checked_file
3. optionally rename a file from temp dir

The latter is done to trigger XFS allocate this file in separate block
group if the file was created in temp dir on step 1.

This patch swaps steps 2 and 3 to keep filesystem-specific opening next
to each other.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00
Pavel Emelyanov
e55c740f49 sstable: Mark filename() private
From now on no callers should use this string to access anything on disk

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00
Pavel Emelyanov
5f579eb405 sstable: Introduce index_filename()
Currently the sstable::filename(Index) is used in several places that
get the filename as a printable or throwable string and don't treat is
as a real location of any file.

For those, add the index_filename() helper symmetrical to toc_filename()
and (in some sense) the get_filename() one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00
Pavel Emelyanov
bbbbd6dbfc tests: Disclosure private filename() calls
The sstable::filename() is going to become private method. Lots of tests
call it, but tests do call a lot of other sstable private methods,
that's OK. Make the sstable::filename() yet another one of that kind in
advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00
Pavel Emelyanov
4a91f3d443 sstable: Move wipe_storage() on storage
Now when the filesystem cleaning code is sitting in one method, it can
finally be made the storage class one.

Exception-safe allocation of toc_name (spoiler: it's copied anyway one
step later, so it's "not that safe" actually) is moved into storage as
well. The caller is left with toc_filename() call in its exception
handler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00
Pavel Emelyanov
c92d45eaa9 sstable: Remove temp dir in wipe_storage()
When unlinking an sstable for whatever reason it's good to check if the
temp dir is handing around. In some cases it's not (compaction), but
keeping the whole wiping code together makes it easier to move it on
storage class in one go.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00
Pavel Emelyanov
88ede71320 sstable: Move unlink parts into wipe_storage
Just move the code. This is to make the next patch smaller.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00
Pavel Emelyanov
0336cb3bdd sstable: Remove get_temp_dir()
Only one private called of it left, it's better to open-code it there

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00
Pavel Emelyanov
3326063b8b sstable: Move write_toc() to storage
This method initiates the sstable creation. Effectively it's the first
step in sstable creation transaction implemented on top of rename()
call. Thus this method is moved onto storage under respective name.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00
Pavel Emelyanov
636d49f1c1 sstable: Shuffle open_sstable()
When an sstable is prepared to be written on disk the .write_toc() is
called on it which created temporary toc file. Prior to this, the writer
code calls generate_toc() to collect components on the sstable.

This patch adds the .open_sstable() API call that does both. This
prepares the write_toc() part to be moved to storage, because it's not
just "write data into TOC file", it's the first step in transaction
implemeted on top of rename()s.

The test need care -- there's rewrite_toc_without_scylla_component()
thing in utils that doesn't want the generate_toc() part to be called.
It's not patched here and continues calling .write_toc().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00
Pavel Emelyanov
d3216b10d6 sstable: Move touch_temp_dir() to storage
The continuation of the previously moved remove_temp_dir() one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00
Pavel Emelyanov
1a34cb98fc sstable: Move move() to storage
The sstable can be "moved" in two cases -- to move from staging or to
move to quarantine. Both operation are sstable API ones, but the
implementation is storage-specific. This patch makes the latter a method
of storage class.

One thing to note is that only quarantine() touched the target directly.
Now also the move_to_new_dir() happenning on load also does it, but
that's harmless.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:47 +03:00
Pavel Emelyanov
18f6165993 sstable: Move create_links() to storage
This method is currently used in two places: sstable::snapshot() and
sstable::seal_sstable(). The latter additionally touches the target
backup/ subdir.

This patch moves the whole thing on storage and adds touch for all the
cases. For snapshots this might be excessive, but harmless.

Tests get their private-disclosure way to access sstable._storage in
few places to call create_links directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:13:45 +03:00
Pavel Emelyanov
136a8681e0 sstable: Move seal_sstable() to storage
Now the sstable sealing is split into storage part, internal-state part
and the seal-with-backup kick.

This move makes remove_temp_dir() private.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:13:45 +03:00
Pavel Emelyanov
334d231f56 sstable: Tossing internals of seal_sstable()
There are two of them -- one API call and the other one that just
"seals" it. The latter one also changes the _marked_for_deletion bit on
the sstable.

This patch makes the latter method prepared to be moved onto storage,
because sealing means comitting TOC file on disk with the help of rename
system call which is purely storage thing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:13:45 +03:00
Pavel Emelyanov
ce3a8a4109 sstable: Move remove_temp_dir() to storage
This one is simple, it just accesses _temp_dir thing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:13:45 +03:00
Pavel Emelyanov
9027d137d2 sstable: Move create_links_common() to storage
Same as previous patch. This move makes the previously moved
check_create_links_replay() a private method of the storage class.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:13:45 +03:00
Pavel Emelyanov
990032b988 sstable: Move check_create_links_replay() to storage
It needs to get sstable const reference to get the filename(s) from it.
Other than that it's pure filesystem-accessing method.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:13:45 +03:00
Pavel Emelyanov
041a8c80ad sstable: Remove one of create_links() overloads
There are two -- one that accepts generation and the other one that does
not. The latter is only called by the former, so no need in keeping both.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:13:45 +03:00
Pavel Emelyanov
f1558b6988 sstable: Remove create_links_and_mark_for_removal()
There's only one user of it, it can document its "and mark for removal"
intention via dedicated bool_class argument.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:13:45 +03:00
Pavel Emelyanov
65f40b28e6 sstable: Indentation fix after prevuous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:13:45 +03:00
Pavel Emelyanov
428adda4a9 sstable: Coroutinize create_links_common()
Looks much shorter and easier-to-patch this way.

The dst_dir argument is made value from const reference, old code copied
it with do_with() anyway.

Indentation is deliberately left broken until next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:13:45 +03:00
Pavel Emelyanov
ab13a99586 sstable: Rename create_links_common()'s "dir" argument
The whole method is going to move onto newly introduced
filesystem_storage that already has field of the same name onboard. To
avoid confusion, rename the argument to dst_dir.

No functional changes, _just_ s/dir/dst_dir/g throughout the method.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:13:45 +03:00
Pavel Emelyanov
4977c73163 sstable: Make mark_for_removal bool_class
Its meaning is comment-documented anyway. Also, next patches will remove
the create_links_and_mark_for_removal() so callers need some verbose
meaning of this boolean in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:13:45 +03:00
Pavel Emelyanov
f53d6804a6 sstable, table: Add sstable::snapshot() and use in table::take_snapshot
The replica/ code now "knows" that snapshotting an sstable means
creating a bunch of hard-links on disk. Abstract that via
sstable::snapshot() method.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:13:44 +03:00
Pavel Emelyanov
2803dcda6d sstable: Move _dir and _temp_dir on filesystem_storage
Those two fields define the way sstable is stored as collection of
on-disk files. First step towards making the storage access abstract is
in moving the paths onto filesystem_storage embedded class.

Both are made public for now, the rest of the code is patched to access
them via _storage.<smth>. The rest of the set moves parts of sstable::
methods into the filesystem_storage, then marks the paths private.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:13:44 +03:00
Pavel Emelyanov
17c8ba6034 sstable: Use sync_directory() method
The sstable::write_toc() executes sync_directory() by hand. Better to
use the method directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:13:44 +03:00
Pavel Emelyanov
e934f42402 test, sstable: Use component_basename in test
One case gets full sstable datafile path to get the basename from it.
There's already the basename helper on the class sstable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:13:44 +03:00
Pavel Emelyanov
376915d406 sstables: Move read_{digest|checksum} on sstable
These methods access sstables as files on disk, in order to hide the
"path on filesystem" meaning of sstables::filename() the whole method
should be made sstable:: one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:13:44 +03:00
Pavel Emelyanov
d561495f0d Merge 'topology: get rid of pending state' from Benny Halevy
Now, with a44ca06906, is_normal_token_owner that replaced is_member
does not rely anymore on the pending status
of endpoints in topology.

With that we can get rid of this state and just keep all endpoints we know about in the topology.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12294

* github.com:scylladb/scylladb:
  topology: get rid of pending state
  topology: debug log update and remove endpoint
2022-12-14 19:28:35 +03:00
Benny Halevy
bdb6550305 view: row_locker: add latency_stats_tracker
Refactor the existing stats tracking and updating
code into struct latency_stats_tracker and while at it,
count lock_acquisitions only on success.

Decrement operations_currently_waiting_for_lock in the destructor
so it's always balanced with the uncoditional increment
in the ctor.

As for updating estimated_waiting_for_lock, it is always
updated in the dtor, both on success and failure since
the wait for the lock happened, whether waiting
timed out or not.

Fixes #12190

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12225
2022-12-14 17:37:22 +02:00
Avi Kivity
9ee78975b7 Merge 'Fix topology mismatch on read-repair handler creation' from Pavel Emelyanov
The schedule_repair() receives a bunch of endpoint:mutations pairs and tries to create handlers for those. When creating the handlers it re-obtains topology from schema->ks->effective_replication_map chain, but this new topology can be outdated as compared to the list of endpoints at hand.

The fix is to carry the e.r.m. pointer used by read executor reconciliation all the way down to repair handlers creation. This requires some manipulations with mutate_internal() and mutate_prepare() argument lists.

fixes: #12050 (it was the same problem)

Closes #12256

* github.com:scylladb/scylladb:
  proxy: Carry replication map with repair mutation(s)
  proxy: Wrap read repair entries into read_repair_mutation
  proxy: Turn ref to forwardable ref in mutations iterator
2022-12-14 17:33:43 +02:00
Tomasz Grabiec
23e4c83155 position_in_partition: Make after_key() work with non-full keys
This fixes a long standing bug related to handling of non-full
clustering keys, issue #1446.

after_key() was creating a position which is after all keys prefixed
by a non-full key, rather than a position which is right after that
key.

This will issue will be caught by cql_query_test::test_compact_storage
in debug mode when mutation_partition_v2 merging starts inserting
sentinels at position after_key() on preemption.

It probably already causes problems for such keys.
2022-12-14 14:47:33 +01:00
Botond Dénes
16c50bed5e Merge 'sstables: coroutinize update_info_for_opened_data' from Avi Kivity
A complicated function (in continuation style) that benefits
from this simplification.

Closes #12289

* github.com:scylladb/scylladb:
  sstables: update_info_for_opened_data: reindent
  sstables: update_info_for_opened_data: coroutinize
2022-12-14 15:12:22 +02:00
Nadav Har'El
92d03be37b materialized view: fix bug in some large modifications to base partitions
Sometimes a single modification to a base partition requires updates to
a large number of view rows. A common example is deletion of a base
partition containing many rows. A large BATCH is also possible.

To avoid large allocations, we split the large amount of work into
batch of 100 (max_rows_for_view_updates) rows each. The existing code
assumed an empty result from one of these batches meant that we are
done. But this assumption was incorrect: There are several cases when
a base-table update may not need a view update to be generated (see
can_skip_view_updates()) so if all 100 rows in a batch were skipped,
the view update stopped prematurely. This patch includes two tests
showing when this bug can happen - one test using a partition deletion
with a USING TIMESTAMP causing the deletion to not affect the first
100 rows, and a second test using a specially-crafed large BATCH.
These use cases are fairly esoteric, but in fact hit a user in the
wild, which led to the discovery of this bug.

The fix is fairly simple: To detect when build_some() is done it is no
longer enough to check if it returned zero view-update rows; Rather,
it explicitly returns whether or not it is done as an std::optional.

The patch includes several tests for this bug, which pass on Cassandra,
failed on Scylla before this patch, and pass with this patch.

Fixes #12297.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12305
2022-12-14 14:50:38 +02:00
Botond Dénes
e7d8855675 Merge 'Revert accidental submodule updates' from Benny Halevy
The abseil and tools/java submodules were accidentally updated in
71bc12eecc
(merged to master in 51f867339e)

This series reverts those changes.

Closes #12311

* github.com:scylladb/scylladb:
  Revert accidental update of tools/java submodule
  Revert accidental update of abseil submodule
2022-12-14 13:20:08 +02:00
Benny Halevy
865193f99a Revert accidental update of tools/java submodule
The tools/java submodule was accidentally updated
in 71bc12eecc
Revert this change.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-14 13:06:30 +02:00
Benny Halevy
9911ba195b Revert accidental update of abseil submodule
The abseil module was accidentally updated
in 71bc12eecc
Revert this change.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-14 13:05:04 +02:00
Pavel Emelyanov
ab8fc0e166 proxy: Carry replication map with repair mutation(s)
The create_write_response_handler() for read repair needs the e.r.m.
from the caller, because it effectively accepts list of endpoints from
it.

So this patch equips all read_repair_mutation-s with the e.r.m. pointer
so that the handler creation can use it. It's the same for all
mutations, so it's a waste of space, but it's not bad -- there's
typically few mutations in this range and the entry passed there is
temporary, so even lots of them won't occupy lots of memory for long.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-14 14:03:39 +03:00
Pavel Emelyanov
140f373e15 proxy: Wrap read repair entries into read_repair_mutation
The schedule_repair() operates on a map of endpoint:mutations pairs.
Next patch will need to extend this entry and it's going to be easier if
the entry is wrapped in a helper structure in advance.

This is where the forwardable reference cursor from the previous patch
gets its user. The schedule_repair() produces a range of rvalue
wrappers, but the create_write_response_handler accepting it is OK, it
copies mutations anyway.

The printing operator is added to facilitate mutations logging from
mutate_internal() method.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-14 14:01:12 +03:00
Pavel Emelyanov
014b563ef1 proxy: Turn ref to forwardable ref in mutations iterator
The mutate_prepare() is iterating over range of mutation with 'auto&'
cursor thus accepting only lvalues. This is very restrictive, the caller
of mutate_prepare() may as well provide rvalues if the target
create_write_response_handler() or lambda accepts it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-14 14:00:10 +03:00
Avi Kivity
3fa230fee4 Merge 'cql3: expr: make it possible to prepare and evaluate conjunctions' from Jan Ciołek
This PR implements two things:
* Getting the value of a conjunction of elements separated by `AND` using `expr::evaluate`
* Preparing conjunctions using `prepare_expression`

---

`NULL` is treated as an "unkown value" - maybe `true` maybe `false`.
`TRUE AND NULL` evaluates to `NULL` because it might be `true` but also might be `false`.
`FALSE AND NULL` evaluates to `FALSE` because no matter what value `NULL` acts as, the result will still be `FALSE`.
Unset and empty values are not allowed.

Usually in CQL the rule is that when `NULL` occurs in an operation the whole expression becomes `NULL`, but here we decided to deviate from this behavior.
Treating `NULL` as an "unkown value" is the standard SQL way of handing `NULLs` in conjunctions.
It works this way in MySQL and Postgres so we do it this way as well.

The evaluation short-circuits. Once `FALSE` is encountered the function returns `FALSE` immediately without evaluating any further elements.
It works this way in Postgres as well, for example:
`SELECT true AND NULL AND 1/0 = 0` will throw a division by zero error,
 but `SELECT false AND 1/0 = 0` will successfully evaluate to `FALSE`.

Closes #12300

* github.com:scylladb/scylladb:
  expr_test: add unit tests for prepare_expression(conjunction)
  cql3: expr: make it possible to prepare conjunctions
  expr_test: add tests for evaluate(conjunction)
  cql3: expr: make it possible to evaluate conjunctions
2022-12-14 09:48:26 +02:00
Botond Dénes
122b267478 Merge 'repair: coroutinize to_repair_rows_list' from Avi Kivity
Simplify a somewhat complicated function.

Closes #12290

* github.com:scylladb/scylladb:
  repair: to_repair_rows_list: reindent
  repair: to_repair_rows_list: coroutinize
2022-12-14 09:39:47 +02:00
Avi Kivity
c09583bcef storage_proxy: coroutinize send_truncate_blocking
Not particularly important, but a small simplification.

Closes #12288
2022-12-14 09:39:33 +02:00
Tomasz Grabiec
132d5d4fa1 messaging: Shutdown on stop() if it wasn't shut down earlier
All rpc::client objects have to be stopped before they are
destroyed. Currently this is done in
messaging_service::shutdown(). The cql_test_env does not call
shutdown() currently. This can lead to use-after-free on the
rpc::client object, manifesting like this:

Segmentation fault on shard 0.
Backtrace:
column_mapping::~column_mapping() at schema.cc:?
db::cql_table_large_data_handler::internal_record_large_cells(sstables::sstable const&, sstables::key const&, clustering_key_prefix const*, column_definition const&, unsigned long, unsigned long) const at ./db/large_data_handler.cc:180
operator() at ./db/large_data_handler.cc:123
 (inlined by) seastar::future<void> std::__invoke_impl<seastar::future<void>, db::cql_table_large_data_handler::cql_table_large_data_handler(gms::feature_service&, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>)::$_1&, sstables::sstable const&, sstables::key const&, clustering_key_prefix const*, column_definition const&, unsigned long, unsigned long>(std::__invoke_other, db::cql_table_large_data_handler::cql_table_large_data_handler(gms::feature_service&, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>)::$_1&, sstables::sstable const&, sstables::key const&, clustering_key_prefix const*&&, column_definition const&, unsigned long&&, unsigned long&&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/invoke.h:61
 (inlined by) std::enable_if<is_invocable_r_v<seastar::future<void>, db::cql_table_large_data_handler::cql_table_large_data_handler(gms::feature_service&, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>)::$_1&, sstables::sstable const&, sstables::key const&, clustering_key_prefix const*, column_definition const&, unsigned long, unsigned long>, seastar::future<void> >::type std::__invoke_r<seastar::future<void>, db::cql_table_large_data_handler::cql_table_large_data_handler(gms::feature_service&, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>)::$_1&, sstables::sstable const&, sstables::key const&, clustering_key_prefix const*, column_definition const&, unsigned long, unsigned long>(db::cql_table_large_data_handler::cql_table_large_data_handler(gms::feature_service&, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>)::$_1&, sstables::sstable const&, sstables::key const&, clustering_key_prefix const*&&, column_definition const&, unsigned long&&, unsigned long&&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/invoke.h:114
 (inlined by) std::_Function_handler<seastar::future<void> (sstables::sstable const&, sstables::key const&, clustering_key_prefix const*, column_definition const&, unsigned long, unsigned long), db::cql_table_large_data_handler::cql_table_large_data_handler(gms::feature_service&, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>, utils::updateable_value<unsigned int>)::$_1>::_M_invoke(std::_Any_data const&, sstables::sstable const&, sstables::key const&, clustering_key_prefix const*&&, column_definition const&, unsigned long&&, unsigned long&&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/std_function.h:290
std::function<seastar::future<void> (sstables::sstable const&, sstables::key const&, clustering_key_prefix const*, column_definition const&, unsigned long, unsigned long)>::operator()(sstables::sstable const&, sstables::key const&, clustering_key_prefix const*, column_definition const&, unsigned long, unsigned long) const at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/std_function.h:591
 (inlined by) db::cql_table_large_data_handler::record_large_cells(sstables::sstable const&, sstables::key const&, clustering_key_prefix const*, column_definition const&, unsigned long, unsigned long) const at ./db/large_data_handler.cc:175
seastar::rpc::log_exception(seastar::rpc::connection&, seastar::log_level, char const*, std::__exception_ptr::exception_ptr) at ./build/release/seastar/./seastar/src/rpc/rpc.cc:109
operator() at ./build/release/seastar/./seastar/src/rpc/rpc.cc:788
operator() at ./build/release/seastar/./seastar/include/seastar/core/future.hh:1682
 (inlined by) void seastar::futurize<seastar::future<void> >::satisfy_with_result_of<seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::rpc::client::client(seastar::rpc::logger const&, void*, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14>(seastar::rpc::client::client(seastar::rpc::logger const&, void*, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::rpc::client::client(seastar::rpc::logger const&, void*, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14&, seastar::future_state<seastar::internal::monostate>&&)#1}::operator()(seastar::internal::promise_base_with_type<void>&&, seastar::rpc::client::client(seastar::rpc::logger const&, void*, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14&, seastar::future_state<seastar::internal::monostate>&&) const::{lambda()#1}>(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::rpc::client::client(seastar::rpc::logger const&, void*, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14>(seastar::rpc::client::client(seastar::rpc::logger const&, void*, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::rpc::client::client(seastar::rpc::logger const&, void*, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14&, seastar::future_state<seastar::internal::monostate>&&)#1}::operator()(seastar::internal::promise_base_with_type<void>&&, seastar::rpc::client::client(seastar::rpc::logger const&, void*, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14&, seastar::future_state<seastar::internal::monostate>&&) const::{lambda()#1}&&) at ./build/release/seastar/./seastar/include/seastar/core/future.hh:2134
 (inlined by) operator() at ./build/release/seastar/./seastar/include/seastar/core/future.hh:1681
 (inlined by) seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::rpc::client::client(seastar::rpc::logger const&, void*, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::rpc::client::client(seastar::rpc::logger const&, void*, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14>(seastar::rpc::client::client(seastar::rpc::logger const&, void*, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::rpc::client::client(seastar::rpc::logger const&, void*, seastar::rpc::client_options, seastar::socket, seastar::socket_address const&, seastar::socket_address const&)::$_14&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>::run_and_dispose() at ./build/release/seastar/./seastar/include/seastar/core/future.hh:781
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2319
 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2756
seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2925
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2808
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:265
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:156
operator() at ./build/release/seastar/./seastar/src/testing/test_runner.cc:75
 (inlined by) void std::__invoke_impl<void, seastar::testing::test_runner::start_thread(int, char**)::$_0&>(std::__invoke_other, seastar::testing::test_runner::start_thread(int, char**)::$_0&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/invoke.h:61
 (inlined by) std::enable_if<is_invocable_r_v<void, seastar::testing::test_runner::start_thread(int, char**)::$_0&>, void>::type std::__invoke_r<void, seastar::testing::test_runner::start_thread(int, char**)::$_0&>(seastar::testing::test_runner::start_thread(int, char**)::$_0&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/invoke.h:111
 (inlined by) std::_Function_handler<void (), seastar::testing::test_runner::start_thread(int, char**)::$_0>::_M_invoke(std::_Any_data const&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/std_function.h:290
std::function<void ()>::operator()() const at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/std_function.h:591
 (inlined by) seastar::posix_thread::start_routine(void*) at ./build/release/seastar/./seastar/src/core/posix.cc:73

Fix by making sure that shutdown() is called prior to destruction.

Fixes #12244

Closes #12276
2022-12-14 10:28:26 +03:00
Tzach Livyatan
7cd613fc08 Docs: Improve wording on the os-supported page v2
Closes #11871
2022-12-14 08:59:26 +02:00
Botond Dénes
31fcfe62e1 Merge 'doc: add the description of AzureSnitch to the documentation' from Anna Stuchlik
Fixes https://github.com/scylladb/scylladb/issues/11712

Updates added with this PR:
- Added a new section with the description of AzureSnitch (similar to others + examples and language improvements).
- Fixed the headings so that they render properly.
- Replaced "Scylla" with "ScyllaDB".

Closes #12254

* github.com:scylladb/scylladb:
  docs: replace Scylla with ScyllaDB on the Snitches page
  docs: fix the headings on the Snitches page
  doc: add the description of AzureSnitch to the documentation
2022-12-14 08:58:48 +02:00
Lubos Kosco
3f9dca9c60 doc: print out the generated UUID for sending to support
Closes #12176
2022-12-14 08:57:54 +02:00
guy9
a329fcd566 Updated University monitoring lesson link
Closes #11906
2022-12-14 08:50:26 +02:00
Jan Ciolek
9afa9f0e50 expr_test: add unit tests for prepare_expression(conjunction)
Add unit tests which ensure that preparing conjunctions
works as expected.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-12-13 20:23:17 +01:00
Jan Ciolek
dde86a2da6 cql3: expr: make it possible to prepare conjunctions
prepare_expression used to throw an error
when encountering a conjunction.

Now it's possible to use prepare_expression
to prepare an expression that contains
conjunctions.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-12-13 20:23:17 +01:00
Jan Ciolek
5f5b1c4701 expr_test: add tests for evaluate(conjunction)
Add unit tests which ensure that evaluating
a conjunction behaves as expected.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-12-13 20:23:17 +01:00
Jan Ciolek
b3c16f6bc8 cql3: expr: make it possible to evaluate conjunctions
Previously it was impossible to use expr::evaluate()
to get the value of a conjunction of elements
separated by ANDs.

Now it has been implemented.

NULL is treated as an "unkown value" - maybe true maybe false.
`TRUE AND NULL` evaluates to NULL because it might be true but also might be false.
`FALSE AND NULL` evaluates to FALSE because no matter what value NULL acts as, the result will still be FALSE.
Unset and empty values are not allowed.

Usually in CQL the rule is that when NULL occurs in an operation the whole expression
becomes NULL, but here we decided to deviate from this behavior.
Treating NULL as an "unkown value" is the standard SQL way of handing NULLs in conjunctions.
It works this way in MySQL and Postgres so we do it this way as well.

The evaluation short-circuits. Once FALSE is encountered the function returns FALSE
immediately without evaluating any further elements.
It works this way in Postgres as well, for example:
`SELECT true AND NULL AND 1/0 = 0` will throw a division by zero error
but `SELECT false AND 1/0 = 0` will successfully evaluate to FALSE.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-12-13 20:23:08 +01:00
Benny Halevy
e9e66f3ca7 database: drop_table_on_all_shards: limit truncated_at time
The infinetely high time_point of `db_clock::time_point::max()`
used in ba42852b0e
is too high for some clients that can't represent
that as a date_time string.

Instead, limit it to 9999-12-31T00:00:00+0000,
that is practically sufficient to ensure truncation of
all sstables and should be within the clients' limits.

Fixes #12239

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12273
2022-12-13 16:46:20 +02:00
Avi Kivity
919888fe60 Merge 'docs/dev: Add backport instructions for contributors' from Jan Ciołek
Add instructions on how to backport a feature to on older version of Scylla.

It contains a detailed step-by-step instruction so that people unfamiliar with intricacies of Scylla's repository organization can easily get the hang of it.

This is the guide I wish I had when I had to do my first backport.

I put it in backport.md because that looks like the file responsible for this sort of information.
For a moment I thought about `CONTRIBUTING.md`, but this is a really short file with general information, so it doesn't really fit there. Maybe in the future there will be some sort of unification (see #12126)

Closes #12138

* github.com:scylladb/scylladb:
  dev/docs: add additional git pull to backport docs
  docs/dev: add a note about cherry-picking individual commits
  docs/dev: use 'is merged into' instead of 'becomes'
  docs/dev: mention that new backport instructions are for the contributor
  docs/dev: Add backport instructions for contributors
2022-12-13 16:27:04 +02:00
Pavel Emelyanov
fe4cf231bc snitch: Check http response codes to be OK
Several snitch drivers make http requests to get
region/dc/zone/rack/whatever from the cloud provider. They blindly rely
on the response being successfull and read response body to parse the
data they need from.

That's not nice, add checks for requests finish with http OK statuses.

refs: #12185

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #12287
2022-12-13 14:49:18 +02:00
Benny Halevy
68141d0aac topology: get rid of pending state
Now, with a44ca06906,
is_normal_token_owner that replaced is_member
does not rely anymore on the pending status
of endpoints in topology.

With that we can get rid of this state and just keep
all endpoints we know about in the topology.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-13 14:17:18 +02:00
Benny Halevy
f2753eba30 topology: debug log update and remove endpoint
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-13 14:17:13 +02:00
Avi Kivity
c7cee0da40 Merge 'storage_service: handle_state_normal: always update_topology before update_normal_tokens' from Benny Halevy
update_normal_tokens checks that that the endpoint is in topology. Currently we call update_topology on this path only if it's not a normal_token_owner, but there are paths when the endpoint could be a normal token owner but still
be pending in topology so always update it, just in case.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12080

* github.com:scylladb/scylladb:
  storage_service: handle_state_normal: always update_topology before update_normal_tokens
  storage_service: handle_state_normal: delete outdated comment regarding update pending ranges race
2022-12-13 13:41:10 +02:00
Avi Kivity
75e469193b Merge 'Use Host ID as Raft ID' from Kamil Braun
Thanks to #12250, Host IDs uniquely identify nodes. We can use them as Raft IDs which simplifies the code and makes reasoning about it easier, because Host IDs are always guaranteed to be present (while Raft IDs may be missing during upgrade).

Fixes: https://github.com/scylladb/scylladb/issues/12204

Closes #12275

* github.com:scylladb/scylladb:
  service/raft: raft_group0: take `raft::server_id` parameter in `remove_from_group0`
  gms, service: stop gossiping and storing RAFT_SERVER_ID
  Revert "gms/gossiper: fetch RAFT_SERVER_ID during shadow round"
  service: use HOST_ID instead of RAFT_SERVER_ID during replace
  service/raft: use gossiped HOST_ID instead of RAFT_SERVER_ID to update Raft address map
  main: use Host ID as Raft ID
2022-12-13 13:39:41 +02:00
Andrii Patsula
cd2e786d72 Report a warning when a server's IP cannot be found in ping.
Fixes #12156
Closes #12206
2022-12-13 11:18:59 +01:00
Botond Dénes
51f867339e Merge 'Docs: cleanup add-node-to-cluster' from Benny Halevy
This series improves the add-node-to-cluster document, in particular around the documentation for the associated cleanup procedure, and the prerequisite steps.

It also removes information about outdated releases.

Closes #12210

* github.com:scylladb/scylladb:
  docs: operating-scylla: add-node-to-cluster: deleted instructions for unsupported releases
  docs: operating-scylla: add-node-to-cluster: cleanup: move tips to a note
  docs: operating-scylla: add-node-to-cluster: improve wording of cleanup instructions
  docs: operating-scylla: prerequisites: system_auth is a keyspace, not a table
  docs: operating-scylla: prerequisites: no Authetication status is gathered
  docs: operating-scylla: prerequisites: simplify grep commands
  docs: operating-scylla: add-node-to-cluster: prerequisites: number sub-sections
  docs: operating-scylla: add-node-to-cluster: describe other nodes in plural
2022-12-13 10:54:05 +02:00
Botond Dénes
4122854ae7 Merge 'repair: coroutinize repair_range' from Avi Kivity
Nicer and simpler, but essentially cosmetic.

Closes #12235

* github.com:scylladb/scylladb:
  repair: reindent repair_range
  repair: coroutinize repair_range
2022-12-13 08:16:05 +02:00
Avi Kivity
96890d4120 repair: to_repair_rows_list: reindent 2022-12-12 22:54:07 +02:00
Avi Kivity
e482cb1764 repair: to_repair_rows_list: coroutinize
Simplifying a complicated function. It will also be a
little faster due to fewer allocations, but not significantly.
2022-12-12 22:52:12 +02:00
Avi Kivity
c728de8533 sstables: update_info_for_opened_data: reindent
Recover much-needed indent levels for future use.
2022-12-12 22:38:07 +02:00
Avi Kivity
eace9a226c sstables: update_info_for_opened_data: coroutinize
Nothing special, just simplifying a complicated function.
2022-12-12 22:35:46 +02:00
Michał Jadwiszczak
5985f22841 version: Reverse version increase
Revert version change made by PR #11106, which increased it to `4.0.0`
to enable server-side describe on latest cqlsh.

Turns out that our tooling some way depends on it (eg. `sstableloader`)
and it breaks dtests.
Reverting only the version allows to leave the describe code unchanged
and it fixes the dtests.

cqlsh 6.0.0 will return a warning when running `DESC ...` commands.

Closes #12272
2022-12-12 18:45:32 +02:00
Kamil Braun
a26f62b37b service/raft: raft_group0: take raft::server_id parameter in remove_from_group0
We no longer need to translate from IP to Raft ID using the address map,
because Raft ID is now equal to the Host ID - which is always available
at the call site of `remove_from_group0`.
2022-12-12 15:23:05 +01:00
Kamil Braun
bf6679906f gms, service: stop gossiping and storing RAFT_SERVER_ID
It is equal to (if present) HOST_ID and no longer used for anything.

The application state was only gossiped if `experimental-features`
contained `raft`, so we can free this slot.

Similarly, `raft_server_id`s were only persisted in `system.peers` if
the `SUPPORTS_RAFT` cluster feature was enabled, which happened only
when `experimental-features` contained `raft`. The `raft_server_id`
field in the schema was also introduced recently in `master` and didn't
get to be in a release yet. Given either of these reasons, we can remove
this field safely.
2022-12-12 15:20:30 +01:00
Kamil Braun
5dbe236339 Revert "gms/gossiper: fetch RAFT_SERVER_ID during shadow round"
This reverts commit 60217d7f50.
We no longer need RAFT_SERVER_ID.
2022-12-12 15:20:20 +01:00
Kamil Braun
3e58da0719 service: use HOST_ID instead of RAFT_SERVER_ID during replace
Makes the code simpler because we can assume that HOST_ID is always
there.
2022-12-12 15:18:56 +01:00
Kamil Braun
32c56920b4 service/raft: use gossiped HOST_ID instead of RAFT_SERVER_ID to update Raft address map
With the earlier commit, if gossiped RAFT_SERVER_ID is not empty then
it's the same as HOST_ID.
2022-12-12 15:16:56 +01:00
Calle Wilund
e99626dc10 config: Change wording of "none" in encryption options to maybe reduce user confusion
Fixes /scylladb/scylla-enterprise/issues#1262

Changes the somewhat ambiguous "none" into "not set" to clarify that "none" is not an
option to be written out, but an absense of a choice (in which case you also have made
a choice).

Closes #12270
2022-12-12 16:14:53 +02:00
Kamil Braun
f3243ff674 main: use Host ID as Raft ID
The Host ID now uniquely identifies a node (we no longer steal it during
node replace) and Raft is still experimental. We can reuse the Host ID
of a node as its Raft ID. This will allow us to remove and simplify a
lot of code.

With this we can already remove some dead code in this commit.
2022-12-12 15:14:51 +01:00
Botond Dénes
d44c5f5548 scripts: add open-coredump.sh
Script for "one-click" opening of coredumps.
It extracts the build-id from the coredump, retrieves metadata for that
build, downloads the binary package, the source code and finally
launches the dbuild container, with everything ready to load the
coredump.
The script is idempotent: running it after the prepartory steps will
re-use what is already donwloaded.

The script is not trying to provide a debugging environment that caters
to all the different ways and preferences of debugging. Instead, it just
sets up a minimalistic environment for debugging, while providing
opportunities for the user to customization according to their
preferred.

I'm not entirely sure, coredumps from master branch will work, but we
can address this later when we confirm they don't.

Example:

    $ ~/ScyllaDB/scylla/worktree0/scripts/open-coredump.sh ./core.scylla.113.bac3650b616f4f09a4d1ab160574b6a5.4349.1669185225000000000000
    Build id: 5009658b834aaf68970135bfc84f964b66ea4dee
    Matching build is scylla-5.0.5 0.20221009.5a97a1060 release-x86_64
    Downloading relocatable package from http://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-5.0/scylla-x86_64-package-5.0.5.0.20221009.5a97a1060.tar.gz
    Extracting package scylla-x86_64-package-5.0.5.0.20221009.5a97a1060.tar.gz
    Cloning scylla.git
    Downloading scylla-gdb.py
    Copying scylla-gdb.py from /home/bdenes/ScyllaDB/storage/11961/open-coredump.sh.dir/scylla.repo
    Launching dbuild container.

    To examine the coredump with gdb:

        $ gdb -x scylla-gdb.py -ex 'set directories /src/scylla' --core ./core.scylla.113.bac3650b616f4f09a4d1ab160574b6a5.4349.1669185225000000000000 /opt/scylladb/libexec/scylla

    See https://github.com/scylladb/scylladb/blob/master/docs/dev/debugging.md for more information on how to debug scylla.

    Good luck!
    [root@fedora workdir]#

Closes #12223
2022-12-12 12:55:28 +02:00
Kamil Braun
dcba652013 Merge 'replacenode: do not inherit host_id' from Benny Halevy
We want to always be able to distinguish between
the replacing node and the replacee by using different,
unique, host identifiers.

This will allow us to use the host_id authoritatively
to identify the node (rather then its endpoint ip address)
for token mapping and node operations.

Also, it will be used in the following patch to never allow the
replaced node to rejoin the cluster, as its host_id should never
be reused.

This change does not affect #5523, the replaced node may still steal back its tokens if restarted.

Refs #9839
Refs #12040

Closes #12250

* github.com:scylladb/scylladb:
  docs: replace-dead-node: update host_id of replacing node
  docs: replace-dead-node: fix alignment
  db: system_keyspace: change set_local_host_id to private set_local_random_host_id
  storage_service: do not inherit the host_id of a replaced a node
2022-12-12 11:00:42 +01:00
Benny Halevy
c6f05b30e1 task_manager: task: impl: add virtual destructor
The generic task holds and destroyes a task::impl
but we want the derived class's destructor to be called
when the task is destroyed otherwise, for example,
member like abort_source subscription will not be destroyed
(and auto-unlinked).

Fixes #12183

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12266
2022-12-11 22:10:59 +02:00
Benny Halevy
36a9f62833 repair: repair_module: use mutable capture for func
It is moved into the async thread so the encapsulating
function should be defined mutable to move the func
rather thna copying it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12267
2022-12-11 22:10:28 +02:00
Nadav Har'El
0c26032e70 test/cql-pytest: translate more Cassandra tests
This patch includes a translation of two more test files from
Cassandra's CQL unit test directory cql3/validation/operations.

All tests included here pass on Cassandra. Several test fail on Scylla
and are marked "xfail". These failures discovered two previously-unknown
bugs:

    #12243: Setting USING TTL of "null" should be allowed
    #12247: Better error reporting for oversized keys during INSERT

And also added reproducers for two previously-known bugs:

    #3882: Support "ALTER TABLE DROP COMPACT STORAGE"
    #6447: TTL unexpected behavior when setting to 0 on a table with
           default_time_to_live

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12248
2022-12-11 21:42:57 +02:00
Nadav Har'El
09a3c63345 cross-tree: allow std::source_location in clang 14
We recently (commit 6a5d9ff261) started
to use std::source_location instead of std::experimental::source_location.
However, this does not work on clang 14, because libc++ 12's
<source_location> only works if __builtin_source_location, and that is
not available on clang 14.

clang 15 is just three months old, and several relatively-recent
distributions still carry clang 14 so it would be nice to support it
as well.

So this patch adds a trivial compatibility header file, which, when
included and compiled with clang 14, it aliases the functional
std::experimental::source_location to std::source_location.

It turns out it's enough to include the new header file from three
headers that included <source_location> -  I guess all other uses
of source_location depend on those header files directly or indirectly.
We may later need to include the compatibility header file in additional
places, bug for now we don't.

Refs #12259

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12265
2022-12-11 20:28:49 +02:00
Avi Kivity
e6ffc22053 Merge 'cql3: Server-side DESC statement' from Michał Jadwiszczak
This PR adds server-side `DESCRIBE` statement, which is required in latest cqlsh version.

The only change from the user perspective is the `DESC ...` statement can be used with cqlsh version >= 6.0. Previously the statement was executed from client side, but starting with Cassandra 4.0 and cqlsh 6.0, execution of describe was moved to server side, so the user was unable to do `DESC ...` with Scylla and cqlsh 6.0.

Implemented describe statements:
- `DESC CLUSTER`
- `DESC [FULL] SCHEMA`
- `DESC [ONLY] KEYSPACE`
- `DESC KEYSPACES/TYPES/FUNCTIONS/AGGREGATES/TABLES`
- `DESC TYPE/FUNCTION/AGGREGATE/MATERIALIZED VIEW/INDEX/TABLE`
- `DESC`

[Cassandra's implementation for reference](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/statements/DescribeStatement.java)

Changes in this patch:
- cql3::util: added `single_quite()` function
- added `data_dictionary::keyspace_element` interface
- implemented `data_dictionary::keyspace_element` for:
    - keyspace_metadata,
    - UDT, UDF, UDA
    - schema
- cql3::functions: added `get_user_functions()` and `get_user_aggregates()` to get all UDFs/UDAs in specified keyspace
- data_dictionary::user_types_metadata: added `has_type()` function
- extracted `describe_ring()` from storage_service to standalone helper function in `locator/util.hh`
- storage_proxy: added `describe_ring()` (implemented using helper function mentioned above)
- extended CQL grammar to handle describe statement
- increased version in `version.hh` to 4.0.0, so cqlsh will use server-side describe statement

Referring: https://github.com/scylladb/scylla/issues/9571, https://github.com/scylladb/scylladb/issues/11475

Closes #11106

* github.com:scylladb/scylladb:
  version: Increasing version
  cql-pytest: Add tests for server-side describe statement
  cql-pytest: creating random elements for describe's tests
  cql3: Extend CQL grammar with server-side describe statement
  cql3:statements: server-side describe statement
  data_dictonary: add `get_all_keyspaces()` and `get_user_keyspaces()`
  storage_proxy: add `describe_ring()` method
  storage_service, locator: extract describe_ring()
  data_dictionary:user_types_metadata: add has_type() function
  cql3:functions: `get_user_functions()` and `get_user_aggregates()`
  implement `keyspace_element` interface
  data_dictionary: add `keyspace_element` interface
  cql3: single_quote() util function
  view: row_lock: lock_ck: reindent
  test/topology: enable replace tests
  service/raft: report an error when Raft ID can't be found in `raft_group0::remove_from_group0`
  service: handle replace correctly with Raft enabled
  gms/gossiper: fetch RAFT_SERVER_ID during shadow round
  service: storage_service: sleep 2*ring_delay instead of BROADCAST_INTERVAL before replace
2022-12-11 18:29:36 +02:00
Michał Jadwiszczak
8d88c9721e version: Increasing version
The `current()` version in version.hh has to be increased to at
least 4.0.0, so server-side describe will be used. Otherwise,
cqlsh returns warning that client-side describe is not supported.
2022-12-10 12:51:05 +01:00
Michał Jadwiszczak
3ddde7c5ad cql-pytest: Add tests for server-side describe statement 2022-12-10 12:51:05 +01:00
Michał Jadwiszczak
f91d05df43 cql-pytest: creating random elements for describe's tests
Add helper functions to create random elements (keyspaces, tables, types)
to increase the coverage of describe statment's tests.

This commit also adds `random_seed` fixture. The fixture should be
always used when using random functions. In case of test's failure, the
seed will be present in test's signature and the case can be easili
recreated.
After the test finishes, the fixture restores state of `random` to
before-test state.
2022-12-10 12:51:05 +01:00
Michał Jadwiszczak
c563b2133c cql3: Extend CQL grammar with server-side describe statement 2022-12-10 12:51:05 +01:00
Michał Jadwiszczak
e572d5f111 cql3:statements: server-side describe statement
Starting from cqlsh 6.0.0, execution of the describe statement was moved
from the client to the server.

This patch implements server-side describe statement. It's done by
simply fetching all needed keyspace elements (keyspace/table/index/view/UDT/UDF/UDA)
and generating the desired description or list of names of all elements.
The description of any element has to respect CQL restrictions(like
name's quoting) to allow quickly recreate the schema by simply copy-pasting the descritpion.
2022-12-10 12:51:05 +01:00
Michał Jadwiszczak
673393d88a data_dictonary: add get_all_keyspaces() and get_user_keyspaces()
Adds functions to `data_dictionary::database` in order to obtain names
of all keyspaces/all user keyspaces.
2022-12-10 12:51:05 +01:00
Michał Jadwiszczak
360dbf98f1 storage_proxy: add describe_ring() method
In order to execute `DESC CLUSTER`, there has to be a way to describe
ring. `storage_service` is not available at query execution. This patch
adds `describe_ring()` as a method of `storage_proxy()` (using helper
function from `locator/util.hh`).
2022-12-10 12:51:05 +01:00
Michał Jadwiszczak
dd46a92e23 storage_service, locator: extract describe_ring()
`describe_ring()` was implemented as a method of `storage_service`. This
patch extracts it from there to a standalone helper function in
`locator/util.hh`.
2022-12-10 12:51:05 +01:00
Michał Jadwiszczak
51a02e3bd7 data_dictionary:user_types_metadata: add has_type() function
Adds `has_type()` function to `user_types_metadata`. The functions
determins whether UDT with given name exists.
2022-12-10 12:50:52 +01:00
Michał Jadwiszczak
06cd03d3cd cql3:functions: get_user_functions() and get_user_aggregates()
Helper functions to obtain UDFs/UDAs for certain keyspace.
2022-12-10 12:36:59 +01:00
Michał Jadwiszczak
29ad5a08a8 implement keyspace_element interface
This patch implements `data_dictionary::keyspace_element` interfece
in: `keyspace_metadata`, `user_type_impl`, `user_function`,
`user_aggregate` and schema.
2022-12-10 12:34:09 +01:00
Michał Jadwiszczak
f30378819d data_dictionary: add keyspace_element interface
A common interace for all keyspace elements, which are:
keyspace, UDT, UDF, UDA, tables, views, indexes.
The interface is to have a unified way to describe those elements.
2022-12-10 12:27:38 +01:00
Michał Jadwiszczak
0589116991 cql3: single_quote() util function
`single_quote()` takes a string and transforms it to a string
which can be safely used in CQL commands.
Single quoting involves wrapping the name in single-quotes ('). A sigle-quote
character itself is quoted by doubling it.
Single quoting is necessary for dates, IP addresses or string literals.
2022-12-10 12:27:22 +01:00
Benny Halevy
9c2a5a755f view: row_lock: lock_ck: reindent
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-10 12:27:22 +01:00
Kamil Braun
c43e64946a test/topology: enable replace tests
Also add some TODOs for enhancing existing tests.
2022-12-10 12:27:22 +01:00
Kamil Braun
b01cba8206 service/raft: report an error when Raft ID can't be found in raft_group0::remove_from_group0
Also simplify the code and improve logging in general.

The previous code did this: search for the ID in the address map. If it
couldn't be found, perform a read barrier and search again. If it again
couldn't be found, return.

This algorithm depended on the fact that IP addresses were stored in
group 0 configuration. The read barrier was used to obtain the most
recent configuration, and if the IP was not a part of address map after
the read barrier, that meant it's simply not a member of group 0.

This logic no longer applies so we can simplify the code.

Furthermore, when I was fixing the replace operation with Raft enabled,
at some point I had a "working" solution with all tests passing. But I
was suspicious and checked if the replaced node got removed from
group 0. It wasn't. So the replace finished "successfully", but we had
an additional (voting!) member of group 0 which didn't correspond to
a token ring member.

The last version of my fixes ensure that the node gets removed by the
replacing node. But the system is fragile and nothing prevents us from
breaking this again. At least log an error for now. Regression tests
will be added later.
2022-12-10 12:27:22 +01:00
Kamil Braun
c65f4ae875 service: handle replace correctly with Raft enabled
We must place the Raft ID obtained during the shadow round in the
address map. It won't be placed by the regular gossiping route if we're
replacing using the same IP, because we override the application state
of the replaced node. Even if we replace a node with a different IP, it
is not guaranteed that background gossiping manages to update the
address map before we need it, especially in tests where we set
ring_delay to 0 and disable wait_for_gossip_to_settle. The shadow round,
on the other hand, performs a synchronous request (and if it fails
during bootstrap, bootstrap will fail - because we also won't be able to
obtain the tokens and Host ID of the replaced node).

Fetch the Raft ID of the replaced node in `prepare_replacement_info`,
which runs the shadow round. Return it in `replacement_info`. Then
`join_token_ring` passes it to `setup_group0`, which stores it in the
address map. It does that after `join_group0` so the entry is
non-expiring (the replaced node is a member of group 0). Later in the
replace procedure, we call `remove_from_group0` for the replaced node.
`remove_from_group0` will be able to reverse-translate the IP of the
replaced node to its Raft ID using the address map.
2022-12-10 12:27:22 +01:00
Kamil Braun
60217d7f50 gms/gossiper: fetch RAFT_SERVER_ID during shadow round
During the replace operation we need the Raft ID of the replaced node.
The shadow round is used for fetching all necessary information before
the replace operation starts.
2022-12-10 12:27:22 +01:00
Kamil Braun
b424cc40fa service: storage_service: sleep 2*ring_delay instead of BROADCAST_INTERVAL before replace
Most of the sleeps related to gossiping are based on `ring_delay`,
which is configurable and can be set to lower value e.g. during tests.

But for some reason there was one case where we slept for a hardcoded
value, `service::load_broadcaster::BROADCAST_INTERVAL` - 60 seconds.

Use `2 * get_ring_delay()` instead. With the default value of
`ring_delay` (30 seconds) this will give the same behavior.
2022-12-10 12:27:22 +01:00
Anna Stuchlik
8d1050e834 docs: replace Scylla with ScyllaDB on the Snitches page 2022-12-09 13:34:18 +01:00
Anna Stuchlik
5cb191d5b0 docs: fix the headings on the Snitches page 2022-12-09 13:26:36 +01:00
Anna Stuchlik
a699904374 doc: add the description of AzureSnitch to the documentation 2022-12-09 13:22:01 +01:00
Nadav Har'El
e47794ed98 test/cql-pytest: regression test for index scan with start token
When we have a table with partition key p and an indexed regular column
v, the test included in this patch checks the query

     SELECT p FROM table WHERE v = 1 AND TOKEN(p) > 17

This can work and not require ALLOW FILTERING, because the secondary index
posting-list of "v=1" is ordered in p's token order (to allow SELECT with
and without an index to return the same order - this is explained in
issue #7443). So this test should pass, and indeed it does on both current
Scylla, and Cassandra.

However, it turns out that this was a bug - issue #7043 - in older
versions of Scylla, and only fixed in Scylla 4.6. In older versions,
the SELECT wasn't accepted, claiming it requires ALLOW FILTERING,
and if ALLOW FILTERING was added, the TOKEN(p) > 17 part was silently
ignored.

The fix for issue #7043 actually included regression tests, C++ tests in
test/boost/secondary_index_test.cc. But in this patch we also add a Python
test in test/cql-pytest.

One of the benefits of cql-pytest is that we can (and I did) run the same
test on Cassandra to verify we're not implementing a wrong feature.
Another benefit is that we can run a new test on an old version, and
not even require re-compilation: You can run this new test on any
existing installation of Scylla to check if it still has issue #7043.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12237
2022-12-09 09:33:16 +02:00
Benny Halevy
018dedcc0c docs: replace-dead-node: update host_id of replacing node
The replacing node no longer assumes the host_id
of the replacee.  It will continue to use a random,
unique host_id.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-09 08:23:31 +02:00
Benny Halevy
37d75e5a21 docs: replace-dead-node: fix alignment 2022-12-09 08:23:31 +02:00
Benny Halevy
89920d47d6 db: system_keyspace: change set_local_host_id to private set_local_random_host_id
Now that the local host_id is never changed externally
(by the storage_service upon replace-node),
the method can be made private and be used only for initializing the
local host_id to a random one.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-09 08:23:31 +02:00
Benny Halevy
9942c60d93 storage_service: do not inherit the host_id of a replaced a node
We want to always be able to distinguish between
the replacing node and the replacee by using different,
unique, host identifiers.

This will allow us to use the host_id authoritatively
to identify the node (rather then its endpoint ip address)
for token mapping and node operations.

Also, it will be used in the following patch to never allow the
replaced node to rejoin the cluster, as its host_id should never
be reused.

Refs #9839
Refs #12040

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-09 08:23:31 +02:00
Pavel Emelyanov
7197757750 broadcast_tables: Forward-declare storage_proxy in lang.hh
Currently the header includes storage_proxy.hh and spreads this over the
code via raft_group0_client.hh -> group0_state_machine.hh -> lang.hh

Forward declaring proxy class it eliminates ~100 indirect dependencies on
storage_proxy.hh via this chain.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #12241
2022-12-09 01:23:51 +02:00
Pavel Emelyanov
6075e01312 test/lib: Remove sstable_utils.hh from simple_schema.hh
The latter is pretty popular test/lib header that disseminates the
former one over whole lot of unit tests. The former, in turn, naturally
includes sstables.hh thus making tons of unrelated tests depend on
sstables class unused by them.

However, simple removal doesn't work, becase of local_shard_only bool
class definition in sstable_utils.hh used in simple_schema.hh. This
thing, in turn, is used in keys making helpers that don't belong to
sstable utils, so these are moved into simple_schema as well.

When done, this affects the mutation_source_test.hh, which needs the
local_shard_only bool class (and helps spreading the sstables.hh
throughout more unrelated tests) and a bunch of .cc test sources that
used sstable_utils.hh to indirectly include various headers of their
demand.

After patching, sstables.hh touches 2x times less tests. As a side
effect the sstables_manager.hh also becomes 2x times less dependent
on by tests.

Continuation of 9bdea110a6

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #12240
2022-12-08 15:37:33 +02:00
Tomasz Grabiec
4e7ddb6309 position_in_partition: Introduce before_key(position_in_partition_view) 2022-12-08 13:41:28 +01:00
Tomasz Grabiec
536c0ab194 db: Fix trim_clustering_row_ranges_to() for non-full keys and reverse order
trim_clustering_row_ranges_to() is broken for non-full keys in reverse
mode. It will trim the range to
position_in_partition_view::after_key(full_key) instead of
position_in_partition_view::before_key(key), hence it will include the
key in the resulting range rather than exclude it.

Fixes #12180
Refs #1446
2022-12-08 13:41:28 +01:00
Tomasz Grabiec
232ce699ab types: Fix comparison of frozen sets with empty values
A frozen set can be part of the clustering key, and with compact
storage, the corresponding key component can have an empty value.

Comparison was not prepared for this, the iterator attempts to
deserialize the item count and will fail if the value is empty.

Fixes #12242
2022-12-08 13:41:11 +01:00
Nadav Har'El
4cdaba778d Merge 'Secondary indexes on static columns' from Piotr Dulikowski
This pull request introduces support for global secondary indexes based on static columns.

Local secondary indexes based on secondary columns are not planned to be supported and are explicitly forbidden. Because there is only one static row per partition and local indexes require full partition key when querying, such indexes wouldn't be very useful and would only waste resources.

The index table for secondary indexes on static columns, unlike other secondary indexes, do not contain clustering keys from the base table. A static column's value determines a set of full partitions, so the clustering keys would only be unnecessary.

The already existing logic for querying using secondary indexes works after introducing minimal notifications. The view update generation path now works on a common representation of static and clustering rows, but the new representation allowed to keep most of the logic intact.

New cql-pytests are added. All but one of the existing tests for secondary indexes on static columns - ported from Cassandra - now work and have their `xfail` marks lifted; the remaining test requires support for collection indexing, so it will start working only after #2962 is fixed.

Materialized view with static rows as a key are __not__ implemented in this PR.

Fixes: #2963

Closes #11166

* github.com:scylladb/scylladb:
  test_materialized_view: verify that static columns are not allowed
  test_secondary_index: add (currently failing) test for static index paging
  test_secondary_index: add more tests for secondary indexes on static columns
  cassandra_tests: enable existing tests for static columns
  create_index_statement: lift restriction on secondary indexes on static rows
  db/view: fetch and process static rows when building indexes
  gms/feature_service: introduce SECONDARY_INDEXES_ON_STATIC_COLUMNS cluster feature
  create_index_statement: disallow creation of local indexes with static columns
  select_statement: prepare paging for indexes on static columns
  select_statement: do not attempt to fetch clustering columns from secondary index's table
  secondary_index_manager: don't add clustering key columns to index table of static column index
  replica/table: adjust the view read-before-write to return static rows when needed
  db/view: process static rows in view_update_builder::on_results
  db/view: adjust existing view update generation path to use clustering_or_static_row
  column_computation: adjust to use clustering_or_static_row
  db/view: add clustering_or_static_row
  deletable_row: add column_kind parameter to is_live
  view_info: adjust view_column to accept column_kind
  db/view: base_dependent_view_info: split non-pk columns into regular and static
2022-12-08 09:54:05 +02:00
Konstantin Osipov
02c30ab5d6 build: fix link error (abseil) on ubuntu toolchain with clang 15
abseil::hash depends on abseil::city and declareds CityHash32
as an external symbol. The city library static library, however,
precedes hash in the link list, which apparently makes the linker
simply drop it from the object list, since its symbols are not
used elsewhere.

Fix the linker ordering to help linker see that CityHash32
is used.

Closes #12231
2022-12-08 09:47:16 +02:00
Avi Kivity
d6457778f1 Merge 'Coroutinize some table functions in preparation to static compaction groups' from Raphael "Raph" Carvalho
Extracted from https://github.com/scylladb/scylladb/pull/12139

Closes #12236

* github.com:scylladb/scylladb:
  replica: table: Fix indentation
  replica: coroutinize table::discard_sstables()
  replica: Coroutinize table::flush()
2022-12-08 09:29:58 +02:00
Piotr Dulikowski
4883e43677 test_materialized_view: verify that static columns are not allowed
Adds a test which verifies that static columns are not allowed in
materialized views. Although we added support for static columns in
secondary indexes, which share a lot of code with materialized views,
static columns in materialized views are not yet ready to use.
2022-12-08 07:41:33 +01:00
Piotr Dulikowski
f864944dcb test_secondary_index: add (currently failing) test for static index paging
Currently, when executing queries accelerated by an index on a static
column, paging is unable to break base table partitions across pages and
is forced to return them in whole. This will cause problems if such a
query must return a very large base table partition because it will have
to be loaded into memory.

Fixing this issue will require a more sophisticated approach than what
was done in the PR. For the time being, an xfailing pytest is added
which should start passing after paging is improved.
2022-12-08 07:41:33 +01:00
Piotr Dulikowski
4f836115fd test_secondary_index: add more tests for secondary indexes on static columns
Adds cql-pytests which test the secondary index on static columns
feature.
2022-12-08 07:41:32 +01:00
Botond Dénes
897b501ba3 Merge 'doc: update the 5.1 upgrade guide with the mode-related information' from Anna Stuchlik
This PR adds the link to the KB article about updating the mode after the upgrade to the 5.1 upgrade guide.
In addition, I have:
- updated the KB article to include the versions affected by that change.
- fixed the broken link to the page about metric updates (it is not related to the KB article, but I fixed it in the same PR to limit the number of PRs that need to be backported).

Related: https://github.com/scylladb/scylladb/pull/11122

Closes #12148

* github.com:scylladb/scylladb:
  doc: update the releases in the KB about updating the mode after upgrade
  doc: fix the broken link in the 5.1 upgrade guide
  doc: add the link to the 5.1-related KB article to the 5.1 upgrade guide
2022-12-08 07:32:10 +02:00
Tomasz Grabiec
992a73a861 row_cache: Destroy coroutine under region's allocator
The reason is alloc-dealloc mismatch of position_in_partition objects
allocated by cursors inside coroutine object stored in the update
variable in row_cache::do_update()

It is allocated under cache region, but in case of exception it will
be destroyed under the standard allocator. If update is successful, it
will be cleared under region allocator, so there is not problem in the
normal case.

Fixes #12068

Closes #12233
2022-12-07 21:44:21 +02:00
Raphael S. Carvalho
9ae0d8ba28 replica: table: Fix indentation
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-07 15:53:22 -03:00
Raphael S. Carvalho
b9a33d5a91 replica: coroutinize table::discard_sstables()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-07 15:52:36 -03:00
Raphael S. Carvalho
192b64a5ac replica: Coroutinize table::flush()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-07 15:52:27 -03:00
Benny Halevy
a076ceef97 view: row_lock: lock_ck: reindent
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-07 19:27:30 +02:00
Avi Kivity
909fbfdd2f repair: reindent repair_range 2022-12-07 18:17:21 +02:00
Avi Kivity
796ec5996f repair: coroutinize repair_range 2022-12-07 18:13:10 +02:00
Benny Halevy
78c5961114 docs: operating-scylla: add-node-to-cluster: deleted instructions for unsupported releases
2.3 and 2018.1 ended their life and are long gone.
No need to have instructions for them in the master version of this
document.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-07 17:07:35 +02:00
Benny Halevy
adeb03e60f docs: operating-scylla: add-node-to-cluster: cleanup: move tips to a note
And be more verbose about why the tips are recommended and their
ramifications.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-07 17:07:18 +02:00
Benny Halevy
6e324137bd docs: operating-scylla: add-node-to-cluster: improve wording of cleanup instructions
"use `nodetool cleanup` cleanup command" repeats words, change to
"run the `nodetool cleanup` command".

Also, improve the description of the cleanup action
and how it relate to the bootstrapping process.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-07 17:07:08 +02:00
Benny Halevy
eeed330647 docs: operating-scylla: prerequisites: system_auth is a keyspace, not a table
Fix the phrase referring to it as a table respectively.
Also, do some minor phrasing touch-ups in this area.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-07 17:06:54 +02:00
Benny Halevy
5d840d4232 docs: operating-scylla: prerequisites: no Authetication status is gathered
Authetication status isn't gathered from scylla.yaml,
only the authenticator, so change the caption respectively.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-07 17:06:48 +02:00
Benny Halevy
9cb7056d3e docs: operating-scylla: prerequisites: simplify grep commands
Writing `cat X | grep Y` is both inefficient and somewhat
unprofessional.  The grep command works very well on a file argument
so `grep Y X` will do the job perfectly without the need for a pipe.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-07 17:06:36 +02:00
Benny Halevy
71bc12eecc docs: operating-scylla: add-node-to-cluster: prerequisites: number sub-sections
To improve their readability.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-07 17:06:35 +02:00
Benny Halevy
16db7bea82 docs: operating-scylla: add-node-to-cluster: describe other nodes in plural
Typically data will be streamed from multiple existing nodes
to the new node, not from a single one.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-12-07 17:03:23 +02:00
Tomasz Grabiec
a46b2e4e4c Merge 'Make node replace procedure work with Raft' from Kamil Braun
We need to obtain the Raft ID of the replaced node during the shadow round and
place it in the address map. It won't be placed by the regular gossiping route
if we're replacing using the same IP, because we override the application state
of the replaced node. Even if we replace a node with a different IP, it is not
guaranteed that background gossiping manages update the address map before we
need it, especially in tests where we set ring_delay to 0 and disable
wait_for_gossip_to_settle. The shadow round, on the other hand, performs a
synchronous request (and if it fails during bootstrap, bootstrap will fail -
because we also won't be able to obtain the tokens and Host ID of the replaced
node).

Fetch the Raft ID of the replaced node in `prepare_replacement_info`,
which runs the shadow round. Return it in `replacement_info`. Then
`join_token_ring` passes it to `setup_group0`, which stores it in the
address map. It does that after `join_group0` so the entry is
non-expiring (the replaced node is a member of group 0). Later in the
replace procedure, we call `remove_from_group0` for the replaced node.
`remove_from_group0` will be able to reverse-translate the IP of the
replaced node to its Raft ID using the address map.

Also remove an unconditional 60 seconds sleep from the replace code. Make it
dependent on ring_delay.

Enable the replace tests.

Modify some code related to removing servers from group 0 which depended on
storing IP addresses in the group 0 configuration.

Closes #12172

* github.com:scylladb/scylladb:
  test/topology: enable replace tests
  service/raft: report an error when Raft ID can't be found in `raft_group0::remove_from_group0`
  service: handle replace correctly with Raft enabled
  gms/gossiper: fetch RAFT_SERVER_ID during shadow round
  service: storage_service: sleep 2*ring_delay instead of BROADCAST_INTERVAL before replace
2022-12-07 15:30:27 +01:00
Pavel Emelyanov
9bdea110a6 code: Reduce fanout of sstables(_manager)?.hh over headers
This change removes sstables.hh from some other headers replacing it
with version.hh and shared_sstable.hh. Also this drops
sstables_manager.hh from some more headers, because this header
propagates sstables.hh via self. That change is pretty straightforward,
but has a recochet in database.hh that needs disk-error-handler.hh.

Without the patch touch sstables/sstable.hh results in 409 targets
recompillation, with the patch -- 299 targets.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #12222
2022-12-07 14:34:19 +02:00
Botond Dénes
57a4971962 Merge 'dirty_memory_manager: tidy up' from Avi Kivity
Tidy up namespaces, move code to the right file, and
move the whole thing to the replica module where it
belongs.

Closes #12219

* github.com:scylladb/scylladb:
  dirty_memory_manager: move implementaton from database.cc
  dirty_memory_manager: move to replica module
  test: dirty_memory_manager_test: disambiguate classes named 'test_region_group'
  dirty_memory_manager: stop using using namespace
2022-12-07 14:25:59 +02:00
Avi Kivity
f7f5700289 dirty_memory_manager: move implementaton from database.cc
A few leftover method implementations were left in database.cc
when dirty_memory_manager.cc was created, move them to their
correct place now.
2022-12-06 22:28:54 +02:00
Avi Kivity
444de2831e dirty_memory_manager: move to replica module
It's a replica-side thing, so move it there. The related
flush_permit and sstable_write_permit are moved alongside.
2022-12-06 22:24:17 +02:00
Avi Kivity
a038a35ad6 test: dirty_memory_manager_test: disambiguate classes named 'test_region_group'
There are two similarly named classes: ::test_region_group and
dirty_memory_manager_logalloc::test_region_group. Rename the
former to ::raii_region_group (that's what it's for) and the
latter to ::test_region_group, to reduce confusion.
2022-12-06 22:20:38 +02:00
Avi Kivity
dfdae5ffa9 dirty_memory_manager: stop using using namespace
`using namespace` is pretty bad, especially in a header, as it
pollutes the namespace for everyone. Stop using it and qualify
names instead.
2022-12-06 21:37:38 +02:00
Avi Kivity
47a8fad2a2 Merge 'scylla-types: add serialize action' from Botond Dénes
Serializes the value that is an instance of a type. The opposite of `deserialize` (previously known as `print`).
All other actions operate on serialized values, yet up to now we were missing a way to go from human readable values to serialized ones. This prevented for example using `scylla types tokenof $pk` if one only had the human readable key value.
Example:

```
$ scylla types serialize -t Int32Type -- -1286905132
b34b62d4
$ scylla types serialize --prefix-compound -t TimeUUIDType -t Int32Type -- d0081989-6f6b-11ea-0000-0000001c571b 16
0010d00819896f6b11ea00000000001c571b000400000010
$ scylla types serialize --prefix-compound -t TimeUUIDType -t Int32Type -- d0081989-6f6b-11ea-0000-0000001c571b
0010d00819896f6b11ea00000000001c571b
```

Closes #12029

* github.com:scylladb/scylladb:
  docs: scylla-types.rst: add mention of per-operation --help
  tools/scylla-types: add serialize operation
  tools/scylla-types: prepare for action handlers with string arguments
  tools/scylla-types: s/print/deserialize/ operation
  docs: scylla-types.rst: document tokenof and shardof
  docs: scylla-types.rst: fix typo in compare operation description
2022-12-06 19:27:15 +02:00
Nadav Har'El
f275bfd57b Update CODEOWNERS file
Update the CODEOWNERS file with some people who joined different parts
of the project, and one person that left.

Note that despite is name, CODEOWNERS does not list "ownership" in any
strict sense of the word - it is more about who is willing and/or
knowledgeable enough to participate in reviewing changes to particular
files or directories. Github uses this file to automatically suggest
who should review a pull request.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12216
2022-12-06 19:26:03 +02:00
Benny Halevy
5007ded2c1 view: row_lock: lock_ck: serialize partition and row locking
The problematic scenario this patch fixes might happen due to
unfortunate serialization of locks/unlocks between lock_pk and lock_ck,
as follows:

    1. lock_pk acquires an exclusive lock on the partition.
    2.a lock_ck attempts to acquire shared lock on the partition
        and any lock on the row. both cases currently use a fiber
        returning a future<rwlock::holder>.
    2.b since the partition is locked, the lock_partition times out
        returning an exceptional future.  lock_row has no such problem
        and succeeds, returning a future holding a rwlock::holder,
        pointing to the row lock.
    3.a the lock_holder previously returned by lock_pk is destroyed,
        calling `row_locker::unlock`
    3.b row_locker::unlock sees that the partition is not locked
        and erases it, including the row locks it contains.
    4.a when_all_succeeds continuation in lock_ck runs.  Since
        the lock_partition future failed, it destroyes both futures.
    4.b the lock_row future is destroyed with the rwlock::holder value.
    4.c ~holder attempts to return the semaphore units to the row rwlock,
        but the latter was already destroyed in 3.b above.

Acquiring the partition lock and row lock in parallel
doesn't help anything, but it complicates error handling
as seen above,

This patch serializes acquiring the row lock in lock_ck
after locking the partition to prevent the above race.

This way, erasing the unlocked partition is never expected
to happen while any of its rows locks is held.

Fixes #12168

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12208
2022-12-06 16:29:46 +02:00
Botond Dénes
f017e9f1c6 docs: document the reader concurrency semaphore diagnostics dump
The diagnostics dumped by the reader concurrency semaphore are pretty
common-sight in logs, as soon as a node becomes problematic. The reason
is that the reader concurrency semaphore acts as the canary in the coal
mine: it is the first that starts screaming when the node or workload is
unhealthy. This patch adds documentation of the content of the
diagnostics and how to diagnose common problems based on it.

Fixes: #10471

Closes #11970
2022-12-06 16:24:44 +02:00
Botond Dénes
c35cee7e2b docs: scylla-types.rst: add mention of per-operation --help 2022-12-06 14:47:28 +02:00
Botond Dénes
4f9799ce4f tools/scylla-types: add serialize operation
Takes human readable values and converts them to serialized hex encoded
format. Only regular atomic types are supported for now, no
collection/UDT/tuple support, not even in frozen form.
2022-12-06 14:46:53 +02:00
Botond Dénes
7c87655b4b tools/scylla-types: prepare for action handlers with string arguments
Currently all action handlers have bytes arguments, parsed from
hexadecimal string representations. We plan on adding a serialize
command which will require raw string arguments. Prepare the
infrastructure for supporting both types of action handlers.
2022-12-06 14:45:30 +02:00
Botond Dénes
15452730fb tools/scylla-types: s/print/deserialize/ operation
Soon we will have a serialize operation. Rename the current print
operation to deserialize in preparation to that. We want the two
operations (serialize and deserialize) to reflect their relation in
their names too.
2022-12-06 14:45:30 +02:00
Botond Dénes
f98e6552b4 docs: scylla-types.rst: document tokenof and shardof
These new actions were added recently but without the accompanying
documentation change. Make up for this now.
2022-12-06 14:45:30 +02:00
Botond Dénes
30c047cae6 docs: scylla-types.rst: fix typo in compare operation description 2022-12-06 14:45:23 +02:00
Piotr Dulikowski
680423ad9d cassandra_tests: enable existing tests for static columns
Removes the "xfail" marker from the now-passing tests related to
secondary indexes on static columns.
2022-12-06 11:21:16 +01:00
Piotr Dulikowski
cc3af3190d create_index_statement: lift restriction on secondary indexes on static rows
Secondary indexes on static columns should work now. This commit lifts
the existing restriction after the cluster is fully upgraded to a
version which supports such indexes.
2022-12-06 11:21:16 +01:00
Piotr Dulikowski
86dad30b66 db/view: fetch and process static rows when building indexes
This commit modifies the view builder and its consumer so that static
rows are always fetched and properly processed during view build.

Currently, the view builder will always fetch both static and clustering
rows, regardless of the type of indexes being built. For indexes on
static columns this is wasteful and could be improved so that only the
types of rows relevant to indexes being built are fetched - however,
doing this sounds a bit complicated and I would rather start with
something simpler which has a better chance of working.
2022-12-06 11:21:16 +01:00
Piotr Dulikowski
25fec0acce gms/feature_service: introduce SECONDARY_INDEXES_ON_STATIC_COLUMNS cluster feature
The new feature will prevent secondary indexes on static columns from
being created unless the whole cluster is ready to support them.
2022-12-06 11:21:16 +01:00
Piotr Dulikowski
9f14f0ac09 create_index_statement: disallow creation of local indexes with static columns
Local indexes on static columns don't make sense because there is only
one static row per partition. It's always better to just run SELECT
DISTINCT on the base table. Allowing for such an index would only make
such queries slower (due to double lookup), would take unnecessary space
and could pose potential consistency problems, so this commit explicitly
forbids them.
2022-12-06 11:21:16 +01:00
Piotr Dulikowski
8c4cdfc2db select_statement: prepare paging for indexes on static columns
When performing a query on a table which is accelerated by a secondary
index, the paging state returned along with the query contains a
partition key and a clustering key of the secondary index table. The
logic wasn't prepared to handle the case of secondary indexes on static
columns - notably, it tried to put base table's clustering key columns
into the paging state which caused problems in other places.

This commit fixes the paging logic so that the PK and CK of a secondary
index table is calculated correctly. However, this solution has a major
drawback: because it is impossible to encode clustering key of the base
table in the paging state, partitions returned by queries accelerated by
secondary indexes on static columns will _not_ be split by paging. This
can be problematic in case there are large partitions in the base table.

The main advantage of this fix is that it is simple. Moreover, the
problem described above is not unique to static column indexes, but also
happens e.g. in case of some indexes on clustering columns (see case 2
of scylladb/scylla#7432). Fixing this issue will require a more
sophisticated solution and may affect more than only secondary indexes
on static columns, so this is left for a followup.
2022-12-06 11:21:16 +01:00
Piotr Dulikowski
ba390072c5 select_statement: do not attempt to fetch clustering columns from secondary index's table
The previous commit made sure that the index table for secondary indexes
on static tables don't have columns corresponding to clustering rows in
the base table - therefore, we must make sure that we don't try to fetch
them when querying the index table.
2022-12-06 11:21:16 +01:00
Piotr Dulikowski
983b440a81 secondary_index_manager: don't add clustering key columns to index table of static column index
The implementation of secondary indexes on static columns relies on the
fact that the index table only includes partition key columns of the
base table, but not clustering key columns. A static column's value
determines a set of full partitions, so including the clustering key
would only be redundant. It would also generate more work as a single
static column update would require a large portion of the index to be
updated.

This commit makes sure that clustering columns are not included in the
index table for indexes based on a static column.
2022-12-06 11:21:16 +01:00
Piotr Dulikowski
6ab41d76e6 replica/table: adjust the view read-before-write to return static rows when needed
Adjusts the read-before-write query issued in
`table::do_push_view_replica_updates` so that, when needed, requests
static columns and makes sure that the static row is present.
2022-12-06 11:21:16 +01:00
Piotr Dulikowski
18be90b1e6 db/view: process static rows in view_update_builder::on_results
The `view_update_builder::on_results()` function is changed to react to
static rows when comparing read-before-write results with the base table
mutation.
2022-12-06 11:21:16 +01:00
Piotr Dulikowski
2dd95d76f1 db/view: adjust existing view update generation path to use clustering_or_static_row
The view update path is modified to use `clustering_or_static_row`
instead of just `clustering_row`.
2022-12-06 11:21:16 +01:00
Piotr Dulikowski
b0a31bb7a7 column_computation: adjust to use clustering_or_static_row
Adjusts the column_computation interface so that it is able to accept
both clustering and static rows through the common
db::view::clustering_or_static_row interface.
2022-12-06 11:21:16 +01:00
Piotr Dulikowski
986ab6034c db/view: add clustering_or_static_row
Adds a `clustering_or_static_row`, which is a common, immutable
representation of either a static or clustering row. It will allow to
handle view update generation based on static or clustering rows in a
uniform way.
2022-12-06 11:21:16 +01:00
Piotr Dulikowski
05d4328f02 deletable_row: add column_kind parameter to is_live
While deletable_row is used to hold regular columns of a clustering row,
its name or implementation doesn't suggest that it is a requirement. In
fact, some of its methods already take a column_kind parameter which is
used to interpret the kind of columns held in the row.

This commit removes the assumption about the column kind from the
`deletable_row::is_live` method.
2022-12-06 11:21:16 +01:00
Piotr Dulikowski
27c81432cd view_info: adjust view_column to accept column_kind
The `view_info::view_column()` and `view_column` in view.cc allow to get
a view's column definition which corresponds to given base table's
column. They currently assume that the given column id corresponds to a
regular column. In preparation for secondary indexes based on static
columns, this commit adjusts those functions so that they accept other
kinds of columns, including static columns.
2022-12-06 11:21:16 +01:00
Piotr Dulikowski
f7b7724eaf db/view: base_dependent_view_info: split non-pk columns into regular and static
Currently, `base_dependent_view_info::_base_non_pk_columns_in_view_pk`
field keeps a list of non-primary-key columns from the base table which
are a part of the view's primary key. Because the current code does not
allow indexes on static columns yet, the columns kept in the
aforementioned field are always assumed to be regular columns of the
base table and are kept as `column_id`s which do not contain information
about the column kind.

This commit splits the `_base_non_pk_columns_in_view_pk` field into two,
one for regular columns and the other for static columns, so that it is
possible to keep both kinds of columns in `base_dependent_view_info` and
the structure can be used for secondary indexes on static columns.
2022-12-06 11:21:16 +01:00
Botond Dénes
681bd62424 Update tools/java submodule
* tools/java ecab7cf7d6...1c4e1e7a7d (2):
  > Merge "Cqlsh serverless v2" from Karol Baryla
  > Update Java Driver version to 3.11.2.4
2022-12-06 09:06:09 +02:00
Botond Dénes
6a1dbffaaa Merge 'compaction_manager: coroutinize postponed_compactions_reevaluation' from Avi Kivity
Three lambdas were removed, simplifying the code.

Closes #12207

* github.com:scylladb/scylladb:
  compaction_manager: reindent postponed_compactions_reevaluation()
  compaction_manager: coroutinize postponed_compactions_reevaluation()
  compaction_manager: make postponed_compactions_reevaluation() return a future
2022-12-06 08:08:36 +02:00
Avi Kivity
2339a3fa06 database: remove continuation for updating statistics
update_write_metrics() is a continuation added solely for updating
statistics. Fold it into do_update to reduce an allocation in the
write path.

```console
$ ./artifacts/before --write --smp 1  2<&1 | grep insn
189930.77 tps ( 57.2 allocs/op,  13.2 tasks/op,   50994 insns/op,        0 errors)
189954.18 tps ( 57.2 allocs/op,  13.2 tasks/op,   51086 insns/op,        0 errors)
188623.86 tps ( 57.2 allocs/op,  13.2 tasks/op,   51083 insns/op,        0 errors)
190115.01 tps ( 57.2 allocs/op,  13.2 tasks/op,   51092 insns/op,        0 errors)
190173.71 tps ( 57.2 allocs/op,  13.2 tasks/op,   51083 insns/op,        0 errors)
median 189954.18 tps ( 57.2 allocs/op,  13.2 tasks/op,   51086 insns/op,        0 errors)
```

vs

```console
$ ./artifacts/after --write --smp 1  2<&1 | grep insn
190358.38 tps ( 56.2 allocs/op,  12.2 tasks/op,   50754 insns/op,        0 errors)
185222.78 tps ( 56.2 allocs/op,  12.2 tasks/op,   50789 insns/op,        0 errors)
184508.09 tps ( 56.2 allocs/op,  12.2 tasks/op,   50842 insns/op,        0 errors)
142099.47 tps ( 56.2 allocs/op,  12.2 tasks/op,   50825 insns/op,        0 errors)
190447.22 tps ( 56.2 allocs/op,  12.2 tasks/op,   50811 insns/op,        0 errors)
```

One allocation and ~300 cycles saved.

update_write_metrics() is still called from other call sites, so it is
not removed.

Closes #12108
2022-12-06 07:04:17 +02:00
Botond Dénes
6daa1e973f Merge 'alternator: fix hangs related to TTL scanning' from Nadav Har'El
The first patch in this small series fixes a hang during shutdown when the expired-item scanning thread can hang in a retry loop instead of quitting.  These hangs were seen in some test runs (issue #12145).

The second patch is a failsafe against additional bugs like those solved by the first patch: If any bugs causes the same page fetch to repeatedly time out, let's stop the attempts after 10 retries instead of retrying for ever. When we stop the retries, a warning will be printed to the log, Scylla will wait until the next scan period and start a new scan from scratch - from a random position in the database, instead of hanging potentially-forever waiting for the same page.

Closes #12152

* github.com:scylladb/scylladb:
  alternator ttl: in scanning thread, don't retry the same page too many times
  alternator: fix hang during shutdown of expiration-scanning thread
2022-12-06 06:44:22 +02:00
Botond Dénes
c5da96e6f7 Merge 'cql3: batch_statement: coroutinize get_mutations()' from Avi Kivity
As it has a do_with(), coroutinizing it is an automatic win.

Closes #12195

* github.com:scylladb/scylladb:
  cql3: batch_statement: reindent get_mutations()
  cql3: batch_statement: coroutinize get_mutations()
2022-12-06 06:41:44 +02:00
Avi Kivity
d2b1d2f695 compaction_manager: reindent postponed_compactions_reevaluation() 2022-12-05 22:02:27 +02:00
Avi Kivity
1669025736 compaction_manager: coroutinize postponed_compactions_reevaluation()
So much nicer.
2022-12-05 22:01:41 +02:00
Avi Kivity
d2c44cba77 compaction_manager: make postponed_compactions_reevaluation() return a future
postponed_compactions_reevaluation() runs until compaction_manager is
stopped, checking if it needs to launch new compactions.

Make it return a future instead of stashing its completion somewhere.
This makes is easier to convert it to a coroutine.
2022-12-05 21:58:48 +02:00
Avi Kivity
fe4d7fbdf2 Update abseil submodule
* abseil 7f3c0d78...4e5ff155 (125):
  > Add a compilation test for recursive hash map types
  > Add AbslStringify support for enum types in Substitute.
  > Use a c++14-style constexpr initialization if c++14 constexpr is available.
  > Move the vtable into a function to delay instantiation until the function is called. When the variable is a global the compiler is allowed to instantiate it more aggresively and it might happen before the types involved are complete. When it is inside a function the compiler can't instantiate it until after the functions are called.
  > Cosmetic reformatting in a test.
  > Reorder base64 unescape methods to be below the escaping methods.
  > Fixes many compilation issues that come from having no external CI coverage of the accelerated CRC implementation and some differences bewteen the internal and external implementation.
  > Remove static initializer from mutex.h.
  > Import of CCTZ from GitHub.
  > Remove unused iostream include from crc32c.h
  > Fix MSVC builds that reject C-style arrays of size 0
  > Remove deprecated use of absl::ToCrc32c()
  > CRC: Make crc32c_t as a class for explicit control of operators
  > Convert the full parser into constexpr now that Abseil requires C++14, and use this parser for the static checker. This fixes some outstanding bugs where the static checker differed from the dynamic one. Also, fix `%v` to be accepted with POSIX syntax.
  > Write (more) directly into the structured buffer from StringifySink, including for (size_t, char) overload.
  > Avoid using the non-portable type __m128i_u.
  > Reduce flat_hash_{set,map} generated code size.
  > Use ABSL_HAVE_BUILTIN to fix -Wundef __has_builtin warning
  > Add a TODO for the deprecation of absl::aligned_storage_t
  > TSAN: Remove report_atomic_races=0 from CI now that it has been fixed
  > absl: fix Mutex TSan annotations
  > CMake: Remove trailing commas in `AbseilDll.cmake`
  > Fix AMD cpu detection.
  > CRC: Get CPU detection and hardware acceleration working on MSVC x86(_64)
  > Removing trailing period that can confuse a url in str_format.h.
  > Refactor btree iterator generation code into a base class rather than using ifdefs inside btree_iterator.
  > container.h: fix incorrect comments about the location of <numeric> algorithms.
  > Zero encoded_remaining when a string field doesn't fit, so that we don't leave partial data in the buffer (all decoders should ignore it anyway) and to be sure that we don't try to put any subsequent operands in either (there shouldn't be enough space).
  > Improve error messages when comparing btree iterators when generations are enabled.
  > Document the WebSafe* and *WithPadding variants more concisely, as deltas from Base64Encode.
  > Drop outdated comment about LogEntry copyability.
  > Release structured logging.
  > Minor formatting changes in preparation for structured logging...
  > Update absl::make_unique to reflect the C++14 minimum
  > Update Condition to allocate 24 bytes for MSVC platform pointers to methods.
  > Add missing include
  > Refactor "RAW: " prefix formatting into FormatLogPrefix.
  > Minor formatting changes due to internal refactoring
  > Fix typos
  > Add a new API for `extract_and_get_next()` in b-tree that returns both the extracted node and an iterator to the next element in the container.
  > Use AnyInvocable in internal thread_pool
  > Remove absl/time/internal/zoneinfo.inc.  It was used to guarantee availability of a few timezones for "time_test" and "time_benchmark", but (file-based) zoneinfo is now secured via existing Bazel data/env attributes, or new CMake environment settings.
  > Updated documentation on use of %v Also updated documentation around FormatSink and PutPaddedString
  > Use the correct Bazel copts in crc targets
  > Run the //absl/time timezone tests with a data dependency on, and a matching ${TZDIR} setting for, //absl/time/internal/cctz:zoneinfo.
  > Stop unnecessary clearing of fields in ~raw_hash_set.
  > Fix throw_delegate_test when using libc++ with shared libraries
  > CRC: Ensure SupportsArmCRC32PMULL() is defined
  > Improve error messages when comparing btree iterators.
  > Refactor the throw_delegate test into separate test cases
  > Replace std::atomic_flag with std::atomic<bool> to avoid the C++20 deprecation of ATOMIC_FLAG_INIT.
  > Add support for enum types with AbslStringify
  > Release the CRC library
  > Improve error messages when comparing swisstable iterators.
  > Auto increase inlined capacity whenever it does not affect class' size.
  > drop an unused dep
  > Factor out the internal helper AppendTruncated, which is used and redefined in a couple places, plus several more that have yet to be released.
  > Fix some invalid iterator bugs in btree_test.cc for multi{set,map} emplace{_hint} tests.
  > Force a conservative allocation for pointers to methods in Condition objects.
  > Fix a few lint findings in flags' usage.cc
  > Narrow some _MSC_VER checks to not catch clang-cl.
  > Small cleanups in logging test helpers
  > Import of CCTZ from GitHub.
  > Merge pull request abseil/abseil-cpp#1287 from GOGOYAO:patch-1
  > Merge pull request abseil/abseil-cpp#1307 from KindDragon:patch-1
  > Stop disabling some test warnings that have been fixed
  > Support logging of user-defined types that implement `AbslStringify()`
  > Eliminate span_internal::Min in favor of std::min, since Min conflicts with a macro in a third-party library.
  > Fix -Wimplicit-int-conversion.
  > Improve error messages when dereferencing invalid swisstable iterators.
  > Cord: Avoid leaking a node if SetExpectedChecksum() is called on an empty cord twice in a row.
  > Add a warning about extract invalidating iterators (not just the iterator of the element being extracted).
  > CMake: installed artifacts reflect the compiled ABI
  > Import of CCTZ from GitHub.
  > Import of CCTZ from GitHub.
  > Support empty Cords with an expected checksum
  > Move internal details from one source file to another more appropriate source file.
  > Removes `PutPaddedString()` function
  > Return uint8_t from CappedDamerauLevenshteinDistance.
  > Remove the unknown CMAKE_SYSTEM_PROCESSOR warning when configuring ABSL_RANDOM_RANDEN_COPTS
  > Enforce Visual Studio 2017 (MSVC++ 15.0) minumum
  > `absl::InlinedVector::swap` supports non-assignable types.
  > Improve b-tree error messages when dereferencing invalid iterators.
  > Mutex: Fix stall on single-core systems
  > Document Base64Unescape() padding
  > Fix sign conversion warnings in memory_test.cc.
  > Fix a sign conversion warning.
  > Fix a truncation warning on Windows 64-bit.
  > Use btree iterator subtraction instead of std::distance in erase_range() and count().
  > Eliminate use of internal interfaces and make the test portable and expose it to OSS.
  > Fix various warnings for _WIN32.
  > Disables StderrKnobsDefault due to order dependency
  > Implement btree_iterator::operator-, which is faster than std::distance for btree iterators.
  > Merge pull request abseil/abseil-cpp#1298 from rpjohnst:mingw-cmake-build
  > Implement function to calculate Damerau-Levenshtein distance between two strings.
  > Change per_thread_sem_test from size medium to size large.
  > Support stringification of user-defined types in AbslStringify in absl::Substitute.
  > Fix "unsafe narrowing" warnings in absl, 12/12.
  > Revert change to internal 'Rep', this causes issues for gdb
  > Reorganize InlineData into an inner Rep structure.
  > Remove internal `VLOG_xxx` macros
  > Import of CCTZ from GitHub.
  > `absl::InlinedVector` supports move assignment with non-assignable types.
  > Change Cord internal layout, which reduces store-load penalties on ARM
  > Detects accidental multiple invocations of AnyInvocable<R(...)&&>::operator()&& by producing an error in debug mode, and clarifies that the behavior is undefined in the general case.
  > Fix a bug in StrFormat. This issue would have been caught by any compile-time checking but can happen for incorrect formats parsed via ParsedFormat::New. Specifically, if a user were to add length modifiers with 'v', for example the incorrect format string "%hv", the ParsedFormat would incorrectly be allowed.
  > Adds documentation for stringification extension
  > CMake: Remove check_target calls which can be problematic in case of dependency cycle
  > Changes mutex unlock profiling
  > Add static_cast<void*> to the sources for trivial relocations to avoid spurious -Wdynamic-class-memaccess errors in the presence of other compilation errors.
  > Configure ABSL_CACHE_ALIGNED for clang-like and MSVC toolchains.
  > Fix "unsafe narrowing" warnings in absl, 11/n.
  > Eliminate use of internal interfaces
  > Merge pull request abseil/abseil-cpp#1289 from keith:ks/fix-more-clang-deprecated-builtins
  > Merge pull request abseil/abseil-cpp#1285 from jun-sheaf:patch-1
  > Delete LogEntry's copy ctor and assignment operator.
  > Make sinks provided to `AbslStringify()` usable with `absl::Format()`.
  > Cast unused variable to void
  > No changes in OSS.
  > No changes in OSS
  > Replace the kPower10ExponentTable array with a formula.
  > CMake: Mark absl::cord_test_helpers and absl::spy_hash_state PUBLIC
  > Use trivial relocation for transfers in swisstable and b-tree.
  > Merge pull request abseil/abseil-cpp#1284 from t0ny-peng:chore/remove-unused-class-in-variant
  > Removes the legacy spellings of the thread annotation macros/functions by default.

Closes #12201
2022-12-05 21:07:16 +02:00
Eliran Sinvani
5a5514d052 cql server: Only parallelize relevant cql requests
The cql server uses an execution stage to process and execute queries,
however, processing stage is best utilized when having a recurrent flow
that needs to be called repeatedly since it better utilizes the
instruction cache.
Up until now, every request was sent through the processing stage, but
most requests are not meant to be executed repeatedly with high volume.
This change processes and executes the data queries asynchronously,
through an execution stage, and all of the rest are processed one by
one, only continuing once the request has been done end to end.

Tests:
Unit tests in dev and debug.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>

Closes #12202
2022-12-05 21:06:58 +02:00
Takuya ASADA
b7851ab1ec docker: fix locale on SSH shell
4ecc08c broke locale settings on SSH shell, since we dropped "update-locale".
To fix this without installing locales package, we need to manually specify
LANG=C.UTF-8 in /etc/default/locale.

see https://github.com/scylladb/scylla-cluster-tests/pull/5519

Closes #12197
2022-12-05 20:02:18 +02:00
Avi Kivity
6f2d060d12 Merge 'Make sstable_directory call sstable_manager for sstables' components' from Pavel Emelyanov
This PR hits two goals for "object storage" effort

1. Sstables loader "knows" that sstables components are stored in a Linux directory and uses utils/lister to access it. This is not going to work with sstables over object storage, the loader should be abstracted from the underlying storage.

2. Currently class keyspace and class column_family carry "datadir" and "all_datadirs" on board which are path on local filesystem where sstable files are stored (those usually started with /var/lib/scylla/data). The paths include subsdirs like "snapshots", "staging", etc. This is not going to look nice for obejct storage, the /var/lib/ prefix is excessive and meaningless in this case. Instead, ks and cf should know their "location" and some other component should know the directory where in which the files are stored.

Said that, this PR prepares distributed_loader and sstables_directly to stop using Linux paths explicitly by making both call sstables_manager to list and open sstables object. After it will be possible to teach manager to list sstables from object storage. Also this opens the way to removing paths from keyspace and column_family classes and replacing those with relative "location"s.

Closes #12128

* github.com:scylladb/scylladb:
  sstable_directory: Get components lister from manager
  sstable_directory: Extract directory lister
  sstable_directory: Remove sstable creation callback
  sstable_directory: Call manager to make sstables
  sstable_directory: Keep error handler generator
  sstable_directory: Keep schema_ptr
  sstable_directory: Use directory semaphore from manager
  sstable_directory: Keep reference on manager
  tests: Use sstables creation helper in some cases
  sstables_manager: Keep directory semaphore reference
  sstables, code: Wrap directory semaphore with concurrency
2022-12-05 18:54:17 +02:00
Gleb Natapov
022a825b33 raft: introduce not_a_member error and return it when non member tries to do add/modify_config
Currently if a node that is outside of the config tries to add an entry
or modify config transient error is returned and this causes the node
to retry. But the error is not transient. If a node tries to do one of
the operations above it means it was part of the cluster at some point,
but since a node with the same id should not be added back to a cluster
if it is not in the cluster now it will never be.

Return a new error not_a_member to a caller instead.

Message-Id: <Y42mTOx8bNNrHqpd@scylladb.com>
2022-12-05 17:11:04 +01:00
Benny Halevy
c61083852c storage_service: handle_state_normal: calculate candidates_for_removal when replacing tokens
We currently try to detect a replaced node so to insert it to
endpoints_to_remove when it has no owned tokens left.
However, for each token we first generate a multimap using
get_endpoint_to_token_map_for_reading().

There are 2 problems with that:

1. unless the replaced node owns a single token, this map will not
   be empty after erasing one token out of it, since the
   token metadata has not changed yet (this is done later with
   update_normal_tokens(owned_tokens, endpoint)).
2. generating this map for each token is inefficient, turning this
   algorithm complexity to quadratic in the number of tokens...

This change copies the current token_to_endpoint map
to temporary map and erases replaced tokens from it,
while maintaining a set of candidates_for_removal.

After traversing all replaced tokens, we check again
the `token_to_endpoint_map` erasing from `candidates_for_removal`
any endpoint that still owns tokens.
The leftover candidates are endpoints the own no tokens
and so they are added to `hosts_to_remove`.

Fixes #12082

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12141
2022-12-05 16:17:18 +01:00
Botond Dénes
3d620378d4 Merge 'view: coroutinize maybe_mark_view_as_built' from Avi Kivity
Simplifying it a little.

Closes #12171

* github.com:scylladb/scylladb:
  view: reindent maybe_mark_view_as_built
  view: coroutinize maybe_mark_view_as_built
2022-12-05 13:43:34 +02:00
Kamil Braun
3f8aaeeab9 test/topology: enable replace tests
Also add some TODOs for enhancing existing tests.
2022-12-05 11:50:07 +01:00
Kamil Braun
ee19411783 service/raft: report an error when Raft ID can't be found in raft_group0::remove_from_group0
Also simplify the code and improve logging in general.

The previous code did this: search for the ID in the address map. If it
couldn't be found, perform a read barrier and search again. If it again
couldn't be found, return.

This algorithm depended on the fact that IP addresses were stored in
group 0 configuration. The read barrier was used to obtain the most
recent configuration, and if the IP was not a part of address map after
the read barrier, that meant it's simply not a member of group 0.

This logic no longer applies so we can simplify the code.

Furthermore, when I was fixing the replace operation with Raft enabled,
at some point I had a "working" solution with all tests passing. But I
was suspicious and checked if the replaced node got removed from
group 0. It wasn't. So the replace finished "successfully", but we had
an additional (voting!) member of group 0 which didn't correspond to
a token ring member.

The last version of my fixes ensure that the node gets removed by the
replacing node. But the system is fragile and nothing prevents us from
breaking this again. At least log an error for now. Regression tests
will be added later.
2022-12-05 11:50:07 +01:00
Kamil Braun
4429885543 service: handle replace correctly with Raft enabled
We must place the Raft ID obtained during the shadow round in the
address map. It won't be placed by the regular gossiping route if we're
replacing using the same IP, because we override the application state
of the replaced node. Even if we replace a node with a different IP, it
is not guaranteed that background gossiping manages to update the
address map before we need it, especially in tests where we set
ring_delay to 0 and disable wait_for_gossip_to_settle. The shadow round,
on the other hand, performs a synchronous request (and if it fails
during bootstrap, bootstrap will fail - because we also won't be able to
obtain the tokens and Host ID of the replaced node).

Fetch the Raft ID of the replaced node in `prepare_replacement_info`,
which runs the shadow round. Return it in `replacement_info`. Then
`join_token_ring` passes it to `setup_group0`, which stores it in the
address map. It does that after `join_group0` so the entry is
non-expiring (the replaced node is a member of group 0). Later in the
replace procedure, we call `remove_from_group0` for the replaced node.
`remove_from_group0` will be able to reverse-translate the IP of the
replaced node to its Raft ID using the address map.
2022-12-05 11:50:07 +01:00
Kamil Braun
45bb5bfb52 gms/gossiper: fetch RAFT_SERVER_ID during shadow round
During the replace operation we need the Raft ID of the replaced node.
The shadow round is used for fetching all necessary information before
the replace operation starts.
2022-12-05 11:50:07 +01:00
Kamil Braun
7222c2f9a1 service: storage_service: sleep 2*ring_delay instead of BROADCAST_INTERVAL before replace
Most of the sleeps related to gossiping are based on `ring_delay`,
which is configurable and can be set to lower value e.g. during tests.

But for some reason there was one case where we slept for a hardcoded
value, `service::load_broadcaster::BROADCAST_INTERVAL` - 60 seconds.

Use `2 * get_ring_delay()` instead. With the default value of
`ring_delay` (30 seconds) this will give the same behavior.
2022-12-05 11:50:07 +01:00
Pavel Emelyanov
b5ede873f2 sstable_directory: Get components lister from manager
For now this is almost a no-op because manager just calls
sstables_directory code back to create the lister.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 12:03:19 +03:00
Pavel Emelyanov
3f9b8c855d sstable_directory: Extract directory lister
Currently the utils/lister.cc code is in use to list regular files in a
directory. This patch wraps the lister into more abstract components
lister class.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 12:03:19 +03:00
Pavel Emelyanov
abd3602b10 sstable_directory: Remove sstable creation callback
It's no longer used.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 12:03:19 +03:00
Pavel Emelyanov
3d559391df sstable_directory: Call manager to make sstables
Now the directory code has everyhting it needs to create sstable object
and can stop using the external lambda.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 12:03:19 +03:00
Pavel Emelyanov
db657a8d1c sstable_directory: Keep error handler generator
Yet another continuation to previous patch -- IO error handlers
generator is also needed to create sstables.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 12:03:19 +03:00
Pavel Emelyanov
4281f4af42 sstable_directory: Keep schema_ptr
Continuation of one-before-previous patch. In order to create sstable
without external lambda the directory code needs schema.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 12:03:19 +03:00
Pavel Emelyanov
8df1bcb907 sstable_directory: Use directory semaphore from manager
After previous patch sstables_directory code may no longer require for
semaphore argument, because it can get one from manager. This makes the
directory API shorter and simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 12:03:19 +03:00
Pavel Emelyanov
4da941e159 sstable_directory: Keep reference on manager
The sstables_directly accesses /var/lib/scylla/data in two ways -- lists
files in it and opens sstables. The latter is abdtracted with the help
of lambdas passed around, but the former (listing) is done by using
directory liters from utils.

Listing sstables components with directlry lister won't work for object
storage, the directory code will need to call some abstraction layer
instead. Opening sstables with the help of a lambda is a bit of
overkill, having sstables manager at hand could make it much simpler.

Said that, this patch makes sstables_directly reference sstables_manager
on start.

This change will also simplify directory semaphore usage (next patch).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 12:03:19 +03:00
Pavel Emelyanov
784d78810a tests: Use sstables creation helper in some cases
Several test cases push sstables creation lambda into
with_sstables_directory helper. There's a ready to use helper class that
does the same. Next patch will make additional use of that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 12:03:19 +03:00
Pavel Emelyanov
5e13ce2619 sstables_manager: Keep directory semaphore reference
Preparational patch. The semaphore will be used by sstables_directory in
next patches.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 12:03:18 +03:00
Pavel Emelyanov
be8512d7cc sstables, code: Wrap directory semaphore with concurrency
Currently this is a sharded<semaphore> started/stopped in main and
referenced by database in order to be fed into sstables code. This
semaphore always comes with the "concurrency" parameter that limits the
parallel_for_each parallelizm.

This patch wraps both together into directory_semaphore class. This
makes its usage simpler and will allow extending it in the future.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 11:59:30 +03:00
Asias He
c6087cf3a0 repair: Reduce repair reader eviction with diff shard count
When repair master and followers have different shard count, the repair
followers need to create multi-shard readers. Each multi-shard reader
will create one local reader on each shard, N (smp::count) local readers
in total.

There is a hard limit on the number of readers who can work in parallel.
When there are more readers than this limit. The readers will start to
evict each other, causing buffers already read from disk to be dropped
and recreating of readers, which is not very efficient.

To optimize and reduce reader eviction overhead, a global reader permit
is introduced which considers the multi-shard reader bloats.

With this patch, at any point in time, the number of readers created by
repair will not exceed the reader limit.

Test Results:

1) with stream sem 10, repair global sem 10, 5 ranges in parallel, n1=2
shards, n2=8 shards, memory wanted =1

1.1)
[asias@hjpc2 mycluster]$ time nodetool -p 7200 repair ks2  (repair on n2)
[2022-11-23 17:45:24,770] Starting repair command #1, repairing 1
ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true)
[2022-11-23 17:45:53,869] Repair session 1
[2022-11-23 17:45:53,869] Repair session 1 finished

real    0m30.212s
user    0m1.680s
sys     0m0.222s

1.2)
[asias@hjpc2 mycluster]$ time nodetool  repair ks2  (repair on n1)
[2022-11-23 17:46:07,507] Starting repair command #1, repairing 1
ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true)
[2022-11-23 17:46:30,608] Repair session 1
[2022-11-23 17:46:30,608] Repair session 1 finished

real    0m24.241s
user    0m1.731s
sys     0m0.213s

2) with stream sem 10, repair global sem no_limit, 5 ranges in
parallel, n1=2 shards, n2=8 shards, memory wanted =1

2.1)
[asias@hjpc2 mycluster]$ time nodetool -p 7200 repair ks2 (repair on n2)
[2022-11-23 17:49:49,301] Starting repair command #1, repairing 1
ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true)
[2022-11-23 17:52:01,414] Repair session 1
[2022-11-23 17:52:01,415] Repair session 1 finished

real    2m13.227s
user    0m1.752s
sys     0m0.218s

2.2)
[asias@hjpc2 mycluster]$ time nodetool  repair ks2 (repair on n1)
[2022-11-23 17:52:19,280] Starting repair command #1, repairing 1
ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true)
[2022-11-23 17:52:42,387] Repair session 1
[2022-11-23 17:52:42,387] Repair session 1 finished

real    0m24.196s
user    0m1.689s
sys     0m0.184s

Comparing 1.1) and 2.1), it shows the eviction played a major role here.
The patch gives 73s / 30s = 2.5X speed up in this setup.

Comparing 1.1 and 1.2, it shows even if we limit the readers, starting
on the lower shard is faster 30s / 24s = 1.25X (the total number of
multishard readers is lower)

Fixes #12157

Closes #12158
2022-12-05 10:47:36 +02:00
Botond Dénes
1e20095547 Update tools/java submodule
* tools/java 1c06006447...ecab7cf7d6 (1):
  > Add VSCode files to gitignore
2022-12-05 09:54:51 +02:00
Botond Dénes
c4d72c8dd0 Merge 'cql3: select_statement: split and coroutinize process_results()' from Avi Kivity
Split the simple (and common) case from the complex case,
and coroutinize the latter. Hopefully this generates better
code for the simple case, and it makes the complex case a
little nicer.

Closes #12194

* github.com:scylladb/scylladb:
  cql3: select_statement: reindent process_results_complex()
  cql3: select_statement: coroutinize process_results_complex()
  cql3: select_statement: split process_results() into fast path and complex path
2022-12-05 08:16:22 +02:00
Avi Kivity
a0a4711b74 snapshot: protect list operations against the lambda coroutine fiasco
run_snapshot_list_operation() takes a continuation, so passing it
a lambda coroutine without protection is dangerous.

Protect the coroutine with coroutine::lambda so it doesn't lost its
contents.

Fixes #12192.

Closes #12193
2022-12-05 08:14:39 +02:00
guy9
cb842b2729 Replacing the Docs top bar message from the LIVE event to the community forum announcement
Closes #12189
2022-12-05 08:05:04 +02:00
Avi Kivity
6326be5796 cql3: batch_statement: reindent get_mutations() 2022-12-04 21:47:22 +02:00
Avi Kivity
2d74360de3 cql3: batch_statement: coroutinize get_mutations()
It has a do_with(), so an automatic win.
2022-12-04 21:45:10 +02:00
Avi Kivity
0834bb0365 cql3: select_statement: reindent process_results_complex() 2022-12-04 21:36:17 +02:00
Avi Kivity
a63f98e3fc cql3: select_statement: coroutinize process_results_complex()
Not a huge gain, since it's just a do_with, but still a little better.

Note the inner lambda is not a coroutine, so isn't susceptibe to
the lambda coroutine fiasco.
2022-12-04 21:34:51 +02:00
Avi Kivity
7f29efa0ad cql3: select_statement: split process_results() into fast path and complex path
This will allow us to coroutinize the complex path without adding an
allocation to the fast path.
2022-12-04 21:30:45 +02:00
Avi Kivity
02b66bb31a Merge 'Mark sstable::<directory accessing methods> private' from Pavel Emelyanov
One of the prerequisites to make sstables reside on object-storage is not to let the rest of the code "know" the filesystem path they are located on (because sometimes they will not be on any filesystem path). This patch makes the methods that can reveal this path back private so that later they can be abstracted out.

Closes #12182

* github.com:scylladb/scylladb:
  sstable: Mark some methods private
  test: Don't get sstable dir when known
  test: Use move_to_quarantine() helper
  test: Use sstable::filename() overload without dir name
  sstables: Reimplement batch directory sync after move
  table, tests: Make use of move_to_new_dir() default arg
  sstables: Remove fsync_directory() helper
  table: Simplify take_snapshot()'s collecting sstables names
2022-12-04 17:45:37 +02:00
Kamil Braun
b551cd254c test: test_raft_upgrade: fix test_recover_stuck_raft_upgrade flakiness
The test enables an error injection inside the Raft upgrade procedure
on one of the nodes which will cause the node to throw an exception
before entering `synchronize` state. Then it restarts other nodes with
Raft enabled, waits until they enter `synchronize` state, puts them in
RECOVERY mode, removes the error-injected node and creates a new Raft
group 0.

As soon as the other nodes enter `synchronize`, the test disabled the
error injection (the rest of the test was outside the `async with
inject_error(...)` block). There was a small chance that we disabled the
error injection before the node reached it. In that case the node also
entered `synchronize` and the cluster managed to finish the upgrade
procedure. We encountered this during next promotion.

Eliminate this possibility by extending the scope of the `async with
inject_error(...)` block, so that the RECOVERY mode steps on the other
nodes are performed within that block.

Closes #12162
2022-12-02 21:26:44 +01:00
Avi Kivity
94f18b5580 test: sstable_conforms_to_mutation_source: use do_with_async() where needed
The test clearly needs a thread (it converts a reader to a mutation
without waiting), so give it one.

Closes #12178
2022-12-02 20:48:37 +01:00
Pavel Emelyanov
084522d9eb sstable: Mark some methods private
There are several class sstable methods that reveal internal directory
path to caller. It's not object-storage-friendly. Fortunately, all the
callers of those methods had been patched not to work with full paths,
so these can be marked private.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-02 21:15:02 +03:00
Pavel Emelyanov
fb63850f2c test: Don't get sstable dir when known
The sstable_move_test creates sstables in its own temp directories and
the requests these dirs' paths back from sstables. Test can come with
the paths it has at hand, no need to call sstables for it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-02 21:13:58 +03:00
Pavel Emelyanov
4c742a658d test: Use move_to_quarantine() helper
Two places in tests move sstable to quarantine subdir by hand. There's
the class sstable method that does the same, so use it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-02 21:13:19 +03:00
Pavel Emelyanov
d6244b7408 test: Use sstable::filename() overload without dir name
The dir this place currently uses is the directory where the sstable was
created, so dropping this argument would just render the same path.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-02 21:12:21 +03:00
Pavel Emelyanov
a702affd4d sstables: Reimplement batch directory sync after move
There's a table::move_sstables_from_staging() method that gets a bunch
of sstables and moves them from staging subdit into table's root
datadir. Not to flush the root dir for every sstable move, it asks the
sstable::move_to_new_dir() not to flush, but collects staging dir names
and flushes them and the root dir at the end altothether.

In order to make it more friendly to object-storage and to remove one
more caller of sstable::get_dir() the delayed_commit_changes struct is
introduced. It collects _all_ the affected dir names in unordered_set,
then allows flushing them. By default the move_to_new_dir() doesn't
receive this object and flushes the directories instantly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-02 21:08:47 +03:00
Pavel Emelyanov
1b42d5fce3 table, tests: Make use of move_to_new_dir() default arg
The method in question accepts boolean bit whether or not it should sync
directories at the end. It's always true but in one case, so there's the
default value for it. Make use of it.

Anticipating the suggestion to replace bool with bool_class -- next
patch will replace it with something else.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-02 21:07:16 +03:00
Pavel Emelyanov
339feb4205 sstables: Remove fsync_directory() helper
The one effectively wraps existing seastar sync_directory() helper into
two io_check-s. It's simpler just to call the latter directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-02 21:05:43 +03:00
Pavel Emelyanov
80f5d7393f table: Simplify take_snapshot()'s collecting sstables names
The method in question "snapshots" all sstables it can find, then writes
their Datafile names into the manifest file. To get the list of file
names it iterates over sstables list again and does silly conversion of
full file path to file name with the help of the directory path length.

This all can be made much simpler if just collecting component names
directly at the time sstable is hardlinked.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-02 21:02:37 +03:00
Raphael S. Carvalho
d61b4f9dfb compaction_manager: Delete compaction_state's move constructor
compaction_state shouldn't be moved once emplaced. moving it could
theoretically cause task's gate holder to have a dangling pointer to
compaction_state's gate, but turns out gate's move ctor will actually
fail under this assertion:
assert(!_count && "gate reassigned with outstanding requests");

Cannot happen today, but let's make it more future proof.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #12167
2022-12-02 20:56:57 +03:00
Tomasz Grabiec
1a6bf2e9ca Merge 'service/raft: specialized verb for failure detector pinger' from Kamil Braun
We used GOSSIP_ECHO verb to perform failure detection. Now we use
a special verb DIRECT_FD_PING introduced for this purpose.

There are multiple reasons to do so.

One minor reason: we want to use the same connection as other Raft
verbs: if we can't deliver Raft append_entries or vote messages
somewhere, that endpoint should be marked dead; if we can, the
endpoint should be marked alive. So putting pings on the same
connection as the other Raft verbs is important when dealing with
weird situations where some connections are available but others are
not. Observe that in `do_get_rpc_client_idx`, we put the new verb in
the right place.

Another minor reason: we remove the awkward gossiper `echo_pinger`
abstraction which required storing and updating gossiper generation
numbers. This also removes one dependency from Raft service code to
gossiper.

Major reason 1: the gossip echo handler has a weird mechanism where a
replacing node returns errors during the replace operation to some of
the nodes. In Raft however, we want to mark servers as alive when they
are alive, including a server running on a node that's replacing
another node.

Major reason 2, related to the previous one: when server B is
replacing server A with the same IP, the failure detector will try to
ping both servers. Both servers are mapped to the same IP by the
address map, so pings to both servers will reach server B. We want
server B to respond to the pings destined for server B, but not to
pings destined for server A, so the sender can mark B alive but keep A
marked dead.

To do this, we include the destination's Raft ID in our RPCs. The
destination compares the received ID with its own. If it's different,
it returns a `wrong_destination` response, and the failure detector
knows that the ping did not reach the destination (it reached someone
else).

Yet another reason: removes "Not ready to respond gossip echo
message" log spam during replace.

Closes #12107

* github.com:scylladb/scylladb:
  service/raft: specialized verb for failure detector pinger
  db: system_keyspace: de-staticize `{get,set}_raft_server_id`
  service/raft: make this node's Raft ID available early in group registry
2022-12-02 13:54:02 +01:00
Pavel Emelyanov
71179ff5ab distributed_loader: Use coroutine::lambda in sleeping coroutine
According to seastar/doc/lambda-coroutine-fiasco.md lambda that
co_awaits once loses its capture frame. In distrobuted_loader
code there's at least one of that kind.

fixes: #12175

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #12170
2022-12-02 13:06:33 +02:00
Pavel Emelyanov
1d91914166 sstables: Drop set_generation() method
The method became unused since 70e5252a (table: no longer accept online
loading of SSTable files in the main directory) and the whole concept of
reshuffling sstables was dropped later by 7351db7c (Reshape upload files
and reshard+reshape at boot).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #12165
2022-12-01 22:17:10 +02:00
Avi Kivity
2978052113 view: reindent maybe_mark_view_as_built
Several identation levels were harmed during the preparation
of this patch.
2022-12-01 22:09:21 +02:00
Avi Kivity
ac2e2f8883 view: coroutinize maybe_mark_view_as_built
Somewhat simplifies complicated logic.
2022-12-01 22:04:51 +02:00
Kamil Braun
cbdcc944b5 service/raft: specialized verb for failure detector pinger
We used GOSSIP_ECHO verb to perform failure detection. Now we use
a special verb DIRECT_FD_PING introduced for this purpose.

There are multiple reasons to do so.

One minor reason: we want to use the same connection as other Raft
verbs: if we can't deliver Raft append_entries or vote messages
somewhere, that endpoint should be marked dead; if we can, the
endpoint should be marked alive. So putting pings on the same
connection as the other Raft verbs is important when dealing with
weird situations where some connections are available but others are
not. Observe that in `do_get_rpc_client_idx`, we put the new verb in
the right place.

Another minor reason: we remove the awkward gossiper `echo_pinger`
abstraction which required storing and updating gossiper generation
numbers. This also removes one dependency from Raft service code to
gossiper.

Major reason 1: the gossip echo handler has a weird mechanism where a
replacing node returns errors during the replace operation to some of
the nodes. In Raft however, we want to mark servers as alive when they
are alive, including a server running on a node that's replacing
another node.

Major reason 2, related to the previous one: when server B is
replacing server A with the same IP, the failure detector will try to
ping both servers. Both servers are mapped to the same IP by the
address map, so pings to both servers will reach server B. We want
server B to respond to the pings destined for server B, but not to
pings destined for server A, so the sender can mark B alive but keep A
marked dead.

To do this, we include the destination's Raft ID in our RPCs. The
destination compares the received ID with its own. If it's different,
it returns a `wrong_destination` response, and the failure detector
knows that the ping did not reach the destination (it reached someone
else).

Yet another reason: removes "Not ready to respond gossip echo
message" log spam during replace.
2022-12-01 20:54:18 +01:00
Kamil Braun
02c64becdc db: system_keyspace: de-staticize {get,set}_raft_server_id
Part of the anti-globals war.
2022-12-01 20:54:18 +01:00
Kamil Braun
99fe580068 service/raft: make this node's Raft ID available early in group registry
Raft ID was loaded or created late in the boot procedure, in
`storage_service::join_token_ring`.

Create it earlier, as soon as it's possible (when `system_keyspace`
is started), pass it to `raft_group_registry::start` and store it inside
`raft_group_registry`.

We will use this Raft ID stored in group registry in following patches.
Also this reduces the number of disk accesses for this node's Raft ID.
It's now loaded from disk once, stored in `raft_group_registry`, then
obtained from there when needed.

This moves `raft_group_registry::start` a bit later in the startup
procedure - after `system_keyspace` is started - but it doesn't make
a difference.
2022-12-01 20:54:18 +01:00
Nadav Har'El
6fcb5302a6 alternator-test: xfail a flaky test exposing a known bug
In a recent commit 757d2a4, we removed the "xfail" mark from the test
test_manual_requests.py::test_too_large_request_content_length
because it started to pass on more modern versions of Python, with a
urllib3 bug fixed.

Unfortunately, the celebration was premature: It turns out that although
the test now *usually* passes, it sometimes fails. This is caused by
a Seastar bug scylladb/seastar#1325, which I opened #12166 to track
in this project. So unfortunately we need to add the "xfail" mark back
to this test.

Note that although the test will now be marked "xfail", it will actually
pass most of the time, so will appear as "xpass" to people run it.
I put a note in the xfail reason string as a reminder why this is
happening.

Fixes #12143
Refs #12166
Refs scylladb/seastar#1325

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12169
2022-12-01 20:00:46 +02:00
Kamil Braun
3cd035d1b9 test/pylib: scylla_cluster: remove ScyllaCluster.decommissioned field
The field was not used for anything. We can keep decommissioned server
in `stopped` field.

In fact it caused us a problem: since recently, we're using
`ScyllaCluster.uninstall` to clean-up servers after test suite finishes
(previously we were using `ScyllaServer.uninstall` directly). But
`ScyllaCluster.uninstall` didn't look into the `decommissioned` field,
so if a server got decommissioned, we wouldn't uninstall it, and it left
us some unnecessary artifacts even for successful tests. This is now
fixed.

Closes #12163
2022-12-01 19:07:26 +02:00
Avi Kivity
a4b77a5691 Merge 'Cleanup sstables::test_env's manager usage' from Pavel Emelyanov
Mainly this PR removes global db::config and feature service that are used by sstables::test_env as dependencies for embedded sstables_manager. Other than that -- drop unused methods, remove nested test_env-s and relax few cases that use two temp dirs at a time for no gain.

Closes #12155

* github.com:scylladb/scylladb:
  test, utils: Use only one tempdir
  sstable_compaction_test: Dont create nested envs
  mutation_reader_test: Remove unused create_sstable() helper
  tests, lib: Move globals onto sstables::test_env
  tests: Use sstables::test_env.db_config() to access config
  features: Mark feature_config_from_db_config const
  sstable_3_x_test: Use env method to create sst
  sstable_3_x_test: Indentation fix after previous patch
  sstable_3_x_test: Use sstable::test_env
  test: Add config to sstable::test_env creation
  config: Add constexpr value for default murmur ignore bits
2022-12-01 17:47:25 +02:00
Pavel Emelyanov
4c6bfc078d code: Use http::re(quest|ply) instead of httpd:: ones
Recent seastar update deprecated those from httpd namespace.

fixes: #12142

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #12161
2022-12-01 17:33:35 +02:00
Pavel Emelyanov
adc6ee7ea8 test, utils: Use only one tempdir
There's a do_with_cloned_tmp_directory that makes two temp dirs to toss
sstables between them. Make it go with just one, all the more so it
would resemble existing manipulations aroung staging/ subdir

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-01 13:39:57 +03:00
Pavel Emelyanov
15a7b9cafa sstable_compaction_test: Dont create nested envs
The "compact" test case runs in sstables::test_env and additionally
wraps it with another instance provided by do_with_tmp_directory helper.
It's simpler to create the temp dir by hand and use outter env.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-01 13:39:56 +03:00
Pavel Emelyanov
69fe5fd054 mutation_reader_test: Remove unused create_sstable() helper
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-01 13:39:54 +03:00
Pavel Emelyanov
400bc2c11d tests, lib: Move globals onto sstables::test_env
There's a bunch of objects that are used by test_env as sstables_manager
dependencies. Now when no other code needs those globals they better sit
on the test_env next to the manager

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-01 13:39:36 +03:00
Pavel Emelyanov
6a294b9ad6 tests: Use sstables::test_env.db_config() to access config
Currently some places use global test config, but it's going to be
removed soon, so switch to using config from environment

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-01 13:39:30 +03:00
Pavel Emelyanov
b4e31ad359 features: Mark feature_config_from_db_config const
It's in fact such. Other than that, next patch will call it with const
config at hand and fail to compile without this fix

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-01 13:39:27 +03:00
Pavel Emelyanov
8178845ef3 sstable_3_x_test: Use env method to create sst
Just to make it shorter and conform to other sst env tests

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-01 13:39:19 +03:00
Pavel Emelyanov
8d5d05012e sstable_3_x_test: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-01 13:39:09 +03:00
Pavel Emelyanov
6628d801f2 sstable_3_x_test: Use sstable::test_env
There are several cases there that construct sstables_manager by hand
with the help of a bunch of global dependencies. It's nicer to use
existing wrapper.

(indentation left broken until next patch)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-01 13:38:46 +03:00
Pavel Emelyanov
1d8c76164f test: Add config to sstable::test_env creation
To make callers (tests) construct it with different options. In
particular, one test will soon want to construct it with custom large
data handler of its own.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-01 13:38:18 +03:00
Pavel Emelyanov
6d0c8fb6e2 config: Add constexpr value for default murmur ignore bits
... and use in some places of sstable_compaction_test. This will allow
getting rid of global test_db_config thing later

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-01 13:38:15 +03:00
Botond Dénes
dbd00fd3e9 Merge 'Task manager shard repair tasks' from Aleksandra Martyniuk
The PR introduces shard_repair_task_impl which represents a repair task
that spans over a single shard repair.

repair_info is replaced with shard_repair_task_impl, since both serve
similar purpose.

Closes #12066

* github.com:scylladb/scylladb:
  repair: reindent
  repair: replace repair_info with shard_repair_task_impl
  repair: move repair_info methods to shard_repair_task_impl
  repair: rename methods of repair_module
  repair: change type of repair_module::_repairs
  repair: keep a reference to shard_repair_task_impl in row_level_repair
  repair: move repair_range method to shard_repair_task_impl
  repair: make do_repair_ranges a method of shard_repair_task_impl
  repair: copy repair_info methods to shard_repair_task_impl
  repair: corutinize shard task creation
  repair: define run for shard_repair_task_impl
  repair: add shard_repair_task_impl
2022-12-01 10:04:31 +02:00
Nadav Har'El
5eda8ce4fd alternator ttl: in scanning thread, don't retry the same page too many times
Since fixing issue #11737, when the expiration scanner times out reading
a page of data, it retries asking for the same page instead of giving up
on the scan and starting anew later. This retry was infinite - which can
cause problems if we have a bug in the code or several nodes down, which
can lead to getting hung in the same place in the scan for a very long
(potentially infinite) time without making any progress.

An example of such a bug was issue #12145, where we forgot to handle
shutdowns, so on shutdown of the cluster we just hung forever repeating
the same request that will never succeed. It's better in this case to
just give up on the current scan, and start it anew (from a random
position) later.

Refs #12145 (that issue was already fixed, by a different patch which
stops the iteration when shutting down - not waiting for an infinite
number of iterations and not even one more).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-11-30 18:42:37 +02:00
Nadav Har'El
d08eef5a30 alternator: fix hang during shutdown of expiration-scanning thread
The expiration-scanning thread is a long-running thread which can scan
data for hours, but checks for its abort-source before fetching each
page to allow for timely shutdown. Recently, we added the ability to
retry the page fetching in case of timeout, for forgot to check the
abort source in this new retry loop - which lead to an infinitely-long
shutdown in some tests while the retry loop retries forever.

In this patch we fix this bug by using sleep_abortable() instead of
sleep(). sleep_abortable() will throw an exception if the abort source
was triggered before or during the sleep - and this exception will
stop the scan immediately.

Fixes #12145

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-11-30 18:38:17 +02:00
Jan Ciolek
05ea0c1d60 dev/docs: add additional git pull to backport docs
Botond noted that an additional git pull
might be needed here:
https://github.com/scylladb/scylladb/pull/12138#discussion_r1035857007

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-30 16:14:02 +01:00
Jan Ciolek
e74873408b docs/dev: add a note about cherry-picking individual commits
Some people prefer to cherry-pick individual commits
so that they have less conflicts to resolve at once.

Add a comment about this possibility.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-30 16:06:39 +01:00
Kamil Braun
0f9d0dd86e Merge 'raft: support IP address change' from Konstantin Osipov
This is the core of dynamic IP address support in Raft, moving out the
IP address sourcing from Raft Group 0 configuration to gossip. At start
of Raft, the raft id <> IP address translation map is tuned into the
gossiper notifications and learns IP addresses of Raft hosts from them.

The series intentionally doesn't contain the part which speeds up the
initial cluster assembly by persisting the translation cache and using
more sources besides gossip (discovery, RPC) to show correctness of the
approach.

Closes #12035

* github.com:scylladb/scylladb:
  raft: (rpc) do not throw in case of a missing IP address in RPC
  raft: (address map) actively maintain ip <-> raft server id map
2022-11-30 15:40:18 +01:00
Aleksandra Martyniuk
78a6193c01 repair: reindent 2022-11-30 13:53:52 +01:00
Aleksandra Martyniuk
b4ad914fe1 repair: replace repair_info with shard_repair_task_impl
repair_info is deleted and all its attributes are moved to
shard_repair_task_impl.
2022-11-30 13:53:52 +01:00
Aleksandra Martyniuk
f6ec2cec92 repair: move repair_info methods to shard_repair_task_impl 2022-11-30 13:53:18 +01:00
Jan Ciolek
32663e6adb docs/dev: use 'is merged into' instead of 'becomes'
The backport instructions said that after passing
the tests next `becomes` master, but it's more
exact to say that next `is merged into` master.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-30 13:25:10 +01:00
Jan Ciolek
28cf8a18de docs/dev: mention that new backport instructions are for the contributor
Previously the section was called:
"How to backport a patch", which could be interpreted
as instructions for the maintainer.

The new title clearly states that these instructions
are for the contributor in case the maintainer couldn't
backport the patch by themselves.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-30 13:23:15 +01:00
Takuya ASADA
4ecc08c4fe docker: switch default locale to C.UTF-8
Since we switched scylla-machine-image locale to C.UTF-8 because
ubuntu-minimal image does not have en_US.UTF-8 by default, we should
do same on our docker image to reduce image size.

Verified #9570 does not occur on new image, since it is still UTF-8
locale.

Closes #12122
2022-11-30 13:58:43 +02:00
Anna Stuchlik
15cc3ecf64 doc: update the releases in the KB about updating the mode after upgrade 2022-11-30 12:53:13 +01:00
Anna Stuchlik
242a3916f0 doc: fix the broken link in the 5.1 upgrade guide 2022-11-30 12:49:20 +01:00
Alejo Sanchez
f7aa08ef25 test.py: don't stop cluster's site if not started
The site member is created in ScyllaCluster.start(), for startup failure
this might not be initialized, so check it's present before stop()ing
it. And delete it as it's not running and proper initialization should
call ScyllaCluster.start().

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #11939
2022-11-30 13:47:18 +02:00
Anna Stuchlik
1575d96856 doc: add the link to the 5.1-related KB article to the 5.1 upgrade guide 2022-11-30 12:40:49 +01:00
Nadav Har'El
ce347f4b67 test/cql-pytest: add test for meaning of fetch_size with filtering
A question was raised on what fetch_size (the requested page size
in a paged scan) counts when there is a filter: does it count the
rows before filtering (as scanned from disk) or after filter (as
will be returned to the client)?

This patch adds a test which demonstrates that Cassandra and Scylla
behave differently in this respect: Cassandra counts post-filtering -
so fetch_size results are actually returned, while Scylla currently
counts pre-filtering.

It is arguable which behavior is the "correct" one - we discuss this in
issue #12102. But we have already had several users (such as #11340)
who complained about Scylla's behavior and expected Cassandra's behavior,
so if we decide to keep Scylla's behavior we should at least explain and
justify this decision in our documentation. Until then, let's have this
test which reminds us of this incompatibility. This test currently passes
on Cassandra and fails (xfail) on Scylla.

Refs #11340
Refs #12102

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12103
2022-11-30 12:27:06 +02:00
Nadav Har'El
8bd8ef3d03 test/cql-pytest: add regression test for old issue
This patch adds a regression test for the old issue #65 which is about
a multi-column (tuple) clustering-column relation in a SELECT when one
these columns has reversed order. It turns out that we didn't notice,
but this issue was already solved - but we didn't have a regression test
for it. So this patch adds just a regression test. The test confirms that
Scylla now behaves like was desired when that issue was opened. The test
also passes on Cassandra, confirming that Scylla and Cassandra behave
the same for such requests.

Fixes #65

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12130
2022-11-30 12:22:21 +02:00
Michał Jadwiszczak
8e64e18b80 forward_service: add debug logs
Adds a few debug logs to see what is happening in https://github.com/scylladb/scylladb/issues/11684

Wrapped `forward_result::printer` into `seastar::value_of` to lazy
evaluate the printer

Closes #12113
2022-11-30 12:15:26 +02:00
Yaniv Kaul
b66ca3407a doc: Typo - then -> than
Fix a typo.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes #12140
2022-11-30 12:03:56 +02:00
Botond Dénes
50aea9884b Merge 'Improve the Raft upgrade procedure' from Kamil Braun
Better logging, less code, a minor fix.

Closes #12135

* github.com:scylladb/scylladb:
  service/raft: raft_group0: less repetitive logging calls
  service/raft: raft_group0: fix sleep_with_exponential_backoff
2022-11-30 11:24:20 +02:00
Avi Kivity
6a5d9ff261 treewide: use non-experimental std::source_location
Now that we use libstdc++ 12, we can use the standardized
source_location.

Closes #12137
2022-11-30 11:06:43 +02:00
Jan Ciolek
56a802c979 docs/dev: Add backport instructions for contributors
Add instructions on how to backport a feature
to on older version of Scylla.

It contains a detailed step-by-step instruction
so that people unfamiliar with intricacies
of Scylla's repository organization can
easily get the hang of it.

This is the guide I wish I had when I had
to do my first backport.

I put it in backport.md because that
looks like the file responsible
for this sort of information.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-29 22:10:27 +01:00
Konstantin Osipov
fbe7886cc0 raft: (rpc) do not throw in case of a missing IP address in RPC
Remove raft_address_map::get_inet_address()

While at it, coroutinize some rpc mehtods.

To propagate up the event of missing IP address, use coroutine::exception(
with a proper type (raft::transport_error) and a proper error message.

This is a building block from removing
raft_address_map::get_inet_address() which is too generic, and shifting
the responsibility of handling missing addresses to the address map
clients. E.g. one-way RPC shouldn't throw if an address is missing, but
just drop the message.

PS An attempt to use a single template function rendered to be too
complex:
- some functions require a gate, some don't
- some return void, some future<> and some future<raft::data_type>
2022-11-29 19:55:48 +03:00
Konstantin Osipov
73e5298273 raft: (address map) actively maintain ip <-> raft server id map
1) make address map API flexible

Before this patch:
- having a mapping without an actual IP address was an
  internal error
- not having a mapping for an IP address was an internal
  error
- re-mapping to a new IP address wasn't allowed

After this patch:

- the address map may contain a mapping
  without an actual IP address, and the caller must be prepared for it:
  find() will return a nullopt. This happens when we first add an entry
  to Raft configuration and only later learn its IP address, e.g.  via
  gossip.

- it is allowed to re-map an existing entry to a new address;
2) subscribe to gossip notifications

Learning IP addresses from gossip allows us to adjust
the address map whenever a node IP address changes.
Gossiper is also the only valid source of re-mapping, other sources
(RPC) should not re-map, since otherwise a packet from a removed
server can remap the id to a wrong address and impact liveness of a Raft
cluster.

3) prompt address map state with app state

Initialize the raft address map with initial
gossip application state, specifically IPs of members
of the cluster. With this, we no longer need to store
these IPs in Raft configuration (and update them when they change).

The obvious drawback of this approach is that a node
may join Raft config before it propagates its IP address
to the cluster via gossip - so the boot process has to
wait until it happens.

Gossip also doesn't tell us which IPs are members of Raft configuration,
so we subscribe to Group0 configuration changes to mark the
members of Raft config "non-expiring" in the address translation
map.

Thanks to the changes above, Raft configuration no longer
stores IP addresses.

We still keep the 'server_info' column in the raft_config system table,
in case we change our mind or decide to store something else in there.
2022-11-29 19:55:43 +03:00
Kamil Braun
3dbcff435f service/raft: raft_group0: less repetitive logging calls
Some log messages in retry loops in the Raft upgrade procedure included
a sentence like "sleeping before retrying..."; but not all of them.

With the recently added `sleep_with_exponential_backoff` abstraction we
can put this "sleeping..." message in a single place, and it's also easy
to say how long we're going to sleep.

I also enjoy using this `source_location` thing.
2022-11-29 17:42:43 +01:00
Nadav Har'El
c5121cf273 cql: fix column-name aliases in SELECT JSON
The SELECT JSON statement, just like SELECT, allows the user to rename
selected columns using an "AS" specification. E.g., "SELECT JSON v AS foo".
This specification was not honored: We simply forgot to look at the
alias in SELECT JSON's implementation (we did it correctly in regular
SELECT). So this patch fixes this bug.

We had two tests in cassandra_tests/validation/entities/json_test.py
that reproduced this bug. The checks in those tests now pass, but these
two tests still continue to fail after this patch because of two other
unrelated bugs that were discovered by the same tests. So in this patch
I also add a new test just for this specific issue - to serve as a
regression test.

Fixes #8078

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12123
2022-11-29 18:16:19 +02:00
Avi Kivity
faf11587fa Update seastar submodule
* seastar 4f4cc00660...3a5db04197 (16):
  > tls: add missing include <map>
  > Merge 'util/process: use then_unpack to help automatically unpack tuple.' from Jianyong Chen
  > HTTP: define formatter for status_type to fix build.
  > fsnotifier: move it into namespace experimental and add docs.
  > Move fsnotify.hh to the 'include' directory for public use.
  > Merge 'reactor: define make_pipe() and use make_pipe() in reactor::spawn()' from Kefu Chai
  > Merge 'Fix: error when compiling http_client_demo' from Amossss
  > util/process: using `data_sink_impl::put`
  > Merge 'dns: serialize UDP sends.' from Calle Wilund
  > build: use correct version when finding liburing
  > Merge 'Add simple http client' from Pavel Emelyanov
  > future: use invoke_result instead of nested requirements
  > Merge 'reactor: use separate calls in reactor and reactor_backend for read/write/sendmsg/recvmsg' from Kefu Chai
  > util, core: add spawn_process() helper
  > parallel utils: add note about shard-local parallelism
  > shared_mutex: return typed exceptional future in with_* error handlers

Closes #12131
2022-11-29 18:10:06 +02:00
Kamil Braun
580bdec875 service/raft: raft_group0: fix sleep_with_exponential_backoff
It was immediately jumping to _max_retry_period.
2022-11-29 16:27:59 +01:00
Nadav Har'El
6bc3075bbd test/alternator: increase timeout on TTL tests
Some of the tests in test/alternator/test_ttl.py need an expiration scan
pass to complete and expire items. In development builds on developer
machines, this usually takes less than a second (our scanning period is
set to half a second). However, in debug builds on Jenkins each scan
often takes up to 100 (!) seconds (this is the record we've seen so far).
This is why we set the tests' timeout to 120.

But recently we saw another test run failing. I think the problem is
that in some case, we need not one, but *two* scanning passes to
complete before the timeout: It is possible that the test writes an
item right after the current scan passed it, so it doesn't get expired,
and then we a second scan at a random position, possibly making that
item we mention one of the last items to be considered - so in total
we need to wait for two scanning periods, not one, for the item to
expire.

So this patch increases the timeout from 120 seconds to 240 seconds -
more than twice the highest scanning time we ever saw (100 seconds).

Note that this timeout is just a timeout, it's not the typical test
run time: The test can finish much more quickly, as little as one
second, if items expire quickly on a fast build and machine.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12106
2022-11-29 16:37:54 +03:00
Nadav Har'El
1f8adda4b2 Merge 'treewide: improve compatibility with gcc 12' from Avi Kivity
Fix some issues found with gcc 12. Note we can't fully compile with gcc yet, due to [1].

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98056

Closes #12121

* github.com:scylladb/scylladb:
  utils: observer: qualify seastar::noncopyable_function
  sstables: generation_type: forgo constexpr on hash of generation_type
  logalloc: disambiguate types and non-type members
  task_manager: disambiguate types and non-type members
  direct_failure_detector: don't change meaning of endpoint_liveness
  schema: abort on illegal per column computation kind
  database: abort on illegal per partition rate limit operation
  mutation_fragment: abort on illegal fragment type
  per_partition_rate_limit_options: abort on illegal operation type
  schema: drop unused lambda
  mutation_partition: drop unused lambda
  cql3: create_index_statement: remove unused lambda
  transport: prevent signed and unsigned comparison
  database: don't compare signed and unsigned types
  raft: don't compare signed and unsigned types
  compaction: don't compare signed and unsigned compaction counts
  bytes_ostream: don't take reference to packed variable
2022-11-29 13:57:24 +02:00
Avi Kivity
ea99750de7 test: give tests less-unique identifiers
Test identifiers are very unique, but this makes them less
useful in Jenkins Test Result Analyzer view. For example,
counter_test can be counter_test.432 in one run and counter_test.442
in another. Jenkins considers them different and so we don't see
a trend.

Limit the id uniqueness within a test case, so that we'll have
counter_test.{1, 2, 3} consistently. Those test will be grouped
together so we can see pass/fail trends.

Closes #11946
2022-11-29 13:14:14 +02:00
Yaniv Kaul
fef8e43163 doc: cluster management: Replace a misplaced period with a a bulleted list of items
Signed-Off-By: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes #12125
2022-11-29 12:42:24 +02:00
Botond Dénes
e9fec761a2 Merge 'doc: document the procedure for updating the mode after upgrade' from Anna Stuchlik
Fix https://github.com/scylladb/scylla-docs/issues/4126

Closes #11122

* github.com:scylladb/scylladb:
  doc: add info about the time-consuming step due to resharding
  doc: add the new KB to the toctree
  doc: doc: add a KB about updating the mode in perftune.yaml after upgrade
2022-11-29 12:41:46 +02:00
Avi Kivity
ea901fdb9d cql3: expr: fold null into untyped_constant/constant
Our `null` expression, after the prepare stage, is redundant with a
`constant` expression containing the value NULL.

Remove it. Its role in the unprepared stage is taken over by
untyped_constant, which gains a new type_class enumeration to
represent it.

Some subtleties:
 - Usually, handling of null and untyped_constant, or null and constant
   was the same, so they are just folded into each other
 - LWT "like" operator now has to discriminate between a literal
   string and a literal NULL
 - prepare and test_assignment were folded into the corresponing
   untyped_constant functions. Some care had to be taken to preserve
   error messages.

Closes #12118
2022-11-29 11:02:18 +02:00
Aleksandra Martyniuk
8bc0af9e34 repair: fix double start of data sync repair task
Currently, each data sync repair task is started (and hence run) twice.
Thus, when two running operations happen within a time frame long
enough, the following situation may occur:
- the first run finishes
- after some time (ttl) the task is unregistered from the task manager
- the second run finishes and attempts to finish the task which does
  not exist anymore
- memory access causes a segfault.

The second call to start is deleted. A check is added
to the start method to ensure that each task is started at most once.

Fixes: #12089

Closes #12090
2022-11-29 00:00:10 +02:00
Avi Kivity
9765b2e3bc cql3: expr: drop remnants of bool component from expression
In ad3d2ee47d, we replaced `bool` as an expression element
(representing a boolean constant) with `constant`. But a comment
and a concept continue to mention it.

Remove the comment and the concept fragment.

Closes #12119
2022-11-28 23:18:26 +02:00
Pavel Emelyanov
ae79669fd2 topology: Be less restrictive about missing endpoints
Recent changes in topology restricted the get_dc/get_rack calls. Older
code was trying to locate the endpoint in gossiper, then in system
keyspace cache and if the endpoint was not found in both -- returned
"default" location.

New code generates internal error in this case. This approach already
helped to spot several BUGs in code that had been eventually fixed, but
echoes of that change still pop up.

This patch relaxes the "missing endpoint" case by printing a warning in
logs and returning back the "default" location like old code did.

tests: update_cluster_layout_tests.py::*
       hintedhandoff_additional_test.py::TestHintedHandoff::test_hintedhandoff_rebalance
       bootstrap_test.py::TestBootstrap::test_decommissioned_wiped_node_can_join
       bootstrap_test.py::TestBootstrap::test_failed_bootstap_wiped_node_can_join
       materialized_views_test.py::TestMaterializedViews::test_decommission_node_during_mv_insert_4_nodes

refs: #11900
refs: #12054
fixes: #11870

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #12067
2022-11-28 22:01:09 +02:00
Avi Kivity
3a6eafa8c6 utils: observer: qualify seastar::noncopyable_function
gcc checks name resolution eagerly, and can't find noncopyable_function
as this header doesn't include "seastarx.hh". Qualify the name
so it finds it.
2022-11-28 21:58:30 +02:00
Avi Kivity
5ae98ab3de sstables: generation_type: forgo constexpr on hash of generation_type
std::hash isn't constexpr, so gcc refuses to make hash of generation_type
constexpr. It's pointless anyway since we never have a compile-time
sstable generation.
2022-11-28 21:58:30 +02:00
Avi Kivity
a2d43bb851 logalloc: disambiguate types and non-type members
logalloc::tracker has some members with the same names as types from
namespace scope. gcc (rightfully) complains that this changes
the meaning of the name. Qualify the types to disambiguate.
2022-11-28 21:58:30 +02:00
Avi Kivity
ed5da87930 task_manager: disambiguate types and non-type members
task_manager has some members with the same names as types from
namespace scope. gcc (rightfully) complains that this changes
the meaning of the name. Qualify the types to disambiguate.
2022-11-28 21:58:30 +02:00
Avi Kivity
27be1670d1 direct_failure_detector: don't change meaning of endpoint_liveness
It's used both as a type and as a member. Qualify the type so they
have different names.
2022-11-28 21:58:30 +02:00
Avi Kivity
735c46cb63 schema: abort on illegal per column computation kind
Without memory corruption it's not possible for the switch to
fall through, and the compiler will error if we forget to add
a case. The compiler however is obliged to consider that we might
store some other value in the variable.
2022-11-28 21:58:30 +02:00
Avi Kivity
f73a51250c database: abort on illegal per partition rate limit operation
Without memory corruption it's not possible for the switch to
fall through, and the compiler will error if we forget to add
a case. The compiler however is obliged to consider that we might
store some other value in the variable.
2022-11-28 21:58:30 +02:00
Avi Kivity
f469885b41 mutation_fragment: abort on illegal fragment type
Without memory corruption it's not possible for the switch to
fall through, and the compiler will error if we forget to add
a case. The compiler however is obliged to consider that we might
store some other value in the variable.
2022-11-28 21:58:30 +02:00
Avi Kivity
a3c89cedbd per_partition_rate_limit_options: abort on illegal operation type
Without memory corruption it's not possible for the switch to
fall through, and the compiler will error if we forget to add
a case. The compiler however is obliged to consider that we might
store some other value in the variable.
2022-11-28 21:58:30 +02:00
Avi Kivity
7ec28a81bf schema: drop unused lambda
get_cell is defined but not used.
2022-11-28 21:58:30 +02:00
Avi Kivity
c493a2379a mutation_partition: drop unused lambda
should_purge_row_tombstone is defined but not used.
2022-11-28 21:58:30 +02:00
Avi Kivity
e25bf62871 cql3: create_index_statement: remove unused lambda
throw_exception is defined but not used.
2022-11-28 21:58:30 +02:00
Avi Kivity
5dedf85288 transport: prevent signed and unsigned comparison
This can lead to undefined behavior. Cast to unsigned, after
we've verified the value is indeed positive.
2022-11-28 21:58:30 +02:00
Avi Kivity
77be69b600 database: don't compare signed and unsigned types
gcc warns it can lead to undefined behavior, though 2G entries
in a list of mutations are unlikely. Use the correct type for iteration.
2022-11-28 21:58:30 +02:00
Avi Kivity
fb6804e7a4 raft: don't compare signed and unsigned types
gcc warns it can lead to undefined behavior, though 2G entries
in a list of mutations are unlikely. Use the correct type for iteration.
2022-11-28 21:58:30 +02:00
Avi Kivity
f565db75ce compaction: don't compare signed and unsigned compaction counts
gcc warns as this can lead to incorrect results. Cast the threshold
to an unsigned type (we know it's positive at this point) to avoid
the warning.
2022-11-28 21:41:56 +02:00
Avi Kivity
23b94ac391 bytes_ostream: don't take reference to packed variable
bytes_ostream is packed, so its _begin member is packed as well.
gcc (correctly) disallows taking a reference to an unaligned variable
in an aligned refernce, and complains.

Make it happy by open-coding the exchange operation.
2022-11-28 21:40:18 +02:00
Nadav Har'El
5480211061 Merge 'test.py: support node replace operation' from Kamil Braun
The `add_server` function now takes an optional `ReplaceConfig` struct
(implemented using `NamedTuple`), which specifies the ID of the replaced
server and whether to reuse the IP address.

If we want to reuse the IP address, we don't allocate one using the host
registry. This required certain refactors: moving the code responsible
for allocation of IPs outside `ScyllaServer`, into `ScyllaCluster`.

Add two tests, but they are now skipped: one of them is failing (unability
for new node to join group 0) and both suffer from a hardcoded 60-second sleep
in Scylla.

Closes #12032

* github.com:scylladb/scylladb:
  test/topology: simple node replace tests (currently disabled)
  test/pylib: scylla_cluster: support node replace operation
  test/pylib: scylla_cluster: move members initialization to constructor
  test/pylib: scylla_cluster: (re)lease IP addr outside ScyllaServer
  test/pylib: scylla_cluster: refactor create_server parameters to a struct
  test.py: stop/uninstall clusters instead of servers when cleaning up
  test/pylib: artifact_registry: replace `Awaitable` type with `Coroutine`
  test.py: prepare for adding extra config from test when creating servers
  test/pylib: manager_client: convert `add_server` to use `put_json`
  test/pylib: rest_client: allow returning JSON data from `put_json`
  test/pylib: scylla_cluster: don't import from manager_client
2022-11-28 16:06:39 +02:00
Takuya ASADA
4d8fb569a1 install.sh: drop locale workaround from python3 thunk
Since #7408 does not occur on current python3 version (3.11.0), let's drop
the workarond.

Closes #12097
2022-11-28 13:07:03 +02:00
Anna Stuchlik
452915cef6 doc: set the documentation version 5.1 as default (latest)
Closes #12105
2022-11-28 12:02:13 +01:00
Avi Kivity
380da0586c Update tools/python3 submodule (drop locale workaround)
* tools/python3 773070e...548e860 (1):
  > install.sh: drop locale workaround from python3 thunk
2022-11-28 12:24:13 +02:00
Avi Kivity
0da66371a5 storage_proxy: coroutinize inner continuation of create_hint_sync_point()
It is part of a coroutine::parallel_for_each(), which is safe for lambda coroutines.

Closes #12057
2022-11-28 11:30:00 +02:00
Avi Kivity
d12d42d1a6 Revert "configure: temporarily disable wasm support for aarch64"
This reverts commit e2fe8559ca. I
ran all the release mode tests on aarch64 with it reverted, and
it passes. So it looks like whatever problems we had with it
were fixed.

Closes #12072
2022-11-28 11:30:00 +02:00
Nadav Har'El
99a72a9676 Merge 'cql3: expr: make it possible to evaluate expr::binary_operator' from Jan Ciołek
As a part of CQL rewrite we want to be able to perform filtering by calling `evaluate()` on an expression and checking if it evaluates to `true`. Currently trying to do that for a binary operator would result in an error.

Right now checking if a binary operation like `col1 = 123` is true is done using `is_satisfied_by`, which is able to check if a binary operation evaluates to true for a small set of predefined cases.

Eventually once the grammar is relaxed we will be able to write expressions like: `(col1 < col2) = (1 > ?)`, which doesn't fit with what `is_satisfied_by` is supposed to do.
Additionally expressions like `1 = NULL` should evaluate to `NULL`, not `true` or `false`. `is_satsified_by` is not able to express that properly.

The proper way to go is implementing `evaluate(binary_operator)`, which takes a binary operation and returns what the result of it would be.

Implementing `prepare_expression` for `binary_operator` requires us to be able to evaluate it first. In the next PR I will add support for `prepare_expression`.

Closes #12052

* github.com:scylladb/scylladb:
  cql-pytest: enable two unset value tests that pass now
  cql-pytest: reduce unset value error message
  cql3: expr: change unset value error messages to lowercase
  cql_pytest: ensure that where clauses like token(p) = 0 AND p = 0 are rejected
  cql3: expr: remove needless braces around switch cases
  cql3: move evaluation IS_NOT NULL to a separate function
  expr_test: test evaluating LIKE binary_operator
  expr_test: test evaluating IS_NOT binary_operator
  expr_test: test evaluating CONTAINS_KEY binary_operator
  expr_test: test evaluating CONTAINS binary_operator
  expr_test: test evaluating IN binary_operator
  expr_test: test evaluating GTE binary_operator
  expr_test: test evaluating GT binary_operator
  expr_test: test evaluating LTE binary_operator
  expr_test: test evaluating LT binary_operator
  expr_test: test evaluating NEQ binary_operator
  expr_test: test evaluating EQ binary_operator
  cql3: expr properly handle null in is_one_of()
  cql3: expr properly handle null in like()
  cql3: expr properly handle null in contains_key()
  cql3: expr properly handle null in contains()
  cql3: expr: properly handle null in limits()
  cql3: expr: remove unneeded overload of limits()
  cql3: expr: properly handle null in equality operators
  cql3: expr: remove unneeded overload of equal()
  cql3: expr: use evaluate(binary_operator) in is_satisfied_by
  cql3: expr: handle IS NOT NULL when evaluating binary_operator
  cql3: expr: make it possible to evaluate binary_operator
  cql3: expr: accept expression as lhs argument to like()
  cql3: expr: accept expression as lhs in contains_key
  cql3: expr: accept expression as lhs argument to contains()
2022-11-28 11:30:00 +02:00
Nadav Har'El
1e59c3f9ef alternator: if TTL scan times out, continue immediately
The Alternator TTL expiration scanner scans an entire table using many
small pages. If any of those pages time out for some reason (e.g., an
overload situation), we currently consider the entire scan to have failed
and wait for the next scan period (which by default is 24 hours) when
we start the scan from scratch (at a random position). There is a risk
that if these timeouts are common enough to occur once or more per
scan, the result is that we double or more the effective expiration lag.

A better solution, done in this patch, is to retry from the same position
if a single page timed out - immediately (or almost immediately, we add
a one-second sleep).

Fixes #11737

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12092
2022-11-28 11:30:00 +02:00
Avi Kivity
45a57bf22d Update tools/java submodule (revert scylla-driver)
scylla-driver causes dtests to fail randomly (likely
due to incorrect handling of the USE statement). Revert
it.

* tools/java 73422ee114...1c06006447 (2):
  > Revert "Add Scylla Cloud serverless support"
  > Revert "Switch cqlsh to use scylla-driver"
2022-11-28 11:29:08 +02:00
Benny Halevy
8f584a9a80 storage_service: handle_state_normal: always update_topology before update_normal_tokens
update_normal_tokens checks that that the endpoint is in topology.
Currently we call update_topology on this path only if it's
not a normal_token_owner, but there are paths when the
endpoint could be a normal token owner but still
be pending in topology so always update it, just in case.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-28 11:25:36 +02:00
Benny Halevy
6b13fd108a storage_service: handle_state_normal: delete outdated comment regarding update pending ranges race
asias@scylladb.com said:
> This comments was moved up to the wrong place when tmptr->update_topology was added.
> There is no race now since we use the copy-update-replace method to update token_metadada.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-28 11:25:36 +02:00
Kefu Chai
af011aaba1 utils/variant_element: simplify is_variant_element with right fold
for better readability than the recursive approach.

Signed-off-by: Kefu Chai <tchaikov@gmail.com>

Closes #12091
2022-11-27 16:34:34 +02:00
Avi Kivity
78222ea171 Update tools/java submodule (cqlsh system_distributed_everywhere is a system keyspace)
* tools/java 874e2d529b...73422ee114 (1):
  > Mark "system_distributed_everywhere" as system ks
2022-11-27 15:37:57 +02:00
Aleksandra Martyniuk
9a3d114349 tasks: move methods from task_manager to source file
Methods from tasks::task_manager and nested classes are moved
to source file.

Closes #12064
2022-11-27 15:09:28 +02:00
Piotr Dulikowski
22fbf2567c utils/abi: don't use the deprecated std::unexpected_handler
Recently, clang started complaining about std::unexpected_handler being
deprecated:

```
In file included from utils/exceptions.cc:18:
./utils/abi/eh_ia64.hh:26:10: warning: 'unexpected_handler' is deprecated [-Wdeprecated-declarations]
    std::unexpected_handler unexpectedHandler;
         ^
/usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/exception:84:18: note: 'unexpected_handler' has been explicitly marked deprecated here
  typedef void (*_GLIBCXX11_DEPRECATED unexpected_handler) ();
                 ^
/usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/x86_64-redhat-linux/bits/c++config.h:2343:32: note: expanded from macro '_GLIBCXX11_DEPRECATED'
                               ^
/usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/x86_64-redhat-linux/bits/c++config.h:2334:46: note: expanded from macro '_GLIBCXX_DEPRECATED'
                                             ^
1 warning generated.
```

According to cppreference.com, it was deprecated in C++11 and removed in
C++17 (!).

This commit gets rid of the warning by inlining the
std::unexpected_handler typedef, which is defined as a pointer a
function with 0 arguments, returning void.

Fixes: #12022

Closes #12074
2022-11-27 12:25:20 +02:00
Alejo Sanchez
5ff4b8b5f8 pytest: catch rare exception for random tables test
On rare occassions a SELECT on a DROPpped table throws
cassandra.ReadFailure instead of cassandra.InvalidRequest. This could
not be reproduced locally.

Catch both exceptions as the table is not present anyway and it's
correctly marked as a failure.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #12027
2022-11-27 10:26:55 +02:00
Michał Chojnowski
a75e4e1b23 db: config: disable global index page caching by default
Global index page caching, as introduced in 4.6
(078a6e422b and 9f957f1cf9) has proven to be misdesigned,
because it poses a risk of catastrophic performance regressions in
common workloads by flooding the cache with useless index entries.
Because of that risk, it should be disabled by default.

Refs #11202
Fixes #11889

Closes #11890
2022-11-26 14:27:26 +02:00
Aleksandra Martyniuk
c2ea3f49e6 repair: rename methods of repair_module
Methods of repair_module connected with repair_module::_repairs
are renamed to match repair_module::_repairs type.
2022-11-25 16:41:02 +01:00
Aleksandra Martyniuk
13dbd75ba8 repair: change type of repair_module::_repairs
As a preparation to replacing repair_info with shard_repair_task_impl,
type of _repairs in repair module is changed from
std::unordered_map<int, lw_shared_ptr<repair_info>> to
std::unordered_map<int, tasks::task_id>.
2022-11-25 16:41:02 +01:00
Aleksandra Martyniuk
55c01a1beb repair: keep a reference to shard_repair_task_impl in row_level_repair
As a part of replacing repair_info with shard_repair_task_impl,
instead of a reference to repair_info, row_level_repair keeps
a reference to shard_repair_task_impl.
2022-11-25 16:41:02 +01:00
Aleksandra Martyniuk
9b664570f0 repair: move repair_range method to shard_repair_task_impl 2022-11-25 16:41:02 +01:00
Aleksandra Martyniuk
3ac5ba7b28 repair: make do_repair_ranges a method of shard_repair_task_impl
Function do_repair_ranges is directly connected to shard repair tasks.
Turning it into shard_repair_task_impl method enables an access to tasks'
members with no additional intermediate layers.
2022-11-25 16:41:02 +01:00
Aleksandra Martyniuk
a09dfcdacd repair: copy repair_info methods to shard_repair_task_impl
Methods of repair_info are copied to shard_repair_task_impl. They are
not used yet, it's a preparation for replacing repair_info with
shard_repair_task_impl.
2022-11-25 16:41:02 +01:00
Aleksandra Martyniuk
a4b1bdb56c repair: corutinize shard task creation 2022-11-25 16:41:02 +01:00
Aleksandra Martyniuk
996c0f3476 repair: define run for shard_repair_task_impl
Operations performed as a part of shard repair are moved
to shard_repair_task_impl run method.
2022-11-25 16:41:02 +01:00
Aleksandra Martyniuk
ba9770ea02 repair: add shard_repair_task_impl
Create a task spanning over a repair performed on a given shard.
2022-11-25 16:40:49 +01:00
Anna Stuchlik
d5f676106e doc: remove the LWT page from the index of Enterprise features
Closes #12076
2022-11-24 21:59:05 +02:00
Aleksandra Martyniuk
dcc17037c7 repair: fix bad cast in tasks::task_id parsing
In system_keyspace::get_repair_history value of repair_uuid
is got from row as tasks::task_id.
tasks::task_id is represented by an abstract_type specific
for utils::UUID. Thus, since their typeids differ, bad_cast
is thrown.

repair_uuid is got from row as utils::UUID and then cast.
Since no longer needed, data_type_for<tasks::task_id> is deleted.

Fixes: #11966

Closes #12062
2022-11-24 19:37:44 +02:00
Jan Ciolek
77c7d8b8f6 cql-pytest: enable two unset value tests that pass now
While implementing evaluate(binary_operator)
missing checks for unset value were added
for comparisons in filtering code.

Because of that some tests for unset value
started passing.

There are still other tests for unset value
that are failing because Scylla doesn't
have all the checks that it should.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-24 17:07:17 +01:00
Jan Ciolek
5bc0bc6531 cql-pytest: reduce unset value error message
When unset value appears in an invalid place
both Cassandra and Scylla throw an error.

The tests were written with Cassandra
and thus the expected error messages were
exactly the same as produced by Cassandra.

Scylla produces different error messages,
but both databases return messages with
the text 'unset value'.

Reduce the expected message text
from the whole message to something
that contains 'unset value'.

It would be hard to mimic Cassandra's
error messages in Scylla. There is no
point in spending time on that.
Instead it's better to modify the tests
so that they are able to work with
both Cassandra and Scylla.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-24 17:04:07 +01:00
Jan Ciolek
08f40a116d cql3: expr: change unset value error messages to lowercase
The messages used to contain UNSET_VALUE
in capital letters, but the tests
expect messages with 'unset value'.

Change the message so that it can
match the expected error text in tests.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-24 17:02:44 +01:00
Kamil Braun
fda6403b29 test/topology: simple node replace tests (currently disabled)
Add two node replace tests using the freshly added infrastructure.

One test replaces a node while using a different IP. It is disabled
because the replace operation has an unconditional 60-seconds sleep
(it doesn't depend on the ring_delay setting for some reason). The sleep
needs to be fixed before we can enable this test.

The other test replaces while reusing the replaced node's IP.
Additionally to the sleep, the test fails because the node cannot join
group 0; it's stuck in an infinite loop of trying to join:
```
INFO  2022-11-18 15:56:19,933 [shard 0] raft_group0 - server 8de951fd-a528-4a82-ac54-592ea269537f found no local group 0. Discovering...
INFO  2022-11-18 15:56:19,933 [shard 0] raft_group0 - server 8de951fd-a528-4a82-ac54-592ea269537f found group 0 with group id 25d2b050-6751-11ed-b534-c3c40c275dd3, leader b7047f7e-03e6-4797-a723-24054201f91d
INFO  2022-11-18 15:56:19,934 [shard 0] raft_group0 - Server 8de951fd-a528-4a82-ac54-592ea269537f is starting group 0 with id 25d2b050-6751-11ed-b534-c3c40c275dd3
WARN  2022-11-18 15:56:20,935 [shard 0] raft_group0 - failed to modify config at peer b7047f7e-03e6-4797-a723-24054201f91d: seastar::rpc::timeout_error (rpc call timed out). Retrying.
INFO  2022-11-18 15:56:21,937 [shard 0] raft_group0 - server 8de951fd-a528-4a82-ac54-592ea269537f found group 0 with group id 25d2b050-6751-11ed-b534-c3c40c275dd3, leader ee0175ea-6159-4d4c-9d7c-95c934f8a408
WARN  2022-11-18 15:56:22,937 [shard 0] raft_group0 - failed to modify config at peer ee0175ea-6159-4d4c-9d7c-95c934f8a408: seastar::rpc::timeout_error (rpc call timed out). Retrying.
INFO  2022-11-18 15:56:23,938 [shard 0] raft_group0 - server 8de951fd-a528-4a82-ac54-592ea269537f found group 0 with group id 25d2b050-6751-11ed-b534-c3c40c275dd3, leader ee0175ea-6159-4d4c-9d7c-95c934f8a408
WARN  2022-11-18 15:56:24,939 [shard 0] raft_group0 - failed to modify config at peer ee0175ea-6159-4d4c-9d7c-95c934f8a408: seastar::rpc::timeout_error (rpc call timed out). Retrying.
```
and so on.
2022-11-24 16:26:23 +01:00
Kamil Braun
2f60550ff3 test/pylib: scylla_cluster: support node replace operation
The `add_server` function now takes an optional `ReplaceConfig` struct
(implemented using `NamedTuple`), which specifies the ID of the replaced
server and whether to reuse the IP address.

If we want to reuse the IP address, we don't allocate one using the host
registry.

Since now multiple servers can have the same IP, introduce a
`leased_ips` set to `ScyllaCluster` which is used when `uninstall`ing
the cluster - to make sure we don't `release_host` the same host twice.
2022-11-24 16:26:23 +01:00
Kamil Braun
d80247f912 test/pylib: scylla_cluster: move members initialization to constructor
Previously some members had to be initialized in `install` because
that's when we first knew the IP address.

Now we know the IP address during construction, which allows us to make
the code a bit shorter and simpler, and establish invariants: some
members (such as `self.config`) are now valid for the entire lifetime of
the server object.

`install()` is reduced to performing only side effects (creating
directories, writing config files), all calculation is done inside the
constructor.
2022-11-24 16:26:23 +01:00
Kamil Braun
3934eefd20 test/pylib: scylla_cluster: (re)lease IP addr outside ScyllaServer
`ScyllaServer`s were constructed without IP addresses. They leased an IP
address from `HostRegistry` and released them in `uninstall`.

This responsibility was now moved into `ScyllaCluster`, which leases an
IP address for a server before constructing it, and passes it to the
constructor. It releases the addresses of its serverswhen uninstalling
itself.

This will allow the cluster to reuse the IP address of an existing
server in that cluster when adding a new server which wants to replace
the existing one. Instead of leasing a new address, it will pass
the existing IP address to the new server's constructor.

The refactor is also nice in that it establishes an invariant for
`ScyllaServer`, simplifying reasoning about the class: now it has
an `ip_addr` field at all times.

`host_registry` was moved from `ScyllaServer` to `ScyllaCluster`.
2022-11-24 16:26:23 +01:00
Kamil Braun
9d5e1191da test/pylib: scylla_cluster: refactor create_server parameters to a struct
`ScyllaCluster` constructor takes a function `create_server` which
itself takes 3 parameters now. Soon it will take a 4th. The list of
parameters is repeated at the constructor definition and the call site
of the constructor, with many parameters it begins being tiresome.
Refactor the list of parameters to a `NamedTuple`.
2022-11-24 16:26:23 +01:00
Kamil Braun
d582666293 test.py: stop/uninstall clusters instead of servers when cleaning up
`self.artifacts` was calling `ScyllaServer.stop` and
`ScyllaServer.uninstall`. Now it calls `ScyllaCluster.stop` and
`ScyllaCluster.uninstall`, which underneath stops/uninstalls
servers in this cluster.

We must be a bit more careful now in case installing/starting a
server inside a cluster fails: there are no server cleanup artifacts,
and a server is added to cluster's `running` map only after
`install_and_start` finishes (until that happens,
`ScyllaCluster.stop/uninstall` won't catch this server).
So handle failures explicitly in `install_and_start`.

This commit does not logically change how the tests are running - every
started server belongs to some cluster, so it will be cleaned up
- but it's an important refactor.

It will allow us to move IP address (de)allocation code outside
`ScyllaServer`, into `ScyllaCluster`, which in turn will allow us to
implement node replace operation for the case where we want to reuse
the replaced node's IP.

Also, `ScyllaCluster.uninstall` was unused before this change, now it's
used.
2022-11-24 16:26:17 +01:00
Avi Kivity
29a4b662f8 Merge 'doc: document the Alternator TTL feature as GA' from Anna Stuchlik
Currently, TTL is listed as one of the experimental features: https://docs.scylladb.com/stable/alternator/compatibility.html#experimental-api-features

This PR moves the feature description from the Experimental Features section to a separate section.
I've also added some links and improved the formatting.

@tzach I've relied on your release notes for RC1.

Refs: https://github.com/scylladb/scylladb/issues/5060

Closes #11997

* github.com:scylladb/scylladb:
  Update docs/alternator/compatibility.md
  doc: update the link to Enabling Experimental Features
  doc: remove the note referring to the previous ScyllaDB versions and add the relevant limitation to the paragraph
  doc: update the links to the Enabling Experimental Features section
  doc: add the link to the Enabling Experimental Features section
  doc: move the TTL Alternator feature from the Experimental Features section to the production-ready section
2022-11-24 17:22:05 +02:00
Nadav Har'El
2dedb5ea75 alternator: make Alternator TTL feature no longer "experimental"
Until now, the Alternator TTL feature was considered "experimental",
and had to be manually enabled on all nodes of the cluster to be usable.

This patch removes this requirement and in essence GAs this feature.

Even after this patch, Alternator TTL is still a "cluster feature",
i.e., for this feature to be usable every node in the cluster needs
to support it. If any of the nodes is old and does not yet support this
feature, the UpdateTimeToLive request will not be accepted, so although
the expiration-scanning threads may exist on the newer nodes, they will
not do anything because none of the tables can be marked as having
expiration enabled.

This patch does not contain documentation fixes - the documentation
still suggests that the Alternator TTL feature is experimental.
The documentation patch will come separately.

Fixes #12037

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12049
2022-11-24 17:21:39 +02:00
Tzach Livyatan
e96d31d654 docs: Add Authentication and Authorization as a prerequisite for Auditing.
Closes #12058
2022-11-24 17:21:23 +02:00
Kamil Braun
df731a5b0c test/pylib: artifact_registry: replace Awaitable type with Coroutine
The `cleanup_before_exit` method of `ArtifactRegistry` calls `close()`
on artifacts. mypy complains that `Awaitable` has no such method. In
fact, the `artifact` objects that we pass to `ArtifactRegistry`
(obtained by calling `async def` functions) do have a `close()` method,
and they are a particular case of `Awaitable`s, but in general not
all `Awaitable`s have `close()`.

Replace `Awaitable` with one of its subtypes: `Coroutine`. `Coroutine`s
have a `close()` method, and `async def` functions return objects of
this type. mypy no longer complains.
2022-11-24 16:17:05 +01:00
Nadav Har'El
c6bb64ab0e Merge 'Fix LWT insert crash if clustering key is null' from Gusev Petr
[PR](https://github.com/scylladb/scylladb/pull/9314) fixed a similar issue with regular insert statements
but missed the LWT code path.

It's expected behaviour of
`modification_statement::create_clustering_ranges` to return an
empty range in this case, since `possible_lhs_values` it
uses explicitly returns `empty_value_set` if it evaluates `rhs`
to null, and it has a comment about it (All NULL
comparisons fail; no column values match.) On the other hand,
all components of the primary key are required to be set,
this is checked at the prepare phase, in
`modification_statement::process_where_clause`. So the only
problem was `modification_statement::execute_with_condition`
was not expecting an empty `clustering_range` in case of
a null clustering key.

Also this patch contains a fix for the problem with wrong
column name in Scylla error messages. If `INSERT` or `DELETE`
statement is missing a non-last element of
the primary key, the error message generated contains
an invalid column name.

The problem occurs if the query contains a column with the list type,
otherwise
`statement_restrictions::process_clustering_columns_restrictions`
checks that all the components of the key are specified.

Closes #12047

* github.com:scylladb/scylladb:
  cql: refactor, inline modification_statement::validate_primary_key_restrictions
  cql: DELETE with null value for IN parameter should be forbidden
  cql: add column name to the error message in case of null primary key component
  cql: batch statement, inserting a row with a null key column should be forbidden
  cql: wrong column name in error messages
  modification_statement: fix LWT insert crash if clustering key is null
2022-11-24 16:15:27 +02:00
Nadav Har'El
6e9f739f19 Merge 'doc: add the links to the per-partition rate limit extension ' from Anna Stuchlik
Release 5.1. introduced a new CQL extension that applies to the CREATE TABLE and ALTER TABLE statements. The ScyllaDB-specific extensions are described on a separate page, so the CREATE TABLE and ALTER TABLE should include links to that page and section.

Note: CQL extensions are described with Markdown, while the Data Definition page is RST. Currently, there's no way to link from an RST page to an MD subsection (using a section heading or anchor), so a URL is used as a temporary solution.

Related: https://github.com/scylladb/scylladb/pull/9810

Closes #12070

* github.com:scylladb/scylladb:
  doc: move the info about per-partition rate limit for the ALTER TABLE statemet from the paragraph to the list
  doc: add the links to the per-partition rate limit extention to the CREATE TABLE and ALTER TABLE sections
2022-11-24 16:03:30 +02:00
Anna Stuchlik
8049670772 doc: move the info about per-partition rate limit for the ALTER TABLE statemet from the paragraph to the list 2022-11-24 14:42:11 +01:00
Anna Stuchlik
57a58b17a8 doc: enable publishing the documentation for version 5.1
Closes #12059
2022-11-24 13:55:25 +02:00
Benny Halevy
243dc2efce hints: host_filter: check topology::has_endpoint if enabled_selectively
Don't call get_datacenter(ep) without checking
first has_endpoint(ep) since the former may abort
on internal error if the endpoint is not listed
in topology.

Refs #11870

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12054
2022-11-24 14:33:06 +03:00
Anna Stuchlik
f158d31e24 doc: add the links to the per-partition rate limit extention to the CREATE TABLE and ALTER TABLE sections 2022-11-24 11:26:33 +01:00
Petr Gusev
b95305ae2b cql: refactor, inline modification_statement::validate_primary_key_restrictions
The function didn't add much value, just forwarded to _restrictions.
Removed it and called _restrictions->validate_primary_key directly.
2022-11-23 21:56:12 +04:00
Petr Gusev
f9936bb0cb cql: DELETE with null value for IN parameter should be forbidden
If a DELETE statement contains an IN operator and the
parameter value for it is NULL, this should also trigger
an error. This is in line with how Cassandra
behaves in this case.
2022-11-23 21:39:23 +04:00
Petr Gusev
c123f94110 cql: add column name to the error message in case of null primary key component
It's more user-friendly and the error message
corresponds to what Cassandra provides in this case.
2022-11-23 21:39:23 +04:00
Petr Gusev
7730c4718e cql: batch statement, inserting a row with a null key column should be forbidden
Regular INSERT statements with null values for primary key
components are rejected by Scylla since #9286 and #9314.
Batch statements missed a similar check, this patch
fixes it.

Fixes: #12060
2022-11-23 21:39:23 +04:00
Petr Gusev
89a5397d7c cql: wrong column name in error messages
If INSERT or DELETE statement is missing a non-last element of
the primary key, the error message generated contains
an invalid column name.

The problem occurs if the query contains a column with the list type,
otherwise
statement_restrictions::process_clustering_columns_restrictions
checks that all the components of the key are specified.

Fixes: #12046
2022-11-23 21:39:16 +04:00
Benny Halevy
996eac9569 topology: add get_datacenters
Returns an unordered set of datacenter names
to be used by network_topology_replication_strategy
and for ks_prop_defs.

The set is kept in sync with _dc_endpoints.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12023
2022-11-23 18:39:36 +02:00
Takuya ASADA
9acdd3af23 dist: drop deprecated AMI parameters on setup scripts
Since we moved all IaaS code to scylla-machine-image, we nolonger need
AMI variable on sysconfig file or --ami parameter on setup scripts,
and also never used /etc/scylla/ami_disabled.
So let's drop all of them from Scylla core core.

Related with scylladb/scylla-machine-image#61

Closes #12043
2022-11-23 17:56:13 +02:00
Avi Kivity
7c66fdcad1 Merge 'Simplify sstable_directory configuration' from Pavel Emelyanov
When started the sstable_directory is constructed with a bunch of booleans that control the way its process_sstable_dir method works. It's shorter and simpler to pass these booleans into method directly, all the more so there's another flag that's already passed like this.

Closes #12005

* github.com:scylladb/scylladb:
  sstable_directory: Move all RAII booleans onto flags
  sstable_directory: Convert sort-sstables argument to flags struct
  sstable_directory: Drop default filter
2022-11-23 16:16:04 +02:00
Avi Kivity
70bfa708f5 storage_proxy: coroutinize change_hints_host_filter()
Trivial straight-line code, no performance implications.

Closes #12056
2022-11-23 15:34:24 +02:00
Jan Ciolek
84501851eb cql_pytest: ensure that where clauses like token(p) = 0 AND p = 0 are rejected
Scylla doesn't support combining restrictions
on token with other restrictions on partition key columns.

Some pieces of code depend on the assumption
that such combinations are allowed.
In case they were allowed in the future
these functions would silently start
returning wrong results, and we would
return invalid rows.

Add a test that will start failing once
this restriction is removed. It will
warn the developer to change the
functions that used to depend
on the assumption.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 13:09:22 +01:00
Botond Dénes
602dfdaf98 Merge 'Task manager top level repair tasks' from Aleksandra Martyniuk
The PR introduces top level repair tasks representing repair and node operations
performed with repair. The actions performed as a part of these operations are
moved to corresponding tasks' run methods.

Also a small change to repair module is added.

Closes #11869

* github.com:scylladb/scylladb:
  repair: define run for data_sync_repair_task_impl
  repair: add data_sync_repair_task_impl
  tasks: repair: add noexcept to task impl constructor
  repair: define run for user_requested_repair_task_impl
  repair: add user_requested_repair_task_impl
  repair: allow direct access to max_repair_memory_per_range
2022-11-23 14:02:30 +02:00
Jan Ciolek
338af848a8 cql3: expr: remove needless braces around switch cases
Originally put braces around the cases because
there were local variables that I didn't want
to be shadowed.

Now there are no variables so the braces
can be removed without any problems.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:30 +01:00
Jan Ciolek
e8a46d34c2 cql3: move evaluation IS_NOT NULL to a separate function
When evaluating a binary operation with
operations like EQUAL, LESS_THAN, IN
the logic of the operation is put
in a separate function to keep things clean.

IS_NOT NULL is the only exception,
it has its evaluate implementation
right in the evaluate(binary_operator)
function.

It would be cleaner to have it in
a separate dedicated function,
so it's moved to one.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:30 +01:00
Jan Ciolek
b6cf6e6777 expr_test: test evaluating LIKE binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:29 +01:00
Jan Ciolek
6774272fd6 expr_test: test evaluating IS_NOT binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:29 +01:00
Jan Ciolek
e6c78bb6c2 expr_test: test evaluating CONTAINS_KEY binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:29 +01:00
Jan Ciolek
4f250609ab expr_test: test evaluating CONTAINS binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:29 +01:00
Jan Ciolek
3ca04cfcc2 expr_test: test evaluating IN binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:28 +01:00
Jan Ciolek
41f452b73f expr_test: test evaluating GTE binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:28 +01:00
Jan Ciolek
1fe9a9ce2a expr_test: test evaluating GT binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:28 +01:00
Jan Ciolek
ef2a77a3e0 expr_test: test evaluating LTE binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:28 +01:00
Jan Ciolek
3cbb2d44e8 expr_test: test evaluating LT binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:27 +01:00
Jan Ciolek
9feee70710 expr_test: test evaluating NEQ binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:27 +01:00
Jan Ciolek
e77dba0b0b expr_test: test evaluating EQ binary_operator
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:27 +01:00
Jan Ciolek
63a89776a1 cql3: expr properly handle null in is_one_of()
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:27 +01:00
Jan Ciolek
214dab9c77 cql3: expr properly handle null in like()
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:26 +01:00
Jan Ciolek
2ce9c95a9d cql3: expr properly handle null in contains_key()
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:26 +01:00
Jan Ciolek
336ad61aa3 cql3: expr properly handle null in contains()
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:26 +01:00
Jan Ciolek
e2223be1ec cql3: expr: properly handle null in limits()
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:26 +01:00
Jan Ciolek
d1abf2e168 cql3: expr: remove unneeded overload of limits()
There is a more general version of limits()
which takes expressions as both the lhs and rhs
arguments.

There is no need for a specialized overload.
This specialized overload takes a tuple_constructor
as lhs, but we call evaluate() on both sides
of a binary operator before checking equality,
so this won't be useful at all.

Having multiple functions increases the risk
that one of them has a bug, while giving
dubious benfit.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:25 +01:00
Jan Ciolek
0609a425e6 cql3: expr: properly handle null in equality operators
Expressions like:
123 = NULL
NULL = 123
NULL = NULL
NULL != 123

should be tolerated, but evaluate to NULL.
The current code assumes that a binary operator
can only evaluate to a boolean - true or false.

Now a binary operator can also evaluate to NULL.
This should happen in cases when one of the
operator's sides is NULL.

A special class is introduced to represent a value
that can be one of three things: true, false or null.
It's better than using std::optional<bool>,
because optional has implicit conversions to bool
that could cause confusion and bugs.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-23 12:44:22 +01:00
Aleksandra Martyniuk
a3016e652f repair: define run for data_sync_repair_task_impl
Operations performed as a part of data sync repair are moved
to data_sync_repair_task_impl run method.
2022-11-23 10:44:19 +01:00
Aleksandra Martyniuk
42239c8fed repair: add data_sync_repair_task_impl
Create a task spanning over whole node operation. Tasks of that type
are stored on shard 0.
2022-11-23 10:19:53 +01:00
Aleksandra Martyniuk
9e108a2490 tasks: repair: add noexcept to task impl constructor
Add noexcept to constructor of tasks::task_manager::task::impl
and inheriting classes.
2022-11-23 10:19:53 +01:00
Aleksandra Martyniuk
4a4e9c12df repair: define run for user_requested_repair_task_impl
Operations performed as a part of user requested repair are
moved to user_requested_repair_task_impl run method.
2022-11-23 10:19:51 +01:00
Aleksandra Martyniuk
3800b771fc repair: add user_requested_repair_task_impl
Create a task spanning over whole user requested repair.
Tasks of that type are stored on shard 0.
2022-11-23 10:11:09 +01:00
Aleksandra Martyniuk
0256ede089 repair: allow direct access to max_repair_memory_per_range
Access specifier of constexpr value max_repair_memory_per_range
in repair_module is changed to public and its getter is deleted.
2022-11-23 10:11:09 +01:00
Anna Stuchlik
16e2b9acd4 Update docs/alternator/compatibility.md
Co-authored-by: Daniel Lohse <info@asapdesign.de>
2022-11-23 09:51:04 +01:00
Avi Kivity
d7310fd083 gdb: messaging: print tls servers too
Many systems have most traffic on tls servers, so print them.

Closes #12053
2022-11-23 07:59:02 +02:00
Avi Kivity
aec9faddb1 Merge 'storage_proxy: use erm topology' from Benny Halevy
When processing a query, we keep a pointer to an effective_replication_map.
In a couple places we used the latest topology instead of the one held by the effective_replication_map
that the query uses and that might lead to inconsistencies if, for example, a node is removed from topology after decommission that happens concurrently to the query.

This change gets the topology& from the e_r_m in those cases.

Fixes #12050

Closes #12051

* github.com:scylladb/scylladb:
  storage_proxy: pass topology& to sort_endpoints_by_proximity
  storage_proxy: pass topology& to is_worth_merging_for_range_query
2022-11-22 20:04:41 +02:00
Botond Dénes
49ec7caf27 mutation_fragment_stream_validator: avoid allocation when stream is correct
Currently the ctor of said class always allocates as it copies the
provided name string and it creates a new name via format().
We want to avoid this, now that the validator is used on the read path.
So defer creating the formatted name to when we actually want to log
something, which is either when log level is debug or when an error is
found. We don't care about performance in either case, but we do care
about it on the happy path.
Further to the above, provide a constructor for string literal names and
when this is used, don't copy the name string, just save a view to it.

Refs: #11174

Closes #12042
2022-11-22 19:19:18 +02:00
Nadav Har'El
ce7c1a6c52 Merge 'alternator: fix wrong 'where' condition for GSI range key' from Marcin Maliszkiewicz
Contains fixes requested in the issue (and some tiny extras), together with analysis why they don't affect the users (see commit messages).

Fixes [ #11800](https://github.com/scylladb/scylladb/issues/11800)

Closes #11926

* github.com:scylladb/scylladb:
  alternator: add maybe_quote to secondary indexes 'where' condition
  test/alternator: correct xfail reason for test_gsi_backfill_empty_string
  test/alternator: correct indentation in test_lsi_describe
  alternator: fix wrong 'where' condition for GSI range key
2022-11-22 17:46:52 +02:00
Pavel Emelyanov
22133a3949 sstable_directory: Move all RAII booleans onto flags
There's a bunch of booleans that control the behavior of sstable
directory scanning. Currently they are described as verbose
bool_class<>-es and are put into sstable_directory construction time.

However, these are not used outside of .process_sstable_dir() method and
moving them onto recently added flags struct makes the code much
shorter (29 insertions(+), 121 deletions(-))

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-22 18:30:00 +03:00
Pavel Emelyanov
7ca5e143d7 sstable_directory: Convert sort-sstables argument to flags struct
The sstable_directory::process_sstable_dir() accepts a boolean to
control its behavior when collecting sstables. Turn this boolean into a
structure of flags. The intention is to extend this flags set in the
future (next patch).

This boolean is true all the time, but one place sets it to true in a
"verbose" manner, like this:

        bool sort_sstables_according_to_owner = false;
        process_sstable_dir(directory, sort_sstables_according_to_owner).get();

the local variable is not used anymore. Using designated initializers
solves the verbosity in a nicer manner.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-22 18:19:23 +03:00
Pavel Emelyanov
7c7017d726 sstable_directory: Drop default filter
It's used as default argument for .reshape() method, but callers specify
it explicitly. At the same time the filter is simple enough and is only
used in one place so that the caller can just use explicit lambda.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-22 18:19:23 +03:00
Jan Ciolek
6be142e3a0 cql3: expr: remove unneeded overload of equal()
There is a more general version of equal()
which takes expressions as both the lhs and rhs
arguments.

There is no need for a specialized overload.
This specialized overload takes a tuple_constructor
as lhs, but we call evaluate() on both sides
of a binary operator before checking equality,
so this won't be useful at all.

Having multiple functions increases the risk
that one of them has a bug, while giving
dubious benfit.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-22 14:28:10 +01:00
Benny Halevy
731a74c71f storage_proxy: pass topology& to sort_endpoints_by_proximity
It mustn't use the latest topology that may differ from the
one used by the query as it may be missing nodes
(e.g. after concurrent decommission).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-22 15:02:40 +02:00
Benny Halevy
ab3fc1e069 storage_proxy: pass topology& to is_worth_merging_for_range_query
It mustn't use the latest topology that may differ from the
one used by the query as it may be missing nodes
(e.g. after concurrent decommission).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-22 15:01:58 +02:00
Petr Gusev
0d443dfd16 modification_statement: fix LWT insert crash if clustering key is null
PR #9314 fixed a similar issue with regular insert statements
but missed the LWT code path.

It's expected behaviour of
modification_statement::create_clustering_ranges to return an
empty range in this case, since possible_lhs_values it
uses explicitly returns empty_value_set if it evaluates rhs
to null, and it has a comment about it (All NULL
comparisons fail; no column values match.) On the other hand,
all components of the primary key are required to be set,
this is checked at the prepare phase, in
modification_statement::process_where_clause. So the only
problem was modification_statement::execute_with_condition
was not expecting an empty clustering_range in case of
a null clustering key.

Fixes: #11954
2022-11-22 16:45:16 +04:00
Marcin Maliszkiewicz
2bf2ffd3ed alternator: add maybe_quote to secondary indexes 'where' condition
This bug doesn't affect anything, the reason is descibed in the commit:
'alternator: fix wrong 'where' condition for GSI range key'.

But it's theoretically correct to escape those key names and
the difference can be observed via CQL's describe table. Before
the patch 'where' condition is missing one double quote in variable
name making it mismatched with corresponding column name.
2022-11-22 11:08:23 +01:00
Marcin Maliszkiewicz
4389baf0d9 test/alternator: correct xfail reason for test_gsi_backfill_empty_string
Previously cited issue is closed already.
2022-11-22 11:08:23 +01:00
Marcin Maliszkiewicz
59eca20af1 test/alternator: correct indentation in test_lsi_describe
Otherwise I think assert is not executed in a loop. And I am not sure why lsi variable can be bound
to anything. As I tested it was pointing to the last element in lsis...
2022-11-22 11:08:23 +01:00
Marcin Maliszkiewicz
d6d20134de alternator: fix wrong 'where' condition for GSI range key
This bug doesn't manifest in a visible way to the user.

Adding the index to an existing table via GlobalSecondaryIndexUpdates is not supported
so we don't need to consider what could happen for empty values of index range key.
After the index is added the only interesting value user can set is omitting
the value (null or empty are not allowed, see test_gsi_empty_value and
test_gsi_null_value).

In practice no matter of 'where' condition the underlaying materialized
view code is skipping row updates with missing keys as per this comment:
'If one of the key columns is missing, set has_new_row = false
meaning that after the update there will be no view row'.

Thats why the added test passes both before and after the patch.
But it's still usefull to include it to exercise those code paths.

Fixes #11800
2022-11-22 11:08:23 +01:00
Nadav Har'El
ff617c6950 cql-pytest: translate a few small Cassandra tests
This patch includes a translation of several additional small test files
from Cassandra's CQL unit test directory cql3/validation/operations.

All tests included here pass on both Cassandra and Scylla, so they did
not discover any new Scylla bugs, but can be useful in the future as
regression tests.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12045
2022-11-22 07:54:13 +02:00
Botond Dénes
f3eecb47f6 Merge 'Optimize cleanup compaction get ranges for invalidation' from Benny Halevy
Take advantage of the facts that both the owned ranges
and the initial non_owned_ranges (derived from the set of sstables)
are deoverlapped and sorted by start token to turn
the calculation of the final non_owned_ranges from
quadratic to linear.

Fixes #11922

Closes #11903

* github.com:scylladb/scylladb:
  dht: optimize subtract_ranges
  compaction: refactor dht::subtract_ranges out of get_ranges_for_invalidation
  compaction_manager: needs_cleanup: get first/last tokens from sstable decorated keys
2022-11-22 06:45:01 +02:00
Jan Ciolek
a1407ef576 cql3: expr: use evaluate(binary_operator) in is_satisfied_by
is_satisfied_by has to check if a binary_operator is satisfied
by some values. It used to be impossible to evaluate
a binary_operator, so is_satisfied had code to check
if its satisfied for a limited number of cases
occuring when filtering queries.

Now evaluate(binary_operator) has been implemented
and is_satisfied_by can use it to check if a binary_operator
evaluates to true.
This is cleaner and reduces code duplication.
Additionally cql tests will test the new evalute() implementation.

There is one special case with token().
When is_satisfied_by sees a restriction on token
it assumes that it's satisfied because it's
sure that these token restrictions were used
to generate partition ranges.

I had to leave this special case in because it's impossible
to evaluate(token). Once this is implemented I will remove
the special case because it's risky and prone to cause
bugs.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-21 20:40:06 +01:00
Jan Ciolek
9c4889ecc3 cql3: expr: handle IS NOT NULL when evaluating binary_operator
The code to evaluate binary operators
was copied from is_satisfied_by.
is_satisfied_by wasn't able to evaluate
IS NOT NULL restrictions, so when such restriction
is encountered it throws an exception.

Implement proper handling for IS NOT NULL binary operators.

The switch ensures that all variants of oper_t are handled,
otherwise there would be a compilation error.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-21 20:40:00 +01:00
Avi Kivity
bf2e54ff85 Merge 'Move deletion log code to sstable_directory.cc' from Pavel Emelyanov
In order to support different storage kinds for sstable files (e.g. -- s3) it's needed to localize all the places that manipulate files on a POSIX filesystem so that custom storage could implement them in its own way. This set moves the deletion log manipulations to the sstable_directory.cc, which already "knows" that it works over a directory.

Closes #12020

* github.com:scylladb/scylladb:
  sstables: Delete log file in replay_pending_delete_log()
  sstables: Move deletion log manipulations to sstable_directory.cc
  sstables: Open-code delete_sstables() call
  sstables: Use fs::path in replay_pending_delete_log()
  sstables: Indentation fix after previous patch
  sstables: Coroutinize replay_pending_delete_log
  sstables: Read pending delete log with one line helper
  sstables: Dont write pending log with file_writer
2022-11-21 21:22:59 +02:00
Jan Ciolek
b4cc92216b cql3: expr: make it possible to evaluate binary_operator
evaluate() takes an expression and evaluates it
to a constant value. It wasn't possible to evalute
binary operators before, so it's added.

The code is based on is_satisfied_by,
which is currently used to check
whether a binary operator evaluates
to true or false.

It looks like is_satisfied_by and evalate()
do pretty much the same thing, one could be
implemented using the other.
In the future they might get merged
into a single function.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-21 17:48:23 +01:00
Jan Ciolek
8d81eaa68f cql3: expr: accept expression as lhs argument to like()
like() used to only accept column_value as the lhs
to evaluate. Changed it to accept any generic expression.
This will allow to evaluate a more diverse set of
binary operators.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-21 16:33:18 +01:00
Jan Ciolek
b1a12686dc cql3: expr: accept expression as lhs in contains_key
contains_key() used to only accept column_value as the lhs
to evaluate. Changed it to accept any generic expression.
This will allow to evaluate a more diverse set of
binary operators.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-21 16:33:02 +01:00
Jan Ciolek
79cd9cd956 cql3: expr: accept expression as lhs argument to contains()
contains() used to only accept column_value as the lhs
to evaluate. Changed it to accept any generic expression.
This will allow to evaluate a more diverse set of
binary operators.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-21 16:32:44 +01:00
Benny Halevy
57ff3f240f dht: optimize subtract_ranges
Take advantage of the fact that both ranges and
ranges_to_subtract are deoverlapped and sorted by
to reduce the calculation complexity from
quadratic to linear.

Fixes #11922

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-21 15:48:28 +02:00
Benny Halevy
8b81635d95 compaction: refactor dht::subtract_ranges out of get_ranges_for_invalidation
The algorithm is generic and can be used elsewhere.

Add a unit test for the function before it gets
optimized in the following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-21 15:48:26 +02:00
Benny Halevy
7c6f60ae72 compaction_manager: needs_cleanup: get first/last tokens from sstable decorated keys
Currently, the function is inefficient in two ways:
1. unnecessary copy of first/last keys to automatic variables
2. redecorating the partition keys with the schema passed to
   needs_cleanup.

We canjust use the tokens from the sstable first/last decorated keys.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-21 15:44:32 +02:00
Pavel Emelyanov
2f9b7931af sstables: Delete log file in replay_pending_delete_log()
It's natural that the replayer cleans up after itself

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-21 13:16:22 +03:00
Pavel Emelyanov
bdc47b7717 sstables: Move deletion log manipulations to sstable_directory.cc
The deletion log concept uses the fact that files are on a POSIX
filesystem. Support for another storage type will have to reimplement
this place, so keep the FS-specific code in _directory.cc file.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-21 13:16:21 +03:00
Pavel Emelyanov
865c51c6cf sstables: Open-code delete_sstables() call
It's no used by any other code, but to be used it requires the caller to
tranform TOC file names by prepending sstable directory to them. Things
get shorter and simpler if merging the helper code into the caller.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-21 13:15:25 +03:00
Pavel Emelyanov
a61c96a627 sstables: Use fs::path in replay_pending_delete_log()
It's called by a code that has fs::path at hand and internally uses
helpers that need fs::path too, so no need to convert it back and forth.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-21 13:15:25 +03:00
Pavel Emelyanov
f5684bcaf0 sstables: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-21 13:15:25 +03:00
Pavel Emelyanov
85a73ca9c6 sstables: Coroutinize replay_pending_delete_log
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-21 13:15:25 +03:00
Pavel Emelyanov
6f3fd94162 sstables: Read pending delete log with one line helper
There's one in seastar since recently

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-21 13:15:25 +03:00
Pavel Emelyanov
2dedf4d03a sstables: Dont write pending log with file_writer
It's a wrapper over output_stream with offset tracking and the tracking
is not needed to generate a log file. As a bonus of switching back we
get a stream.write(sstring) sugar.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-21 13:15:24 +03:00
Botond Dénes
2d4439a739 Merge 'doc: add a troubleshooting article about the missing configuration files' from Anna Stuchlik
Fix https://github.com/scylladb/scylladb/issues/11598

This PR adds the troubleshooting article submitted by @syuu1228 in the deprecated _scylla-docs_ repo, with https://github.com/scylladb/scylla-docs/pull/4152.
I copied and reorganized the content and rewritten it a little according to the RST guidelines so that the page renders correctly.

@syuu1228 Could you review this PR to make sure that my changes didn't distort the original meaning?

Closes #11626

* github.com:scylladb/scylladb:
  doc: apply the feedback to improve clarity
  doc: add the link to the new Troubleshooting section and replace Scylla with ScyllaDB
  doc: add the new page to the toctree
  doc: add a troubleshooting article about the missing configuration files
2022-11-21 12:02:31 +02:00
Kamil Braun
135eb4a041 test.py: prepare for adding extra config from test when creating servers
We will use this for replace operations to pass the IP of replaced node.
2022-11-21 10:57:03 +01:00
Kamil Braun
ac91e9d8be test/pylib: manager_client: convert add_server to use put_json
We shall soon pass some JSON data into these requests.
2022-11-21 10:57:03 +01:00
Kamil Braun
82eb9af80d test/pylib: rest_client: allow returning JSON data from put_json
We'll use `put_json` for requests which want to pass JSON data into the
call and also return JSON.
2022-11-21 10:57:03 +01:00
Kamil Braun
4fef2d099b test/pylib: scylla_cluster: don't import from manager_client
There's a logical dependency from `manager_client` to `scylla_cluster`
(`ManagerClient` defined in `manager_client` talks to
`ScyllaClusterManager` defined in `scylla_cluster` over RPC). There is
no such dependency in the other way. Do not introduce it accidentally.

We can import these types from the `internal_types` module.
2022-11-21 10:57:03 +01:00
Nadav Har'El
757d2a4c02 test/alternator: un-xfail a test which passes on modern Python
We had an xfailing test that reproduced a case where Alternator tried
to report an error when the request was too long, but the boto library
didn't see this error and threw a "Broken Pipe" error instead. It turns
out that this wasn't a Scylla bug but rather a bug in urllib3, which
overzealously reported a "Broken Pipe" instead of trying to read the
server's response. It turns out this issue was already fixed in
   https://github.com/urllib3/urllib3/pull/1524

and now, on modern installations, the test that used to fail now passes
and reports "XPASS".

So in this patch we remove the "xfail" tag, and skip the test if
running an old version of urllib3.

Fixes #8195

Closes #12038
2022-11-21 08:10:10 +02:00
Botond Dénes
ffc3697f2f Merge 'storage_service api: handle dropped tables' from Benny Halevy
Gracefully skip tables that were removed in the background.

Fixes #12007

Closes #12013

* github.com:scylladb/scylladb:
  api: storage_service: fixup indentation
  api: storage_service: add run_on_existing_tables
  api: storage_service: add parse_table_infos
  api: storage_service: log errors from compaction related handlers
  api: storage_service: coroutinize compaction related handlers
2022-11-21 07:56:27 +02:00
Avi Kivity
994603171b Merge 'Add validator to the mutation compactor' from Botond Dénes
Fragment reordering and fragment dropping bugs have been plaguing us since forever. To fight them we added a validator to the sstable write path to prevent really messed up sstables from being written.
This series adds validation to the mutation compactor. This will cover reads and compaction among others, hopefully ridding us of such bugs on the read path too.
This series fixes some benign looking issues found by unit tests after the validator was added -- although how benign a producer emitting two partition-ends depends entirely on how the consumer reacts to it, so no such bug is actually benign.

Fixes: https://github.com/scylladb/scylladb/issues/11174

Closes #11532

* github.com:scylladb/scylladb:
  mutation_compactor: add validator
  mutation_fragment_stream_validator: add a 'none' validation level
  test/boost/mutation_query_test: test_partition_limit: sort input data
  querier: consume_page(): use partition_start as the sentinel value
  treewide: use ::for_partition_end() instead of ::end_of_partition_tag_t{}
  treewide: use ::for_partition_start() instead of ::partition_start_tag_t{}
  position_in_partition: add for_partition_{start,end}()
2022-11-20 20:33:26 +02:00
Avi Kivity
779b01106d Merge 'cql3: expr: add unit tests for prepare_expression' from Jan Ciołek
Adds unit tests for the function `expr::prepare_expression`.

Three minor bugs were found by these tests, both fixed in this PR.
1. When preparing a map, the type for tuple constructor was taken from an unprepared tuple, which has `nullptr` as its type.
2. Preparing an empty nonfrozen list or set resulted in `null`, but preparing a map didn't. Fixed this inconsistency.
3. Preparing a `bind_variable` with `nullptr` receiver was allowed. The `bind_variable` ended up with a `nullptr` type, which is incorrect. Changed it to throw an exception,

Closes #11941

* github.com:scylladb/scylladb:
  test preparing expr::usertype_constructor
  expr_test: test that prepare_expression checks style_type of collection_constructor
  expr_test: test preparing expr::collection_constructor for map
  prepare_expr: make preparing nonfrozen empty maps return null
  prepare_expr: fix a bug in map_prepare_expression
  expr_test: test preparing expr::collection_constructor for set
  expr_test: test preparing expr::collection_constructor for list
  expr_test: test preparing expr::tuple_constructor
  expr_test: test preparing expr::untyped_constant
  expr_test_utils: add make_bigint_raw/const
  expr_test_utils: add make_tinyint_raw/const
  expr_test: test preparing expr::bind_variable
  cql3: prepare_expr: forbid preparing bind_variable without a receiver
  expr_test: test preparing expr::null
  expr_test: test preparing expr::cast
  expr_test_utils: add make_receiver
  expr_test_utils: add make_smallint_raw/const
  expr_test: test preparing expr::token
  expr_test: test preparing expr::subscript
  expr_test: test preparing expr::column_value
  expr_test: test preparing expr::unresolved_identifier
  expr_test_utils: mock data_dictionary::database
2022-11-20 20:03:54 +02:00
Nadav Har'El
2ba8b8d625 test/cql-pytest: remove "xfail" from passing test testIndexOnFrozenCollectionOfUDT
We had a test that used to fail because of issue #8745. But this issue
was alread fixed, and we forgot to remove the "xfail" marker. The test
now passes, so let's remove the xfail marker.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12039
2022-11-20 19:54:59 +02:00
Avi Kivity
40f61db120 Merge 'docs: describe the Raft upgrade and recovery procedures' from Kamil Braun
Add new guide for upgrading 5.1 to 5.2.

In this new upgrade doc, include additional steps for enabling
Raft using the `consistent_cluster_management` flag. Note that we don't
have this flag yet but it's planned to replace the experimental flag in
5.2.

In the "Raft in ScyllaDB" document, add sections about:
- enabling Raft in existing clusters in Scylla 5.2,
- verifying that the internal Raft upgrade procedure finishes
  successfully,
- recovering from a stuck Raft upgrade procedure or from a majority loss
  situation.

Fix some problems in the documentation, e.g. it is not possible to
enable Raft in an existing cluster in 5.0, but the documentation claimed
that it is.

Follow-up items:
- if we decide for a different name for `consistent_cluster_management`,
  use that name in the docs instead
- update the warnings in Scylla to link to the Raft doc
- mention Enterprise versions once we know the numbers
- update the appropriate upgrade docs for Enterprise versions
  once they exist

Closes #11910

* github.com:scylladb/scylladb:
  docs: describe the Raft upgrade and recovery procedures
  docs: add upgrade guide 5.1 -> 5.2
2022-11-20 19:00:23 +02:00
Avi Kivity
15ee8cfc05 Merge 'reader_concurrency_semaphore: fix waiter/inactive race' from Botond Dénes
We recently (in 7fbad8de87) made sure all admission paths can trigger the eviction of inactive reads. As reader eviction happens in the background, a mechanism was added to make sure only a single eviction fiber was running at any given time. This mechanism however had a preemption point between stopping the fiber and releasing the evict lock. This gave an opportunity for either new waiters or inactive readers to be added, without the fiber acting on it. Since it still held onto the lock, it also prevented from other eviction fibers to start. This could create a situation where the semaphore could admit new reads by evicting inactive ones, but it still has waiters. Since an empty waitlist is also an admission criteria, once one waiter is wrongly added, many more can accumulate.
This series fixes this by ensuring the lock is released in the instant the fiber decides there is no more work to do.
It also fixes the assert failure on recursive eviction and adds a detection to the inactive/waiter contradiction.

Fixes: #11923
Refs: #11770

Closes #12026

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: do_wait_admission(): detect admission-waiter anomaly
  reader_concurrency_semaphore: evict_readers_in_the_background(): eliminate blind spot
  reader_concurrency_semaphore: do_detach_inactive_read(): do a complete detach
2022-11-20 18:51:34 +02:00
Avi Kivity
895d721d5e Merge 'scylla-sstable: data-dump improvements' from Botond Dénes
This series contains a mixed bag of improvements to  `scylla sstable dump-data`. These improvements are mostly aimed at making the json output clearer, getting rid of any ambiguities.

Closes #12030

* github.com:scylladb/scylladb:
  tools/scylla-sstable: traverse sstables in argument order
  tools/scylla-sstable: dump-data docs: s/clustering_fragments/clustering_elements
  tools/scylla-sstable: dump-data/json: use Null instead of "<unknown>"
  tools/scylla-sstable: dump-data/json: use more uniform format for collections
  tools/scylla-sstable: dump-data/json: make cells easier to parse
2022-11-20 17:02:27 +02:00
Avi Kivity
2f9c53fbe4 Merge 'test/pylib: scylla_cluster: use server ID to name workdir and log file, not IP address' from Kamil Braun
Since recently the framework uses a separate set of unique IDs to
identify servers, but the log file and workdir is still named using the
last part of the IP address.

This is confusing: the test logs sometimes don't provide the IP addr
(only the ID), and even if they do, the reader of the test log may not
know that they need to look at the last part of the IP to find the
node's log/workdir.

Also using ID will be necessary if we want to reuse IP addresses (e.g.
during node replace, or simply not to run out of IP addresses during
testing).

So use the ID instead to name the workdir and log file.

Also, when starting a test case, print the used cluster. This will make
it easier to map server IDs to their IP addresses when browsing through
the test logs.

Closes #12018

* github.com:scylladb/scylladb:
  test/pylib: manager_client: print used cluster when starting test case
  test/pylib: scylla_cluster: use server ID to name workdir and log file, not IP address
2022-11-20 16:56:19 +02:00
Avi Kivity
14218d82d6 Update tools/java submodule (serverless)
* tools/java caf754f243...874e2d529b (2):
  > Add Scylla Cloud serverless support
  > Switch cqlsh to use scylla-driver
2022-11-20 16:41:36 +02:00
Tomasz Grabiec
c8e983b4aa test: flat_mutation_reader_assertions: Use fatal BOOST_REQUIRE_EQUAL instead of BOOST_CHECK_EQUAL
BOOST_CHECK_EQUAL is a weaker form of assertion, it reports an error
and will cause the test case to fail but continues. This makes the
test harder to debug because there's no obvious way to catch the
failure in GDB and the test output is also flooded with things which
happen after the failed assertion.

Message-Id: <20221119171855.2240225-1-tgrabiec@scylladb.com>
2022-11-20 16:14:26 +02:00
Nadav Har'El
2d2034ea28 Merge 'cql3: don't ignore other restrictions when a multi column restriction is present during filtering' from Jan Ciołek
When filtering with multi column restriction present all other restrictions were ignored.
So a query like:
`SELECT * FROM WHERE pk = 0 AND (ck1, ck2) < (0, 0) AND regular_col = 0 ALLOW FILTERING;`
would ignore the restriction `regular_col = 0`.

This was caused by a bug in the filtering code:
2779a171fc/cql3/selection/selection.cc (L433-L449)

When multi column restrictions were detected, the code checked if they are satisfied and returned immediately.
This is fixed by returning only when these restrictions are not satisfied. When they are satisfied the other restrictions are checked as well to ensure all of them are satisfied.

This code was introduced back in 2019, when fixing #3574.
Perhaps back then it was impossible to mix multi column and regular columns and this approach was correct.

Fixes: #6200
Fixes: #12014

Closes #12031

* github.com:scylladb/scylladb:
  cql-pytest: add a reproducer for #12014, verify that filtering multi column and regular restrictions works
  boost/restrictions-test: uncomment part of the test that passes now
  cql-pytest: enable test for filtering combined multi column and regular column restrictions
  cql3: don't ignore other restrictions when a multi column restriction is present during filtering
2022-11-20 11:50:38 +02:00
Benny Halevy
ec5707a4a8 api: storage_service: fixup indentation 2022-11-20 09:14:45 +02:00
Benny Halevy
cc63719782 api: storage_service: add run_on_existing_tables
Gracefully skip tables that were removed
in the background.

Fixes #12007

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-20 09:14:29 +02:00
Benny Halevy
9ef9b9d1d9 api: storage_service: add parse_table_infos
The table UUIDs are the same on all shards
so we might as well get them on shard 0
(as we already do) and reuse them on other shards.

It is more efficient and accurate to lookup the table
eventually on the shard using its uuid rather than
its name.  If the table was dropped and recreated
using the same name in the background, the new
table will have a new uuid and do the api function
does not apply to it anymore.

A following change will handle the no_such_column_family
cases.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-20 09:14:21 +02:00
Benny Halevy
9b4a9b2772 api: storage_service: log errors from compaction related handlers
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-20 09:03:25 +02:00
Benny Halevy
a47f96bc05 api: storage_service: coroutinize compaction related handlers
Before we improve parsing tables lists
and handling of no_such_column_family
errors.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-20 09:03:25 +02:00
Jan Ciolek
286f182a8c cql-pytest: add a reproducer for #12014, verify that filtering multi column and regular restrictions works
In issue #12014 a user has encountered an instance of #6200.
When filtering a WHERE clause which contained
both multi-column and regular restrictions,
the regular restrictions were ignored.

Add a test which reproduces the issue
using a reproducer provided by the user.

This problem is tested in another similar test,
but this one reproduces the issue in the exact
way it was found by the user.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-18 15:27:42 +01:00
Jan Ciolek
63fb2612c3 boost/restrictions-test: uncomment part of the test that passes now
A part of the test was commented out due to #6200.
Now #6200 has been fixed and it can be uncommented.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-18 15:14:32 +01:00
Jan Ciolek
99e1032e34 cql-pytest: enable test for filtering combined multi column and regular column restrictions
The test test_multi_column_restrictions_and_filtering was marked as xfail,
because issue #6200 wasn't fixed. Now that filtering
multi column and other restrictions together has been fixed
the test passes.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-18 15:14:32 +01:00
Jan Ciolek
b974d4adfb cql3: don't ignore other restrictions when a multi column restriction is present during filtering
When filtering with multi column restriction present all other restrictions were ignored.
So a query like:
`SELECT * FROM WHERE pk = 0 AND (ck1, ck2) < (0, 0) AND regular_col = 0 ALLOW FILTERING;`

would ignore the restriction `regular_col = 0`.

This was caused by a bug in the filtering code:
2779a171fc/cql3/selection/selection.cc (L433-L449)

When multi column restrictions were detected,
the code checked if they are satisfied and returned immediately.
This is fixed by returning only when these restrictions
are not satisfied. When they are satisfied the other
restrictions are checked as well to ensure all
of them are satisfied.

This code was introduced back in 2019, when fixing #3574.
Perhaps back then it was impossible to mix multi column
and regular columns and this approach was correct.

Fixes: #6200
Fixes: #12014

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-18 15:14:16 +01:00
Botond Dénes
30597f17ed tools/scylla-sstable: traverse sstables in argument order
In the order the user passed them on the command-line.
2022-11-18 15:58:37 +02:00
Botond Dénes
e337b25aa9 tools/scylla-sstable: dump-data docs: s/clustering_fragments/clustering_elements
The usage of clustering_fragments is a typo, the output contains clustering_elements.
2022-11-18 15:58:36 +02:00
Botond Dénes
c39408b394 tools/scylla-sstable: dump-data/json: use Null instead of "<unknown>"
The currently used "<unknown>" marker for invalid values/types is
undistinguishable from a normal value in some cases. Use the much more
distinct and unique json Null instead.
2022-11-18 15:58:36 +02:00
Botond Dénes
1dfceb5716 tools/scylla-sstable: dump-data/json: use more uniform format for collections
Instead of trying to be clever and switching the output on the type of
collection, use the same format always: a list of objects, where the
object has a key and value attribute, containing to the respective
collection item key and values. This makes processing much easier for
machines (and humans too since the previous system wasn't working well).
2022-11-18 15:58:36 +02:00
Botond Dénes
f89acc8df7 tools/scylla-sstable: dump-data/json: make cells easier to parse
There are several slightly different cell types in scylla: regular
cells, collection cells (frozen and non-frozen) and counter cells
(update and shards). In C++ code the type of the cell is always
available for code wishing to make out exactly what kind of cell a cell
is. In the JSON output of the dump-data this is currently really hard to
do as there is not enough information to disambiguate all the different
cell types. We wish to make the JSON output self-sufficient so in this
patch we introduce a "type" field which contains one of:
* regular
* counter-update
* counter-shards
* frozen-collection
* collection

Furthermore, we bring the different types closer by also printing the
counter shards under the 'value' key, not under the 'shards' key as
before. The separate 'shards' is no longer needed to disambiguate.
The documentation and the write operation is also updated to reflect the
changes.
2022-11-18 15:58:36 +02:00
Petr Gusev
41629e97de test.py: handle --markers parameter
Some tests may take longer than a few seconds to run. We want to
mark such tests in some way, so that we can run them selectively.
This patch proposes to use pytest markers for this. The markers
from the test.py command line are passed to pytest
as is via the -m parameter.

By default, the marker filter is not applied and all tests
will be run without exception. To exclude e.g. slow tests
you can write --markers 'not slow'.

The --markers parameter is currently only supported
by Python tests, other tests ignore it. We intend to
support this parameter for other types of tests in the future.

Another possible improvement is not to run suites for which
all tests have been filtered out by markers. The markers are
currently handled by pytest, which means that the logic in
test.py (e.g., running a scylla test cluster) will be run
for such suites.

Closes #11713
2022-11-18 12:36:20 +01:00
Avi Kivity
7da12c64bc Revert "Revert "Merge 'cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query()' from Avi Kivity""
This reverts commit 22f13e7ca3, and reinstates
commit df8e1da8b2 ("Merge 'cql3: select_statement:
coroutinize indexed_table_select_statement::do_execute_base_query()' from
Avi Kivity"). The original commit was reverted due to failures in debug
mode on aarch64, but after commit 224a2877b9
("build: disable -Og in debug mode to avoid coroutine asan breakage"), it
works again.

Closes #12021
2022-11-18 12:44:00 +02:00
Kamil Braun
d7649a86c4 Merge 'Build up to support of dynamic IP address changes in Raft' from Konstantin Osipov
We plan to stop storing IP addresses in Raft configuration, and instead
use the information disseminated through gossip to locate Raft peers.

Implement patches that are building up to that:
* improve Raft API of configuration change notifications
* disseminate raft host id in Gossip
* avoid using Raft addresses from Raft configuraiton, and instead
  consistently use the translation layer between raft server id <-> IP
  address

Closes #11953

* github.com:scylladb/scylladb:
  raft: persist the initial raft address map
  raft: (upgrade) do not use IP addresses from Raft config
  raft: (and gossip) begin gossiping raft server ids
  raft: change the API of conf change notifications
2022-11-18 11:38:19 +01:00
Botond Dénes
437fcdeeda Merge 'Make use of enum_set in directory lister' from Pavel Emelyanov
The lister accepts sort of a filter -- what kind of entries to list, regular, directories or both. It currently uses unordered_set, but enum_set is shorter and better describes the intent.

Closes #12017

* github.com:scylladb/scylladb:
  lister: Make lister::dir_entry_types an enum_set
  database: Avoid useless local variable
2022-11-18 12:15:26 +02:00
Botond Dénes
b39ca29b3c reader_concurrency_semaphore: do_wait_admission(): detect admission-waiter anomaly
The semaphore should admit readers as soon as it can. So at any point in
time there should be either no waiters, or the semaphore shouldn't be
able to admit new reads. Otherwise something went wrong. Detect this
when queuing up reads and dump the diagnostics if detected.
Even though tests should ensure this should never happen, recently we've
seen a race between eviction and enqueuing producing such situations.
This is very hard to write tests for, so add built-in detection and
protection instead. Detecting this is very cheap anyway.
2022-11-18 11:35:47 +02:00
Botond Dénes
ca7014ddb8 reader_concurrency_semaphore: evict_readers_in_the_background(): eliminate blind spot
Said method has a protection against concurrent (recursive more like)
calls to itself, by setting a flag `_evicting` and returning early if
this flag is set. The evicting loop however has at least one preemption
point between deciding there is nothing more to evict and resetting said
flag. This window provides opporunity for new inactive reads or waiters
to be queued without this loop noticing, while denying any other
concurrent invocations at that time from reacting too.
Eliminate this by using repeat() instead of do_until() and setting
`_evicting = false` the moment the loop's run condition becomes false.
2022-11-18 11:35:47 +02:00
Botond Dénes
892f52c683 reader_concurrency_semaphore: do_detach_inactive_read(): do a complete detach
Currently this method detaches the inactive read from the handle and
notifies the permit, calls the notify handler if any and does some stat
bookkeeping. Extend it to do a complete detach: unlink the entry from
the inactive reads list and also cancel the ttl timer.
After this, all that is left to the caller is to destroy the entry.
This will prevent any recursive eviction from causing assertion failure.
Although recursive eviction shouldn't happen, it shouldn't trigger an
assert.
2022-11-18 11:35:43 +02:00
Pavel Emelyanov
a44ca06906 Merge 'token_metadata: Do not use topology info for is_member check' from Asias He
Since commit a980f94 (token_metadata: impl: keep the set of normal token owners as a member), we have a set, _normal_token_owners, which contains all the nodes in the ring.

We can use _normal_token_owners to check if a node is part of the ring directly instead of going through the _toplogy indirectly.

Fixes #11935

Closes #11936

* github.com:scylladb/scylladb:
  token_metadata: Rename is_member to is_normal_token_owner
  token_metadata: Add docs for is_member
  token_metadata: Do not use topology info for is_member check
  token_metadata: Check node is part of the topology instead of the ring
2022-11-18 11:54:07 +03:00
Asias He
4571fcf9e7 token_metadata: Rename is_member to is_normal_token_owner
The name is_normal_token_owner is more clear than is_member.
The is_normal_token_owner reflects what it really checks.
2022-11-18 09:29:20 +08:00
Asias He
965097cde5 token_metadata: Add docs for is_member
Make it clear, is_member checks if a node is part of the token ring and
checks nothing else.
2022-11-18 09:28:56 +08:00
Asias He
a495b71858 token_metadata: Do not use topology info for is_member check
Since commit a980f94 (token_metadata: impl: keep the set of normal token
owners as a member), we have a set, _normal_token_owners, which contains
all the nodes in the ring.

We can use _normal_token_owners to check if a node is part of the ring
directly instead of going through the _toplogy indirectly.

Fixes #11935
2022-11-18 09:28:56 +08:00
Asias He
f2ca790883 token_metadata: Check node is part of the topology instead of the ring
update_normal_tokens is the way to add a new node into the ring. We
should not require a new node to already be in the ring to be able to
add it to the ring. The current code works accidentally because
is_member is checking if a node is in the topology

We should use _topology.has_endpoint to check if a node is part of the
topology explicitly.
2022-11-18 09:28:56 +08:00
Jan Ciolek
77d68153f1 test preparing expr::usertype_constructor
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:41:10 +01:00
Jan Ciolek
eb92fb4289 expr_test: test that prepare_expression checks style_type of collection_constructor
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:41:10 +01:00
Jan Ciolek
77c63a6b92 expr_test: test preparing expr::collection_constructor for map
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:41:09 +01:00
Jan Ciolek
db67ade778 prepare_expr: make preparing nonfrozen empty maps return null
In Scylla and Cassandra inserting an empty collection
that is not frozen, is interpreted as inserting a null value.

list_prepare_expression and set_prepare_expression
have an if which handles this behavior, but there
wasn't one in map_prepare_expression.

As a result preparing empty list or set would result in null,
but preparing an empty map wouldn't. This is inconsistent,
it's better to return null in all cases of empty nonfrozen
collections.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:41:09 +01:00
Jan Ciolek
da71f9b50b prepare_expr: fix a bug in map_prepare_expression
map_prepare_expression takes a collection_constructor
of unprepared items and prepares it.

Elements of a map collection_constructor are tuples (key and value).

map_prepare_expression creates a prepared collection_constructor
by preparing each tuple and adding it to the result.

During this preparation it needs to set the type of the tuple.
There was a bug here - it took the type from unprepared
tuple_constructor and assigned it to the prepared one.
An unprepared tuple_constructor doesn't have a type
so it ended up assigning nullptr.

Instead of that it should create a tuple_type_impl instance
by looking at the types of map key and values,
and use this tuple_type_impl as the type of the prepared tuples.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:35:04 +01:00
Jan Ciolek
a656fdfe9a expr_test: test preparing expr::collection_constructor for set
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:22:37 +01:00
Jan Ciolek
76f587cfe7 expr_test: test preparing expr::collection_constructor for list
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:22:37 +01:00
Jan Ciolek
44b55e6caf expr_test: test preparing expr::tuple_constructor
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:22:37 +01:00
Jan Ciolek
265100a638 expr_test: test preparing expr::untyped_constant
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:22:37 +01:00
Jan Ciolek
f6b9100cd2 expr_test_utils: add make_bigint_raw/const
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:22:37 +01:00
Jan Ciolek
f9ff131f86 expr_test_utils: add make_tinyint_raw/const
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:22:36 +01:00
Jan Ciolek
76b6161386 expr_test: test preparing expr::bind_variable
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:22:36 +01:00
Jan Ciolek
4882724066 cql3: prepare_expr: forbid preparing bind_variable without a receiver
prepare_expression treats receiver as an optional argument,
it can be set to nullptr and the preparation should
still succeed when it's possible to infer the type of an expression.

preparing a bind_variable requires the receiver to be present,
because it doesn't contain any information about the type
of the bound value.

Added a check that the receiver is present.
Allowing to prepare a bind_variable without
the receiver present was a bug.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 20:22:36 +01:00
Avi Kivity
2779a171fc Merge 'Do not run aborted tasks' from Aleksandra Martyniuk
task_manager::task::impl contains an abort source which can
be used to check whether it is aborted and an abort method
which aborts the task (request_abort on abort_source) and all
its descendants recursively.

When the start method is called after the task was aborted,
then its state is set to failed and the task does not run.

Fixes: #11995

Closes #11996

* github.com:scylladb/scylladb:
  tasks: do not run tasks that are aborted
  tasks: delete unused variable
  tasks: add abort_source to task_manager::task::impl
2022-11-17 19:42:46 +02:00
Pavel Emelyanov
a396c27efc Merge 'message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client' from Kamil Braun
`get_rpc_client` calculates a `topology_ignored` field when creating a
client which says whether the client's endpoint had topology information
when this client was created. This is later used to check if that client
needs to be dropped and replaced with a new client which uses the
correct topology information.

The `topology_ignored` field was incorrectly calculated as `true` for
pending endpoints even though we had topology information for them. This
would lead to unnecessary drops of RPC clients later. Fix this.

Remove the default parameter for `with_pending` from
`topology::has_endpoint` to avoid similar bugs in the future.

Apparently this fixes #11780. The verbs used by decommission operation
use RPC client index 1 (see `do_get_rpc_client_idx` in
message/messaging_service.cc). From local testing with additional
logging I found that by the time this client is created (i.e. the first
verb in this group is used), we already know the topology. The node is
pending at that point - hence the bug would cause us to assume we don't
know the topology, leading us to dropping the RPC client later, possibly
in the middle of a decommission operation.

Fixes: #11780

Closes #11942

* github.com:scylladb/scylladb:
  message: messaging_service: check for known topology before calling is_same_dc/rack
  test: reenable test_topology::test_decommission_node_add_column
  test/pylib: util: configurable period in wait_for
  message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client
  message: messaging_service: topology independent connection settings for GOSSIP verbs
2022-11-17 20:14:32 +03:00
Jan Ciolek
42e01cc67f expr_test: test preparing expr::null
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 17:30:05 +01:00
Jan Ciolek
45b3fca71c expr_test: test preparing expr::cast
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 17:30:05 +01:00
Jan Ciolek
498c9bfa0d expr_test_utils: add make_receiver
Add a convenience function which creates receivers.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 17:30:04 +01:00
Jan Ciolek
6873a21fbd expr_test_utils: add make_smallint_raw/const
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 17:30:04 +01:00
Jan Ciolek
488056acb7 expr_test: test preparing expr::token
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 17:30:04 +01:00
Jan Ciolek
7958f77a40 expr_test: test preparing expr::subscript
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 17:30:04 +01:00
Jan Ciolek
569bd61c6c expr_test: test preparing expr::column_value
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 17:30:04 +01:00
Jan Ciolek
26174e29c6 expr_test: test preparing expr::unresolved_identifier
It's interesting that prepare_expression
for column identifiers doesn't require a receiver.
I hope this won't break validation in the future.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 17:30:04 +01:00
Jan Ciolek
c719a923bb expr_test_utils: mock data_dictionary::database
Add a function which creates a mock instance
of data_dictionary::database.

prepare_expression requires a data_dictionary::database
as an argument, so unit tests for it need something
to pass there. make_data_dictionary_database can
be used to create an instance that is sufficient for tests.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-11-17 17:30:00 +01:00
Kamil Braun
8e8c32befe test/pylib: manager_client: print used cluster when starting test case
It will be easier to map server IDs to their IP addresses when browsing
through the test logs.
2022-11-17 17:14:23 +01:00
Pavel Emelyanov
bc62ca46d4 lister: Make lister::dir_entry_types an enum_set
This type is currently an unordered_set, but only consists of at most
two elements. Making it an enum_set renders it into a size_t variable
and better describes the intention.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-17 19:01:45 +03:00
Pavel Emelyanov
c6021b57a1 database: Avoid useless local variable
It's used to run lister::scan_dir() with directory_entry_type::directory
only, but for that is copied around on lambda captures. It's simpler
just to use the value directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-17 19:00:49 +03:00
Kamil Braun
b83234d8aa test/pylib: scylla_cluster: use server ID to name workdir and log file, not IP address
Since recently the framework uses a separate set of unique IDs to
identify servers, but the log file and workdir is still named using the
last part of the IP address.

This is confusing: the test logs sometimes don't provide the IP addr
(only the ID), and even if they do, the reader of the test log may not
know that they need to look at the last part of the IP to find the
node's log/workdir.

Also using ID will be necessary if we want to reuse IP addresses (e.g.
during node replace, or simply not to run out of IP addresses during
testing).
2022-11-17 16:55:12 +01:00
Anna Stuchlik
f7f03e38ee doc: update the link to Enabling Experimental Features 2022-11-17 15:44:46 +01:00
Anna Stuchlik
02cea98f55 doc: remove the note referring to the previous ScyllaDB versions and add the relevant limitation to the paragraph 2022-11-17 15:05:00 +01:00
Anna Stuchlik
ce88c61785 doc: update the links to the Enabling Experimental Features section 2022-11-17 14:59:34 +01:00
Avi Kivity
76be6402ed Merge 'repair: harden effective replication map' from Benny Halevy
As described in #11993 per-shard repair_info instances get the effective_replication_map on their own with no centralized synchronization.

This series ensures that the effective replication maps used by repair (and other associated structures like the token metadata and topology) are all in sync with the one used to initiate the repair operation.

While at at, the series includes other cleanups in this area in repair and view that are not fixes as the calls happen in synchronous functions that do not yield.

Fixes #11993

Closes #11994

* github.com:scylladb/scylladb:
  repair: pass erm down to get_hosts_participating_in_repair and get_neighbors
  repair: pass effective_replication_map down to repair_info
  repair: coroutinize sync_data_using_repair
  repair: futurize do_repair_start
  effective_replication_map: add global_effective_replication_map
  shared_token_metadata: get_lock is const
  repair: sync_data_using_repair: require to run on shard 0
  repair: require all node operations to be called on shard 0
  repair: repair_info: keep effective_replication_map
  repair: do_repair_start: use keyspace erm to get keyspace local ranges
  repair: do_repair_start: use keyspace erm for get_primary_ranges
  repair: do_repair_start: use keyspace erm for get_primary_ranges_within_dc
  repair: do_repair_start: check_in_shutdown first
  repair: get_db().local() where needed
  repair: get topology from erm/token_metdata_ptr
  view: get_view_natural_endpoint: get topology from erm
2022-11-17 13:29:02 +02:00
Konstantin Osipov
262566216b raft: persist the initial raft address map 2022-11-17 14:26:36 +03:00
Konstantin Osipov
b35af73fdf raft: (upgrade) do not use IP addresses from Raft config
Always use raft address map to obtain the IP addresses
of upgrade peers. Right now the map is populated
from Raft configuration, so it's an equivalent transformation,
but in the future raft address map will be populated from other sources:
discovery and gossip, hence the logic of upgrade will change as well.

Do not proceed with the upgrade if an address is
missing from the map, since it means we failed to contact a raft member.
2022-11-17 14:26:31 +03:00
Pavel Emelyanov
2add9ba292 Merge 'Refactor topology out of token_metadata' from Benny Halevy
This series moves the topology code from locator/token_metadata.{cc,hh} out to localtor/topology.{cc,hh}
and introduces a shared header file: locator/types.hh contains shared, low level definitions, in anticipation of https://github.com/scylladb/scylladb/pull/11987

While at it, the token_metadata functions are turned into coroutines
and topology copy constructor is deleted.  The copy functionality is moved into an async `clone_gently` function that allows yielding while copying the topology.

Closes #12001

* github.com:scylladb/scylladb:
  locator: refactor topology out of token_metadata
  locator: add types.hh
  topology: delete copy constructor
  token_metadata: coroutinize clone functions
2022-11-17 13:55:34 +03:00
Aleksandra Martyniuk
7ead1a7857 compaction: request abort only once in compaction_data::stop
compaction_manager::task (and thus compaction_data) can be stopped
because of many different reasons. Thus, abort can be requested more
than once on compaction_data abort source causing a crash.

To prevent this before each request_abort() we check whether an abort
was requested before.

Closes #12004
2022-11-17 12:44:59 +02:00
Benny Halevy
1e2741d2fe abstract_replication_strategy: recognized_options: return unordered_set
An unordered_set is more efficient and there is no need
to return an ordered set for this purpose.

This change facilitates a follow-up change of adding
topology::get_datacenters(), returning an unordered_set
of datacenter names.

Refs #11987

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12003
2022-11-17 11:27:05 +02:00
Botond Dénes
e925c41f02 utils/gs/barrett.hh: aarch64: s/brarett/barrett/
Fix a typo introduced by the the recent patch fixing the spelling of
Barrett. The patch introduced a typo in the aarch64 version of the code,
which wasn't found by promotion, as that only builds on X86_64.

Closes #12006
2022-11-17 11:09:59 +02:00
Konstantin Osipov
051dceeaff raft: (and gossip) begin gossiping raft server ids
We plan to use gossip data to educate Raft RPC about IP addresses
of raft peers. Add raft server ids to application state, so
that when we get a notification about a gossip peer we can
identify which raft server id this notification is for,
specifically, we can find what IP address stands for this server
id, and, whenever the IP address changes, we can update Raft
address map with the new address.

On the same token, at boot time, we now have to start Gossip
before Raft, since Raft won't be able to send any messages
without gossip data about IP addresses.
2022-11-17 12:07:31 +03:00
Konstantin Osipov
990c7a209f raft: change the API of conf change notifications
Pass a change diff into the notification callback,
rather than add or remove servers one by one, so that
if we need to persist the state, we can do it once per
configuration change, not for every added or removed server.

For now still pass added and removed entries in two separate calls
per a single configuration change. This is done mainly to fulfill the
library contract that it never sends messages to servers
outside the current configuration. The group0 RPC
implementation doesn't need the two calls, since it simply
marks the removed servers as expired: they are not removed immediately
anyway, and messages can still be delivered to them.
However, there may be test/mock implementations of RPC which
could benefit from this contract, so we decided to keep it.
2022-11-17 12:07:31 +03:00
Benny Halevy
53fdf75cf9 repair: pass erm down to get_hosts_participating_in_repair and get_neighbors
Now that it is available in repair_info.

Fixes #11993

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 08:07:30 +02:00
Benny Halevy
b69be61f41 repair: pass effective_replication_map down to repair_info
And make sure the token_metadata ring version is same as the
reference one (from the erm on shard 0), when starting the
repair on each shard.

Refs #11993

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 08:07:29 +02:00
Benny Halevy
c47d36b53d repair: coroutinize sync_data_using_repair
Prepare for the next path that will co_await
make_global_effective_replication_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 08:07:04 +02:00
Benny Halevy
58b1c17f5d repair: futurize do_repair_start
Turn it into a coroutine to prepare for the next path
that will co_await make_global_effective_replication_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 08:07:04 +02:00
Benny Halevy
4b9269b7e2 effective_replication_map: add global_effective_replication_map
Class to hold a coherent view of a keyspace
effective replication map on all shards.

To be used in a following patch to pass the sharded
keyspace e_r_m:s to repair.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 08:07:01 +02:00
Avi Kivity
b8b78959fb build: switch to packaged libdeflate rather than a submodule
Now that our toolchain is based on Fedora 37, we can rely on its
libdeflate rather than have to carry our own in a submodule.

Frozen toolchain is regenerated. As a side effect clang is updated
from 15.0.0 to 15.0.4.

Closes #12000
2022-11-17 08:01:00 +02:00
Benny Halevy
2c677e294b shared_token_metadata: get_lock is const
The lock is acquired using an a function that
doesn't modify the shared_token_metadata object.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:58:21 +02:00
Benny Halevy
d6b2124903 repair: sync_data_using_repair: require to run on shard 0
And with that do_sync_data_using_repair can be folded into
sync_data_using_repair.

This will simplify using the effective_replication_map
throughout the operation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:58:21 +02:00
Benny Halevy
0c56c75cf8 repair: require all node operations to be called on shard 0
To simplify using of the effective_replication_map / token_metadata_ptr
throught the operation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:58:21 +02:00
Benny Halevy
64b0756adc repair: repair_info: keep effective_replication_map
Sampled when repair info is constructed.
To be used throughout the repair process.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:58:21 +02:00
Benny Halevy
c7d753cd44 repair: do_repair_start: use keyspace erm to get keyspace local ranges
Rather than calling db.get_keyspace_local_ranges that
looks up the keyspace and its erm again.

We want all the inforamtion derived from the erm to
be based on the same source.

The function is synchronous so this changes doesn't
fix anything, just cleans up the code.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:58:21 +02:00
Benny Halevy
aaf74776c2 repair: do_repair_start: use keyspace erm for get_primary_ranges
Ensure that the primary ranges are in sync with the
keyspace erm.

The function is synchronous so this change doesn't fix anything,
it just cleans up the code.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:58:21 +02:00
Benny Halevy
9200e6b005 repair: do_repair_start: use keyspace erm for get_primary_ranges_within_dc
Ensure the erm and topology are in sync.

The function is synchronous so this change doesn't fix
anything, just cleans up the code.

Fix mistake in comment while at it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:57:56 +02:00
Benny Halevy
59dc2567fd repair: do_repair_start: check_in_shutdown first
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:56:34 +02:00
Benny Halevy
881eb0df83 repair: get_db().local() where needed
In several places we get the sharded database using get_db()
and then we only use db.local().  Simplify the code by keeping
reference only to the local database upfront.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:56:34 +02:00
Benny Halevy
c22c4c8527 repair: get topology from erm/token_metdata_ptr
We want the topology to be synchronized with the respective
effective_replication_map / token_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:56:34 +02:00
Benny Halevy
94f2e95a2f view: get_view_natural_endpoint: get topology from erm
Get the topology for the effective replication map rather
than from the storage_proxy to ensure its synchronized
with the natural endpoints.

Since there's no preemption between the two calls
currently there is no issue, so this is merely a clean up
of the code and not supposed to fix anything.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-17 07:56:34 +02:00
Nadav Har'El
e393639114 test/cql-pytest: reproducer for crash in LWT with null key
This patch adds a reproducer for issue #11954: Attempting an
"IF NOT EXISTS" (LWT) write with a null key crashes Scylla,
instead of producing a simple error message (like happens
without the "IF NOT EXISTS" after #7852 was fixed).

The test passed on Cassandra, but crashes Scylla. Because of this
crash, we can't just mark the test "xfail" and it's temporarily
marked "skip" instead.

Refs #11954.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11982
2022-11-17 07:31:13 +02:00
Benny Halevy
d0bd305d16 locator: refactor topology out of token_metadata
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-16 21:55:54 +02:00
Benny Halevy
297a4de4e4 locator: add types.hh
To export low-level types that are used by oher modules
for the locator interfaces.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-16 21:53:05 +02:00
Kamil Braun
0c9cb5c5bf Merge 'raft: wait for the next tick before retrying' from Gusev Petr
When `modify_config` or `add_entry` is forwarded to the leader, it may
reach the node at "inappropriate" time and result in an exception. There
are two reasons for it - the leader is changing and, in case of
`modify_config`, other `modify_config` is currently in progress. In both
cases the command is retried, but before this patch there was no delay
before retrying, which could led to a tight loop.

The patch adds a new exception type `transient_error`. When the client
receives it, it is obliged to retry the request after some delay.
Previously leader-side exceptions were converted to `not_a_leader`,
which is strange, especially for `conf_change_in_progress`.

Fixes: #11564

Closes #11769

* github.com:scylladb/scylladb:
  raft: rafactor: remove duplicate code on retries delays
  raft: use wait_for_next_tick in read_barrier
  raft: wait for the next tick before retrying
2022-11-16 18:20:54 +01:00
Aleksandra Martyniuk
4250bd9458 tasks: do not run tasks that are aborted
Currently in start() method a task is run even if it was already
aborted.

When start() is called on an aborted task, its state is set to
task_manager::task_state::failed and it doesn't run.
2022-11-16 18:09:41 +01:00
Aleksandra Martyniuk
ebffca7ea5 tasks: delete unused variable 2022-11-16 18:07:57 +01:00
Aleksandra Martyniuk
752edc2205 tasks: add abort_source to task_manager::task::impl
task_manager::task can be aborted with impl's abort_source.
By default abort request is propagated to all task's descendants.
2022-11-16 18:07:11 +01:00
Avi Kivity
c4f069c6fc Update seastar submodule
* seastar 153223a188...4f4cc00660 (10):
  > Merge 'Avoid using namespace internal' from Pavel Emelyanov
  > Merge 'De-futurize IO class update calls' from Pavel Emelyanov
  > abort_source: subscribe(): remove noexcept qualifier
  > Merge 'Add Prometheus filtering capabilities by label' from Amnon Heiman
  > fsqual: stop causing memory leak error on LeakSanitizer
  > metrics.cc: Do not merge empty histogram
  > Update tutorial.md
  > README-DPDK.md: document --cflags option
  > build: install liburing.pc using stow
  > core/polymorphic_temporary_buffer: include <seastar/core/memory.hh>

Closes #11991
2022-11-16 17:59:33 +02:00
Avi Kivity
3497891cf9 utils: spell "barrett" correctly
As P. T. Barnoom famously said, "write what you like but spell my name
correctly". Following that, we correct the spelling of Barrett's name
in the source tree.

Closes #11989
2022-11-16 16:30:38 +02:00
Benny Halevy
0c94ffcc85 topology: delete copy constructor
Topology is copied only from token_metadata_impl::clone_only_token_map
which copies the token_metadata_impl with yielding to prevent reactor
stalls.  This should apply to topology as well, so
add a clone_gently function for cloning the topology
from token_metadata_impl::clone_only_token_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-16 15:27:28 +02:00
Benny Halevy
4f4fc7fe22 token_metadata: coroutinize clone functions
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-16 15:27:28 +02:00
Kamil Braun
a83789160d message: messaging_service: check for known topology before calling is_same_dc/rack
`is_same_dc` and `is_same_rack` assume that the peer's topology is
known. If it's unknown, `on_internal_error` will be called inside
topology.

When these functions are used in `get_rpc_client`, they are already
protected by an earlier check for knowing the peer's topology
(the `has_topology()` lambda).

Another use is in `do_start_listen()`, where we create a filter for RPC
module to check if it should accept incoming connections. If cross-dc or
cross-rack encryption is enabled, we will reject connections attempts to
the regular (non-ssl) port from other dcs/rack using `is_same_dc/rack`.
However, it might happen that something (other Scylla node or otherwise)
tries to contact us on the regular port and we don't know that thing's
topology, which would result in `on_internal_error`. But this is not a
fatal error; we simply want to reject that connection. So protect these
calls as well.

Finally, there's `get_preferred_ip` with an unprotected `is_same_dc`
call which, for a given peer, may return a different IP from preferred IP
cache if the endpoint resides in the same DC. If there is not entry in
the preferred IP cache, we return the original (external) IP of the
peer. We can do the same if we don't know the peer's topology. It's
interesting that we didn't see this particular place blowing up. Perhaps
the preferred IP cache is always populated after we know the topology.
2022-11-16 14:01:50 +01:00
Kamil Braun
9b2449d3ea test: reenable test_topology::test_decommission_node_add_column
Also improve the test to increase the probability of reproducing #11780
by injecting sleeps in appropriate places.

Without the fix for #11780 from the earlier commit, the test reproduces
the issue in roughly half of all runs in dev build on my laptop.
2022-11-16 14:01:50 +01:00
Kamil Braun
0f49813312 test/pylib: util: configurable period in wait_for 2022-11-16 14:01:50 +01:00
Kamil Braun
1bd2471c19 message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client
`get_rpc_client` calculates a `topology_ignored` field when creating a
client which says whether the client's endpoint had topology information
when topology was created. This is later used to check if that client
needs to be dropped and replaced with a new client which uses the
correct topology information.

The `topology_ignored` field was incorrectly calculated as `true` for
pending endpoints even though we had topology information for them. This
would lead to unnecessary drops of RPC clients later. Fix this.

Remove the default parameter for `with_pending` from
`topology::has_endpoint` to avoid similar bugs in the future.

Apparently this fixes #11780. The verbs used by decommission operation
use RPC client index 1 (see `do_get_rpc_client_idx` in
message/messaging_service.cc). From local testing with additional
logging I found that by the time this client is created (i.e. the first
verb in this group is used), we already know the topology. The node is
pending at that point - hence the bug would cause us to assume we don't
know the topology, leading us to dropping the RPC client later, possibly
in the middle of a decommission operation.

Fixes: #11780
2022-11-16 14:01:50 +01:00
Kamil Braun
840be34b5f message: messaging_service: topology independent connection settings for GOSSIP verbs
The gossip verbs are used to learn about topology of other nodes.
If inter-dc/rack encryption is enabled, the knowledge of topology is
necessary to decide whether it's safe to send unencrypted messages to
nodes (i.e., whether the destination lies in the same dc/rack).

The logic in `messaging_service::get_rpc_client`, which decided whether
a connection must be encrypted, was this (given that encryption is
enabled): if the topology of the peer is known, and the peer is in the
same dc/rack, don't encrypt. Otherwise encrypt.

However, it may happen that node A knows node B's topology, but B
doesn't know A's topology. A deduces that B is in the same DC and rack
and tries sending B an unencrypted message. As the code currently
stands, this would cause B to call `on_internal_error`. This is what I
encountered when attempting to fix #11780.

To guarantee that it's always possible to deliver gossiper verbs (even
if one or both sides don't know each other's topology), and to simplify
reasoning about the system in general, choose connection settings that
are independent of the topology - for the connection used by gossiper
verbs (other connections are still topology-dependent and use complex
logic to handle the situation of unknown-and-later-known topology).

This connection only contains 'rare' and 'cheap' verbs, so it's not a
performance problem to always encrypt it (given that encryption is
configured). And this is what already was happening in the past; it was
at some point removed during topology knowledge management refactors. We
just bring this logic back.

Fixes #11992.

Inspired by xemul/scylla@45d48f3d02.
2022-11-16 13:58:07 +01:00
Anna Stuchlik
01c9846bb6 doc: add the link to the Enabling Experimental Features section 2022-11-16 13:24:45 +01:00
Anna Stuchlik
f1b2f44aad doc: move the TTL Alternator feature from the Experimental Features section to the production-ready section 2022-11-16 13:23:07 +01:00
Nadav Har'El
2f2f01b045 materialized views: fix view writes after base table schema change
When we write to a materialized view, we need to know some information
defined in the base table such as the columns in its schema. We have
a "view_info" object that tracks each view and its base.

This view_info object has a couple of mutable attributes which are
used to lazily-calculate and cache the SELECT statement needed to
read from the base table. If the base-table schema ever changes -
and the code calls set_base_info() at that point - we need to forget
this cached statement. If we don't (as before this patch), the SELECT
will use the wrong schema and writes will no longer work.

This patch also includes a reproducing test that failed before this
patch, and passes afterwords. The test creates a base table with a
view that has a non-trivial SELECT (it has a filter on one of the
base-regular columns), makes a benign modification to the base table
(just a silly addition of a comment), and then tries to write to the
view - and before this patch it fails.

Fixes #10026
Fixes #11542
2022-11-16 13:58:21 +02:00
Nadav Har'El
7cbb0b98bb Merge 'doc: document user defined functions (UDFs)' from Anna Stuchlik
This PR is V2 of the[ PR created by @psarna.](https://github.com/scylladb/scylladb/pull/11560).
I have:
- copied the content.
- applied the suggestions left by @nyh.
- made minor improvements, such as replacing "Scylla" with "ScyllaDB", fixing punctuation, and fixing the RST syntax.

Fixes https://github.com/scylladb/scylladb/issues/11378

Closes #11984

* github.com:scylladb/scylladb:
  doc: label user-defined functions as Experimental
  doc: restore the note for the Count function (removed by mistatke)
  doc: document user defined functions (UDFs)
2022-11-16 13:09:47 +02:00
Botond Dénes
cbf9be9715 Merge 'Avoid 0.0.0.0 (and :0) as preferred IP' from Pavel Emelyanov
Despite docs discourage from using INADDR_ANY as listen address, this is not disabled in code. Worse -- some snitch drivers may gossip it around as the INTERNAL_IP state. This set prevents this from happening and also adds a sanity check not to use this value if it somehow sneaks in.

Closes #11846

* github.com:scylladb/scylladb:
  messaging_service: Deny putting INADD_ANY as preferred ip
  messaging_service: Toss preferred ip cache management
  gossiping_property_file_snitch: Dont gossip INADDR_ANY preferred IP
  gossiping_property_file_snitch: Make _listen_address optional
2022-11-16 08:30:42 +02:00
Avi Kivity
43d3e91e56 tools: toolchain: prepare: use real bash associative array
When we translate from docker/go arch names to the kernel arch
names, we use an associative array hack using computed variable
names "{$!variable_name}". But it turns out bash has real
associative arrays, introduced with "declare -A". Use the to make
the code a little clearer.

Closes #11985
2022-11-16 08:17:47 +02:00
Botond Dénes
e90d0811d0 Merge 'doc: update ScyllaDB requirements - supported CPUs and AWS i4g instances' from Anna Stuchlik
Fix https://github.com/scylladb/scylla-docs/issues/4144

Closes #11226

* github.com:scylladb/scylladb:
  Update docs/getting-started/system-requirements.rst
  doc: specify the recommended AWS instance types
  doc: replace the tables with a generic description of support for Im4gn and Is4gen instances
  doc: add support for AWS i4g instances
  doc: extend the list of supported CPUs
2022-11-16 08:15:00 +02:00
Botond Dénes
bd1fcbc38f Merge 'Introduce reverse vector_deserializer.' from Michał Radwański
As indicated in #11816, we'd like to enable deserializing vectors in reverse.
The forward deserialization is achieved by reading from an input_stream. The
input stream internally is a singly linked list with complicated logic. In order to
allow for going through it in reverse, instead when creating the reverse vector
initializer, we scan the stream and store substreams to all the places that are a
starting point for a next element. The iterator itself just deserializes elements
from the remembered substreams, this time in reverse.

Fixes #11816

Closes #11956

* github.com:scylladb/scylladb:
  test/boost/serialization_test.cc: add test for reverse vector deserializer
  serializer_impl.hh: add reverse vector serializer
  serializer_impl: remove unneeded generic parameter
2022-11-16 07:37:24 +02:00
Anna Stuchlik
cdb6557f23 doc: label user-defined functions as Experimental 2022-11-15 21:22:01 +01:00
Avi Kivity
d85f731478 build: update toolchain to Fedora 37 with clang 15
'cargo' instantiation now overrides internal git client with
cli client due to unbounded memory usage [1].

[1] https://github.com/rust-lang/cargo/issues/10583#issuecomment-1129997984
2022-11-15 16:48:09 +00:00
Anna Stuchlik
1f1d88d04e doc: restore the note for the Count function (removed by mistatke) 2022-11-15 17:41:22 +01:00
Anna Stuchlik
dbb19f55fb doc: document user defined functions (UDFs) 2022-11-15 17:33:05 +01:00
Nadav Har'El
e4dba6a830 test/cql-pytest: add test for when MV requires IS NOT NULL
As noted in issue #11979, Scylla inconsistently (and unlike Cassandra)
requires "IS NOT NULL" one some but not all materialized-view key
columns. Specifically, Scylla does not require "IS NOT NULL" on the
base's partition key, while Cassandra does.

This patch is a test which demonstrates this inconsistency. It currently
passes on Cassandra and fails on Scylla, so is marked xfail.

Refs #11979

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11980
2022-11-15 14:21:48 +01:00
Asias He
16bd9ec8b1 gossip: Improve get_live_token_owners and get_unreachable_token_owners
The get_live_token_owners returns the nodes that are part of the ring
and live.

The get_unreachable_token_owners returns the nodes that are part of the ring
and is not alive.

The token_metadata::get_all_endpoints returns nodes that are part of the
ring.

The patch changes both functions to use the more authoritative source to
get the nodes that are part of the ring and call is_alive to check if
the node is up or down. So that the correctness does not depend on
any derived information.

This patch fixes a truncate issue in storage_proxy::truncate_blocking
where it calls get_live_token_owners and get_unreachable_token_owners to
decide the nodes to talk with for truncate operation. The truncate
failed because incorrect nodes were returned.

Fixes #10296
Fixes #11928

Closes #11952
2022-11-15 14:21:48 +01:00
Botond Dénes
21489c9f9c Merge 'doc: add the "Scylladb Enterprise" label to the Enterprise-only features' from Anna Stuchlik
This PR is a follow-up to https://github.com/scylladb/scylladb/pull/11918.

With this PR:
- The "ScyllaDB Enterprise" label is added to all the features that are only available in ScyllaDB Enterprise.
- The previous Enterprise-only note is removed (it was included in multiple files as _/rst_include/enterprise-only-note.rst_ - this file is removed as it is no longer used anywhere in the docs).
- "Scylla Enterprise" was removed from `versionadded `because now it's clear that the feature was added for Enterprise.

Closes #11975

* github.com:scylladb/scylladb:
  doc: remove the enterprise-only-note.rst file, which was replaced by the ScyllaDB Enterprise label and is not used anymore
  doc: add the ScyllaDB Enterprise label to the descriptions of Enterprise-only features
2022-11-15 14:21:48 +01:00
Botond Dénes
34f29c8d67 Merge 'Use with_sstable_directory() helper in tests' from Pavel Emelyanov
The helper is already widely used, one (last) test case can benefit from using it too

Closes #11978

* github.com:scylladb/scylladb:
  test: Indentation fix after previous patch
  test: Wse with_sstable_directory() helper
2022-11-15 14:21:48 +01:00
Nadav Har'El
8a4ab87e44 Merge 'utils: crc: generate crc barrett fold tables at compile time' from Avi Kivity
We use Barrett tables (misspelled in the code unfortunately) to fold
crc computations of multiple buffers into a single crc. This is important
because it turns out to be faster to compute crc of three different buffers
in parallel rather than compute the crc of one large buffer, since the crc
instruction has latency 3.

Currently, we have a separate code generation step to compute the
fold tables. The step generates a new C++ source files with the tables.
But modern C++ allows us to do this computation at compile time, avoiding
the code generation step. This simplifies the build.

This series does that. There is some complication in that the code uses
compiler intrinsics for the computation, and these are not constexpr friendly.
So we first introduce constexpr-friendly alternatives and use them.

To prove the transformation is correct, I compared the generated code from
before the series and from just before the last step (where we use constexpr
evaluation but still retain the generated file) and saw no difference in the values.

Note that constexpr is not strictly needed - we could have run the code in the
global variables' initializer. But that would cause a crash if we run on a pre-clmul
machine, and is not as fun.

Closes #11957

* github.com:scylladb/scylladb:
  test: crc: add unit tests for constexpr clmul and barrett fold
  utils: crc combine table: generate at compile time
  utils: barrett: inline functions in header
  utils: crc combine table: generate tables at compile time
  utils: crc combine table: extract table generation into a constexpr function
  utils: crc combine table: extract "pow table" code into constexpr function
  utils: crc combine table: store tables std::arrray rather than C array
  utils: barrett: make the barrett reduction constexpr friendly
  utils: clmul: add 64-bit constexpr clmul
  utils: barrett: extract barrett reduction constants
  utils: barrett: reorder functions
  utils: make clmul() constexpr
2022-11-15 14:21:48 +01:00
Petr Gusev
ae3e0e3627 raft: rafactor: remove duplicate code on retries delays
Introduce a templated function do_on_leader_with_retries,
use it in add_entries/modify_config/read_barrier. The
function implements the basic logic of retries with aborts
and leader changes handling, adds a delay between
iterations to protect against tight loops.
2022-11-15 13:18:53 +04:00
Petr Gusev
15cc1667d0 raft: use wait_for_next_tick in read_barrier
Replaced the yield on transport_error
with wait_for_next_tick. Added delays for retries, similar
to add_entry/modify_config: we postpone the next
call attempt if we haven't received new information
about the current leader.
2022-11-15 12:31:49 +04:00
Petr Gusev
5e15c3c9bd raft: wait for the next tick before retrying
When modify_config or add_entry is forwarded
to the leader, it may reach the node at
"inappropriate" time and result in an exception.
There are two reasons for it - the leader is
changing and, in case of modify_config, other
modify_config is currently in progress. In
both cases the command is retried, but before
this patch there was no delay before retrying,
which could led to a tight loop.

The patch adds a new exception type transient_error.
When the client node receives it, it is obliged to retry
the request, possibly after some delay. Previously, leader-side
exceptions were converted to not_a_leader exception,
which is strange, especially for conf_change_in_progress.

We add a delay before retrying in modify_config
and add_entry if the client hasn't received any new
information about the leader since the last attempt.
This can happen if the server
responds with a transient_error with an empty leader
and the current node has not yet learned the new leader.
We neglect an excessive delay if the newly elected leader
is the same as the previous one, this supposed to be a rare.

Fixes: #11564
2022-11-15 11:49:26 +04:00
Pavel Emelyanov
8dcd9d98d6 test: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-14 20:11:01 +03:00
Pavel Emelyanov
c9128e9791 test: Wse with_sstable_directory() helper
It's already used everywhere, but one test case wires up the
sstable_directory by hand. Fix it too, but keep in mind, that the caller
fn stops the directory early.

(indentation is deliberately left broken)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-14 20:11:01 +03:00
Michał Radwański
32c60b44c5 test/boost/serialization_test.cc: add test for reverse vector
deserializer

This test is just a copy-pasted version of forward serializer test.
2022-11-14 16:06:24 +01:00
Michał Radwański
dce67f42f8 serializer_impl.hh: add reverse vector serializer
Currently when we want to deserialize mutation in reverse, we unfreeze
it and consume from the end. This new reverse vector deserializer
goes through input stream remembering substreams that contain a given
output range member, and while traversing from the back, deserialize
each substream.
2022-11-14 16:06:24 +01:00
Anna Stuchlik
e36bd208cc doc: remove the enterprise-only-note.rst file, which was replaced by the ScyllaDB Enterprise label and is not used anymore 2022-11-14 15:20:51 +01:00
Anna Stuchlik
36324fe748 doc: add the ScyllaDB Enterprise label to the descriptions of Enterprise-only features 2022-11-14 15:16:51 +01:00
Takuya ASADA
da6c472db9 install.sh: Skip systemd existance check when --without-systemd
When --without-systemd specified, install.sh should skip systemd
existance check.

Fixes #11898

Closes #11934
2022-11-14 14:07:46 +02:00
Benny Halevy
ff5527deb1 topology: copy _sort_by_proximity in copy constructor
Fixes #11962

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11965
2022-11-14 13:59:56 +03:00
Pavel Emelyanov
bd48fdaad5 Merge 'handle_state_normal: do not update topology of removed endpoint' from Benny Halevy
Currently, when replacing a node ip, keeping the old host,
we might end up with the the old endpoint in system.peers
if it is inserted back into the topology by `handle_state_normal`
when on_join is called with the old endpoint.

Then, later on, on_change sees that:
```
    if (get_token_metadata().is_member(endpoint)) {
        co_await do_update_system_peers_table(endpoint, state, value);
```

As described in #11925.

Fixes #11925

Closes #11930

* github.com:scylladb/scylladb:
  storage_service, system_keyspace: add debugging around system.peers update
  storage_service: handle_state_normal: update topology and notify_joined endpoint only if not removed
2022-11-14 13:58:28 +03:00
Botond Dénes
8e38551d93 Merge 'Allow each compaction group to have its own compaction backlog tracker' from Raphael "Raph" Carvalho
Today, compaction_backlog_tracker is managed in each compaction_strategy
implementation. So every compaction strategy is managing its own
tracker and providing a reference to it through get_backlog_tracker().

But this prevents each group from having its own tracker, because
there's only a single compaction_strategy instance per table.
To remove this limitation, compaction_strategy impl will no longer
manage trackers but will instead provide an interface for trackers
to be created, such that each compaction_group will be allowed to
create its own tracker and manage it by itself.

Now table's backlog will be the sum of all compaction_group backlogs.
The normalization factor is applied on the sum, so we don't have
to adjust each individual backlog to any factor.

Closes #11762

* github.com:scylladb/scylladb:
  replica: Allow one compaction_backlog_tracker for each compaction_group
  compaction: Make compaction_state available for compaction tasks being stopped
  compaction: Implement move assignment for compaction_backlog_tracker
  compaction: Fix compaction_backlog_tracker move ctor
  compaction: Use table_state's backlog tracker in compaction_read_monitor_generator
  compaction: kill undefined get_unimplemented_backlog_tracker()
  replica: Refactor table::set_compaction_strategy for multiple groups
  Fix exception safety when transferring ongoing charges to new backlog tracker
  replica: move_sstables_from_staging: Use tracker from group owning the SSTable
  replica: Move table::backlog_tracker_adjust_charges() to compaction_group
  replica: table::discard_sstables: Use compaction_group's backlog tracker
  replica: Disable backlog tracker in compaction_group::stop()
  replica: database_sstable_write_monitor: use compaction_group's backlog tracker
  replica: Move table::do_add_sstable() to compaction_group
  test/sstable_compaction_test: Switch to table_state::get_backlog_tracker()
  compaction/table_state: Introduce get_backlog_tracker()
2022-11-14 07:05:28 +02:00
Avi Kivity
b8cb34b928 test: crc: add unit tests for constexpr clmul and barrett fold
Check that the constexpr variants indeed match the runtime variants.

I verified manually that exactly one computation in each test is
executed at run time (and is compared against a constant).
2022-11-13 16:22:29 +02:00
Avi Kivity
70217b5109 utils: crc combine table: generate at compile time
By now the crc combine tables are generated at compile time,
but still in a separate code generation step. We now eliminate
the code generation step and instead link the global variables
directly into the main executable. The global variables have
been conveniently named exactly as the code generation step
names them, so we don't need to touch any users.
2022-11-12 17:26:45 +02:00
Avi Kivity
164e991181 utils: barrett: inline functions in header
Avoid duplicate definitions if the same header is used from more than
one place, at it will soon be.
2022-11-12 17:26:08 +02:00
Avi Kivity
a4f06773da utils: crc combine table: generate tables at compile time
Move the tables into global constinit variables that are
generated at compile time. Note the code that creates
the generated crc32_combine_table.cc is still called; it
transorms compile-time generated tables into a C++ source
that contains the same values, as literals.

If we generate a diff between gen/utils/gz/crc_combine_table.cc
before this series and after this patch, we see the only change
in the file is the type of the variable (which changed to
std::array), proving our constexpr code is correct.
2022-11-12 17:16:59 +02:00
Avi Kivity
a229fdc41e utils: crc combine table: extract table generation into a constexpr function
Move the code to a constexpr function, so we can later generate the tables at
compile time. Note that although the function is constexpr, it is still
evaluated at runtime, since the calling function (main()) isn't constexpr
itself.
2022-11-12 17:13:52 +02:00
Avi Kivity
d42bec59bb utils: crc combine table: extract "pow table" code into constexpr function
A "pow table" is used to generate the Barrett fold tables. Extract its
code into a constexpr function so we can later generate the fold tables
at compile time.
2022-11-12 17:11:44 +02:00
Avi Kivity
6e34014b64 utils: crc combine table: store tables std::arrray rather than C array
C arrays cannot be returned from functions and therefore aren't suitable
for constexpr processing. std::array<> is a regular value and so is
constexpr friendly.
2022-11-12 17:09:02 +02:00
Avi Kivity
1e9252f79a utils: barrett: make the barrett reduction constexpr friendly
Dispatch to intrinsics or constexpr based on evaluation context.
2022-11-12 17:04:44 +02:00
Avi Kivity
0bd90b5465 utils: clmul: add 64-bit constexpr clmul
This is used when generating the Barrett reduction tables, and also when
applying the Barrett reduction at runtime, so we need it to be constexpr
friendly.
2022-11-12 17:04:05 +02:00
Avi Kivity
c376c539b8 utils: barrett: extract barrett reduction constants
The constants are repeated across x86_64 and aarch64, so extract
them into a common definition.
2022-11-12 17:00:17 +02:00
Avi Kivity
2fdf81af7b utils: barrett: reorder functions
Reorder functions in dependency order rather than forward
declaring them. This makes them more constexpr-friendly.
2022-11-12 16:52:41 +02:00
Avi Kivity
8aa59a897e utils: make clmul() constexpr
clmul() is a pure function and so should already be constexpr,
but it uses intrinsics that aren't defined as constexpr and
so the compiler can't really compute it at compile time.

Fix by defining a constexpr variant and dispatching based
on whether we're being constant-evaluated or not.

The implementation is simple, but in any case proof that it
is correct will be provided later on.
2022-11-12 16:49:43 +02:00
Raphael S. Carvalho
b88acffd66 replica: Allow one compaction_backlog_tracker for each compaction_group
Today, compaction_backlog_tracker is managed in each compaction_strategy
implementation. So every compaction strategy is managing its own
tracker and providing a reference to it through get_backlog_tracker().

But this prevents each group from having its own tracker, because
there's only a single compaction_strategy instance per table.
To remove this limitation, compaction_strategy impl will no longer
manage trackers but will instead provide an interface for trackers
to be created, such that each compaction group will be allowed to
have its own tracker, which will be managed by compaction manager.

On compaction strategy change, table will update each group with
the new tracker, which is created using the previously introduced
ompaction_group_sstable_set_updater.

Now table's backlog will be the sum of all compaction_group backlogs.
The normalization factor is applied on the sum, so we don't have
to adjust each individual backlog to any factor.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:22:51 -03:00
Raphael S. Carvalho
d862dd815c compaction: Make compaction_state available for compaction tasks being stopped
compaction_backlog_tracker will be managed by compaction_manager, in the
per table state. As compaction tasks can access the tracker throughout
its lifetime, remove() can only deregister the state once we're done
stopping all tasks which map to that state.
remove() extracted the state upfront, then performed the stop, to
prevent new tasks from being registered and left behind. But we can
avoid the leak of new tasks by only closing the gate, which waits
for all tasks (which are stopped a step earlier) and once closed,
prevents new tasks from being registered.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:22:51 -03:00
Raphael S. Carvalho
0a152a2670 compaction: Implement move assignment for compaction_backlog_tracker
That's needed for std::optional to work on its behalf.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:22:49 -03:00
Raphael S. Carvalho
fe305cefd0 compaction: Fix compaction_backlog_tracker move ctor
Luckily it's not used anywhere. Default move ctor was picked but
it won't clear _manager of old object, meaning that its destructor
will incorrectly deregister the tracker from
compaction_backlog_manager.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:37 -03:00
Raphael S. Carvalho
8e1e30842d compaction: Use table_state's backlog tracker in compaction_read_monitor_generator
A step closer towards a separate backlog tracker for each compaction group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:37 -03:00
Raphael S. Carvalho
fedafd76eb compaction: kill undefined get_unimplemented_backlog_tracker()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:37 -03:00
Raphael S. Carvalho
90991bda69 replica: Refactor table::set_compaction_strategy for multiple groups
Refactoring the function for it to accomodate multiple compaction
groups.

To still provide strong exception guarantees, preparation and
execution of changes will be separated.

Once multiple groups are supported, each group will be prepared
first, and the noexcept execution will be done as a last step.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:37 -03:00
Raphael S. Carvalho
244efddb22 Fix exception safety when transferring ongoing charges to new backlog tracker
When setting a new strategy, the charges of old tracker is transferred
to the new one.

The problem is that we're not reverting changes if exception is
triggered before the new strategy is successfully set.

To fix this exception safety issue, let's copy the charges instead
of moving them. If exception is triggered, the old tracker is still
the one used and remain intact.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:37 -03:00
Raphael S. Carvalho
d1e2dbc592 replica: move_sstables_from_staging: Use tracker from group owning the SSTable
When moving SSTables from staging directory, we'll conditionally add
them to backlog tracker. As each group has its own tracker, a given
sstable will be added to the tracker of the group that owns it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:37 -03:00
Raphael S. Carvalho
9031dc3199 replica: Move table::backlog_tracker_adjust_charges() to compaction_group
Procedures that call this function happen to be in compaction_group,
so let's move it to group. Simplifies the change where the procedure
retrieves tracker from the group itself.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:36 -03:00
Raphael S. Carvalho
116459b69e replica: table::discard_sstables: Use compaction_group's backlog tracker
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:36 -03:00
Raphael S. Carvalho
b2d8545b15 replica: Disable backlog tracker in compaction_group::stop()
As we're moving backlog tracker to compaction group, we need to
stop the tracker there too. We're moving it a step earlier in
table::stop(), before sstables are cleared, but that's okay
because it's still done after the group was deregistered
from compaction manager, meaning no compactions are running.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:36 -03:00
Raphael S. Carvalho
91b0d772e2 replica: database_sstable_write_monitor: use compaction_group's backlog tracker
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:36 -03:00
Raphael S. Carvalho
f37a05b559 replica: Move table::do_add_sstable() to compaction_group
All callers of do_add_sstable() live in compaction_group, so it
should be moved into compaction_group too. It also makes easier
for the function to retrieve the backlog tracker from the group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:36 -03:00
Raphael S. Carvalho
835927a2ad test/sstable_compaction_test: Switch to table_state::get_backlog_tracker()
Important for decoupling backlog tracker from table's compaction
strategy.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:36 -03:00
Raphael S. Carvalho
1ec0ef18a5 compaction/table_state: Introduce get_backlog_tracker()
This interface will be helpful for allowing replica::table, unit
tests and sstables::compaction to access the compaction group's tracker
which will be managed by the compaction manager, once we complete
the decoupling work.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:36 -03:00
Nadav Har'El
ff87624fb4 test/cql-pytest: add another regression test for reversed-type bug
In commit 544ef2caf3 we fixed a bug where
a reveresed clustering-key order caused problems using a secondary index
because of incorrect type comparison. That commit also included a
regression test for this fix.

However, that fix was incomplete, and improved later in commit
c8653d1321. That later fix was labeled
"better safe than sorry", and did not include a test demonstrating
any actual bug, so unsurprisingly we never backported that second
fix to any older branches.

Recently we discovered that missing the second patch does cause real
problems, and this patch includes a test which fails when the first
patch is in, but the second patch isn't (and passes when both patches
are in, and also passes on Cassandra).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11943
2022-11-11 11:01:22 +02:00
Botond Dénes
302917f63d mutation_compactor: add validator
The mutation compactor is used on most read-paths we have, so adding a
validator to it gives us a good coverage, in particular it gives us full
coverage of queries and compaction.
The validator validates mutation token (and mutation fragment kind)
monotonicity as that is quite cheap, while it is enough to catch the
most common problems. As we already have a validator on the compaction
path (in the sstable writer), the validator is disabled when the
mutation compactor is instantiated for compaction.
We should probably make this configurable at some point. The addition
of this validator should prevent the worst of the fragment reordering
bugs to affect reads.
2022-11-11 10:26:05 +02:00
Botond Dénes
5c245b4a5e mutation_fragment_stream_validator: add a 'none' validation level
Which, as its name suggests, makes the validating filter not validate
anything at all. This validation level can be used effectively to make
it so as if the validator was not there at all.
2022-11-11 09:58:44 +02:00
Botond Dénes
a4b58f5261 test/boost/mutation_query_test: test_partition_limit: sort input data
The test's input data is currently out-of-order, violating a fundamental
invariant of data always being sorted. This doesn't cause any problems
right now, but soon it will. Sort it to avoid it.
2022-11-11 09:58:44 +02:00
Botond Dénes
2c551bb7ce querier: consume_page(): use partition_start as the sentinel value
Said method calls `compact_mutation_state::start_new_page()` which
requires the kind of the next fragment in the reader. When there is no
fragment (reader is at EOS), we use partition-end. This was a poor
choice: if the reader is at EOS, partition-kind was the last fragment
kind, if the stream were to continue the next fragment would be a
partition-start.
2022-11-11 09:58:18 +02:00
Botond Dénes
0bcfc9d522 treewide: use ::for_partition_end() instead of ::end_of_partition_tag_t{}
We just added a convenience static factory method for partition end,
change the present users of the clunky constructor+tag to use it
instead.
2022-11-11 09:58:18 +02:00
Botond Dénes
f1a039fc2b treewide: use ::for_partition_start() instead of ::partition_start_tag_t{}
We just added a convenience static factory method for partition start,
change the present users of the clunky constructor+tag to use it
instead.
2022-11-11 09:58:18 +02:00
Botond Dénes
6a002953e9 position_in_partition: add for_partition_{start,end}() 2022-11-11 09:58:18 +02:00
Kamil Braun
4a2ec888d5 Merge 'test.py: use internal id to manage servers' from Alecco
Instead of using assigned IP addresses, use a local integer ID for
managing servers. IP address can be reused by a different server.

While there, get host ID (UUID). This can also be reused with `node
replace` so it's not good enough for tracking.

Closes #11747

* github.com:scylladb/scylladb:
  test.py: use internal id to manage servers
  test.py: rename hostname to ip_addr
  test.py: get host id
  test.py: use REST api client in ScyllaCluster
  test.py: remove unnecessary reference to web app
  test.py: requests without aiohttp ClientSession
2022-11-10 17:12:16 +01:00
Kamil Braun
1cc68b262e docs: describe the Raft upgrade and recovery procedures
In the 5.1 -> 5.2 upgrade doc, include additional steps for enabling
Raft using the `consistent_cluster_management` flag. Note that we don't
have this flag yet but it's planned to replace the experimental flag in
5.2.

In the "Raft in ScyllaDB" document, add sections about:
- enabling Raft in existing clusters in Scylla 5.2,
- verifying that the internal Raft upgrade procedure finishes
  successfully,
- recovering from a stuck Raft upgrade procedure or from a majority loss
  situation.

Fix some problems in the documentation, e.g. it is not possible to
enable Raft in an existing cluster in 5.0, but the documentation claimed
that it is.

Follow-up items:
- if we decide for a different name for `consistent_cluster_management`,
  use that name in the docs instead
- update the warnings in Scylla to link to the Raft doc
- mention Enterprise versions once we know the numbers
- update the appropriate upgrade docs for Enterprise versions
  once they exist
2022-11-10 17:08:57 +01:00
Kamil Braun
3dab07ec11 docs: add upgrade guide 5.1 -> 5.2
It's a copy-paste from the 5.0 -> 5.1 guide with substitutions:
s/5.1/5.2,
s/5.0/5.1

The metric update guide is not written, I left a TODO.

Also I didn't include the guide in
docs/upgrade/upgrade-opensource/index.rst, since 5.2 is not released
yet.

The guide can be accessed by manually following the link:
/upgrade/upgrade-opensource/upgrade-guide-from-5.1-to-5.2/
2022-11-10 16:49:14 +01:00
Alejo Sanchez
700054abee test.py: use internal id to manage servers
Instead of using assigned IP addresses, use an internal server id.

Define types to distinguish local server id, host ID (UUID), and IP
address.

This is needed to test servers changing IP address and for node replace
(host UUID).

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-11-10 09:14:37 +01:00
Alejo Sanchez
1e38f5478c test.py: rename hostname to ip_addr
The code explicitly manages an IP as string, make it explicit in the
variable name.

Define its type and test for set in the instance instead of using an
empty string as placeholder.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-11-10 09:14:37 +01:00
Alejo Sanchez
f478eb52a3 test.py: get host id
When initializing a ScyllaServer, try to get the host id instead of only
checking the REST API is up.

Use the existing aiohttp session from ScyllaCluster.

In case of HTTP error check the status was not an internal error (500+).

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-11-10 09:14:37 +01:00
Alejo Sanchez
78663dda72 test.py: use REST api client in ScyllaCluster
Move the REST api client to ScyllaCluster. This will allow the cluster
to query its own servers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-11-10 09:14:37 +01:00
Alejo Sanchez
75ea345611 test.py: remove unnecessary reference to web app
The aiohttp.web.Application only needs to be passed, so don't store a
reference in ScyllaCluster object.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-11-10 09:14:37 +01:00
Alejo Sanchez
a5316b0c6b test.py: requests without aiohttp ClientSession
Simplify REST helper by doing requests without a session.

Reusing an aiohttp.ClientSession causes knock-on effects on
`rest_api/test_task_manager` due to handling exceptions outside of an
async with block.

Requests for cluster management and Scylla REST API don't need session,
anyway.

Raise HTTPError with status code, text reason, params, and json.

In ScyllaCluster.install_and_start() instead of adding one more custom
exception, just catch all exceptions as they will be re-raised later.

While there avoid code duplication and improve sanity, type checking,
and lint score.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-11-10 09:14:37 +01:00
Botond Dénes
21bc37603a Merge 'utils: config_src: add set_value_on_all_shards functions' from Benny Halevy
Currently when we set a single value we need
to call broadcast_to_all_shards to let observers on all
shards get notified of the new value.

However, the latter broadcasts all value to all shards
so it's terribly inefficient.

Instead, add async set_value_on_all_shards functions
to broadcast a value to all shards.

Use those in system_keyspace for db_config_table virtual table
and in task_manager_test to update the task_manager ttl.

Refs #7316

Closes #11893

* github.com:scylladb/scylladb:
  tests: check ttl on different shards
  utils: config_src: add set_value_on_all_shards functions
  utils: config_file: add config_source::API
2022-11-10 07:16:39 +02:00
Botond Dénes
3aff59f189 Merge 'staging sstables: filter tokens for view update generation' from Benny Halevy
This mini-series introduces dht::tokens_filter and uses it for consuming staging sstable in the view_update_generator.

The tokens_filter uses the token ranges owned by the current node, as retrieved by get_keyspace_local_ranges.

Refs #9559

Closes #11932

* github.com:scylladb/scylladb:
  db: view_update_generator: always clean up staging sstables
  compaction: extract incremental_owned_ranges_checker out to dht
2022-11-10 07:00:51 +02:00
Avi Kivity
9b6ab5db4a Update seastar submodule
* seastar e0dabb361f...153223a188 (8):
  > build: compile dpdk with -fpie (position independent executable)
  > Merge 'io_request: remove ctor overloads of io_request and s/io_request/const io_request/' from Kefu Chai
  > iostream: remove unused function
  > smp: destroy_smp_service_group: verify smp_service_group id
  > core/circular_buffer: refactor loop in circular_buffer::erase()
  > Merge 'Outline reactor::add_task() and sanitize reactor::shuffle() methods' from Pavel Emelyanov
  > Add NOLINT for cert-err58-cpp
  > tests: Fix false-positive use-after-free detection

Closes #11940
2022-11-09 23:36:50 +02:00
Aleksandra Martyniuk
b0ed4d1f0f tests: check ttl on different shards
Test checking if ttl is properly set is extended to check
whether the ttl value is changed on non-zero shard.
2022-11-09 16:58:46 +02:00
Botond Dénes
725e5b119d Revert "replica: Pick new generation for SSTables being moved from staging dir"
This reverts commit ba6186a47f.

Said commit violates the widely held assumption that sstables
generations can be used as sstable identity. One known problem caused
this is potential OOO partition emitted when reading from sstables
(#11843). We now also have a better fix for #11789 (the bug this commit
was meant to fix): 4aa0b16852. So we can
revert without regressions.

Fixes: #11843

Closes #11886
2022-11-09 16:35:31 +02:00
Eliran Sinvani
ab7429b77d cql: Fix crash upon use of the word empty for service level name
Wrong access to an uninitialized token instead of the actual
generated string caused the parser to crash, this wasn't
detected by the ANTLR3 compiler because all the temporary
variables defined in the ANTLR3 statements are global in the
generated code. This essentialy caused a null dereference.

Tests: 1. The fixed issue scenario from github.
       2. Unit tests in release mode.

Fixes #11774

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190612133151.20609-1-eliransin@scylladb.com>

Closes #11777
2022-11-09 15:58:57 +02:00
Anna Stuchlik
d2e54f7097 Merge branch 'master' into anna-requirements-arm-aws 2022-11-09 14:39:00 +01:00
Anna Stuchlik
8375304d9b Update docs/getting-started/system-requirements.rst
Co-authored-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2022-11-09 14:37:34 +01:00
Benny Halevy
38d8777d42 storage_service, system_keyspace: add debugging around system.peers update
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-09 14:45:47 +02:00
Benny Halevy
5401b6055c storage_service: handle_state_normal: update topology and notify_joined endpoint only if not removed
Currently, when replacing a node ip, keeping the old host,
we might end up with the the old endpoint in system.peers
if it is inserted back into the topology by `handle_state_normal`
when on_join is called with the old endpoint.

Then, later on, on_change sees that:
```
        if (get_token_metadata().is_member(endpoint)) {
            co_await do_update_system_peers_table(endpoint, state, value);
```

As described in #11925.

Fixes #11925

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-09 14:45:22 +02:00
Benny Halevy
1a183047c0 utils: config_src: add set_value_on_all_shards functions
Currently when we set a single value we need
to call broadcast_to_all_shards to let observers on all
shards get notified of the new value.

However, the latter broadcasts all value to all shards
so it's terribly inefficient.

Instead, add async set_value_on_all_shards functions
to broadcast a value to all shards.

Use those in system_keyspace for db_config_table virtual table
and in task_manager_test to update the task_manager ttl.

Refs #7316

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-09 11:55:14 +02:00
Benny Halevy
e83f42ec70 utils: config_file: add config_source::API
For task_manager test api.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-09 11:53:20 +02:00
Botond Dénes
94db2123b9 Update tools/java submodule
* tools/java 583261fc0e...caf754f243 (1):
  > build: remove JavaScript snippets in ant build file
2022-11-09 07:59:04 +02:00
Benny Halevy
10f8f13b90 db: view_update_generator: always clean up staging sstables
Since they are currently not cleaned up by cleanup compaction
filter their tokens, processing only tokens owned by the
current node (based on the keyspace replication strategy).

Refs #9559

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-09 07:38:22 +02:00
Benny Halevy
fd3e66b0cc compaction: extract incremental_owned_ranges_checker out to dht
It is currently used by cleanup_compaction partition filter.
Factor it out so it can be used to filter staging sstables in
the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-09 07:32:56 +02:00
Gleb Natapov' via ScyllaDB development
2100a8f4ca service: raft: demote configuration change error to warning since it is retried anyway
Message-Id: <Y2ohbFtljmd5MNw0@scylladb.com>
2022-11-09 00:09:39 +01:00
Avi Kivity
04ecf4ee18 Update tools/java submodule (cassandra-stress fails with node down)
* tools/java 87672be28e...583261fc0e (1):
  > cassandra-stress: pass all hosts stright to the driver
2022-11-08 14:58:14 +02:00
Botond Dénes
7f69cccbdf scylla-gdb.py: $downcast_vptr(): add multiple inheritance support
When a class inherits from multiple virtual base classes, pointers to
instances of this class via one of its base classes, might point to
somewhere into the object, not at its beginning. Therefore, the simple
method employed currently by $downcast_vptr() of casting the provided
pointer to the type extracted from the vtable name fails. Instead when
this situation is detected (detectable by observing that the symbol name
of the partial vtable is not to an offset of +16, but larger),
$downcast_vptr() will iterate over the base classes, adjusting the
pointer with their offsets, hoping to find the true start of the object.
In the one instance I tested this with, this method worked well.
At the very least, the method will now yield a null pointer when it
fails, instead of a badly casted object with corrupt content (which the
developer might or might not attribute to the bad cast).

Closes #11892
2022-11-08 14:51:26 +02:00
Michał Chojnowski
3e0c7a6e9f test: sstable_datafile_test: eliminate a use of std::regex to prevent stack overflow
This usage of std::regex overflows the seastar::thread stack size (128 KiB),
causing memory corruption. Fix that.

Closes #11911
2022-11-08 14:41:34 +02:00
Botond Dénes
2037d7f9cd Merge 'doc: add the "ScyllaDB Enterprise" label to highlight the Enterprise-only features' from Anna Stuchlik
This PR adds the "ScyllaDB Enterprise" label to highlight the Enterprise-only features on the following pages:
- Encryption at Rest - the label indicates that the entire page is about an Enterprise-only feature.
- Compaction - the labels indicate the sections that are Enterprise-only.

There are more occurrences across the docs that require a similar update. I'll update them in another PR if this PR is approved.

Closes #11918

* github.com:scylladb/scylladb:
  doc: fix the links to resolve the warnings
  doc: add the Enterprise label on the Compaction page (to a subheading and on a list of strategies) to replace the info box
  doc: add the Enterprise label to the Encryption at Rest page (the entire page) to replace the info box
2022-11-08 09:53:48 +02:00
Raphael S. Carvalho
a57724e711 Make off-strategy compaction wait for view building completion
Prior to off-strategy compaction, streaming / repair would place
staging files into main sstable set, and wait for view building
completion before they could be selected for regular compaction.

The reason for that is that view building relies on table providing
a mutation source without data in staging files. Had regular compaction
mixed staging data with non-staging one, table would have a hard time
providing the required mutation source.

After off-strategy compaction, staging files can be compacted
in parallel to view building. If off-strategy completes first, it
will place the output into the main sstable set. So a parallel view
building (on sstables used for off-strategy) may potentially get a
mutation source containing staging data from the off-strategy output.
That will mislead view builder as it won't be able to detect
changes to data in main directory.

To fix it, we'll do what we did before. Filter out staging files
from compaction, and trigger the operation only after we're done
with view building. We're piggybacking on off-strategy timer for
still allowing the off-strategy to only run at the end of the
node operation, to reduce the amount of compaction rounds on
the data introduced by repair / streaming.

Fixes #11882.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #11919
2022-11-08 08:53:58 +02:00
Botond Dénes
243fcb96f0 Update tools/python3 submodule
* tools/python3 bf6e892...773070e (1):
  > create-relocatable-package: harden against missing files
2022-11-08 08:43:30 +02:00
Avi Kivity
46690bcb32 build: harden create-relocatable-package.py against changes in libthread-db.so name
create-relocatable-package.py collects shared libraries used by
executables for packaging. It also adds libthread-db.so to make
debugging possible. However, the name it uses has changed in glibc,
so packaging fails in Fedora 37.

Switch to the version-agnostic names, libthread-db.so. This happens
to be a symlink, so resolve it.

Closes #11917
2022-11-08 08:41:22 +02:00
Takuya ASADA
acc408c976 scylla_setup: fix incorrect type definition on --online-discard option
--online-discard option defined as string parameter since it doesn't
specify "action=", but has default value in boolean (default=True).
It breaks "provisioning in a similar environment" since the code
supposed boolean value should be "action='store_true'" but it's not.

We should change the type of the option to int, and also specify
"choices=[0, 1]" just like --io-setup does.

Fixes #11700

Closes #11831
2022-11-08 08:40:44 +02:00
Avi Kivity
3d345609d8 config: disable "mc" format sstables for new data
"md" format was introduced in 4.3, in 3530e80ce1, two years ago.
Disable the option to create new sstables with the "mc" format.

Closes #11265
2022-11-08 08:36:27 +02:00
Anna Stuchlik
0eaafced9d doc: fix the links to resolve the warnings 2022-11-07 19:15:21 +01:00
Anna Stuchlik
b57e0cfb7c doc: add the Enterprise label on the Compaction page (to a subheading and on a list of strategies) to replace the info box 2022-11-07 18:54:35 +01:00
Anna Stuchlik
9f3fcb3fa0 doc: add the Enterprise label to the Encryption at Rest page (the entire page) to replace the info box 2022-11-07 18:48:37 +01:00
Tomasz Grabiec
a9063f9582 Merge 'service/raft: failure detector: ping raft::server_ids, not gms::inet_addresses' from Kamil Braun
Whenever a Raft configuration change is performed, `raft::server` calls
`raft_rpc::add_server`/`raft_rpc::remove_server`. Our `raft_rpc`
implementation has a function, `_on_server_update`, passed in the
constructor, which it called in `add_server`/`remove_server`;
that function would update the set of endpoints detected by the
direct failure detector. `_on_server_update` was passed an IP address
and that address was added to / removed from the failure detector set
(there's another translation layer between the IP addresses and internal
failure detector 'endpoint ID's; but we can ignore it for the purposes
of this commit).

Therefore: the failure detector was pinging a certain set of IP
addresses. These IP addresses were updated during Raft configuration
changes.

To implement the `is_alive(raft::server_id)` function (required by
`raft::failure_detector` interface), we would translate the ID using
the Raft address map, which is currently also updated during
configuration changes, to an IP address, and check if that IP address is
alive according to the direct failure detector (which maintained an
`_alive_set` of type `unordered_set<gms::inet_address>`).

This all works well but it assumes that servers can be identified using
IP addresses - it doesn't play well with the fact that servers may
change their IP addresses. The only immutable identifier we have for a
server is `raft::server_id`. In the future, Raft configurations will not
associate IP addresses with Raft servers; instead we will assume that IP
addresses can change at any time, and there will be a different
mechanism that eventually updates the Raft address map with the latest
IP address for each `raft::server_id`.

To prepare us for that future, in this commit we no longer operate in
terms of IP addresses in the failure detector, but in terms of
`raft::server_id`s. Most of the commit is boilerplate, changing
`gms::inet_address` to `raft::server_id` and function/variable names.
The interesting changes are:
- in `is_alive`, we no longer need to translate the `raft::server_id` to
  an IP address, because now the stored `_alive_set` already contains
  `raft::server_id`s instead of `gms::inet_address`es.
- the `ping` function now takes a `raft::server_id` instead of
  `gms::inet_address`. To send the ping message, we need to translate
  this to IP address; we do it by the `raft_address_map` pointer
  introduced in an earlier commit.

Thus, there is still a point where we have to translate between
`raft::server_id` and `gms::inet_address`; but observe we now do it at
the last possible moment - just before sending the message. If we
have no translation, we consider the `ping` to have failed - it's
equivalent to a network failure where no route to a given address was
found.

Closes #11759

* github.com:scylladb/scylladb:
  direct_failure_detector: get rid of complex `endpoint_id` translations
  service/raft: ping `raft::server_id`s, not `gms::inet_address`es
  service/raft: store `raft_address_map` reference in `direct_fd_pinger`
  gms: gossiper: move `direct_fd_pinger` out to a separate service
  gms: gossiper: direct_fd_pinger: extract generation number caching to a separate class
2022-11-07 16:42:35 +01:00
Botond Dénes
2b572d94f5 Merge 'doc: improve the documentation landing page ' from Anna Stuchlik
This PR introduces the following changes to the documentation landing page:

- The " New to ScyllaDB? Start here!" box is added.
- The "Connect your application to Scylla" box is removed.
- Some wording has been improved.
- "Scylla" has been replaced with "ScyllaDB".

Closes #11896

* github.com:scylladb/scylladb:
  Update docs/index.rst
  doc: replace Scylla with ScyllaDB on the landing page
  doc: improve the wording on the landing page
  doc: add the link to the ScyllaDB Basics page to the documentation landing page
2022-11-07 16:18:59 +02:00
Avi Kivity
91f2cd5ac4 test: lib: exception_predicate: use boost::regex instead of std::regex
std::regex was observed to overflow stack on aarch64 in debug mode. Use
boost::regex until the libstdc++ bug[1] is fixed.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61582

Closes #11888
2022-11-07 14:03:25 +02:00
Kamil Braun
0c7ff0d2cb docs: a single 5.0 -> 5.1 upgrade guide
There were 4 different pages for upgrading Scylla 5.0 to 5.1 (and the
same is true for other version pairs, but I digress) for different
environments:
- "ScyllaDB Image for EC2, GCP, and Azure"
- Ubuntu
- Debian
- RHEL/CentOS

THe Ubuntu and Debian pages used a common template:
```
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p1.rst
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p2.rst
```
with different variable substitutions.

The "Image" page used a similar template, with some extra content in the
middle:
```
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p1.rst
.. include:: /upgrade/_common/upgrade-image-opensource.rst
.. include:: /upgrade/_common/upgrade-guide-v5-ubuntu-and-debian-p2.rst
```

The RHEL/CentOS page used a different template:
```
.. include:: /upgrade/_common/upgrade-guide-v4-rpm.rst
```

This was an unmaintainable mess. Most of the content was "the same" for
each of these options. The only content that must actually be different
is the part with package installation instructions (e.g. calls to `yum`
vs `apt-get`). The rest of the content was logically the same - the
differences were mistakes, typos, and updates/fixes to the text that
were made in some of these docs but not others.

In this commit I prepare a single page that covers the upgrade and
rollback procedures for each of these options. The section dependent on
the system was implemented using Sphinx Tabs.

I also fixed and changed some parts:

- In the "Gracefully stop the node" section:
Ubuntu/Debian/Images pages had:

```rst
.. code:: sh

   sudo service scylla-server stop
```

RHEL/CentOS pages had:
```rst
.. code:: sh

.. include:: /rst_include/scylla-commands-stop-index.rst
```

the stop-index file contained this:
```rst
.. tabs::

   .. group-tab:: Supported OS

      .. code-block:: shell

         sudo systemctl stop scylla-server

   .. group-tab:: Docker

      .. code-block:: shell

         docker exec -it some-scylla supervisorctl stop scylla

      (without stopping *some-scylla* container)
```

So the RHEL/CentOS version had two tabs: one for Scylla installed
directly on the system, one for Scylla running in Docker - which is
interesting, because nothing anywhere else in the upgrade documents
mentions Docker.  Furthermore, the RHEL/CentOS version used `systemctl`
while the ubuntu/debian/images version used `service` to stop/start
scylla-server.  Both work on modern systems.

The Docker option is completely out of place - the rest of the upgrade
procedure does not mention Docker. So I decided it doesn't make sense to
include it. Docker documentation could be added later if we actually
decide to write upgrade documentation when using Docker...  Between
`systemctl` and `service` I went with `service` as it's a bit
higher-level.

- Similar change for "Start the node" section, and corresponding
  stop/start sections in the Rollback procedure.

- To reuse text for Ubuntu and Debian, when referencing "ScyllaDB deb
  repo" in the Debian/Ubuntu tabs, I provide two separate links: to
  Debian and Ubuntu repos.

- the link to rollback procedure in the RPM guide (in 'Download and
  install the new release' section) pointed to rollback procedure from
  3.0 to 3.1 guide... Fixed to point to the current page's rollback
  procedure.

- in the rollback procedure steps summary, the RPM version missed the
  "Restore system tables" step.

- in the rollback procedure, the repository links were pointing to the
  new versions, while they should point to the old versions.

There are some other pre-existing problems I noticed that need fixing:

- EC2/GCP/Azure option has no corresponding coverage in the rollback
  section (Download and install the old release) as it has in the
  upgrade section. There is no guide for rolling back 3rd party and OS
  packages, only Scylla. I left a TODO in a comment.
- the repository links assume certain Debian and Ubuntu versions (Debian
  10 and Ubuntu 20), but there are more available options (e.g. Ubuntu
  22). Not sure how to deal with this problem. Maybe a separate section
  with links? Or just a generic link without choice of platform/version?

Closes #11891
2022-11-07 14:02:08 +02:00
Avi Kivity
9fa1783892 Merge 'cleanup compaction: flush memtable' from Benny Halevy
Flush the memtable before cleaning up the table so not to leave any disowned tokens in the memtable
as they might be resurrected if left in the memtable.

Fixes #1239

Closes #11902

* github.com:scylladb/scylladb:
  table: perform_cleanup_compaction: flush memtable
  table: add perform_cleanup_compaction
  api: storage_service: add logging for compaction operations et al
2022-11-07 13:18:12 +02:00
Anna Stuchlik
c8455abb71 Update docs/index.rst
Co-authored-by: Tzach Livyatan <tzach.livyatan@gmail.com>
2022-11-07 10:25:24 +01:00
AdamStawarz
6bc455ebea Update tombstones-flush.rst
change syntax:

nodetool compact <keyspace>.<mytable>;
to
nodetool compact <keyspace> <mytable>;

Closes #11904
2022-11-07 11:19:26 +02:00
Avi Kivity
224a2877b9 build: disable -Og in debug mode to avoid coroutine asan breakage
Coroutines and asan don't mix well on aarch64. This was seen in
22f13e7ca3 (" Revert "Merge 'cql3: select_statement: coroutinize
indexed_table_select_statement::do_execute_base_query()' from Avi
Kivity"") where a routine coroutinization was reverted due to failures
on aarch64 debug mode.

In clang 15 this is even worse, the existing code starts failing.
However, if we disable optimization (-O0 rather than -Og), things
begin to work again. In fact we can reinstate the patch reverted
above even with clang 12.

Fix (or rather workaround) the problem by avoiding -Og on aarch64
debug mode. There's the lingering fear that release mode is
miscompiled too, but all the tests pass on clang 15 in release mode
so it appears related to asan.

Closes #11894
2022-11-07 10:55:13 +02:00
Benny Halevy
eb3a94e2bc table: perform_cleanup_compaction: flush memtable
We don't explicitly cleanup the memtable, while
it might hold tokens disowned by the current node.

Flush the memtable before performing cleanup compaction
to make sure all tokens in the memtable are cleaned up.

Note that non-owned ranges are invalidate in the cache
in compaction_group::update_main_sstable_list_on_compaction_completion
using desc.ranges_for_cache_invalidation.

Fixes #1239

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-06 19:41:40 +02:00
Benny Halevy
fc278be6c4 table: add perform_cleanup_compaction
Move the integration with compaction_manager
from the api layer to the tabel class so
it can also make sure the memtable is cleaned up in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-06 19:41:33 +02:00
Benny Halevy
85523c45c0 api: storage_service: add logging for compaction operations et al
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-06 19:41:31 +02:00
Petr Gusev
44f48bea0f raft: test_remove_node_with_concurrent_ddl
The test runs remove_node command with background ddl workload.
It was written in an attempt to reproduce scylladb#11228 but seems to have
value on its own.

The if_exists parameter has been added to the add_table
and drop_table functions, since the driver could retry
the request sent to a removed node, but that request
might have already been completed.

Function wait_for_host_known waits until the information
about the node reaches the destination node. Since we add
new nodes at each iteration in main, this can take some time.

A number of abort-related options was added
SCYLLA_CMDLINE_OPTIONS as it simplifies
nailing down problems.

Closes #11734
2022-11-04 17:16:35 +01:00
David Garcia
26bc53771c docs: automatic previews configuration
Closes #11591
2022-11-04 15:44:22 +02:00
Kamil Braun
e086521c1a direct_failure_detector: get rid of complex endpoint_id translations
The direct failure detector operates on abstract `endpoint_id`s for
pinging. The `pigner` interface is responsible for translating these IDs
to 'real' addresses.

Earlier we used two types of addresses: IP addresses in 'production'
code (`gms::gossiper::direct_fd_pinger`) and `raft::server_id`s in test
code (in `randomized_nemesis_test`). For each of these use cases we
would maintain mappings between `endpoint_id`s and the address type.

In recent commits we switched the 'production' code to also operate on
Raft server IDs, which are UUIDs underneath.

In this commit we switch `endpoint_id`s from `unsigned` type to
`utils::UUID`. Because each use case operates in Raft server IDs, we can
perform a simple translation: `raft_id.uuid()` to get an `endpoint_id`
from a Raft ID, `raft::server_id{ep_id}` to obtain a Raft ID from
an `endpoint_id`. We no longer have to maintain complex sharded data
structures to store the mappings.
2022-11-04 09:38:08 +01:00
Kamil Braun
bdeef77f20 service/raft: ping raft::server_ids, not gms::inet_addresses
Whenever a Raft configuration change is performed, `raft::server` calls
`raft_rpc::add_server`/`raft_rpc::remove_server`. Our `raft_rpc`
implementation has a function, `_on_server_update`, passed in the
constructor, which it called in `add_server`/`remove_server`;
that function would update the set of endpoints detected by the
direct failure detector. `_on_server_update` was passed an IP address
and that address was added to / removed from the failure detector set
(there's another translation layer between the IP addresses and internal
failure detector 'endpoint ID's; but we can ignore it for the purposes
of this commit).

Therefore: the failure detector was pinging a certain set of IP
addresses. These IP addresses were updated during Raft configuration
changes.

To implement the `is_alive(raft::server_id)` function (required by
`raft::failure_detector` interface), we would translate the ID using
the Raft address map, which is currently also updated during
configuration changes, to an IP address, and check if that IP address is
alive according to the direct failure detector (which maintained an
`_alive_set` of type `unordered_set<gms::inet_address>`).

This all works well but it assumes that servers can be identified using
IP addresses - it doesn't play well with the fact that servers may
change their IP addresses. The only immutable identifier we have for a
server is `raft::server_id`. In the future, Raft configurations will not
associate IP addresses with Raft servers; instead we will assume that IP
addresses can change at any time, and there will be a different
mechanism that eventually updates the Raft address map with the latest
IP address for each `raft::server_id`.

To prepare us for that future, in this commit we no longer operate in
terms of IP addresses in the failure detector, but in terms of
`raft::server_id`s. Most of the commit is boilerplate, changing
`gms::inet_address` to `raft::server_id` and function/variable names.
The interesting changes are:
- in `is_alive`, we no longer need to translate the `raft::server_id` to
  an IP address, because now the stored `_alive_set` already contains
  `raft::server_id`s instead of `gms::inet_address`es.
- the `ping` function now takes a `raft::server_id` instead of
  `gms::inet_address`. To send the ping message, we need to translate
  this to IP address; we do it by the `raft_address_map` pointer
  introduced in an earlier commit.

Thus, there is still a point where we have to translate between
`raft::server_id` and `gms::inet_address`; but observe we now do it at
the last possible moment - just before sending the message. If we
have no translation, we consider the `ping` to have failed - it's
equivalent to a network failure where no route to a given address was
found.
2022-11-04 09:38:08 +01:00
Kamil Braun
ac70a05c7e service/raft: store raft_address_map reference in direct_fd_pinger
The pinger will use the map to translate `raft::server_id`s to
`gms::inet_address`es when pinging.
2022-11-04 09:38:08 +01:00
Kamil Braun
2c20f2ab9d gms: gossiper: move direct_fd_pinger out to a separate service
In later commit `direct_fd_pinger` will operate in terms of
`raft::server_id`s. Decouple it from `gossiper` since we don't want to
entangle `gossiper` with Raft-specific stuff.
2022-11-04 09:38:08 +01:00
Kamil Braun
e9a4263e14 gms: gossiper: direct_fd_pinger: extract generation number caching to a separate class
`gms::gossiper::direct_fd_pinger` serves multiple purposes: one of them
is to maintain a mapping between `gms::inet_address`es and
`direct_failure_detector::pinger::endpoint_id`s, another is to cache the
last known gossiper's generation number to use it for sending gossip
echo messages. The latter is the only gossiper-specific thing in this
class.

We want to move `direct_fd_pinger` utside `gossiper`. To do that, split the
gossiper-specific thing -- the generation number management -- to a
smaller class, `echo_pinger`.

`echo_pinger` is a top-level class (not a nested one like
`direct_fd_pinger` was) so we can forward-declare it and pass references
to it without including gms/gossiper.hh header.
2022-11-04 09:38:08 +01:00
Avi Kivity
768d77d31b Update seastar submodule
* seastar f32ed00954...e0dabb361f (12):
  > sstring: define formatter
  > file: Dont violate API layering
  > Add compile_commands.json to gitignore
  > Merge 'Add an allocation failure metric' from Travis Downs
  > Use const test objects
  > Ragel chunk parser: compilation err, unused var
  > build: do not expose Valgrind in SeastarTargets.cmake
  > defer: mark deferred_* with [[nodiscard]]
  > Log selected reactor backend during startup
  > http: mark str with [[maybe_unused]]
  > Merge 'reactor: open fd without O_NONBLOCK when using io_uring backend' from Kefu Chai
  > reactor: add accept and connect to io_uring backend

Closes #11895
2022-11-04 09:27:56 +04:00
Anna Stuchlik
fb01565a15 doc: replace Scylla with ScyllaDB on the landing page 2022-11-03 17:42:49 +01:00
Anna Stuchlik
7410ab0132 doc: improve the wording on the landing page 2022-11-03 17:38:14 +01:00
Anna Stuchlik
ab5e48261b doc: add the link to the ScyllaDB Basics page to the documentation landing page 2022-11-03 17:31:03 +01:00
Pavel Emelyanov
efbfcdb97e Merge 'Replicate raft_address_map non-expiring entries to other shards' from Kamil Braun
Replicating `raft_address_map` entries is needed for the following use
cases:
- the direct failure detector - currently it assumes a static mapping of
  `raft::server_id`s to `gms::inet_address`es, which is obtained on Raft
  group 0 configuration changes. To handle dynamic mappings we need to
  modify the failure detector so it pings `raft::server_id`s and obtains
  the `gms::inet_address` before sending the message from
  `raft_address_map`. The failure detector is sharded, so we need the
  mappings to be available on all shards.
- in the future we'll have multiple Raft groups running on different
  shards. To send messages they'll need `raft_address_map`.

Initially I tried to replicate all entries - expiring and non-expiring.
The implementation turned out to be very complex - we need to handle
dropping expired entries and refreshing expiring entries' timestamps
across shards, and doing this correctly while accounting for possible
races is quite problematic.

Eventually I arrived at the conclusion that replicating only
non-expiring entries, and furthermore allowing non-expiring entries to
be added only on shard 0, is good enough for our use cases:
- The direct failure detector is pinging group 0 members only; group
  0 members correspond exactly to the non-expiring entries.
- Group 0 configuration changes are handled on shard 0, so non-expiring
  entries are added/removed on shard 0.
- When we have multiple Raft groups, we can reuse a single Raft server
  ID for all Raft servers running on a single node belonging to
  different groups; they are 'namespaced' by the group IDs. Furthermore,
  every node has a server that belongs to group 0. Thus for every Raft
  server in every group, it has a corresponding server in group 0 with
  the same ID, which has a non-expiring entry in `raft_address_map`,
  which is replicated to all shards; so every group will be able to
  deliver its messages.

With these assumptions the implementation is short and simple.
We can always complicate it in the future if we find that the
assumptions are too strong.

Closes #11791

* github.com:scylladb/scylladb:
  test/raft: raft_address_map_test: add replication test
  service/raft: raft_address_map: replicate non-expiring entries to other shards
  service/raft: raft_address_map: assert when entry is missing in drop_expired_entries
  service/raft: turn raft_address_map into a service
2022-11-03 18:34:42 +03:00
Avi Kivity
ca2010144e test: loading_cache_test: fix use-after-free in test_loading_cache_remove_leaves_no_old_entries_behind
We capture `key` by reference, but it is in a another continuation.

Capture it by value, and avoid the default capture specification.

Found by clang 15 + asan + aarch64.

Closes #11884
2022-11-03 17:23:40 +02:00
Avi Kivity
0c3967cf5e Merge 'scylla-gdb.py: improve scylla-fiber' from Botond Dénes
The main theme of this patchset is improving `scylla-fiber`, with some assorted unrelated improvement tagging along.
In lieu of explicit support for mapping up continuation chains in memory from seastar (there is one but it uses function calls), scylla fiber uses a quite crude method to do this: it scans task objects for outbound references to other task objects to find waiters tasks and scans inbound references from other tasks to find waited-on tasks. This works well for most objects, but there are some problematic ones:
* `seastar::thread_context`: the waited-on task (`seastar::(anonymous namespace)::thread_wake_task`) is allocated on the thread's stack which is not in the object itself. Scylla fiber now scans the stack bottom-up to find this task.
* `seastar::smp_message_queue::async_work_item`: the waited on task lives on another shard. Scylla fiber now digs out the remote shard from the work item and continues the search on the remote shard.
* `seastar::when_all_state`: the waited on task is a member in the same object tripping loop detection and terminating the search. Seastar fiber now uses the `_continuation` member explicitely to look for the next links.

Other minor improvements were also done, like including the shard of the task in the printout.
Example demonstrating all the new additions:
```
(gdb) scylla fiber 0x000060002d650200
Stopping because loop is detected: task 0x000061c00385fb60 was seen before.
[shard 28] #-13 (task*) 0x000061c00385fba0 0x00000000003b5b00 vtable for seastar::internal::when_all_state_component<seastar::future<void> > + 16
[shard 28] #-12 (task*) 0x000061c00385fb60 0x0000000000417010 vtable for seastar::internal::when_all_state<seastar::internal::identity_futures_tuple<seastar::future<void>, seastar::future<void> >, seastar::future<void>, seastar::future<void> > + 16
[shard 28] #-11 (task*) 0x000061c009f16420 0x0000000000419830 _ZTVN7seastar12continuationINS_8internal22promise_base_with_typeIvEEZNS_6futureISt5tupleIJNS4_IvEES6_EEE14discard_resultEvEUlDpOT_E_ZNS8_14then_impl_nrvoISC_S6_EET0_OT_EUlOS3_RSC_ONS_12future_stateIS7_EEE_S7_EE + 16
[shard 28] #-10 (task*) 0x000061c0098e9e00 0x0000000000447440 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::smp_message_queue::async_work_item<seastar::sharded<cql_transport::cql_server>::stop()::{lambda(unsigned int)#1}::operator()(unsigned int)::{lambda()#1}>::run_and_dispose()::{lambda(auto:1)#1}, seastar::future<void>::then_wrapped_nrvo<void, seastar::smp_message_queue::async_work_item<seastar::sharded<cql_transport::cql_server>::stop()::{lambda(unsigned int)#1}::operator()(unsigned int)::{lambda()#1}> >(seastar::smp_message_queue::async_work_item<seastar::sharded<cql_transport::cql_server>::stop()::{lambda(unsigned int)#1}::operator()(unsigned int)::{lambda()#1}>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::smp_message_queue::async_work_item<seastar::sharded<cql_transport::cql_server>::stop()::{lambda(unsigned int)#1}::operator()(unsigned int)::{lambda()#1}>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16
[shard  0] #-9 (task*) 0x000060000858dcd0 0x0000000000449d68 vtable for seastar::smp_message_queue::async_work_item<seastar::sharded<cql_transport::cql_server>::stop()::{lambda(unsigned int)#1}::operator()(unsigned int)::{lambda()#1}> + 16
[shard  0] #-8 (task*) 0x0000600050c39f60 0x00000000007abe98 vtable for seastar::parallel_for_each_state + 16
[shard  0] #-7 (task*) 0x000060000a59c1c0 0x0000000000449f60 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::sharded<cql_transport::cql_server>::stop()::{lambda(seastar::future<void>)#2}, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, {lambda(seastar::future<void>)#2}>({lambda(seastar::future<void>)#2}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda(seastar::future<void>)#2}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16
[shard  0] #-6 (task*) 0x000060000a59c400 0x0000000000449ea0 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, cql_transport::controller::do_stop_server()::{lambda(std::unique_ptr<seastar::sharded<cql_transport::cql_server>, std::default_delete<seastar::sharded<cql_transport::cql_server> > >&)#1}::operator()(std::unique_ptr<seastar::sharded<cql_transport::cql_server>, std::default_delete<seastar::sharded<cql_transport::cql_server> > >&) const::{lambda()#1}::operator()() const::{lambda()#1}, seastar::future<void>::then_impl_nrvo<{lambda()#1}, {lambda()#1}>({lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda()#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16
[shard  0] #-5 (task*) 0x0000600009d86cc0 0x0000000000449c00 vtable for seastar::internal::do_with_state<std::tuple<std::unique_ptr<seastar::sharded<cql_transport::cql_server>, std::default_delete<seastar::sharded<cql_transport::cql_server> > > >, seastar::future<void> > + 16
[shard  0] #-4 (task*) 0x00006000019ffe20 0x00000000007ab368 vtable for seastar::(anonymous namespace)::thread_wake_task + 16
[shard  0] #-3 (task*) 0x00006000085ad080 0x0000000000809e18 vtable for seastar::thread_context + 16
[shard  0] #-2 (task*) 0x0000600009c04100 0x00000000006067f8 _ZTVN7seastar12continuationINS_8internal22promise_base_with_typeIvEEZNS_5asyncIZZN7service15storage_service5drainEvENKUlRS6_E_clES7_EUlvE_JEEENS_8futurizeINSt9result_ofIFNSt5decayIT_E4typeEDpNSC_IT0_E4typeEEE4typeEE4typeENS_17thread_attributesEOSD_DpOSG_EUlvE0_ZNS_6futureIvE14then_impl_nrvoIST_SV_EET0_SQ_EUlOS3_RST_ONS_12future_stateINS1_9monostateEEEE_vEE + 16
[shard  0] #-1 (task*) 0x000060000a59c080 0x0000000000606ae8 _ZTVN7seastar12continuationINS_8internal22promise_base_with_typeIvEENS_6futureIvE12finally_bodyIZNS_5asyncIZZN7service15storage_service5drainEvENKUlRS9_E_clESA_EUlvE_JEEENS_8futurizeINSt9result_ofIFNSt5decayIT_E4typeEDpNSF_IT0_E4typeEEE4typeEE4typeENS_17thread_attributesEOSG_DpOSJ_EUlvE1_Lb0EEEZNS5_17then_wrapped_nrvoIS5_SX_EENSD_ISG_E4typeEOT0_EUlOS3_RSX_ONS_12future_stateINS1_9monostateEEEE_vEE + 16
[shard  0] #0  (task*) 0x000060002d650200 0x0000000000606378 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<service::storage_service::run_with_api_lock<service::storage_service::drain()::{lambda(service::storage_service&)#1}>(seastar::basic_sstring<char, unsigned int, 15u, true>, service::storage_service::drain()::{lambda(service::storage_service&)#1}&&)::{lambda(service::storage_service&)#1}::operator()(service::storage_service&)::{lambda()#1}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, {lambda(service::storage_service&)#1}>({lambda(service::storage_service&)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda(service::storage_service&)#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16
[shard  0] #1  (task*) 0x000060000bc40540 0x0000000000606d48 _ZTVN7seastar12continuationINS_8internal22promise_base_with_typeIvEENS_6futureIvE12finally_bodyIZNS_3smp9submit_toIZNS_7shardedIN7service15storage_serviceEE9invoke_onIZNSB_17run_with_api_lockIZNSB_5drainEvEUlRSB_E_EEDaNS_13basic_sstringIcjLj15ELb1EEEOT_EUlSF_E_JES5_EET1_jNS_21smp_submit_to_optionsESK_DpOT0_EUlvE_EENS_8futurizeINSt9result_ofIFSJ_vEE4typeEE4typeEjSN_SK_EUlvE_Lb0EEEZNS5_17then_wrapped_nrvoIS5_S10_EENSS_ISJ_E4typeEOT0_EUlOS3_RS10_ONS_12future_stateINS1_9monostateEEEE_vEE + 16
[shard  0] #2  (task*) 0x000060000332afc0 0x00000000006cb1c8 vtable for seastar::continuation<seastar::internal::promise_base_with_type<seastar::json::json_return_type>, api::set_storage_service(api::http_context&, seastar::httpd::routes&)::{lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >)#38}::operator()(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >) const::{lambda()#1}, seastar::future<void>::then_impl_nrvo<{lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >)#38}, {lambda()#1}<seastar::json::json_return_type> >({lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >)#38}&&)::{lambda(seastar::internal::promise_base_with_type<seastar::json::json_return_type>&&, {lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >)#38}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16
[shard  0] #3  (task*) 0x000060000a1af700 0x0000000000812208 vtable for seastar::continuation<seastar::internal::promise_base_with_type<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >, seastar::httpd::function_handler::function_handler(std::function<seastar::future<seastar::json::json_return_type> (std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >)> const&)::{lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >, std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}::operator()(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >, std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >) const::{lambda(seastar::json::json_return_type&&)#1}, seastar::future<seastar::json::json_return_type>::then_impl_nrvo<seastar::json::json_return_type&&, seastar::future<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > > >(seastar::json::json_return_type&&)::{lambda(seastar::internal::promise_base_with_type<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >&&, seastar::json::json_return_type&, seastar::future_state<seastar::json::json_return_type>&&)#1}, seastar::json::json_return_type> + 16
[shard  0] #4  (task*) 0x0000600009d86440 0x0000000000812228 vtable for seastar::continuation<seastar::internal::promise_base_with_type<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >, seastar::httpd::function_handler::handle(seastar::basic_sstring<char, unsigned int, 15u, true> const&, std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >, std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)::{lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}, seastar::future<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >::then_impl_nrvo<{lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}, seastar::future>({lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}&&)::{lambda(seastar::internal::promise_base_with_type<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >&&, {lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}&, seastar::future_state<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >&&)#1}, std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > > + 16
[shard  0] #5  (task*) 0x0000600009dba0c0 0x0000000000812f48 vtable for seastar::continuation<seastar::internal::promise_base_with_type<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >, seastar::future<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >::handle_exception<std::function<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > (std::__exception_ptr::exception_ptr)>&>(std::function<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > (std::__exception_ptr::exception_ptr)>&)::{lambda(auto:1&&)#1}, seastar::future<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >::then_wrapped_nrvo<seastar::future<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >, {lambda(auto:1&&)#1}>({lambda(auto:1&&)#1}&&)::{lambda(seastar::internal::promise_base_with_type<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >&&, {lambda(auto:1&&)#1}&, seastar::future_state<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >&&)#1}, std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > > + 16
[shard  0] #6  (task*) 0x0000600026783ae0 0x00000000008118b0 vtable for seastar::continuation<seastar::internal::promise_base_with_type<bool>, seastar::httpd::connection::generate_reply(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >)::{lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}, seastar::future<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >::then_impl_nrvo<{lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}, seastar::httpd::connection::generate_reply(std::unique_ptr<seastar::httpd::request, std::default_delete<seastar::httpd::request> >)::{lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}<bool> >({lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}&&)::{lambda(seastar::internal::promise_base_with_type<bool>&&, {lambda(std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> >)#1}&, seastar::future_state<std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > >&&)#1}, std::unique_ptr<seastar::httpd::reply, std::default_delete<seastar::httpd::reply> > > + 16
[shard  0] #7  (task*) 0x000060000a4089c0 0x0000000000811790 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::httpd::connection::read_one()::{lambda()#1}::operator()()::{lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<std::unique_ptr> >)#2}::operator()(std::default_delete<std::unique_ptr>) const::{lambda(std::default_delete<std::unique_ptr>)#1}::operator()(std::default_delete<std::unique_ptr>) const::{lambda(bool)#2}, seastar::future<bool>::then_impl_nrvo<{lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<std::unique_ptr> >)#2}, {lambda(std::default_delete<std::unique_ptr>)#1}<void> >({lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<std::unique_ptr> >)#2}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda(std::unique_ptr<seastar::httpd::request, std::default_delete<std::unique_ptr> >)#2}&, seastar::future_state<bool>&&)#1}, bool> + 16
[shard  0] #8  (task*) 0x000060000a5b16e0 0x0000000000811430 vtable for seastar::internal::do_until_state<seastar::httpd::connection::read()::{lambda()#1}, seastar::httpd::connection::read()::{lambda()#2}> + 16
[shard  0] #9  (task*) 0x000060000aec1080 0x00000000008116d0 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::httpd::connection::read()::{lambda(seastar::future<void>)#3}, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, {lambda(seastar::future<void>)#3}>({lambda(seastar::future<void>)#3}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda(seastar::future<void>)#3}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16
[shard  0] #10 (task*) 0x000060000b7d2900 0x0000000000811950 vtable for seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::httpd::connection::read()::{lambda()#4}, true>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::httpd::connection::read()::{lambda()#4}>(seastar::httpd::connection::read()::{lambda()#4}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::httpd::connection::read()::{lambda()#4}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> + 16

Found no further pointers to task objects.
If you think there should be more, run `scylla fiber 0x000060002d650200 --verbose` to learn more.
Note that continuation across user-created seastar::promise<> objects are not detected by scylla-fiber.
```

Closes #11822

* github.com:scylladb/scylladb:
  scylla-gdb.py: collection_element: add support for boost::intrusive::list
  scylla-gdb.py: optional_printer: eliminate infinite loop
  scylla-gdb.py: scylla-fiber: add note about user-instantiated promise objects
  scylla-gdb.py: scylla-fiber: reject self-references when probing pointers
  scylla-gdb.py: scylla-fiber: add starting task to known tasks
  scylla-gdb.py: scylla-fiber: add support for walking over when_all
  scylla-gdb.py: add when_all_state to task type whitelist
  scylla-gdb.py: scylla-fiber: also print shard of tasks
  scylla-gdb.py: scylla-fiber: unify task printing
  scylla-gdb.py: scylla fiber: add support for walking over shards
  scylla-gdb.py: scylla fiber: add support for walking over seastar threads
  scylla-gdb.py: scylla-ptr: keep current thread context
  scylla-gdb.py: improve scylla column_families
  scylla-gdb.py: scylla_sstables.filename(): fix generation formatting
  scylla-gdb.py: improve schema_ptr
  scylla-gdb.py: scylla memory: restore compatibility with <= 5.1
2022-11-03 13:52:31 +02:00
Kamil Braun
2049962e11 Fix version numbers in upgrade page title
Closes #11878
2022-11-03 10:06:25 +02:00
Takuya ASADA
45789004a3 install-dependencies.sh: update node_exporter to 1.4.0
To fix CVE-2022-24675, we need to a binary compiled in <= golang 1.18.1.
Only released version which compiled <= golang 1.18.1 is node_exporter
1.4.0, so we need to update to it.

See scylladb/scylla-enterprise#2317

Closes #11400

[avi: regenerated frozen toolchain]

Closes #11879
2022-11-03 10:15:22 +04:00
Yaron Kaikov
20110bdab4 configure.py: remove un-used tar files creation
Starting from https://github.com/scylladb/scylla-pkg/pull/3035 we
removed all old tar.gz prefix from uploading to S3 or been used by
downstream jobs.

Hence, there is no point building those tar.gz files anymore

Closes #11865
2022-11-02 17:44:09 +02:00
Anna Stuchlik
d1f7cc99bc doc: fix the external links to the ScyllaDB University lesson about TTL
Closes #11876
2022-11-02 15:05:43 +02:00
Nadav Har'El
59fa8fe903 Merge 'doc: add the information about AArch64 support to Requirements' from Anna Stuchlik
Fix https://github.com/scylladb/scylla-doc-issues/issues/864

This PR:
- updates the introduction to add information about AArch64 and rewrite the content.
- replaces "Scylla" with "ScyllaDB".

Closes #11778

* github.com:scylladb/scylladb:
  Update docs/getting-started/system-requirements.rst
  doc: fix the link to the OS Support page
  doc: replace Scylla with ScyllaDB
  doc: update the info about supported architecture and rewrite the introduction
2022-11-02 11:18:20 +02:00
Anna Stuchlik
ea799ad8fd Update docs/getting-started/system-requirements.rst
Co-authored-by: Tzach Livyatan <tzach.livyatan@gmail.com>
2022-11-02 09:56:56 +01:00
guy9
097a65df9f adding top banner to the Docs website with a link to the ScyllaDB University fall LIVE event
Closes #11873
2022-11-02 10:20:40 +02:00
Nadav Har'El
b9d88a3601 cql/pytest: add reproducer for timestamp column validation issue
This patch adds a reproducing test for issue #11588, which is still open
so the test is expected to fail on Scylla ("xfail), and passes on Cassandra.

The test shows that Scylla allows an out-of-range value to be written to
timestamp column, but then it can't be read back.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11864
2022-11-01 08:11:01 +02:00
Botond Dénes
dc46bfa783 Merge 'Prepare repair for task manager integration' from Aleksandra Martyniuk
The PR prepares repair for task manager integration:
- Creates repair_module
- Keeps repair_module in repair_service
- Moves tracker methods to repair_module
- Changes UUID to task_id in repair module

Closes #11851

* github.com:scylladb/scylladb:
  repair: check shutdown with abort source in repair module
  repair: use generic module gate for repair module operations
  repair: move tracker to repair module
  repair: move next_repair_command to repair_module
  repair: generate repair id in repair module
  repair: keep shard number in repair_uniq_id
  repair: change UUID to task_id
  repair: add task_manager::module to repair_service
  repair: create repair module and task
2022-11-01 08:05:14 +02:00
Aleksandra Martyniuk
f2fe586f03 repair: check shutdown with abort source in repair module
In repair module the shutdown can be checked using abort_source.
Thus, we can get rid of shutdown flag.
2022-10-31 10:57:29 +01:00
Aleksandra Martyniuk
2d878cc9b5 repair: use generic module gate for repair module operations
Repair module uses a gate to prevent starting new tasks on shutdown.
Generic module's gate serves the same purpose, thus we can
use it also in repair specific context.
2022-10-31 10:56:36 +01:00
Aleksandra Martyniuk
4aae7e9026 repair: move tracker to repair module
Since both tracker and repair_module serve similar purpose,
it is confusing where we should seek for methods connected to them.
Thus, to make it more transparent, tracker class is deleted and all
its attributes and methods are moved to repair_module.
2022-10-31 10:55:36 +01:00
Aleksandra Martyniuk
a5c05dcb60 repair: move next_repair_command to repair_module
Number of the repair operation was counted both with
next_repair_command from tracer and sequence number
from task_manager::module.

To get rid of redundancy next_repair_command was deleted and all
methods using its value were moved to repair_module.
2022-10-31 10:54:39 +01:00
Aleksandra Martyniuk
c81260fb8b repair: generate repair id in repair module
repair_uniq_id for repair task can be generated in repair module
and accessed from the task.
2022-10-31 10:54:24 +01:00
Aleksandra Martyniuk
6432a26ccf repair: keep shard number in repair_uniq_id
Execution shard is one of the traits specific to repair tasks.
Child task should freely access shard id of its parent. Thus,
the shard number is kept in a repair_uniq_id struct.
2022-10-31 10:41:17 +01:00
guy9
276ec377c0 removed broken roadmap link
Closes #11854
2022-10-31 11:33:03 +02:00
Aleksandra Martyniuk
e2c7c1495d repair: change UUID to task_id
Change type of repair id from utils::UUID to task_id to distinguish
them from ids of other entities.
2022-10-31 10:07:08 +01:00
Aleksandra Martyniuk
dc80af33bc repair: add task_manager::module to repair_service
repair_service keeps a shared pointer to repair_module.
2022-10-31 10:04:50 +01:00
Aleksandra Martyniuk
576277384a repair: create repair module and task
Create repair_task_impl and repair_module inheriting from respectively
task manager task_impl and module to integrate repair operations with
task manager.
2022-10-31 10:04:48 +01:00
Takuya ASADA
159bc7c7ea install-dependencies.sh: use binary distributions of PIP package
We currently avoid compiling C code in "pip3 install scylla-driver", but
we actually providing portable binary distributions of the package,
so we should use it by "pip3 install --only-binary=:all: scylla-driver".
The binary distribution contains dependency libraries, so we won't have
problem loading it on relocatable python3.

Closes #11852
2022-10-31 10:38:36 +02:00
Kamil Braun
db6cc035ed test/raft: raft_address_map_test: add replication test 2022-10-31 09:17:12 +01:00
Kamil Braun
7d84007fd5 service/raft: raft_address_map: replicate non-expiring entries to other shards
Replicating `raft_address_map` entries is needed for the following use
cases:
- the direct failure detector - currently it assumes a static mapping of
  `raft::server_id`s to `gms::inet_address`es, which is obtained on Raft
  group 0 configuration changes. To handle dynamic mappings we need to
  modify the failure detector so it pings `raft::server_id`s and obtains
  the `gms::inet_address` before sending the message from
  `raft_address_map`. The failure detector is sharded, so we need the
  mappings to be available on all shards.
- in the future we'll have multiple Raft groups running on different
  shards. To send messages they'll need `raft_address_map`.

Initially I tried to replicate all entries - expiring and non-expiring.
The implementation turned out to be very complex - we need to handle
dropping expired entries and refreshing expiring entries' timestamps
across shards, and doing this correctly while accounting for possible
races is quite problematic.

Eventually I arrived at the conclusion that replicating only
non-expiring entries, and furthermore allowing non-expiring entries to
be added only on shard 0, is good enough for our use cases:
- The direct failure detector is pinging group 0 members only; group
  0 members correspond exactly to the non-expiring entries.
- Group 0 configuration changes are handled on shard 0, so non-expiring
  entries are added/removed on shard 0.
- When we have multiple Raft groups, we can reuse a single Raft server
  ID for all Raft servers running on a single node belonging to
  different groups; they are 'namespaced' by the group IDs. Furthermore,
  every node has a server that belongs to group 0. Thus for every Raft
  server in every group, it has a corresponding server in group 0 with
  the same ID, which has a non-expiring entry in `raft_address_map`,
  which is replicated to all shards; so every group will be able to
  deliver its messages.

With these assumptions the implementation is short and simple.
We can always complicate it in the future if we find that the
assumptions are too strong.
2022-10-31 09:17:12 +01:00
Kamil Braun
acacbad465 service/raft: raft_address_map: assert when entry is missing in drop_expired_entries 2022-10-31 09:17:12 +01:00
Kamil Braun
159bb32309 service/raft: turn raft_address_map into a service 2022-10-31 09:17:10 +01:00
Botond Dénes
139fbb466e Merge 'Task manager extension' from Aleksandra Martyniuk
The PR adds changes to task manager that allow more convenient integration with modules.

Introduced changes:
- adds internal flag in task::impl that allows user to filter too specific tasks
- renames `parent_data` to more appropriate name `task_info`
- creates `tasks/types.hh` which allows using some types connected with task manager without the necessity to include whole task manager
- adds more flexible version of `make_task` method

Closes #11821

* github.com:scylladb/scylladb:
  tasks: add alternative make_task method
  tasks: rename parent_data to task_info and move it
  tasks: move task_id to tasks/types.hh
  tasks: add internal flag for task_manager::task::impl
2022-10-31 09:57:10 +02:00
Botond Dénes
2c021affd1 Merge 'storage_service, repair: use per-shard abort_source' from Benny Halevy
Prevent copying shared_ptr across shards
in do_sync_data_using_repair by allocating
a shared_ptr<abort_source> per shard in
node_ops_meta_data and respectively in node_ops_info.

Fixes #11826

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11827

* github.com:scylladb/scylladb:
  repair: use sharded abort_source to abort repair_info
  repair: node_ops_info: add start and stop methods
  storage_service: node_ops_abort_thread: abort all node ops on shutdown
  storage_service: node_ops_abort_thread: co_return only after printing log message
  storage_service: node_ops_meta_data: add start and stop methods
  repair: node_ops_info: prevent accidental copy
2022-10-31 09:43:34 +02:00
Botond Dénes
63a90cfb6c scylla-gdb.py: collection_element: add support for boost::intrusive::list 2022-10-31 08:18:20 +02:00
Botond Dénes
2fa1864174 scylla-gdb.py: optional_printer: eliminate infinite loop
Currently, to_string() recursively calls itself for engaged optionals.
Eliminate it. Also, use the std_optional wrapper instead of accessing
std::optional internals directly.
2022-10-31 08:18:20 +02:00
Botond Dénes
77b2555a04 scylla-gdb.py: scylla-fiber: add note about user-instantiated promise objects
Scylla fiber uses a crude method of scanning inbound and outbound
references to/from other task objects of recognized type. This method
cannot detect user instantiated promise<> objects. Add a note about this
to the printout, so users are beware of this.
2022-10-31 08:18:20 +02:00
Botond Dénes
2276565a2e scylla-gdb.py: scylla-fiber: reject self-references when probing pointers
A self-reference is never the pointer we are looking for when looking
for other tasks referencing us. Reject such references when scanning
outright.
2022-10-31 08:18:20 +02:00
Botond Dénes
f4365dd7f5 scylla-gdb.py: scylla-fiber: add starting task to known tasks
We collect already seen tasks in a set to be able to detect perceived
task loops and stop when one is seen. Initialize this set with the
starting task, so if it forms a loop, we won't repeat it in the trace
before cutting the loop.
2022-10-31 08:18:20 +02:00
Botond Dénes
48bbf2e467 scylla-gdb.py: scylla-fiber: add support for walking over when_all 2022-10-31 08:18:20 +02:00
Botond Dénes
cb8f02e24b scylla-gdb.py: add when_all_state to task type whitelist 2022-10-31 08:18:20 +02:00
Botond Dénes
62621abc44 scylla-gdb.py: scylla-fiber: also print shard of tasks
Now that scylla-fiber can cross shards, it is important to display the
shard each task in the chain lives on.
2022-10-31 08:18:19 +02:00
Botond Dénes
c21c80f711 scylla-gdb.py: scylla-fiber: unify task printing
Currently there is two loops and a separate line printing the starting
task, all duplicating the formatting logic. Define a method for it and
use it in all 3 places instead.
2022-10-31 08:18:19 +02:00
Botond Dénes
c103280bfd scylla-gdb.py: scylla fiber: add support for walking over shards
Shard boundaries can be crossed in one direction currently: when looking
for waiters on a task, but not in the other direction (looking for
waited-on tasks). This patch fixes that.
2022-10-31 08:18:19 +02:00
Botond Dénes
437f888ba0 scylla-gdb.py: scylla fiber: add support for walking over seastar threads
Currently seastar threads end any attempt to follow waited-on-futures.
Seastar threads need special handling because it allocates the wake up
task on its stack. This patch adds this special handling.
2022-10-31 08:18:19 +02:00
Botond Dénes
fcc63965ed scylla-gdb.py: scylla-ptr: keep current thread context
scylla_ptr.analyze() switches to the thread the analyzed object lives
on, but forgets to switch back. This was very annoying as any commands
using it (which is a bunch of them) were prone to suddenly and
unexpectedly switching threads.
This patch makes sure that the original thread context is switched back
to after analyzing the pointer.
2022-10-31 08:18:19 +02:00
Botond Dénes
91516c1d68 scylla-gdb.py: improve scylla column_families
Rename to scylla tables. Less typing and more up-to-date.
By default it now only lists tables from local shard. Added flag -a
which brings back old behaviour (lists on all shards).
Added -u (only list user tables) and -k (list tables of provided
keyspace only) filtering options.
2022-10-31 08:18:19 +02:00
Botond Dénes
1d3d613b76 scylla-gdb.py: scylla_sstables.filename(): fix generation formatting
Generation was recently converted from an integer to an object. Update
the filename formatting, while keeping backward compatibility.
2022-10-31 08:18:19 +02:00
Botond Dénes
c869f54742 scylla-gdb.py: improve schema_ptr
Add __getitem__(), so members can be accessed.
Strip " from ks_name and cf_name.
Add is_system().
2022-10-31 08:18:19 +02:00
Botond Dénes
66832af233 scylla-gdb.py: scylla memory: restore compatibility with <= 5.1
Recent reworks around dirty memory manager broke backward compatibility
of the scylla memory command (and possibly others). This patch restores
it.
2022-10-31 08:18:19 +02:00
Tenghuan He
e0948ba199 Add directory change instruction
Add directory change instruction while building scylla

Closes #11717
2022-10-30 23:53:02 +02:00
Pavel Emelyanov
477e0c967a scylla-gdb: Evaluate LSA object sizes dynamically
The lsa-segment command tries to walk LSA segment objects by decoding
their descriptors and (!) object sizes as well. Some objects in LSA have
dynamic sizes, i.e. those depending on the object contents. The script
tries to drill down the object internals to get this size, but bad news
is that nowadays there are many dynamic objects that are not covered.
Once stepped upon unsupported object, scylla-gdb likely stops because
the "next" descriptor happens to be in the middle of the object and its
parsing throws.

This patch fixes this by taking advantage of the virtual size() call of
the migrate_fn_type all LSA objects are linked with (indirectly). It
gets the migrator object, the LSA object itself and calls

  ((migrate_fn_type*)<migrator_ptr>)->size((const void*)<object_ptr>)

with gdb. The evaluated value is the live dynamic size of the object.

fixes: #11792
refs: #2455

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11847
2022-10-28 14:11:30 +03:00
Botond Dénes
74c9aa3a3f Merge 'removenode: allow specifying nodes to ignore using host_id' from Benny Halevy
Currently, when specifying nodes to ignore for replace or removenode,
we support specifying them only using their ip address.

As discussed in https://github.com/scylladb/scylladb/issues/11839 for removenode,
we intentionally require the host uuid for specifying the node to remove,
so the nodes to ignore (that are also done, otherwise we need not ignore them),
should be consistent with that and be specified using their host_id.

The series extends the apis and allows either the nodes ip address or their host_id
to be specified, for backward compatibility.

We should deprecate the ip address method over time and convert the tests and management
software to use the ignored nodes' host_id:s instead.

Closes #11841

* github.com:scylladb/scylladb:
  api: doc: remove_node: improve summary
  api, service: storage_service: removenode: allow passing ignore_nodes as uuid:s
  storage_service: get_ignore_dead_nodes_for_replace: use tm.parse_host_id_and_endpoint
  locator: token_metadata: add parse_host_id_and_endpoint
  api: storage_service: remove_node: validate host_id
2022-10-28 13:35:04 +03:00
Benny Halevy
335a8cc362 api: doc: remove_node: improve summary
The current summary of the operation is obscure.
It refers to a token in the ring and the endpoint associated with it,
while the operation uses a host_id to identify a whole node.

Instead, clarify the summary to refer to a node in the cluster,
consistent with the description for the host_id parameter.
Also, describe the effect the call has on the data the removed node
logically owned.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-28 07:52:37 +03:00
Benny Halevy
9ef2631ec2 api, service: storage_service: removenode: allow passing ignore_nodes as uuid:s
Currently the api is inconsistent: requiring a uuid for the
host_id of the node to be removed, while the ignored nodes list
is given as comma-separated ip addresses.

Instead, support identifying the ignored_nodes either
by their host_id (uuid) or ip address.

Also, require all ignore_nodes to be of the same kind:
either UUIDs or ip addresses, as a mix of the 2 is likely
indicating a user error.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-28 07:49:03 +03:00
Benny Halevy
40cd685371 storage_service: get_ignore_dead_nodes_for_replace: use tm.parse_host_id_and_endpoint
Allow specifying the dead node to ignore either as host_id
or ip address.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-28 07:38:13 +03:00
Benny Halevy
b74807cb8a locator: token_metadata: add parse_host_id_and_endpoint
To be used for specifying nodes either by their
host_id or ip address and using the token_metadata
to resolve the mapping.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-28 07:38:13 +03:00
Benny Halevy
340a5a0c94 api: storage_service: remove_node: validate host_id
The node to be removed must be identified by its host_id.
Validate that at the api layer and pass the parsed host_id
down to storage_service::removenode.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-28 07:38:13 +03:00
Takuya ASADA
464b5de99b scylla_setup: allow symlink to --disks option
Currently, --disks options does not allow symlinks such as
/dev/disk/by-uuid/* or /dev/disk/azure/*.

To allow using them, is_unused_disk() should resolve symlink to
realpath, before evaluating the disk path.

Fixes #11634

Closes #11646
2022-10-28 07:24:11 +03:00
Botond Dénes
b744036840 Merge 'scylla_util.py: on sysconfig_parser, don't use double quote when it's possible' from Takuya ASADA
It seems like distribution original sysconfig files does not use double
quote to set the parameter when the value does not contain space.
Adding function to detect spaces in the value, don't usedouble quote
when it not detected.

Fixes #9149

Closes #9153

* github.com:scylladb/scylladb:
  scylla_util.py: adding unescape for sysconfig_parser
  scylla_util.py: on sysconfig_parser, don't use double quote when it's possible
2022-10-28 07:19:13 +03:00
Benny Halevy
44e1058f63 docs: nodetool/removenode: fix host_id in examples
removenode host_id must specify the host ID as a UUID,
not an ip address.

Fixes #11839

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11840
2022-10-27 14:29:36 +03:00
Pavel Emelyanov
7b193ab0a5 messaging_service: Deny putting INADD_ANY as preferred ip
Even though previous patch makes scylla not gossip this as internal_ip,
an extra sanity check may still be useful. E.g. older versions of scylla
may still do it, or this address can be loaded from system_keyspace.

refs: #11502

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-27 14:25:43 +03:00
Pavel Emelyanov
aa7a759ac9 messaging_service: Toss preferred ip cache management
Make it call cache_preferred_ip() even when the cache is loaded from
system_keyspace and move the connection reset there. This is mainly to
prepare for the next patch, but also makes the code a bit shorter

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-27 14:25:43 +03:00
Pavel Emelyanov
91b460f1c4 gossiping_property_file_snitch: Dont gossip INADDR_ANY preferred IP
Gossiping 0.0.0.0 as preferred IP may break the peer as it will
"interpret" this address as <myself> which is not what peer expects.
However, g.p.f.s. uses --listen-address argument as the internal IP
and it's not prohibited to configure it to be 0.0.0.0

It's better not to gossip the INTERNAL_IP property at all if the listen
address is such.

fixes: #11502

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-27 14:25:43 +03:00
Pavel Emelyanov
99579bd186 gossiping_property_file_snitch: Make _listen_address optional
As the preparation for the next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-27 14:15:26 +03:00
Benny Halevy
0ea8250e83 repair: use sharded abort_source to abort repair_info
Currently we use a single shared_ptr<abort_source>
that can't be copied across shards.

Instead, use a sharded<abort_source> in node_ops_info so that each
repair_info instance will use an (optional) abort_source*
on its own shard.

Added respective start and stop methodsm plus a local_abort_source
getter to get the shard-local abort_source (if available).

Fixes #11826

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-27 12:18:30 +03:00
Benny Halevy
88f993e5ed repair: node_ops_info: add start and stop methods
Prepare for adding a sharded<abort_source> member.

Wire start/stop in storage_service::node_ops_meta_data.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-27 12:18:30 +03:00
Benny Halevy
c2f384093d storage_service: node_ops_abort_thread: abort all node ops on shutdown
A later patch adds a sharded<abort_source> to node_ops_info.
On shutdown, we must orderly stop it, so use node_ops_abort_thread
shutdown path (where node_ops_singal_abort is called will a nullopt)
to abort (and stop) all outstanding node_ops by passing
a null_uuid to node_ops_abort, and let it iterate over all
node ops to abort and stop them.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-27 12:14:06 +03:00
Benny Halevy
0efd290378 storage_service: node_ops_abort_thread: co_return only after printing log message
Currently the function co_returns if (!uuid_opt)
so the log info message indicating it's stopped
is not printed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-27 12:14:03 +03:00
Benny Halevy
47e4761b4e storage_service: node_ops_meta_data: add start and stop methods
Prepare for starting and stopping repair node_ops_info

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-27 12:14:03 +03:00
Benny Halevy
5c25066ea7 repair: node_ops_info: prevent accidental copy
Delete node_ops_info copy and move constructors before
we add a sharded<abort_source> member for the per-shard repairs
in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-27 12:14:03 +03:00
Takuya ASADA
cd6030d5df scylla_util.py: adding unescape for sysconfig_parser
Even we have __escape() for escaping " middle of the value to writing
sysconfig file, we didn't unescape for reading from sysconfig file.
So adding __unescape() and call it on get().
2022-10-27 16:39:47 +09:00
Takuya ASADA
de57433bcf scylla_util.py: on sysconfig_parser, don't use double quote when it's possible
It seems like distribution original sysconfig files does not use double
quote to set the parameter when the value does not contain space.
Adding function to detect spaces in the value, don't usedouble quote
when it not detected.

Fixes #9149
2022-10-27 16:36:27 +09:00
Aleksandra Martyniuk
6494de9bb0 tasks: add alternative make_task method
Task manager tasks should be created with make_task method since
it properly sets information about child-parent relationship
between tasks. Though, sometimes we may want to keep additional
task data in classes inheriting from task_manager::task::impl.
Doing it with existing make_task method makes it impossible since
implementation objects are created internally.

The commit adds a new make_task that allows to provide a task
implementation pointer created by caller. All the fields except
for the one connected with children and parent should be set before.
2022-10-26 14:01:05 +02:00
Aleksandra Martyniuk
10d11a7baf tasks: rename parent_data to task_info and move it
parent_data struct contains info that is common	for each task,
not only in parent-child relationship context. To use it this way
without confusion, its name is changed to task_info.

In order to be able to widely and comfortably use task_info,
it is moved from tasks/task_manager.hh to tasks/types.hh
and slightly extended.
2022-10-26 14:01:05 +02:00
Aleksandra Martyniuk
9ecc2047ac tasks: move task_id to tasks/types.hh 2022-10-26 14:01:05 +02:00
Aleksandra Martyniuk
e2e8a286cc tasks: add internal flag for task_manager::task::impl
It is convenient to create many different tasks implementations
representing more and more specific parts of the operation in
a module. Presenting all of them through the api makes it cumbersome
for user to navigate and track, though.

Flag internal is added to task_manager::task::impl so that the tasks
could be filtered before they are sent to user.
2022-10-26 14:01:05 +02:00
Pavel Emelyanov
e245780d56 gossiper: Request topology states in shadow round
When doing shadow round for replacement the bootstrapping node needs to
know the dc/rack info about the node it replaces to configure it on
topology. This topology info is later used by e.g. repair service.

fixes: #11829

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11838
2022-10-25 13:21:20 +03:00
Pavel Emelyanov
64c9359443 storage_proxy: Don't use default-initialized endpoint in get_read_executor()
After calling filter_for_query() the extra_replica to speculate to may
be left default-initialized which is :0 ipv6 address. Later below this
address is used as-is to check if it belongs to the same DC or not which
is not nice, as :0 is not an address of any existing endpoint.

Recent move of dc/rack data onto topology made this place reveal itself
by emitting the internal error due to :0 not being present on the
topology's collection of endpoints. Prior to this move the dc filter
would count :0 as belonging to "default_dc" datacenter which may or may
not match with the dc of the local node.

The fix is to explicitly tell set extra_replica from unset one.

fixes: #11825

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11833
2022-10-25 09:16:50 +03:00
Takuya ASADA
1a11a38add unified: move unified package contents to sub-directory
On most of the software distribution tar.gz, it has sub-directory to contain
everything, to prevent extract contents to current directory.
We should follow this style on our unified package too.

To do this we need to increment relocatable package version to '3.0'.

Fixes #8349

Closes #8867
2022-10-25 08:58:15 +03:00
Takuya ASADA
a938b009ca scylla_raid_setup: run uuidpath existance check only after mount failed
We added UUID device file existance check on #11399, we expect UUID
device file is created before checking, and we wait for the creation by
"udevadm settle" after "mkfs.xfs".

However, we actually getting error which says UUID device file missing,
it probably means "udevadm settle" doesn't guarantee the device file created,
on some condition.

To avoid the error, use var-lib-scylla.mount to wait for UUID device
file is ready, and run the file existance check when the service is
failed.

Fixes #11617

Closes #11666
2022-10-25 08:54:21 +03:00
Yaniv Kaul
cec21d10ed docs: Fix typo (patch -> batch)
See subject.

Closes #11837
2022-10-25 08:50:44 +03:00
Michał Radwański
36508bf5e9 serializer_impl: remove unneeded generic parameter
Input stream used in vector_deserializer doesn't need to be generic, as
there is only one implementation used.
2022-10-24 17:21:38 +02:00
Tomasz Grabiec
687df05e28 db: make_forwardable::reader: Do not emit range_tombstone_change with position past the range
Since the end bound is exclusive, the end position should be
before_key(), not after_key().

Affects only tests, as far as I know, only there we can get an end
bound which is a clustering row position.

Would cause failures once row cache is switched to v2 representation
because of violated assumptions about positions.

Introduced in 76ee3f029c

Closes #11823
2022-10-24 17:06:52 +03:00
Avi Kivity
9e34779c53 Update seastar submodule
* seastar 601e0776c0...f32ed00954 (28):
  > Merge 'treewide: more fmt 9 adjustments' from Avi Kivity
  > rpc: Remove nested class friend declaration from connection
  > reactor: advance the head pointer in batch
  > Add git submodule instructions to HACKING.md, resolves #541
  > dns: Handle TCP mode connect failure
  > future: s/make_exception_ptr/std::make_exception_ptr/
  > reactor: implement read_some(fd, buffer, len) in io_uring
  > reactor: remove unneeded "protected"
  > Merge 'reactor: support more network ops in io_uring backend' from Kefu Chai
  > reactor: Indentation fix after previous patch
  > io: Remove --max-io-requests concept
  > future: add concept constraints to handle_exception()
  > future: improve the doxygen document
  > aio_general_context: flush: provide 1 second grace for retries
  > reactor: destroy_scheduling_group: make sure scheduling_group is valid
  > reactor: pass a plain pointer to io_uring_wait_cqes()
  > gate: add move ctor and move assignment operator for gate
  > reactor: drop stale comment
  > reactor_config: update stale doc comments
  > test: alloc_test: Actually prevent dead allocation elimination
  > util/closeable: hold _obj with reference_wrapper<>
  > memory: Fix off-by-one in large allocation detection
  > util/closeable: add move ctor for deferred_stop
  > reactor: Remove some unused friend declarations
  > core/sharded.hh: tweak on comment for better readability
  > Merge 'fmt 9 ostream fix' from longlene
  > program_options: allow configure switch-stytle option programmatically
  > inet_address: Add helper to check for address being lo/any

Closes #11814
2022-10-21 21:30:07 +03:00
Botond Dénes
4aa0b16852 Merge 'distributed_loader: detect highest generation before populating column families' from Benny Halevy
We should scan all sstables in the table directory and its
subdirectories to determine the highest sstable version and generation
before using it for creating new sstables (via reshard or reshape).

Otherwise, the generations of new sstables created when populating staging (via reshard or reshape) may collide with generations in the base directory, leading to https://github.com/scylladb/scylladb/issues/11789

Refs scylladb/scylladb#11789
Fixes scylladb/scylladb#11793

Closes #11795

* github.com:scylladb/scylladb:
  distributed_loader: populate_column_family: reindent
  distributed_loader: coroutinize populate_column_family
  distributed_loader: table_population_metadata: start: reindent
  distributed_loader: table_population_metadata: coroutinize start_subdir
  distributed_loader: table_population_metadata: start_subdir: reindent
  distributed_loader: pre-load all sstables metadata for table before populating it
2022-10-21 14:07:51 +03:00
Botond Dénes
e981bd4f21 Merge 'Alternator, MV: fix bug in some view updates which set the view key to its existing value' from Nadav Har'El
As described in issue #11801, we saw in Alternator when a GSI has both partition and sort keys which were non-key attributes in the base, cases where updating the GSI-sort-key attribute to the same value it already had caused the entire GSI row to be deleted.

In this series fix this bug (it was a bug in our materialized views implementation) and add a reproducing test (plus a few more tests for similar situations which worked before the patch, and continue to work after it).

Fixes #11801

Closes #11808

* github.com:scylladb/scylladb:
  test/alternator: add test for issue 11801
  MV: fix handling of view update which reassign the same key value
  materialized views: inline used-once and confusing function, replace_entry()
2022-10-21 10:49:28 +03:00
Botond Dénes
396d9e6a46 Merge 'Subscribe repair_info::abort on node_ops_meta_data::abort_source' from Pavel Emelyanov
The storage_service::stop() calls repair_service::abort_repair_node_ops() but at that time the sharded<repair_service> is already stopped and call .local() on it just crashes.

The suggested fix is to remove explicit storage_service -> repair_service kick. Instead, the repair_infos generated for the sake of node-ops are subscribed on the node_ops_meta_data's abort source and abort themselves automatically.

fixes: #10284

Closes #11797

* github.com:scylladb/scylladb:
  repair: Remove ops_uuid
  repair: Remove abort_repair_node_ops() altogether
  repair: Subscribe on node_ops_info::as abortion
  repair: Keep abort source on node_ops_info
  repair: Pass node_ops_info arg to do_sync_data_using_repair()
  repair: Mark repair_info::abort() noexcept
  node_ops: Remove _aborted bit
  node_ops: Simplify construction of node_ops_metadata
  main: Fix message about repair service starting
2022-10-21 10:08:43 +03:00
Avi Kivity
9ebac12e60 test: mutation-test: fix off-by-one in test_large_collection_allocation
The test wants to see that no allocations larger than 128k are present,
but sets the warning threshold to exactly 128k. Due to an off-by-one in
Seastar, this went unnoticed. However, now that the off-by-one in Seastar
is fixed [1], this test starts to fail.

Fix by setting the warning threshold to 128k + 1.

[1] 429efb5086

Closes #11817
2022-10-21 10:04:40 +03:00
Avi Kivity
f0643d1713 alternator: ttl: do not copy mutation while constructing a vector
The vector(initializer_list<T>) constructor copies the T since
initializer_list is read-only. Move the mutation instead.

This happens to fix a use-after-return on clang 15 on aarch64.
I'm fairly sure that's a miscompile, but the fix is worthwhile
regardless.

Closes #11818
2022-10-21 10:04:00 +03:00
Avi Kivity
db79f1eb60 Merge 'cql3: expr: Add unit tests for evaluate()' from Jan Ciołek
This PR adds some unit tests for the `expr::evaluate()` function.

At first I wanted to add the unit tests as part of #11658, but their size grew and grew, until I decided that they deserve their own pull request.

I found a few places where I think it would be better to behave in a different way, but nothing serious.

Closes #11815

* github.com:scylladb/scylladb:
  test/boost: move expr_test_utils.hh to .hh and .cc in test/lib
  cql3: expr: Add unit tests for bind_variable validation of collections
  cql3: expr: Add test for subscripted list and map
  cql3: expr: Add test for usertype_constructor
  cql3: expr: Add test for tuple_constructor
  cql3: expr: Add tests for evaluation of collection constructors
  cql3: expr: Add tests for evaluation of column_values and bind_variables
  cql3: expr: Add constant evaluation tests
  test/boost: Add expr_test_utils.hh
  cql3: Add ostream operator for raw_value
  cql3: add is_empty_value() to raw_value and raw_value_view
2022-10-20 22:55:34 +03:00
Jan Ciolek
4c4ed8e6df test/boost: move expr_test_utils.hh to .hh and .cc in test/lib
expr_test_utils.hh was a header file with helper methods for
expression tests. All functions were inline, because I didn't
know how to create and link a .cc file in test/boost.

Now the header is split into expr_test_utils.hh and expr_test_utils.cc
and moved to test/lib, which is designed to keep this kind of files.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-20 17:31:37 +02:00
Avi Kivity
6ce659be5b Merge "Deglobalize snitch" from Pavel E
"
Snitch was the junction of several services' deps because it was the
holder of endpoint->dc/rack mappings. Now this information is all on
topology object, so snitch can be finally made main-local
"

* 'br-deglobalize-snitch' of https://github.com/xemul/scylla:
  code: Deglobalize snitch
  tests: Get local reference on global snitch instance once
  gossiper: Pass current snitch name into checker
  snitch: Add sharded<snitch_ptr> arg to reset_snitch()
  api: Move update_snitch endpoint
  api: Use local snitch reference
  api: Unset snitch endpoints on stop
  storage_service: Keep local snitch reference
  system_keyspace: Don't use global snitch instance
  snitch: Add const snitch_ptr::operator->()
2022-10-20 16:51:24 +03:00
Avi Kivity
dd0b571d7e Update tools/java submodule (Scylla Cloud serverless config option)
* tools/java 5f2b91d774...87672be28e (1):
  > Add serverless Scylla Cloud config file option
2022-10-20 16:15:28 +03:00
Konstantin Osipov
8c920add42 test: (pytest) fix the pytest wrapper to work on Ubuntu
Ubuntu doesn't have python, only python2 and python3.

Closes #11810
2022-10-20 15:53:24 +03:00
Botond Dénes
669b225c67 reader_permit: resources: remove operator bool and >=
These cannot be meaningfully define for a vector value like resources.
To prevent instinctive misuse, remove them. Operator bool is replaced
with `non_zero()` which hopefully better expresses what to expected.
The comparison operator is just removed and inlined into its own user,
which actually help said user's readability.

Closes #11813
2022-10-20 15:25:11 +03:00
Jan Ciolek
75b27cb61c cql3: expr: Add unit tests for bind_variable validation of collections
evaluating a bind variable should validate
collection values.

Test that bound collection values are validated,
even in case of a nested collection.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-20 12:12:03 +02:00
Jan Ciolek
c4651e897f cql3: expr: Add test for subscripted list and map
Test that subscripting lists and maps works as expected.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-20 12:12:03 +02:00
Jan Ciolek
5a00c3dd76 cql3: expr: Add test for usertype_constructor
Test that evaluate(usertype_constructor) works
as expected.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-20 12:12:03 +02:00
Jan Ciolek
8f6309bd66 cql3: expr: Add test for tuple_constructor
Test that evaluate(tuple_constructor) works
as expected.

It was necessary to implement a custom function
for serializing tuples, because some tests
require the tuple to contain unset_value
or an empty value, which is impossible
to express using the exisiting code.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-20 12:12:03 +02:00
Jan Ciolek
5ae719d51a cql3: expr: Add tests for evaluation of collection constructors
Test that evaluate(collection_constructor) works as expected.

Added a bunch of utility methods for creating
collection values to expr_test_utils.hh.

I was forced to write custom serialization of
collections. I tried to use data_value,
but it doesn't allow to express unset_value
and empty values.

The custom serialization isnt actually used
in this specific commit, but it's needed
in the following ones.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-20 12:12:02 +02:00
Pavel Emelyanov
01b1f56bd7 code: Deglobalize snitch
All uses of snitch not have their own local referece. The global
instance can now be replaced with the one living in main (and tests)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:33:41 +03:00
Pavel Emelyanov
8e4e3f7185 tests: Get local reference on global snitch instance once
Some tests actively use global snitch instance. This patch makes each
test get a local reference and use it everywhere. Next patch will
replace global instance with local one

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:33:40 +03:00
Pavel Emelyanov
898579027d gossiper: Pass current snitch name into checker
Gossiper makes sure local snitch name is the same as the one of other
nodes in the ring. It now gets global snitch to get the name, this patch
passes the name as an argument, because the caller (storage_service) has
snitch instance local reference

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:33:38 +03:00
Pavel Emelyanov
1674882220 snitch: Add sharded<snitch_ptr> arg to reset_snitch()
The method replaces snitch instance on the existing sharded<snitch_ptr>
and the "existing" is nowadays the global instance. This patch changes
it to use local reference passed from API code

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:33:34 +03:00
Pavel Emelyanov
5fba0a7f65 api: Move update_snitch endpoint
It's now living in storage_service.cc, but non-global snitch is
available in endpoint_snitch.cc so move the endpoint handler there

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:33:20 +03:00
Pavel Emelyanov
0d49b0e24a api: Use local snitch reference
The snitch/name endpoint needs snitch instance to get the name from.
Also the storage_service/reset_snitch endpoint will also need snitch
instance to call reset on.

This patch carries local snitch reference all thw way through API setup
and patches the get_name() call. The reset_snitch() will come in the
next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:31:45 +03:00
Pavel Emelyanov
c175ea33e2 api: Unset snitch endpoints on stop
Some time soon snitch API handlers will operate on local snitch
reference capture, so those need to be unset before the target local
variable variable goes away

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:31:12 +03:00
Pavel Emelyanov
ea8bfc4844 storage_service: Keep local snitch reference
Storage service uses snitch in several places:
- boot
- snitch-reconfigured subscription
- preferred IP reconnection

At this point it's worth adding storage_service->snitch explicit
dependency and patch the above to use local reference

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:30:00 +03:00
Pavel Emelyanov
52d6e56a10 system_keyspace: Don't use global snitch instance
There are two places to patch: .start() and .setup() and both only need
snitch to get local dc/rack from, nothing more. Thus both can live with
the explicit argument for now

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:29:26 +03:00
Pavel Emelyanov
f524a79fe9 snitch: Add const snitch_ptr::operator->()
To call snitch->something() on const snitch_ptr& variable later

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:29:25 +03:00
Nadav Har'El
264f453b9d Merge 'Associate alternator user with its service level configuration' from Piotr Sarna
Until now, authentication in alternator served only two purposes:
 - refusing clients without proper credentials
 - printing user information with logs

After this series, this user information is passed to lower layers, which also means that users are capable of attaching service levels to roles, and this service level configuration will be effective with alternator requests.

tests: manually by adding more debug logs and inspecting that per-service-level timeout value was properly applied for an authenticated alternator user

Fixes #11379

Closes #11380

* github.com:scylladb/scylladb:
  alternator: propagate authenticated user in client state
  client_state: add internal constructor with auth_service
  alternator: pass auth_service and sl_controller to server
2022-10-19 23:27:48 +03:00
Avi Kivity
22f13e7ca3 Revert "Merge 'cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query()' from Avi Kivity"
This reverts commit df8e1da8b2, reversing
changes made to 4ff204c028. It causes
a crash in debug mode on aarch64 (likely a coroutine miscompile).

Fixes #11809.
2022-10-19 21:28:55 +03:00
Alexander Turetskiy
636e14cc77 Alternator: Projection field added to return from DescribeTable which describes GSIs and LSIs.
The return from DescribeTable which describes GSIs and LSIs is missing
the Projection field. We do not yet support all the settings Projection
(see #5036), but the default which we support is ALL, and DescribeTable
should return that in its description.

Fixes #11470

Closes #11693
2022-10-19 19:01:08 +03:00
Avi Kivity
69199dbfba Merge 'schema_tables: limit concurrency' from Benny Halevy
To prevent stalls due to large number of tables.

Fixes scylladb/scylladb#11574

Closes #11689

* github.com:scylladb/scylladb:
  schema_tables: merge_tables_and_views reindent
  schema_tables: limit paralellism
2022-10-19 18:40:45 +03:00
Tomasz Grabiec
a979bbf829 dbuild: Do not fail if .gdbinit is missing
Closes #11811
2022-10-19 18:38:09 +03:00
Avi Kivity
6b0afb968d Merge 'reader_concurrency_semaphore: add set_resources()' from Botond Dénes
Allowing to change the total or initial resources the semaphore has. After calling `set_resources()` the semaphore will look like as if it was created with the specified amount of resources when created.

Use the new method in `replica::database::revert_initial_system_read_concurrency_boost()` so it doesn't lead to strange semaphore diagnostics output. Currently the system semaphore has 90/100 count units when there are no reads against it, which has led to some confusion.

I also plan on using the new facility in enterprise.

Closes #11772

* github.com:scylladb/scylladb:
  replica/database: revert initial boost to system semaphore with set_resources()
  reader_concurrency_semaphore: add set_resources()
2022-10-19 18:04:20 +03:00
Raphael S. Carvalho
ba6186a47f replica: Pick new generation for SSTables being moved from staging dir
When moving a SSTable from staging to base dir, we reused the generation
under the assumption that no SSTable in base dir uses that same
generation. But that's not always true.

When reshaping staging dir, reshape compaction can pick a generation
taken by a SSTable in base dir. That's because staging dir is populated
first and it doesn't have awareness of generations in base dir yet.

When that happens, view building will fail to move SSTable in staging
which shares the same generation as another in base dir.

We could have played with order of population, populating base dir
first than staging dir, but the fragility wouldn't be gone. Not
future proof at all.
We can easily make this safe by picking a new generation for the SSTable
being moved from staging, making sure no clash will ever happen.

Fixes #11789.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #11790
2022-10-19 15:33:30 +03:00
Nadav Har'El
2e439c9471 test/alternator: add test for issue 11801
This patch adds a test reproducing issue #11801, and confirming that
the previous patch fixed it. Before the previous patch, the test passed
on DynamoDB but failed on Alternator.
The patch also adds four more passing tests which demonstrate that
issue #11801 only happened in the very specific case where:
 1. A GSI has two key attributes which weren't key attributes in the
    base, and
 2. An update sets the second of those attributes to the same value
    which it already had.

This bug was originally discovered and explained by @fee-mendes.

Refs #11801.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-10-19 14:36:48 +03:00
Benny Halevy
4d7f0be929 distributed_loader: populate_column_family: reindent 2022-10-19 14:18:38 +03:00
Benny Halevy
030afaa934 distributed_loader: coroutinize populate_column_family
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-19 14:18:04 +03:00
Benny Halevy
0f23ee14c9 distributed_loader: table_population_metadata: start: reindent 2022-10-19 14:16:59 +03:00
Benny Halevy
39cec4f304 distributed_loader: table_population_metadata: coroutinize start_subdir
Calling it in a seastar thread was done to reduce code churn
and facilitate backporting.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-19 14:16:59 +03:00
Benny Halevy
5749a54cab distributed_loader: table_population_metadata: start_subdir: reindent 2022-10-19 14:16:59 +03:00
Benny Halevy
119c0f3983 distributed_loader: pre-load all sstables metadata for table before populating it
We should scan all sstables in the table directory and its
subdirectories to determine the highest sstable version and generation
before using it for creating new sstables (via reshard or reshape).

Fixes scylladb/scylladb#11793

Note: table_population_metadata::start_subdir is called
in a seastar thread to facilitate backporting to old versions
that do not support coroutines yet.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-19 14:16:57 +03:00
Nadav Har'El
8f4243b875 MV: fix handling of view update which reassign the same key value
When a materialized view has a key (in Alternator, this can be two
keys) which was a regular column in the base table, and a base
update modifies that regular column, there are two distinct cases:

1. If the old and new key values are different, we need to delete the
   old view row, and create a new view row (with the different key).

2. If the old and new key values are the same, we just need to update
   the pre-existing row.

It's important not to confuse the two cases: If we try to delete and
create the *same* view row in the same timestamp, the result will be
that the row will be deleted (a tombstone wins over data if they have the
same timestamp) instead of updated. This is what we saw in issue #11801.

We had a bug that was seen when an update set the view key column to
the old value it already had: To compare the old and new key values
we used the function compare_atomic_cell_for_merge(), but this compared
not just they values but also incorrectly compared the metadata such as
a the timestamp. Because setting a column to the same value changes its
timestamp, we wrongly concluded that these to be different view keys
and used the delete-and-create code for this case, resulting in the
view row being deleted (as explained above).

The simple fix is to compare just the key values - not looking at
the metadata.

See tests reproducing this bug and confirming its fix in the next patch.

Fixes #11801

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-10-19 13:43:12 +03:00
Nadav Har'El
e1f8cb6521 materialized views: inline used-once and confusing function, replace_entry()
The replace_entry() function is nothing more than a convenience for
calling delete_old_entry() and then create_entry(). But it is only used
once in the code, and we can just open-code the two calls instead of
the one.

The reason I want to change it now is that the shortcut replace_entry()
helped hide a bug (#11801) - replace_entry() works incorrectly if the
old and new row have the same key, because if they do we get a deletion
and creation of the same row with the same timestamp - and the deletion
wins. Having the two calls not hidden by a convenience function makes
this potential problem more apparent.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-10-19 13:25:34 +03:00
Benny Halevy
ce22dd4329 schema_tables: merge_tables_and_views reindent
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-19 13:05:41 +03:00
Benny Halevy
7ccb0e70f0 schema_tables: limit paralellism
To prevent stalls due to large number of tables.

Fixes scylladb/scylladb#11574

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-19 13:05:38 +03:00
Anna Stuchlik
7ec750fc63 docs: add the list of new metrics in 5.1
Closes #11703
2022-10-19 12:06:25 +03:00
Jan Ciolek
1b7acc758e cql3: expr: Add tests for evaluation of column_values and bind_variables
Add tests which test that evaluate(column_value)
and evaluate(bind_variable) work as expected.

values of columns and bind variables are
kept in evaluation_inputs, so we need to mock
them in order for evaluate() to work.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-19 10:30:51 +02:00
Jan Ciolek
0f29015d9f cql3: expr: Add constant evaluation tests
Add unit test for evaluating expr::constant values.
evaluate(constant) just returns constant.value,
so there is no point in trying all the possible combinations.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-19 10:30:42 +02:00
Anna Stuchlik
a066396cd3 doc: fix the command to create and sign a certificate so that the trusted certificate SHA256 is created
Closes #11758
2022-10-19 11:30:20 +03:00
Botond Dénes
37ebbc819a Merge 'Scylla-gdb lsa polishing' from Pavel Emelyanov
It was supposed to be fix for #2455, but eventually it turned out that #11792 blocks this progress but takes more efforts. So for now only a couple of small improvements (not to lose them by chance)

Closes #11794

* github.com:scylladb/scylladb:
  scylla-gdb: Make regions iterable object
  scylla-gdb: Dont print 0x0x
2022-10-19 06:54:49 +03:00
Botond Dénes
2d581e9e8f Merge "Maintain dc/rack by topology" from Pavel Emelyanov
"
There's an ongoing effort to move the endpoint -> {dc/rack} mappings
from snitch onto topology object and this set finalizes it. After it the
snitch service stops depending on gossiper and system keyspace and is
ready for de-globalization. As a nice side-effect the system keyspace no
longer needs to maintain the dc/rack info cache and its starting code gets
relaxed.

refs: #2737
refs: #2795
"

* 'br-snitch-dont-mess-with-topology-data-2' of https://github.com/xemul/scylla: (23 commits)
  system_keyspace: Dont maintain dc/rack cache
  system_keyspace: Indentation fix after previous patch
  system_keyspace: Coroutinuze build_dc_rack_info()
  topology: Move all post-configuration to topology::config
  snitch: Start early
  gossiper: Do not export system keyspace
  snitch: Remove gossiper reference
  snitch: Mark get_datacenter/_rack methods const
  snitch: Drop some dead dependency knots
  snitch, code: Make get_datacenter() report local dc only
  snitch, code: Make get_rack() report local rack only
  storage_service: Populate pending endpoint in on_alive()
  code: Populate pending locations
  topology: Put local dc/rack on topology early
  topology: Add pending locations collection
  topology: Make get_location() errors more verbose
  token_metadata: Add config, spread everywhere
  token_metadata: Hide token_metadata_impl copy constructor
  gosspier: Remove messaging service getter
  snitch: Get local address to gossip via config
  ...
2022-10-19 06:50:21 +03:00
Jan Ciolek
429600a957 test/boost: Add expr_test_utils.hh
Add a header file which will contain
utilities for writing expression tests.

For now it contains simple functions
like make_int_constant(), but there
are many more to come.

I feel like it's cleaner to put
all these functions in a separate
file instead of having them spread randomly
between tests.

It also enables code reuse so that
future expression tests can reuse
these functions instead of writing
them from scratch.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-18 22:48:33 +02:00
Jan Ciolek
855db49306 cql3: Add ostream operator for raw_value
It's possible to print raw_value_view,
but not raw_value. It would be useful
to be able to print both.

Implement printing raw_value
by creating raw_value_view from it
and printing the view.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-18 22:48:25 +02:00
Jan Ciolek
096c65d27f cql3: add is_empty_value() to raw_value and raw_value_view
An empty value is a value that is neither null nor unset,
but has 0 bytes of data.
Such values can be created by the user using
certain CQL functions, for example an empty int
value can be inserted using blobasint(0x).

Add a method to raw_value and raw_value_view,
which allows to check whether the value is empty.
This will be used in many places in which
we need to validate that a value isn't empty.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-18 22:47:48 +02:00
Pavel Emelyanov
3dc7c33847 repair: Remove ops_uuid
It used to be used to abort repair_info by the corresponding node-ops
uuid, but this code is no longer there, so it's good to drop the uuid as
well

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 20:04:23 +03:00
Pavel Emelyanov
b835c3573c repair: Remove abort_repair_node_ops() altogether
This code is dead after previous patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 20:04:23 +03:00
Pavel Emelyanov
8231b4ec1b repair: Subscribe on node_ops_info::as abortion
When node_ops_meta_data aborts it also kicks repair to find and abort
all relevant repair_infos. Now it can be simplified by subscribing
repair_meta on the abort source and aborting it without explicit kick

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 20:04:23 +03:00
Pavel Emelyanov
bf5825daac repair: Keep abort source on node_ops_info
Next patches will need to subscribe on node_ops_meta_data's abort source
inside repair code, so keep the pointer on node_ops_info too. At the
same time, the node_ops_info::abort becomes obsolete, because the same
check can be performed via the abort_source->abort_requested()

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 20:04:23 +03:00
Pavel Emelyanov
bbb7fca09c repair: Pass node_ops_info arg to do_sync_data_using_repair()
Next patches will need to know more than the ops_uuid. The needed info
is (well -- will be) sitting on node_ops_info, so pass it along

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 20:04:23 +03:00
Pavel Emelyanov
5e9c3c65b5 repair: Mark repair_info::abort() noexcept
Next patch will call it inside abort_source subscription callback which
requires the calling code to be noexcept

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 20:04:23 +03:00
Pavel Emelyanov
34458ec2c5 node_ops: Remove _aborted bit
A short cleanup "while at it" -- the node_ops_meta_data doesn't need to
carry dedicated _aborted boolean -- the abort source that sets it is
available instantly

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 20:04:22 +03:00
Pavel Emelyanov
96f0695731 node_ops: Simplify construction of node_ops_metadata
It always constructs node_ops_info the same way

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 20:03:53 +03:00
Pavel Emelyanov
2fa58632b3 main: Fix message about repair service starting
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 17:23:17 +03:00
Botond Dénes
7fbad8de87 reader_concurrency_semaphore: unify admission logic across all paths
The semaphore currently has two admission paths: the
obtain_permit()/with_permit() methods which admits permits on user
request (the front door) and the maybe_admit_waiters() which admits
permits based on internal events like memory resource being returned
(the back door). The two paths used their own admission conditions
and naturally this means that they diverged in time. Notably,
maybe_admit_waiters() did not look at inactive readers assuming that if
there are waiters there cannot be inactive readers. This is not true
however since we merged the execution-stage into the semaphore. Waiters
can queue up even when there are inactive reads and thus
maybe_admit_waiters() has to consider evicting some of them to see if
this would allow for admitting new reads.
To avoid such divergence in the future, the admission logic was moved
into a new method can_admit_read() which is now shared between the two
method families. This method now checks for the possibility of evicting
inactive readers as well.
The admission logic was tuned slightly to only consider evicting
inactive readers if there is a real possibility that this will result
in admissions: notably, before this patch, resource availability was
checked before stalls were (used permits == blocked permits), so we
could evict readers even if this couldn't help.
Because now eviction can be started from maybe_admit_waiters(), which is
also downstream from eviction, we added a flag to avoid recursive
evict -> maybe admit -> evict ... loops.

Fixes: #11770

Closes #11784
2022-10-18 17:07:43 +03:00
Pavel Emelyanov
b5fd65af61 scylla-gdb: Make regions iterable object
This makes it re-usable across different commands (not there yet)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 16:09:47 +03:00
Pavel Emelyanov
0b6b0bd8d2 scylla-gdb: Dont print 0x0x
Formatting pointer adds 0x automatically, no need in adding it
explicitly

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 16:09:09 +03:00
Botond Dénes
df8e1da8b2 Merge 'cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query()' from Avi Kivity
indexed_table_select_statement::do_execute_base_query() is fairly
complicated and becomes a little simpler with coroutines.

Closes #11297

* github.com:scylladb/scylladb:
  cql3: indexed_table_select_statement: fix indentation
  cql3: indexed_table_select_statement: clarify loop termination
  cql3: indexed_table_select_statement: get rid of internal base_query_state struct
  cql3: indexed_table_select_statement: coroutinize do_execute_base_query()
  cql3: indexed_table_select_statement: de-result_wrap() do_execute_base_query()
2022-10-18 08:24:21 +03:00
Avi Kivity
50b1fd4cd2 cql3: indexed_table_select_statement: fix indentation
Restore normal indentation after coroutinization, no code changes.
2022-10-17 22:03:11 +03:00
Avi Kivity
3ad956ca2d cql3: indexed_table_select_statement: clarify loop termination
The loop terminates when we run out of keys. There are extra
conditions such as for short read or page limit, but these are
truly discovered during the loop and qualify as special
conditions, if you squint enough.
2022-10-17 22:03:11 +03:00
Avi Kivity
ec183d4673 cql3: indexed_table_select_statement: get rid of internal base_query_state struct
It was just a crutch for do_with(), and now can be replaced with
ordinary coroutine-protected variables. The member names were renamed
to the final names they were assigned within the do_with().
2022-10-17 22:03:11 +03:00
Avi Kivity
75e1321b08 cql3: indexed_table_select_statement: coroutinize do_execute_base_query()
Indentation and "infinite" for-loop left for later cleanup.

Note the last check for a utils::result<> failure is no longer needed,
since the previous checks for failure resulted in an immediate co_return
rather than propagating the failure into a variable as with continuations.

The lambda coroutine is stabilized with the new seastar::coroutine::lambda
facility.
2022-10-17 22:03:11 +03:00
Avi Kivity
8b019841d8 cql3: indexed_table_select_statement: de-result_wrap() do_execute_base_query()
It's an obstacle to coroutinization as it introduces more lambdas.
2022-10-17 22:03:11 +03:00
Tomasz Grabiec
4ff204c028 Merge 'cache: make all removals of cache items explicit' from Michał Chojnowski
This series is a step towards non-LRU cache algorithms.

Our cache items are able to unlink themselves from the LRU list. (In other words, they can be unlinked solely via a pointer to the item, without access to the containing list head). Some places in the code make use of that, e.g. by relying on auto-unlink of items in their destructor.

However, to implement algorithms smarter than LRU, we might want to update some cache-wide metadata on item removal. But any cache-wide structures are unreachable through an item pointer, since items only have access to themselves and their immediate neighbours. Therefore, we don't want items to unlink themselves — we want `cache.remove(item)`, rather than `item.remove_self()`, because the former can update the metadata in `cache`.

This series inserts explicit item unlink calls in places that were previously relying on destructors, gets rid of other self-unlinks, and adds an assert which ensures that every item is explicitly unlinked before destruction.

Closes #11716

* github.com:scylladb/scylladb:
  utils: lru: assert that evictables are unlinked before destruction
  utils: lru: remove unlink_from_lru()
  cache: make all cache unlinks explicit
2022-10-17 12:47:02 +02:00
Michał Chojnowski
a96433d3a4 utils: lru: assert that evictables are unlinked before destruction
Previous patches introduce the assumption that evictables are manually unlinked
before destruction, to allow for correct bookkeeping within the cache.
This assert assures that this assumptions is correct.

This is particularly important because the switch from automatic to explicit
unlinking had to be done manually. Destructor calls are invisible, so it's
possible that we have missed some automatic destruction site.
2022-10-17 12:07:27 +02:00
Michał Chojnowski
f340c9cca5 utils: lru: remove unlink_from_lru()
unlink_from_lru() allows for unlinking elements from cache without notifying
the cache. This messes up any potential cache bookkeeping.
Improved that by replacing all uses of unlink_from_lru() with calls to
lru::remove(), which does have access to cache's metadata.
2022-10-17 12:07:27 +02:00
Michał Chojnowski
d785364375 cache: make all cache unlinks explicit
Our LSA cache is implemented as an auto_unlink Boost intrusive list, meaning
that elements of the list unlink themselves from the list automatically on
destruction. Some parts of the code rely on that, and don't unlink them
manually.

However, this precludes accurate bookkeeping about the cache. Elements only have
access to themselves and their neighbours, not to any bookkeeping context.
Therefore, a destructor cannot update the relevant metadata.

In this patch, we fix this by adding explicit unlink calls to places where it
would be done by a destructor. In a following patch, we will add an assert to
the destructor to check that every element is unlinked before destruction.
2022-10-17 12:07:27 +02:00
Nadav Har'El
c31bf4184f test/cql-pytest: two reproducers for SI returning oversized pages
This patch has two reproducing tests for issue #7432, which are cases
where a paged query with a restriction backed by a secondary-index
returns pages larger than the desired page size. Because these tests
reproduce a still-open bug, they are both marked "xfail". Both tests
pass on Cassandra.

The two tests involve quite dissimilar casess - one involves requesting
an entire partition (and Scylla forgetting to page through it), and the
other involves GROUP BY - so I am not sure these two bugs even have the
same underlying cause. But they were both reported in #7432, so let's
have reproducers for both.

Refs #7432

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11586
2022-10-17 11:36:05 +03:00
Botond Dénes
d85208a574 replica/database: revert initial boost to system semaphore with set_resources()
Unlike the current method (which uses consume()), this will also adjust the
initial resources, adjusting the semaphore as if it was created with the
reduced amount of resources in the first place. This fixes the confusing
90/100 count resources seen in diagnostics dump outputs.
2022-10-17 07:39:20 +03:00
Botond Dénes
ecc7c72acd reader_concurrency_semaphore: add set_resources()
Allowing to change the total or initial resources the semaphore has.
After calling `set_resources()` the semaphore will look like as if it
was created with the specified amount of resources when created.
2022-10-17 07:39:20 +03:00
Avi Kivity
e5e7780f32 test: work around modern pytest rejecting site-packages
Modern (as of Fedora 37) pytest has the "-sP" flags in the Python command
line, as found in /usr/bin/pytest. This means it will reject the
site-packages directory, where we install the Scylla Python driver. This
causes all the tests to fail.

Work around it by supplying an alternative pytest script that does not
have this change.

Closes #11764
2022-10-17 07:18:33 +03:00
Nadav Har'El
9f02431064 test/cql-pytest: fix test_permissions.py when running with "--ssl"
The tests in test_permissions.py use the new_session() utility function
to create a new connection with a different logged-in user.

It models the new connection on the existing one, but incorrectly
assumed that the connection is NOT ssl. This made this test failed
with cql-pytest/run is passed the "--ssl" option.

In this patch we correctly infer the is_ssl state from the existing
cql fixture, instead of assuming it is false. After this pass,
"cql-pytest/run --ssl" works as expected for this test.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11742
2022-10-17 06:46:46 +03:00
Tomasz Grabiec
c8a372ae7f test: db: Add test for row merging involving many versions
The test verifies that a row which participated in earlier merge, and
its cells lost on the timestamp check, behaves exactly like an empty
row and can accept any mutation.

This wasn't the case in versions prior to f006acc.

Closes #11787
2022-10-16 14:29:49 +03:00
Tomasz Grabiec
5d7e40af99 mvcc: Add snapshot details to the printout of partition_entry
Useful for debugging.

Closes #11788
2022-10-16 14:22:14 +03:00
Nadav Har'El
d2cd9b71b3 Merge 'Make tracing test run again, simplify backend registry and few related cleanups' from Pavel Emelyanov
It turned out that boost/tracing test is not run because its name doesn't match the *_test.cc pattern. While fixing it it turned out that the test cannot even start, because it uses future<>.get() calls outside of seastar::thread context. While patching this place the trace-backend registry was removed for simplicity. And, while at it, few more cleanups "while at it"

Closes #11779

* github.com:scylladb/scylladb:
  tracing: Wire tracing test back
  tracing: Indentation fix after previous patch
  tracing: Move test into thread
  tracing: Dismantle trace-backend registry
  tracing: Use class-registrator for backends
  tracing: Add constraint to trace_state::begin()
  tracing: Remove copy-n-paste comments from test
  tracing: Outline may_create_new_session
2022-10-16 12:32:17 +03:00
Nadav Har'El
1f936838ba Merge 'doc: fix the notes on the OS Support by Platform and Version page' from Anna Stuchlik
Fix https://github.com/scylladb/scylladb/issues/11773

This PR fixes the notes by removing repetition and improving the clairy of the notes on the OS Support page.
In addition, "Scylla" was replaced with "ScyllaDB" on related pages.

Closes #11783

* github.com:scylladb/scylladb:
  doc: replace Scylla with ScyllaDB
  doc: add a comment to remove in future versions any information that refers to previous releases
  doc: rewrite the notes to improve clarity
  doc: remove the reperitions from the notes
2022-10-16 10:13:50 +03:00
Tomasz Grabiec
87b7e7ff9c Merge 'storage_proxy: prepare for fencing, complex ops' from Avi Kivity
Following up on 69aea59d97, which added fencing support
for simple reads and writes, this series does the same for the
complex ops:
 - partition scan
 - counter mutation
 - paxos

With this done, the coordinator knows about all in-flight requests and
can delay topology changes until they are retired.

Closes #11296

* github.com:scylladb/scylladb:
  storage_proxy: hold effective_replication_map for the duration of a paxos transaction
  storage_proxy: move paxos_response_handler class to .cc file
  storage_proxy: deinline paxos_response_handler constructor/destructor
  storage_proxy: use consistent effective_replication_map for counter coordinator
  storage_proxy: improve consistency in query_partition_key_range{,_concurrent}
  storage_proxy: query_partition_key_range_concurrent: reduce smart pointer use
  storage_proxy: query_partition_key_range_concurrent: improve token_metadata consistency
  storage_proxy: query_singular: use fewer smart pointers
  storage_proxy: query_singular: simplify lambda captures
  locator: effective_replication_map: provide non-smart-pointer accessor to token_metadata
  storage_proxy: use consistent token_metadata with rest of singular read
2022-10-14 15:44:35 +02:00
Pavel Emelyanov
6150214da3 Add rust/Cargo.lock to .gitignore
The file appears after build

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11776
2022-10-14 13:54:50 +03:00
Anna Stuchlik
09b0e3f63e doc: replace Scylla with ScyllaDB 2022-10-14 11:06:27 +02:00
Anna Stuchlik
9e2b7e81d3 doc: add a comment to remove in future versions any information that refers to previous releases 2022-10-14 10:53:17 +02:00
Anna Stuchlik
fc0308fe30 doc: rewrite the notes to improve clarity 2022-10-14 10:48:59 +02:00
Anna Stuchlik
1bd0bc00b3 doc: remove the reperitions from the notes 2022-10-14 10:32:52 +02:00
Botond Dénes
621e43a0c8 Merge 'dirty_memory_manager: tidy up' from Avi Kivity
A collection of small cleanups, and a bug fix.

Closes #11750

* github.com:scylladb/scylladb:
  dirty_memory_manager: move region_group data members to top-of-class
  dirty_memory_manager: update region_group comment
  dirty_memory_manager: remove outdated friend
  dirty_memory_manager: fold region_group::push_back() into its caller
  dirty_memory_manager: simplify blocked calculation in region_group::run_when_memory_available
  dirty_memory_manager: remove unneeded local from region_group::run_when_memory_is_available
  dirty_memory_manager: tidy up region_group::execution_permitted()
  dirty_memory_manager: reindent region_group::release_queued_allocations()
  dirty_memory_manager: convert region_group::release_queued_allocations() to a coroutine
  dirty_memory_manager: move region_group::_releaser after _shutdown_requested
  dirty_memory_manager: move region_group queued allocation releasing into a function
  dirty_memory_manager: fold allocation_queue into region_group
  dirty_memory_manager: don't ignore timeout in allocation_queue::push_back()
2022-10-14 06:56:42 +03:00
Avi Kivity
1feaa2dfb4 storage_proxy: handle_write: use coroutine::all() instead of when_all()
coroutine::all() saves an allocation. Since it's safe for lambda
coroutines, remove a coroutine::lambda wrapper.

Closes #11749
2022-10-14 06:56:16 +03:00
Tomasz Grabiec
ee2398960c Merge 'service/raft: simplify raft_address_map' from Kamil Braun
The `raft_address_map` code was "clever": it used two intrusive data structures and did a lot of manual lifetime management; raw pointer manipulation, manual deletion of objects... It wasn't clear who owns which object, who is responsible for deleting what. And there was a lot of code.

In this PR we replace one of the intrusive data structures with a good old `std::unordered_map` and make ownership clear by replacing the raw pointers with `std::unique_ptr`. Furthermore, some invariants which were not clear and enforced in runtime are now encoded in the type system.

The code also became shorter: we reduced its length from ~360 LOC to ~260 LOC.

Closes #11763

* github.com:scylladb/scylladb:
  service/raft: raft_address_map: get rid of `is_linked` checks
  service/raft: raft_address_map: get rid of `to_list_iterator`
  service/raft: raft_address_map: simplify ownership of `expiring_entry_ptr`
  service/raft: raft_address_map: move _last_accessed field from timestamped_entry to expiring_entry_ptr
  service/raft: raft_address_map: don't use intrusive set for timestamped entries
  service/raft: raft_address_map: store reference to `timestamped_entry` in `expiring_entry_ptr`
2022-10-13 18:08:49 +02:00
Kamil Braun
954849799d test/topology: disable flaky test_decommission_add_column
Flaky due to #11780, causes next promotion failures.
We can reenable it after the issue is fixed or a workaround is found.
2022-10-13 17:45:46 +02:00
Pavel Emelyanov
707efb6dfb tracing: Wire tracing test back
The boost/tracing test is not run, because test.py boost suite collects
tests that match *_test.cc pattern. The tracing one apparently doesn't

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 17:59:13 +03:00
Pavel Emelyanov
5b67a2a876 tracing: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 17:59:08 +03:00
Pavel Emelyanov
53ac8536f1 tracing: Move test into thread
The test calls future<>.get()'s in its lambda which is only allowed in
seastar threads. It's not stepped upon because (surprise, surprise) this
test is not run at all. Next patch fixes it.

Meanwhile, the fix is in using cql_env_thread thing for the whole lambda
which runs in it seastar::async() context

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 17:57:35 +03:00
Pavel Emelyanov
5c8a61ace2 tracing: Dismantle trace-backend registry
It's not used any longer

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 17:57:24 +03:00
Pavel Emelyanov
fe7d38661c tracing: Use class-registrator for backends
Currently the code uses its own class registration engine, but there's a
generic one in utils/ that applies here too. In fact, the tracing
backend registry is just a transparent wrapper over the generic one :\

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 17:56:24 +03:00
Pavel Emelyanov
1adb2c8cc3 tracing: Add constraint to trace_state::begin()
It expects that the function is (void) and returns back a string

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 17:56:08 +03:00
Pavel Emelyanov
0a6a5a242e tracing: Remove copy-n-paste comments from test
Tests don't have supervisor, so there's no sense in keeping these bits

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 17:55:40 +03:00
Pavel Emelyanov
79820c2006 tracing: Outline may_create_new_session
It's a private method used purely in tracing.cc, no need in compiling it
every time the header is met somewhere else.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 17:55:14 +03:00
Anna Stuchlik
9f7536d549 doc: fix the link to the OS Support page 2022-10-13 15:36:51 +02:00
Anna Stuchlik
1fd1ce042a doc: replace Scylla with ScyllaDB 2022-10-13 15:21:46 +02:00
Anna Stuchlik
81ce7a88de doc: update the info about supported architecture and rewrite the introduction 2022-10-13 15:18:29 +02:00
Kamil Braun
5a9371bcb0 service/raft: raft_address_map: get rid of is_linked checks
Being linked is an invariant of `expiring_entry_ptr`. Make it explicit
by moving the `_expiring_list.push_front` call into the constructor.
2022-10-13 15:17:07 +02:00
Kamil Braun
cdf3367c05 service/raft: raft_address_map: get rid of to_list_iterator
Unnecessary.
2022-10-13 15:17:06 +02:00
Kamil Braun
0e29495c38 service/raft: raft_address_map: simplify ownership of expiring_entry_ptr
The owner of `expiring_entry_ptr` was almost uniquely its corresponding
`timestamp_entry`; it would delete the expiring entry when it itself got
destroyed. There was one call to explicit `unlink_and_dispose`, which
made the picture unclear.

Make the picture clear: `timestamped_entry` now contains a `unique_ptr`
to its `expiring_entry_ptr`. The `unlink_and_dispose` was replaced with
`_lru_entry = nullptr`.

We can also get rid of the back-reference from `expiring_entry_ptr` to
`timestamped_entry`.

The code becomes shorter and simpler.
2022-10-13 15:16:40 +02:00
Petr Gusev
c76cf5956d removenode: don't stream data from the leaving node
If a removenode is run for a recently stopped node,
the gossiper may not yet know that the node is down,
and the removenode will fail with a Stream failed error
trying to stream data from that node.

In this patch we explicitly reject removenode operation
if the gossiper considers the leaving node up.

Closes #11704
2022-10-13 15:11:32 +02:00
Takuya ASADA
49d5e51d76 reloc: add support stripped binary installation for relocatable package
This add support stripped binary installation for relocatable package.
After this change, scylla and unified packages only contain stripped binary,
and introduce "scylla-debuginfo" package for debug symbol.
On scylla-debuginfo package, install.sh script will extract debug symbol
at /opt/scylladb/<dir>/.debug.

Note that we need to keep unstripped version of relocatable package for rpm/deb,
otherwise rpmbuild/debuild fails to create debug symbol package.
This version is renamed to scylla-unstripped-$version-$release.$arch.tar.gz.

See #8918

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes #9005
2022-10-13 15:11:32 +02:00
Asias He
6134fe4d1f storage_service: Prevent removed node to rejoin in handle_state_normal
- Start n1, n2, n3 (127.0.0.3)
- Stop n3
- Change ip address of n3 to 127.0.0.33 and restart n3
- Decommission n3
- Start new node n4

The node n4 will learn from the gossip entry for 127.0.0.3 that node
127.0.0.3 is in shutdown status which means 127.0.0.3 is still part of
the ring.

This patch prevents this by checking the status for the host id on all
the entries. If any of the entries shows the node with the host id is in
LEFT status, reject to put the node in NORMAL status.

Fixes #11355

Closes #11361
2022-10-13 15:11:32 +02:00
Jan Ciolek
52bbc1065c cql3: allow lists of IN elements to be NULL
Requests like `col IN NULL` used to cause
an error - Invalid null value for colum col.

We would like to allow NULLs everywhere.
When a NULL occurs on either side
of a binary operator, the whole operation
should just evaluate to NULL.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

Closes #11775
2022-10-13 15:11:32 +02:00
Avi Kivity
19e62d4704 commitlog: delete unused "num_deleted" variable
Since d478896d46 we update the variable, but never read it.
Clang 15 notices and complains. Remove the variable to make it
happy.

Closes #11765
2022-10-13 15:11:32 +02:00
Avi Kivity
a2da08f9f9 storage_proxy: hold effective_replication_map for the duration of a paxos transaction
Luckily, all topology calculations are done in get_paxos_participants(),
so all we have to do is it hold the effective_replication_map for the
duration of the transaction, and pass it to get_paxos_participants().
This ensures that the coordinator knows about all in-flight requests
and can fence them from topology changes.
2022-10-13 14:27:26 +03:00
Avi Kivity
69aaa5e131 storage_proxy: move paxos_response_handler class to .cc file
It's not used elsewhere.
2022-10-13 14:27:26 +03:00
Avi Kivity
b2f3934e95 storage_proxy: deinline paxos_response_handler constructor/destructor
They have no business being inline as it's a heavyweight object.
2022-10-13 14:27:26 +03:00
Avi Kivity
94e4ff11be storage_proxy: use consistent effective_replication_map for counter coordinator
Hold the effective_replication_map while talking to the counter leader,
to allow for fencing in the future. The code is somewhat awkward because
the API allows for multiple keyspaces to be in use.

The error code generation, already broken as it doesn't use the correct
table, continues to be broken in that it doesn't use the correct
effective_replication_map, for the same reason.
2022-10-13 14:27:23 +03:00
Avi Kivity
406a046974 storage_proxy: improve consistency in query_partition_key_range{,_concurrent}
query_partition_key_range captures a token_metadata_ptr and uses
it consistently in sequential calls to query_partition_key_range_concurrent
(via tail recursion), but each invocation of
query_partition_key_range_concurrent captures its own
effective_replication_map_ptr. Since these are captured at different times,
they can be inconsistent after the first iteration.

Fix by capturing it once in the caller and propagating it everywhere.
2022-10-13 13:56:52 +03:00
Avi Kivity
5d320e95d5 storage_proxy: query_partition_key_range_concurrent: reduce smart pointer use
Capture token_metadata by reference rather than smart pointer, since
out effective_replication_map_ptr protects it.
2022-10-13 13:56:52 +03:00
Avi Kivity
f75efa965f storage_proxy: query_partition_key_range_concurrent: improve token_metadata consistency
Derive the token_metadata from the effective_replication_map rather than
getting it independently. Not a real bug since these were in the same
continuation, but safer this way.
2022-10-13 13:56:52 +03:00
Avi Kivity
161ce4b34f storage_proxy: query_singular: use fewer smart pointers
Capture token_metadata by reference since we're protecting it with
the mighty effective_replication_map_ptr. This saves a few instructions
to manage smart pointers.
2022-10-13 13:56:33 +03:00
Avi Kivity
efd89c1890 storage_proxy: query_singular: simplify lambda captures
The lambdas in query_singular do not outlive the enclosing coroutine,
so they can capture everything by reference. This simplifies life
for a future update of the lambda, since there's one thing less to
worry about.
2022-10-13 13:52:54 +03:00
Avi Kivity
d9955ab35b locator: effective_replication_map: provide non-smart-pointer accessor to token_metadata
token_metadata is protected by holders of an effective_replication_map_ptr,
so it's just as safe and less expensive for them to obtain a reference
to token_metadata rather than a smart pointer, so give them that option with
a new accessor.
2022-10-13 13:46:04 +03:00
Avi Kivity
86a48cf12f storage_proxy: use consistent token_metadata with rest of singular read
query_singular() uses get_token_metadata_ptr() and later, in
get_read_executor(), captures the effective_replication_map(). This
isn't a bug, since the two are captured in the same continuation and
are therefore consistent, but a way to ensure it stays so is to capture
the effective_replication_map earlier and derive the token_metadata from
it.
2022-10-13 13:46:04 +03:00
Avi Kivity
720fc733f0 dirty_memory_manager: move region_group data members to top-of-class
Rather than have them spread out throughout the class.
2022-10-13 13:12:01 +03:00
Avi Kivity
61b780ae63 dirty_memory_manager: update region_group comment
It's still named region_group. I may merge the whole thing into
dirty_memory_manager to retire the name.
2022-10-13 13:09:01 +03:00
Avi Kivity
7a5fa1497c dirty_memory_manager: remove outdated friend
That friend no longer exists.
2022-10-13 13:03:43 +03:00
Avi Kivity
02b7697051 dirty_memory_manager: fold region_group::push_back() into its caller
It is too trivial to live.
2022-10-13 13:03:43 +03:00
Avi Kivity
d403ecbed9 dirty_memory_manager: simplify blocked calculation in region_group::run_when_memory_available
- apply De Morgan's law
 - merge if block into boolean calculation
2022-10-13 13:03:43 +03:00
Avi Kivity
cb6c7023c1 dirty_memory_manager: remove unneeded local from region_group::run_when_memory_is_available 2022-10-13 13:03:43 +03:00
Avi Kivity
39668d5ae2 dirty_memory_manager: tidy up region_group::execution_permitted()
- remove excess parentheses
 - apply De Morgan's law
 - remove unneeded this->
 - whitespace cleanups
2022-10-13 13:03:43 +03:00
Avi Kivity
02706e78f9 dirty_memory_manager: reindent region_group::release_queued_allocations() 2022-10-13 13:03:43 +03:00
Avi Kivity
128f1c8c21 dirty_memory_manager: convert region_group::release_queued_allocations() to a coroutine
Nicer and faster. We have a rare case where we hold a lock for the duration
of a call but we don't want to hold it until the future it returns is
resolved, so we have to resort to a minor trick.
2022-10-13 13:03:29 +03:00
Avi Kivity
aad4c1c5e9 dirty_memory_manager: move region_group::_releaser after _shutdown_requested
The function that is attached to _releaser depends on
_shutdown_requested. There is currently now use-before-init, since
the function (release_queued_allocations) starts with a yield(),
moving the first use to until after the initialization.

Since I want to get rid of the yield, reorder the fields so that
they are initialized in the right order.
2022-10-13 13:00:50 +03:00
Raphael S. Carvalho
ec79ac46c9 db/view: Add visibility to view updating of Staging SSTables
Today, we're completely blind about the progress of view updating
on Staging files. We don't know how long it will take, nor how much
progress we've made.

This patch adds visibility with a new metric that will inform
the number of bytes to be processed from Staging files.
Before any work is done, the metric tell us the total size to be
processed. As view updating progresses, the metric value is
expected to decrease, unless work is being produced faster than
we can consume them.

We're piggybacking on sstables::read_monitor, which allows the
progress metric to be updated whenever the SSTable reader makes
progress.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #11751
2022-10-12 16:57:37 +03:00
Avi Kivity
2e79bb431c tools: change source_location location
std::experimental::source_location is provided by <experimental/source_location>,
not <source_location>. libstdc++ 12 insists, so change the header.

Closes #11766
2022-10-12 15:29:14 +03:00
Takuya ASADA
6b246dc119 locator::ec2_snitch: Retry HTTP request to EC2 instance metadata service
EC2 instance metadata service can be busy, ret's retry to connect with
interval, just like we do in scylla-machine-image.

Fixes #10250

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes #11688
2022-10-12 13:59:06 +03:00
Kamil Braun
92dd1f7307 service/raft: raft_address_map: move _last_accessed field from timestamped_entry to expiring_entry_ptr
`timestamped_entry` had two fields:

```
optional<clock_time_point> _last_accessed
expiring_entry_ptr* _lru_entry
```

The `raft_address_map` data structure maintained an invariant:
`_last_accessed` is set if and only if `_lru_entry` is not null.
This invariant could be broken for a while when constructing an expiring
`timestamped_entry`: the constructor was given an `expiring = true`
flag, which set the `_last_accessed` field; this was redundant, because
immediately after a corresponding `expiring_entry_ptr` was constructed
which again reset the `_last_accessed` field and set `_lru_entry`.

The code becomes simpler and shorter when we move `_last_accessed` field
into `expiring_entry_ptr`. The invariant is now guaranteed by the type
system: `_last_accessed` is no longer `optional`.
2022-10-12 12:22:57 +02:00
Kamil Braun
262b9473d5 service/raft: raft_address_map: don't use intrusive set for timestamped entries
Intrusive data structures are harder to reason about. In
`raft_address_map` there's a good reason to use an intrusive list for
storing `expiring_entry_ptr`s: we move the entries around in the list
(when their expiration times change) but we want for the objects to stay
in place because `timestamped_entry`s may point to them (although we
could simply update the pointers using the existing back-reference...)

However, there's not much reason to store `timestamped_entry` in an
intrusive set. It was basically used in one place: when dropping expired
entries, we iterate over the list of `expiring_entry_ptr`s and we want
to drop the corresponding `timestamped_entry` as well, which is easy
when we have a pointer to the entry and it's a member of an intrusive
container. But we can deal with it when using non-intrusive containers:
just `find` the element in the container to erase it.

The code becomes shorter with this change.

I also use a map instead of a set because we need to modify the
`timestamped_entry` which wouldn't be possible if it was used as an
`unordered_set` key. In fact using map here makes more sense: we were
using the intrusive set similarly to a map anyway because all lookups
were performed using the `_id` field of `timestamped_entry` (now the
field was moved outside the struct, it's used as the map's key).
2022-10-12 12:22:50 +02:00
Kamil Braun
3e84b1f69c Merge 'test.py: topology fix ssl var and improve pylint score' from Alecco
When code was moved to the new directory, a bug was reintroduced with `ssl` local hiding `ssl` module. Fix again.

Closes #11755

* github.com:scylladb/scylladb:
  test.py: improve pylint score for conftest
  test.py: fix variable name collision with ssl
2022-10-12 11:41:11 +02:00
Avi Kivity
f673d0abbe build: support fmt 9 ostream formatter deprecation
fmt 9 deprecates automatic fallback to std::ostream formatting.
We should migrate, but in order to do so incrementally, first enable
the deprecated fallback so the code continues to compile.

Closes #11768
2022-10-12 09:27:36 +03:00
Avi Kivity
0952cecfc9 build: mark abseil as a system header
Abseil is not under our control, so if a header generates a
warning, we can do nothing about it. So far this wasn't a problem,
but under clang 15 it spews a harmless deprecation warning. Silence
the warning by treating the header as a system header (which it is,
for us).

Closes #11767
2022-10-12 09:27:36 +03:00
Kamil Braun
0c13c85752 service/raft: raft_address_map: store reference to timestamped_entry in expiring_entry_ptr
The class was storing a pointer which couldn't be null.
A reference is a better fit in this case.
2022-10-11 17:21:01 +02:00
Asias He
810b424a8c storage_service: Reject to bootstrap new node when node has unknown gossip status
- Start a cluster with n1, n2, n3
- Full cluster shutdown n1, n2, n3
- Start n1, n2 and keep n3 as shutdown
- Add n4

Node n4 will learn the ip and uuid of n3 but it does not know the gossip
status of n3 since gossip status is published only by the node itself.
After full cluster shutdown, gossip status of n3 will not be present
until n3 is restarted again. So n4 will not think n3 is part of the
ring.

In this case, it is better to reject the bootstrap.

With this patch, one would see the following when adding n4:

```
ERROR 2022-09-01 13:53:14,480 [shard 0] init - Startup failed:
std::runtime_error (Node 127.0.0.3 has gossip status=UNKNOWN. Try fixing it
before adding new node to the cluster.)
```

The user needs to perform either of the following before adding a new node:

1) Run nodetool removenode to remove n3
2) Restart n3 to get it back to the cluster

Fixes #6088

Closes #11425
2022-10-11 15:47:34 +03:00
Botond Dénes
378c6aeebd Merge 'More Raft upgrade tests' from Kamil Braun
Refactor the existing upgrade tests, extracting some common functionality to
helper functions.

Add more tests. They are checking the upgrade procedure and recovery from
failure in scenarios like when a node fails causing the procedure to get stuck
or when we lose a majority in a fully upgraded cluster.

Add some new functionalities to `ScyllaRESTAPIClient` like injecting errors and
obtaining gossip generation numbers.

Extend the removenode function to allow ignoring dead nodes.

Improve checking for CQL availability when starting nodes to speed up testing.

Closes #11725

* github.com:scylladb/scylladb:
  test/topology_raft_disabled: more Raft upgrade tests
  test/topology_raft_disabled: refactor `test_raft_upgrade`
  test/pylib: scylla_cluster: pass a list of ignored nodes to removenode
  test/pylib: rest_client: propagate errors from put_json
  test/pylib: fix some type hints
  test/pylib: scylla_cluster: don't create and drop keyspaces to check if cql is up
2022-10-11 15:30:00 +03:00
Kamil Braun
08e654abf5 Merge 'raft: (service) cleanups on the path for dynamic IP address support' from Konstantin Osipov
In preparation for supporting IP address changes of Raft Group 0:
1) Always use start_server_for_group0() to start a server for group 0.
   This will provide a single extension point when it's necessary to
   prompt raft_address_map with gossip data.
2) Don't use raft::server_address in discovery, since going forward
   discovery won't store raft::server_address. On the same token stop
   using discovery::peer_set anywhere outside discovery (for persistence),
   use a peer_list instead, which is easier to marshal.

Closes #11676

* github.com:scylladb/scylladb:
  raft: (discovery) do not use raft::server_address to carry IP data
  raft: (group0) API refactoring to avoid raft::server_address
  raft: rename group0_upgrade.hh to group0_fwd.hh
  raft: (group0) move the code around
  raft: (discovery) persist a list of discovered peers, not a set
  raft: (group0) always start group0 using start_server_for_group0()
2022-10-11 13:43:41 +02:00
Asias He
58c65954b8 storage_service: Reject decommission if nodes are down
- Start n1, n2, n3

- Apply network nemesis as below:
  + Block gossip traffic going from nodes 1 and 2 to node 3.
  + All the other rpc traffic flows normally, including gossip traffic
    from node 3 to nodes 1 and 2 and responses to node_ops commands from
    nodes 1 and 2 to node 3.

- Decommission n3

Currently, the decommission will be successful because all the network
traffic is ok. But n3 could not advertise status STATUS_LEFT to the rest
of the cluster due to the network nemesis applied. As a result, n1 and
n3 could not move the n3 from STATUS_LEAVING to STATUS_LEFT, so n3 will
stay in DL forever.

I know why the node stays DL forever. The problem is that with
node_ops_cmd based node operation, we still rely on the gossip status of
STATUS_LEFT from the node being decommissioned to notify other nodes
this node has finished decommission and can be moved from STATUS_LEAVING
to STATUS_LEFT.

This patch fixes by checking gossip liveness before running
decommission. Reject if required peer nodes are down.

With the fix, the decommission of n3 will fail like this:

$ nodetool decommission -p 7300
nodetool: Scylla API server HTTP POST to URL
'/storage_service/decommission' failed: std::runtime_error
(decommission[adb3950e-a937-4424-9bc9-6a75d880f23d]: Rejected
decommission operation, removing node=127.0.0.3, sync_nodes=[127.0.0.2,
127.0.0.3, 127.0.0.1], ignore_nodes=[], nodes_down={127.0.0.1})

Fixes #11302

Closes #11362
2022-10-11 14:09:28 +03:00
Botond Dénes
917fdb9e53 Merge "Cut database-system_keyspace circular dependency" from Pavel Emelyanov
"
There's one via the database's compaction manager and large data handler
sub-services. Both need system keyspace to put their info into, but the
latter needs database naturally via query_processor->storage_proxy link.

The solution is to make c.m. | l.d.h. -> sys.ks. dependency be weak with
the help of shared_from_this(), described in details in patch #2 commit
message.

As a (not-that-)side effect this set removes a bunch of global qctx
calls.

refs: #11684 (this set seem to increase the chance of stepping on it)
"

* 'br-sysks-async-users' of https://github.com/xemul/scylla:
  large_data_handler: Use local system_keyspace to update entries
  system_keyspace: De-static compaction history update
  compaction_manager: Relax history paths
  database: Plug/unplug system_keyspace
  system_keyspace: Add .shutdown() method
2022-10-11 08:52:04 +03:00
Nadav Har'El
ef0da14d6f test/cql-pytest: add simple tests for USE statement
This patch adds a couple of simple tests for the USE statement: that
without USE one cannot create a table without explicitly specifying
a keyspace name, and with USE, it is possible.

Beyond testing these specific feature, this patch also serves as an
example of how to write more tests that need to control the effective USE
setting. Specifically, it adds a "new_cql" function that can be used to
create a new connection with a fresh USE setting. This is necessary
in such tests, because if multiple tests use the same cql fixture
and its single connection, they will share their USE setting and there
is no way to undo or reset it after being set.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11741
2022-10-11 08:20:19 +03:00
Kamil Braun
df2fb21972 test/topology: reenable test_remove_node_add_column
After #11691 was merged the test should no longer be flaky. Reenable it.

Closes #11754
2022-10-11 08:18:20 +03:00
Pavel Emelyanov
8b8b37cdda system_keyspace: Dont maintain dc/rack cache
Some good news finally. The saved dc/rack info about the ring is now
only loaded once on start. So the whole cache is not needed and the
loading code in storage_service can be greatly simplified

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:18:31 +03:00
Pavel Emelyanov
775f42c8d1 system_keyspace: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:18:31 +03:00
Pavel Emelyanov
8f1df240c7 system_keyspace: Coroutinuze build_dc_rack_info()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:18:31 +03:00
Pavel Emelyanov
b6061bb97d topology: Move all post-configuration to topology::config
Because of snitch ex-dependencies some bits on topology were initialized
with nasty post-start calls. Now it all can be removed and the initial
topology information can be provided by topology::config

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:18:31 +03:00
Pavel Emelyanov
56d4863eb6 snitch: Start early
Snitch code doesn't need anything to start working, but it is needed by
the low-level token-metadata, so move the snitch to start early (and to
stop late)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:18:31 +03:00
Pavel Emelyanov
16188a261e gossiper: Do not export system keyspace
No users of it left. Despite the gossiper->system_keyspace dependency is
not needed either, keep it alive because gossiper still updates system
keyspace with feature masks, so chances are it will be reactivated some
time later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
2bb354b2e7 snitch: Remove gossiper reference
It doesn't need gossiper any longer. This change will allow starting
snitch early by the next patch, and eventually improving the
token-metadata start-up sequence

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
26f9472f21 snitch: Mark get_datacenter/_rack methods const
They are in fact such, but wasn't marked as const before because they
wanted to talk to non-const gossiper and system_keyspaces methods and
updated snitch internal caches

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
e9bd912f79 snitch: Drop some dead dependency knots
After previous patches and merged branches snitch no longer needs its
method that gets dc/rack for endpoints from gossiper, system keyspace
and its internal caches.

This cuts the last but the biggest snitch->gossiper dependency.
Also this removes implicit snitch->system_keyspace dependency loop

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
4206b1f98f snitch, code: Make get_datacenter() report local dc only
The continuation of the previous patch -- all the code uses
topology::get_datacenter(endpoint) to get peers' dc string. The topology
still uses snitch for that, but it already contains the needed data.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
6c6711404f snitch, code: Make get_rack() report local rack only
All the code out there now calls snitch::get_rack() to get rack for the
local node. For other nodes the topology::get_rack(endpoint) is used.
Since now the topology is properly populated with endpoints, it can
finally be patched to stop using snitch and get rack from its internal
collections

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
bc813771e8 storage_service: Populate pending endpoint in on_alive()
A special-purpose add-on to the previous patch.

When messaging service accepts a new connection it sometimes may want to
drop it early based on whether the client is from the same dc/rack or
not. However, at this stage the information might have not yet had
chances to be spread via storage service pending-tokens updating paths,
so here's one more place -- the on_alive() callback

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
1be97a0a76 code: Populate pending locations
Previous patches added the concept of pending endpoints in the topology,
this patch populates endpoints in this state.

Also, the set_pending_ranges() is patched to make sure that the tokens
added for the enpoint(s) are added for something that's known by the
topology. Same check exists in update_normal_tokens()

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
b61bd6cf56 topology: Put local dc/rack on topology early
Startup code needs to know the dc/rack of the local node early, way
before nodes starts any communication with the ring. This information is
available when snitch activates, but it starts _after_ token-metadata,
so the only way to put local dc/rack in topology is via a startup-time
special API call. This new init_local_endpoint() is temporary and will
be removed later in this set

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
da75552e1f topology: Add pending locations collection
Nowadays the topology object only keeps info about nodes that are normal
members of the ring. Nodes that are joining or bootstrapping or leaving
are out of it. However, one of the goals of this patchset is to make
topology object provide dc/rack info for _all_ nodes, even those in
transitive state.

The introduced _pending_locations is about to hold the dc/rack info for
transitive endpoints. When a node becomes member of the ring it is moved
from pending (if it's there) to current locations, when it leaves the
ring it's moved back to pending.

For now the new collection is just added and the add/remove/get API is
extended to maintain it, but it's not really populated. It will come in
the next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
fa613285e7 topology: Make get_location() errors more verbose
Currently if topology.get_location() doesn't find an entry in its
collection(s) it throws standard out-of-range exception which's very
hard to debug.

Also, next patches will extend this method, the introduced here if
(_current_locations.contains()) makes this future change look nicer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
d60ebc5ace token_metadata: Add config, spread everywhere
Next patches will need to provide some early-start data for topology.
The standard way of doing it is via service config, so this patch adds
one. The new config is empty in this patch, to be filled later

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
7c211e8e50 token_metadata: Hide token_metadata_impl copy constructor
Copying of token_metadata_impl is heavy operation and it's performed
internally with the help of the dedicated clone_async() method. This
method, in turn, doesn't copy the whole object in its copy-ctor, but
rather default-initializes it and carries the remaining fields later.

Having said that, the standart copy-ctor is better to be made private
and, for the sake of being more explicit, marked as shallow-copy-ctor

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
072ef88ed1 gosspier: Remove messaging service getter
No code needs to borrow messaging from gossiper, which is nice

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
66bc84d217 snitch: Get local address to gossip via config
The property-file snitch gossips listen_address as internal-IP state. To
get this value it gets it from snitch->gossiper->messaging_service
chain. This change provides the needed value via config thus cutting yet
another snitch->gossiper dependency and allowing gossiper not to export
messaging service in the future

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
77bde21024 storage_service: Shuffle on_alive() callback
No functional changes, just keep some conditions from if()s as local
variables. This is the churn-reducing preparation for one of the the
next patches

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
583204972e api: Don't report dc/rack for endpoints not in ring
When an endpoint is not in ring the snitch/get_{rack|datacenter} API
still return back some value. The value is, in fact, the default one,
because this is how snitch resolves it -- when it cannot find a node in
gossiper and system keyspace it just returns defaults.

When this happens the API should better return some error (bad param?)
but there's a bug in nodetool -- when the 'status' command collects info
about the ring it first collects the endpoints, then gets status for
each. If between getting an endpoint and getting its status the endpoint
disappears, the API would fail, but nodetool doesn't handle it.

Next patches will make .get_rack/_dc calls use in-topology collections
that don't fall-back to default values if the entry is not found in it,
so prepare the API in advance to return back defaults.

refs: #11706

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:12:47 +03:00
Konstantin Osipov
3e46c32d7b raft: (discovery) do not use raft::server_address to carry IP data
We plan to remove IP information from Raft addresses.
raft::server_address is used in Raft configuration and
also in discovery, which is a separate algorithm, as a handy data
structure, to avoid having new entities in RPC.

Since we plan to remove IP addresses from Raft configuration,
using raft::server_address in discovery and still storing
IPs in it would create ambiguity: in some uses raft::server_address
would store an IP, and in others - would not.

So switch to an own data structure for the purposes of discovery,
discovery_peer, which contains a pair ip, raft server id.

Note to reviewers: ideally we should switch to URIs
in discovery_peer right away. Otherwise we may have to
deal with incompatible changes in discovery when adding URI
support to Scylla.
2022-10-10 16:24:33 +03:00
Pavel Emelyanov
b1f4273f0d large_data_handler: Use local system_keyspace to update entries
The l._d._h.'s way to update system keyspace is not like in other code.
Instead of a dedicated helper on the system_keyspace's side it executes
the insertion query directly with the help of qctx.

Now when the l._d._h. has the weak system keyspace reference it can
execute queries on _it_ rather than on the qctx.

Just like in previous patch, it needs to keep the sys._k.s. weak
reference alive until the query's future resolves.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-10 16:20:59 +03:00
Pavel Emelyanov
907fd2d355 system_keyspace: De-static compaction history update
Compaction manager now has the weak reference on the system keyspace
object and can use it to update its stats. It only needs to take care
and keep the shared pointer until the respective future resolves.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-10 16:20:59 +03:00
Pavel Emelyanov
3e0b61d707 compaction_manager: Relax history paths
There's a virtual method on table_state to update the entry in system
keyspace. It's an overkill to facilitate tests that don't want this.
With new system_keyspace weak referencing it can be made simpled by
moving the updating call to the compaction_manager itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-10 16:20:59 +03:00
Pavel Emelyanov
f9b57df471 database: Plug/unplug system_keyspace
There's a circular dependency between system_keyspace and database. The
former needs the latter because it needs to execula local requests via
query_processor. The latter needs the former via compaction manager and
large data handler, database depends on both and these too need to
insert their entries into system keyspace.

To cut this loop the compaction manager and large data handler both get
a weak reference on the system keysace. Once system keyspace starts is
activcates this reference via the database call. When system keyspace is
shutdown-ed on stop, it deactivates the reference.

Technically the weak reference is implemented by marking the system_k.s.
object as async_sharded_service, and the "reference" in question is the
shared_from_this() pointer. When compaction manager or large data
handler need to update a system keyspace's table, they both hold an
extra reference on the system keyspace until the entry is committed,
thus making sure that sys._k.s. doesn't stop from under their feet. At
the same time, unplugging the reference on shutdown makes sure that no
new entries update will appear and the system_k.s. will eventually be
released.

It's not a C++ classical reference, because system_keyspace starts after
and stops before database.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-10 16:20:59 +03:00
Konstantin Osipov
8857e017c7 raft: (group0) API refactoring to avoid raft::server_address
Replace raft::server_address in a few raft_group0 API
calls with raft::server_id.

These API calls do not need raft::server_address, i.e. the
address part, anyway, and since going forward raft::server_address
will not contain the IP address, stop using it in these calls.

This is a beginning of a multi-patch series to reduce
raft::server_address usage to core raft only.
2022-10-10 15:58:48 +03:00
Konstantin Osipov
224dd9ce1e raft: rename group0_upgrade.hh to group0_fwd.hh
The plan is to add other group-0-related forward declarations
to this file, not just the ones for upgrade.
2022-10-10 15:58:48 +03:00
Konstantin Osipov
e226624daf raft: (group0) move the code around
Move load/store functions for discovered peers up,
since going forward they'll be used to in start_server_for_group0(),
to extend the address map prior to start (and thus speed up
bootstrap).
2022-10-10 15:58:48 +03:00
Konstantin Osipov
199b6d6705 raft: (discovery) persist a list of discovered peers, not a set
We plan to reuse the discovery table to store the peers
after discovery is over, so load/store API must be generalized
to use outside discovery. This includes sending
the list of persisted peers over to a new member of the cluster.
2022-10-10 15:58:48 +03:00
Konstantin Osipov
746322b740 raft: (group0) always start group0 using start_server_for_group0()
When IP addresses are removed from raft::configuration, it's key
to initialize raft_address_map with IP addresses before we start group
0. Best place to put this initialization is start_server_for_group0(),
so make sure all paths which create group 0 use
start_server_for_group0().
2022-10-10 15:58:48 +03:00
Kamil Braun
4974a31510 test/topology_raft_disabled: more Raft upgrade tests
The tests are checking the upgrade procedure and recovery from failure
in scenarios like when a node fails causing the procedure to get stuck
or when we lose a majority in a fully upgraded cluster.

Added some new functionalities to `ScyllaRESTAPIClient` like injecting
errors and obtaining gossip generation numbers.
2022-10-10 14:32:10 +02:00
Pavel Emelyanov
caed12c8f2 system_keyspace: Add .shutdown() method
Many services out there have one (sometimes called .drain()) that's
called early on stop and that's responsible for prearing the service for
stop -- aborting pending/in-flight fibers and alike.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-10 15:29:33 +03:00
Kamil Braun
4460b4e63c test/topology_raft_disabled: refactor test_raft_upgrade
Take reusable parts out of the test to helper functions.
2022-10-10 12:59:12 +02:00
Kamil Braun
fa8dcb0d54 test/pylib: scylla_cluster: pass a list of ignored nodes to removenode
The `removenode` operation normally requires the removing node to
contact every node in the cluster except the one that is being removed.
But if more than 1 node is down it's possible to specify a list of nodes
to ignore for the operation; the `/storage_service/remove_node` endpoint
accepts an `ignore_nodes` param which is a comma-separated list of IPs.

Extend `ScyllaRESTAPIClient`, `ScyllaClusterManager` and `ManagerClient`
so it's possible to pass the list of ignored nodes.

We also modify the `/cluster/remove-node` Manager endpoint to use
`put_json` instead of `get_text` and pass all parameters except the
initiator IP (the IP of the node who coordinates the `removenode`
operation) through JSON. This simplifies the URL greatly (it was already
messy with 3 parameters) and more closely resembles Scylla's endpoint.
2022-10-10 12:59:12 +02:00
Kamil Braun
130ab1d312 test/pylib: rest_client: propagate errors from put_json 2022-10-10 12:59:12 +02:00
Kamil Braun
63892326d5 test/pylib: fix some type hints 2022-10-10 12:59:12 +02:00
Kamil Braun
6e3fe13fcf test/pylib: scylla_cluster: don't create and drop keyspaces to check if cql is up
Do a simple `SELECT` instead.

This speeds up tests - creating and dropping keyspaces is relatively
expensive, and we did this on every server restart.
2022-10-10 12:59:12 +02:00
Alejo Sanchez
7e2a3f2040 test.py: improve pylint score for conftest
Remove unused imports, fix long lines, add ignore flags.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-10-10 12:07:41 +02:00
Alejo Sanchez
aa1f4a321c test.py: fix variable name collision with ssl
Change variable name to avoid collision with module ssl.

This bug was reintroduced when moving code.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-10-10 11:59:13 +02:00
Pavel Emelyanov
53bad617c0 virtual_tables: Use token_metadata.is_member()
This method just jumps into topology.has_endpoint(). The change is
for consistency with other users of it and as a preparation for
topology.has_endpoint() future enhancements

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-10 12:16:19 +03:00
Tomasz Grabiec
fcf0628bc5 dbuild: Use .gdbinit from the host
Useful when starting gdb inside the dbuild container.
Message-Id: <20221007154230.1936584-1-tgrabiec@scylladb.com>
2022-10-09 11:14:33 +03:00
Petr Gusev
0923cb435f raft: mark removed servers as expiring instead of dropping them
There is a flaw in how the raft rpc endpoints are
currently managed. The io_fiber in raft::server
is supposed to first add new servers to rpc, then
send all the messages and then remove the servers
which have been excluded from the configuration.
The problem is that the send_messages function
isn't synchronous, it schedules send_append_entries
to run after all the current requests to the
target server, which can happen
after we have already removed the server from address_map.

In this patch the remove_server function is changed to mark
the server_id as expiring rather than synchronously dropping it.
This means all currently scheduled requests to
that server will still be able to resolve
the ip address for that server_id.

Fixes: #11228

Closes #11748
2022-10-07 19:08:34 +02:00
Avi Kivity
55606a51cb dirty_memory_manager: move region_group queued allocation releasing into a function
It's nicer to see a function release_queued_allocations() in a stack
trace rather than start_releaser(), which has done its work during
initialization.
2022-10-07 17:27:43 +03:00
Avi Kivity
3e60d6c243 dirty_memory_manager: fold allocation_queue into region_group
allocation_queue was extracted out of region_group in
71493c253 and 34d532236. But now that region_group refactoring is
mostly done, we can move them back in. allocation_queue has just one
user and is not useful standalone.
2022-10-07 17:27:40 +03:00
Avi Kivity
01368830b5 dirty_memory_manager: don't ignore timeout in allocation_queue::push_back()
In 34d5322368 ("dirty_memory_manager: move more allocation_queue
functions out of region_group") we accidentally started ignoring the
timeout parameter. Fix that.

No release branch has the breakage.
2022-10-07 17:19:56 +03:00
Kamil Braun
06b87869ba Merge 'Raft transport error' from Gusev Petr
The `add_entry` and `modify_config` methods sometimes do an rpc to
execute the request on the current leader. If the tcp connection
was broken, a `seastar::rpc::closed_error` would be thrown to the client.
This exception was not documented in the method comments and the
client could have missed handling it. For example, this exception
was not handled when calling `modify_config` in `raft_group0`,
which sometimes broke the `removenode` command.

An `intermittent_connection_error` exception was added earlier to
solve a similar problem with the `read_barrier` method. In this patch it
is renamed to `transport_error`, as it seems to better describe the
situation, and an explicit specification for this exception
was added - the rpc implementation can throw it if it is not known
whether the call reached the destination and whether any mutations were made.

In case of `read_barrier` it does not matter and we just retry, in case
of `add_entry` and `modify_config` we cannot retry because of possible mutations,
so we convert this exception to `commit_status_unknown`, which
the client has to handle.

Explicit comments have also been added to `raft::server` methods
describing all possible exceptions.

Closes #11691

* github.com:scylladb/scylladb:
  raft_group0: retry modify_config on commit_status_unknown
  raft: convert raft::transport_error to raft::commit_status_unknown
2022-10-07 15:53:22 +02:00
Petr Gusev
12bb8b7c8d raft_group0: retry modify_config on commit_status_unknown
modify_config can throw commit_status_unknown in case
of a leader change or when the leader is unavailable,
but the information about it has not yet reached the
current node. In this patch modify_config is run
again after some time in this case.
2022-10-07 13:34:23 +04:00
Petr Gusev
d79fbab682 raft: convert raft::transport_error to raft::commit_status_unknown
The add_entry and modify_config methods sometimes do an rpc to
execute the request on the current leader. If the tcp connection
was broken, a seastar::rpc::closed_error would be thrown to the client.
This exception was not documented in the method comments and the
client could have missed handling it. For example, this exception
was not handled when calling modify_config in raft_group0,
which sometimes broke the removenode command.

An intermittent_connection_error exception was added earlier to
solve a similar problem with the read_barrier method. In this patch it
is renamed to transport_error, as it seems to better describe the
situation, and an explicit specification for this exception
was added - the rpc implementation can throw it if it is not known
whether the call reached the target node and whether any
actions were performed on it.

In case of read_barrier it does not matter and we just retry. In case
of add_entry and modify_config we cannot retry
because the rpc calls are not idempotent, so we convert this
exception to commit_status_unknown, which the client has to handle.

Explicit comments have also been added to raft::server methods
describing all possible exceptions.
2022-10-07 13:34:16 +04:00
Botond Dénes
b247f29881 Merge 'De-static system_keyspace::get_{saved|local}_tokens()' from Pavel Emelyanov
Yet another user of global qctx object. Making the method(s) non-static requires pushing the system_keyspace all the way down to size_estimate_virtual_reader and a small update of the cql_test_env

Closes #11738

* github.com:scylladb/scylladb:
  system_keyspace: Make get_{local|saved}_tokens non static
  size_estimates_virtual_reader: Pass sys_ks argument to get_local_ranges()
  cql_test_env: Keep sharded<system_keyspace> reference
  size_estimate_virtual_reader: Keep system_keyspace reference
  system_keyspace: Pass sys_ks argument to install_virtual_readers()
  system_keyspace: Make make() non-static
  distributed_loader: Pass sys_ks argument to init_system_keyspace()
  system_keyspace: Remove dangling forward declaration
2022-10-07 11:28:32 +03:00
Botond Dénes
992afc5b8c Merge 'storage_proxy: coroutinize some functions with do_with' from Avi Kivity
do_with() is a sure indicator for coroutinization, since it adds
an allocation (like the coroutine does with its frame). Therefore
translating a function with do_with is at least a break-even, and
usually a win since other continuations no longer allocate.

This series converts most of storage_proxy's function that have
do_with to coroutines. Two remain, since they are not simple
to convert (the do_with() is kept running in the background and
its future is discarded).

Individual patches favor minimal changes over final readability,
and there is a final patch that restores indentation.

The patches leave some moves from coroutine reference parameters
to the coroutine frame, this will be cleaned up in a follow-up. I wanted
this series not to touch headers to reduce rebuild times.

Closes #11683

* github.com:scylladb/scylladb:
  storage_proxy: reindent after coroutinization
  storage_proxy: convert handle_read_digest() to a coroutine
  storage_proxy: convert handle_read_mutation_data() to a coroutine
  storage_proxy: convert handle_read_data() to a coroutine
  storage_proxy: convert handle_write() to a coroutine
  storage_proxy: convert handle_counter_mutation() to a coroutine
  storage_proxy: convert query_nonsingular_mutations_locally() to a coroutine
2022-10-07 07:37:37 +03:00
Nadav Har'El
72dbce8d46 docs, alternator: mention S3 Import feature in compatibility.md
In August 2022, DynamoDB added a "S3 Import" feature, which we don't yet
support - so let's document this missing feature in the compatibility
document.

Refs #11739.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11740
2022-10-06 19:50:16 +03:00
Avi Kivity
20bad62562 Merge 'Detect and record large collections' from Benny Halevy
This series adds support for detecting collections that have too many items
and recording them in `system.large_cells`.

A configuration variable was added to db/config: `compaction_collection_items_count_warning_threshold` set by default to 10000.
Collections that have more items than this threshold will be warned about and will be recorded as a large cell in the `system.large_cells` table.  Documentation has been updated respectively.

A new column was added to system.large_cells: `collection_items`.
Similar to the `rows` column in system.large_partition, `collection_items` holds the number of items in a collection when the large cell is a collection, or 0 if it isn't.  Note that the collection may be recorded in system.large_cells either due to its size, like any other cell, and/or due to the number of items in it, if it cross the said threshold.

Note that #11449 called for a new system.large_collections table, but extending system.large_cells follows the logic of system.large_partitions is a smaller change overall, hence it was preferred.

Since the system keyspace schema is hard coded, the schema version of system.large_cells was bumped, and since the change is not backward compatible, we added a cluster feature - `LARGE_COLLECTION_DETECTION` - to enable using it.
The large_data_handler large cell detection record function will populate the new column only when the new cluster feature is enabled.

In addition, unit tests were added in sstable_3_x_test for testing large cells detection by cell size, and large_collection detection by the number of items.

Closes #11449

Closes #11674

* github.com:scylladb/scylladb:
  sstables: mx/writer: optimize large data stats members order
  sstables: mx/writer: keep large data stats entry as members
  db: large_data_handler: dynamically update config thresholds
  utils/updateable_value: add transforming_value_updater
  db/large_data_handler: cql_table_large_data_handler: record large_collections
  db/large_data_handler: pass ref to feature_service to cql_table_large_data_handler
  db/large_data_handler: cql_table_large_data_handler: move ctor out of line
  docs: large-rows-large-cells-tables: fix typos
  db/system_keyspace: add collection_elements column to system.large_cells
  gms/feature_service: add large_collection_detection cluster feature
  test: sstable_3_x_test: add test_sstable_too_many_collection_elements
  test: lib: simple_schema: add support for optional collection column
  test: lib: simple_schema: build schema in ctor body
  test: lib: simple_schema: cql: define s1 as static only if built this way
  db/large_data_handler: maybe_record_large_cells: consider collection_elements
  db/large_data_handler: debug cql_table_large_data_handler::delete_large_data_entries
  sstables: mx/writer: pass collection_elements to writer::maybe_record_large_cells
  sstables: mx/writer: add large_data_type::elements_in_collection
  db/large_data_handler: get the collection_elements_count_threshold
  db/config: add compaction_collection_elements_count_warning_threshold
  test: sstable_3_x_test: add test_sstable_write_large_cell
  test: sstable_3_x_test: pass cell_threshold_bytes to large_data_handler
  test: sstable_3_x_test: large_data_handler: prepare callback for testing large_cells
  test: sstable_3_x_test: large_data tests: use BOOST_REQUIRE_[GL]T
  test: sstable_3_x_test: test_sstable_log_too_many_rows: use tests::random
2022-10-06 18:28:21 +03:00
Avi Kivity
62a4d2d92b Merge 'Preliminary changes for multiple Compaction Groups' from Raphael "Raph" Carvalho
What's contained in this series:
- Refactored compaction tests (and utilities) for integration with multiple groups
    - The idea is to write a new class of tests that will stress multiple groups, whereas the existing ones will still stress a single group.
- Fixed a problem when cloning compound sstable set (cannot be triggered today so I didn't open a GH issue)
- Many changes in replica::table for allowing integration with multiple groups

Next:
- Introduce for_each_compaction_group() for iterating over groups wherever needed.
- Use for_each_compaction_group() in replica::table operations spanning all groups (API, readers, etc).
- Decouple backlog tracker from compaction strategy, to allow for backlog isolation across groups
- Introduce static option for defining number of compaction groups and implement function to map a token to its respective group.
- Testing infrastructure for multiple compaction groups (helpful when testing the dynamic behavior: i.e. merging / splitting).

Closes #11592

* github.com:scylladb/scylladb:
  sstable_resharding_test: Switch to table_for_tests
  replica: Move compacted_undeleted_sstables into compaction group
  replica: Use correct compaction_group in try_flush_memtable_to_sstable()
  replica: Make move_sstables_from_staging() robust and compaction group friendly
  test: Rename column_family_for_tests to table_for_tests
  sstable_compaction_test: Use column_family_for_tests::as_table_state() instead
  test: Don't expose compound set in column_family_for_tests
  test: Implement column_family_for_tests::table_state::is_auto_compaction_disabled_by_user()
  sstable_compaction_test: Merge table_state_for_test into column_family_for_tests
  sstable_compaction_test: use table_state_for_test itself in fully_expired_sstables()
  sstable_compaction_test: Switch to table_state in compact_sstables()
  sstable_compaction_test: Reduce boilerplate by switching to column_family_for_tests
2022-10-06 18:23:47 +03:00
Kamil Braun
f94d547719 test.py: include modes in log file name
Instead of `test.py.log`, use:
`test.py.dev.log`
when running with `--mode dev`,
`test.py.dev-release.log`
when running with `--mode dev --mode release`,
and so on.

This is useful in Jenkins which is running test.py multiple times in
different modes; a later run would overwrite a previous run's test.py
file. With this change we can preserve the test.py files of all of these
runs.

Closes #11678
2022-10-06 18:20:39 +03:00
Kamil Braun
3af68052c4 test/topology: disable flaky test_remove_node_add_column test
The test was added recently and since then causes CI failures.

We suspect that it happens if the node being removed was the Raft group
0 leader. The removenode coordinator tries to send to it the
`remove_from_group0` request and fails.
A potential fix is in review: #11691.
2022-10-06 17:04:42 +02:00
Pavel Emelyanov
59da903054 system_keyspace: Make get_{local|saved}_tokens non static
Now all callers have system_keyspace reference at hand. This removes one
more user of the global qctx object

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-06 18:02:09 +03:00
Pavel Emelyanov
b03f1e7b17 size_estimates_virtual_reader: Pass sys_ks argument to get_local_ranges()
This method static calls system_keyspace::get_local_tokens(). Having the
system_keyspace reference will make this method non-static

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-06 18:00:09 +03:00
Pavel Emelyanov
4c099bb3ed cql_test_env: Keep sharded<system_keyspace> reference
There's a test_get_local_ranges() call in size-estimate reader which
will need system keyspace reference. There's no other place for tests to
get it from but the cql_test_env thing

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-06 17:59:21 +03:00
Pavel Emelyanov
34e8e5959f size_estimate_virtual_reader: Keep system_keyspace reference
The s._e._v._reader::fill_buffer() method needs system keyspace to get
node's local tokens. Now it's a static method, having system_keyspace
reference will make it non-static

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-06 17:58:07 +03:00
Pavel Emelyanov
04552f2d58 system_keyspace: Pass sys_ks argument to install_virtual_readers()
The size-estimate-virtual-reader will need it, now it's available as
"this" from system_keyspace::make() method

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-06 17:57:13 +03:00
Pavel Emelyanov
1938412d7a system_keyspace: Make make() non-static
This helper needs system_keyspace reference and using "this" as this
looks natural. Also this de-static-ification makes it possible to put
some sense into the invoke_on_all() call from init_system_keyspace()

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-06 17:56:11 +03:00
Pavel Emelyanov
9f79525f8e distributed_loader: Pass sys_ks argument to init_system_keyspace()
It's final destination is virtual tabls registration code called from
init_system_keyspace() eventually

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-06 17:55:03 +03:00
Pavel Emelyanov
e996503f0d system_keyspace: Remove dangling forward declaration
It doesn't match the real system_keyspace_make() definition and is in
fact not needed, as there's another "real" one in database.hh

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-06 17:54:22 +03:00
Vlad Zolotarov
8195dab92a scylla_prepare: correctly handle a former 'MQ' mode
Fixes a regression introduced in 80917a1054:
"scylla_prepare: stop generating 'mode' value in perftune.yaml"

When cpuset.conf contains a "full" CPU set the negation of it from
the "full" CPU set is going to generate a zero mask as a irq_cpu_mask.
This is an illegal value that will eventually end up in the generated
perftune.yaml, which in line will make the scylla service fail to start
until the issue is resolved.

In such a case a irq_cpu_mask must represent a "full" CPU set mimicking
a former 'MQ' mode.

Fixes #11701
Tested:
 - Manually on a 2 vCPU VM in an 'auto-selection' mode.
 - Manually on a large VM (48 vCPUs) with an 'MQ' manually
   enforced.
Message-Id: <20221004004237.2961246-1-vladz@scylladb.com>
2022-10-06 17:43:37 +03:00
Avi Kivity
9932c4bd62 Merge 'cql3: Make CONTAINS NULL and CONTAINS KEY NULL return false' from Jan Ciołek
Currently doing `CONTAINS NULL` or `CONTAINS KEY NULL` on a collection evaluates to `true`.

This is a really weird behaviour. Collections can't contain `NULL`, even if they wanted to.
Any operation that has a NULL on either side should evaluate to `NULL`, which is interpreted as `false`.
In Cassandra trying to do `CONTAINS NULL` causes an error.

Fixes: #10359

The only problem is that this change is not backwards compatible. Some existing code might break.

Closes #11730

* github.com:scylladb/scylladb:
  cql3: Make CONTAINS KEY NULL return false
  cql3: Make CONTAINS NULL return false
2022-10-06 17:08:56 +03:00
Petr Gusev
40bd9137f8 removenode: add warning in case of exception
The removenode_abort logic that follows the warning
may throw, in which case information about
the original exception was lost.

Fixes: #11722
Closes #11735
2022-10-06 13:49:26 +02:00
Benny Halevy
480b4759a9 idl: streaming: include stream_fwd.hh
To keep the idl definition of plan_id from
getting out of sync with the one in stream_fwd.hh.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11720
2022-10-06 13:49:26 +02:00
Kamil Braun
962ee9ba7b Merge 'Make raft_group0 -> system_keyspace dependency explicit' from Pavel Emelyanov
The raft_group0 code needs system_keyspace and now it gets one from gossiper. This gossiper->system_keyspace dependency is in fact artificial, gossiper doesn't need system ks, it's there only to let raft and snitch call gossiper.get_system_keyspace().

This makes raft use system ks directly, snitch is patched by another branch

Closes #11729

* github.com:scylladb/scylladb:
  raft_group0: Use local reference
  raft_group0: Add system keyspace reference
2022-10-06 13:49:26 +02:00
Tomasz Grabiec
023f78d6ae test: lib: random_mutation_generator: Introduce a switch for generating simpler mutations for easier debugging
Closes #11731
2022-10-06 13:49:26 +02:00
Raphael S. Carvalho
14d6459efc compaction: Make compaction_manager stop more robust
Commit aba475fe1d accidentally fixed a race, which happens in
the following sequence of events:

1) storage service starts drain() via API for example
2) main's abort source is triggered, calling compaction_manager's do_stop()
via subscription.
        2.1) do_stop() initiates the stop but doesn't wait for it.
        2.2) compaction_manager's state is set to stopped, such that
        compaction_manager::stop() called in defer_verbose_shutdown()
        will wait for the stop and not start a new one.
3) drain() calls compaction_manager::drain() changing the state from
stopped to disabled.
4) main calls compaction_manager::stop() (as described in 2.2) and
incorrectly tries to stop the manager again, because the state was
changed in step 3.

aba475fe1d accidentally fixed this problem because drain() will no
longer take place if it detects the shutdown process was initiated
(it does so by ignoring drain request if abort source's subscription
was unlinked).

This shows us that looking at the state to determine if stop should be
performed is fragile, because once the state changes from A to B,
manager doesn't know the state was A. To make it robust, we can instead
check if the future that stores stop's promise is engaged, meaning that
the stop was already initiated and we don't have to start a new one.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #11711
2022-10-06 13:49:26 +02:00
Botond Dénes
753f671eaa Merge 'dirty_memory_manager: simplify, clarify, and document' from Avi Kivity
This series undoes some recent damage to clarity, then
goes further by renaming terms around dirty_memory_manager
to be clearer. Documentation is added.

Closes #11705

* github.com:scylladb/scylladb:
  dirty_memory_manager: re-term "virtual dirty" to "unspooled dirty"
  dirty_memory_manager: rename _virtual_region_group
  api: column_family: fix memtable off-heap memory reporting
  dirty_memory_manager: unscramble terminology
2022-10-06 13:49:26 +02:00
Tomasz Grabiec
4c8dc41f75 Merge 'Handle storage_io_error's ENOSPC when flushing' from Pavel Emelyanov
This is the continuation of the a980510654 that tries to catch ENOSPCs reported via storage_io_error similarly to how defer_verbose_shutdown() does on stop

Closes #11664

* github.com:scylladb/scylladb:
  table: Handle storage_io_error's ENOSPC when flushing
  table: Rewrap retry loop
2022-10-06 13:49:26 +02:00
Raphael S. Carvalho
fcdff50a35 sstable_resharding_test: Switch to table_for_tests
Important step for multiple compaction groups. As a bonus, lots
of boilerplate is removed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
cf3f93304e replica: Move compacted_undeleted_sstables into compaction group
Compacted undeleted sstables are relevant for avoiding data resurrection
in the purge path. As token ranges of groups won't overlap, it's
better to isolate this data, so to prevent one group from interfering
with another.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
56ac62bbd6 replica: Use correct compaction_group in try_flush_memtable_to_sstable()
We need to pass the compaction_group received as a param, not the one
retrieved via as_table_state(). Needed for supporting multiple
groups.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
707ebf9cf7 replica: Make move_sstables_from_staging() robust and compaction group friendly
Off-strategy can happen in parallel to view building.

A semaphore is used to ensure they don't step on each other's
toe.

If off-strategy completes first, then move_sstables_from_staging()
won't find the SSTable alive and won't reach code to add
the file to the backlog tracker.

If view building completes first, the SSTable exists, but it's
not reshaped yet (has repair origin) and shouldn't be
added to the backlog tracker.

Off-strategy completion code will make sure new sstables added
to main set are accounted by the backlog tracker, so
move_sstables_from_staging() only need to add to tracker files
which are certainly not going through a reshape compaction.

So let's take these facts into account to make the procedure
more robust and compaction group friendly. Very welcome change
for when multiple groups are supported.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
7d82373e3a test: Rename column_family_for_tests to table_for_tests
To avoid confusion, as replica::column_family was already renamed
to replica::table.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
e56bfecd8d sstable_compaction_test: Use column_family_for_tests::as_table_state() instead
That's important for multiple compaction groups. Once replica::table
supports multiple groups, there will be no table::as_table_state(),
so for testing table with a single group, we'll be relying on
column_family_for_tests::as_table_state().

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
5a028ca4dc test: Don't expose compound set in column_family_for_tests
The compound set shouldn't be exposed in main_sstables() because
once we complete the switch to column_family_for_tests::table_state,
can happen compaction will try to remove or add elements to its
set snapshot, and compound set isn't allowed to either ops.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
b16d6c55b1 test: Implement column_family_for_tests::table_state::is_auto_compaction_disabled_by_user()
Needed once we switch to column_family_for_tests::table_state, so unit
tests relying on correct value will still work

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
a6d24a763a sstable_compaction_test: Merge table_state_for_test into column_family_for_tests
This change will make table_state_for_test the table_state of
column_family_for_tests. Today, an unit test has to keep a reference
to them both and logically couple them, but that's error prone.

This change is also important when replica::table supports multiple
compaction groups, so unit tests won't have to directly reference
the table_state of table, but rather use the one managed by
column_family_for_tests.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
6a0eabd17a sstable_compaction_test: use table_state_for_test itself in fully_expired_sstables()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
a6affea008 sstable_compaction_test: Switch to table_state in compact_sstables()
The switch is important once we have multiple compaction groups,
as a single table may own several groups. There will no longer be
a replica::table::as_table_state().

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
2aa6518486 sstable_compaction_test: Reduce boilerplate by switching to column_family_for_tests
Lots of boilerplate is reduced, and will also help to complete the
switch from replica::table to compaction::table_state in the unit
tests.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:18 -03:00
Jan Ciolek
a2c359a741 cql3: Make CONTAINS KEY NULL return false
A binary operator like this:
{1: 2, 3: 4} CONTAINS KEY NULL
used to evaluate to `true`.

This is wrong, any operation involving null
on either side of the operator should evaluate
to NULL, which is interpreted as false.

This change is not backwards compatible.
Some existing code might break.

partially fixes: #10359

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-05 18:15:44 +02:00
Jan Ciolek
bbfef4b510 cql3: Make CONTAINS NULL return false
A binary operator like this:
[1, 2, 3] CONTAINS NULL
used to evaluate to `true`.

This is wrong, any operation involving null
on either side of the operator should evaluate
to NULL, which is interpreted as false.

This change is not backwards compatible.
Some existing code might break.

partially fixes: #10359

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-05 18:15:15 +02:00
Pavel Emelyanov
fb8ed684fa raft_group0: Use local reference
It now grabs one from gossiper which is weird. A bit later it will be
possible to remove gossiper->system_keyspace dependency

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-05 17:35:58 +03:00
Pavel Emelyanov
8570fe3c30 raft_group0: Add system keyspace reference
The sharded<system_keyspace> is already started by the time raft_group0
is created

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-05 17:35:13 +03:00
Michał Chojnowski
a0204c17c5 treewide: remove mentions of seastar::thread::should_yield()
thread_scheduling_group has been retired many years ago.
Remove the leftovers, they are confusing.

Closes #11714
2022-10-05 12:26:37 +03:00
Michał Chojnowski
8aa24194b7 row_cache: remove a dead try...catch block in eviction
All calls in the try block have been noexcept for some time.
Remove the try...catch and the associated misleading comment to avoid confusing
source code readers.

Closes #11715
2022-10-05 12:23:47 +03:00
Benny Halevy
7286f5d314 sstables: mx/writer: optimize large data stats members order
Since `_partition_size_entry` and `_rows_in_partition_entry`
are accessed at the same time when updated, and similarly
`_cell_size_entry` and `_elements_in_collection_entry`,
place the member pairs closely together to improve data
cache locality.

Follow the same order when preparing the
`scylla_metadata::large_data_stats` map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-05 10:54:04 +03:00
Benny Halevy
8c8a0adb40 sstables: mx/writer: keep large data stats entry as members
To save the map lookup on the hot write path,
keep each large data stats entry as a member in the writer
object and build a map for storing the disk_hash in the
scylla metadata only when finalizing it in consume_end_of_stream.

Fixes #11686

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-05 10:54:04 +03:00
Benny Halevy
2c4ff71d2b db: large_data_handler: dynamically update config thresholds
make the various large data thresholds live-updateable
and construct the observers and updaters in
cql_table_large_data_handler to dynamically update
the base large_data_handler class threshold members.

Fixes #11685

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-05 10:53:40 +03:00
Benny Halevy
6d582054c0 utils/updateable_value: add transforming_value_updater
Automatically updates a value from a utils::updateable_value
Where they can be of different types.
An optional transfom function can provide an additional transformation
when updating the value, like multiplying it by a factor for unit conversion,
for example.

To be used for auto-updating the large data thresholds
from the db::config.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-05 10:52:49 +03:00
Botond Dénes
4c13328788 Merge 'Return all sstables in table::get_sstable_set()' from Raphael "Raph" Carvalho
This fixes a regression introduced by 1e7a444, where table::get_sstable_set() isn't exposing all sstables, but rather only the ones in the main set. That causes user of the interface, such as get_sstables_by_partition_key() (used by API to return sstable name list which contains a particular key), to miss files in the maintenance set.

Fixes https://github.com/scylladb/scylladb/issues/11681.

Closes #11682

* github.com:scylladb/scylladb:
  replica: Return all sstables in table::get_sstable_set()
  sstables: Fix cloning of compound_sstable_set
2022-10-05 06:55:50 +03:00
Pavel Emelyanov
2c1ef0d2b7 sstables.hh: Remove unused headers
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11709
2022-10-04 23:37:07 +02:00
Raphael S. Carvalho
827750c142 replica: Return all sstables in table::get_sstable_set()
get_sstable_set() as its name implies is not confined to the main
or maintenance set, nor to a specific compaction group, so let's
make it return the compound set which spans all groups, meaning
all sstables tracked by a table will be returned.

This is a regression introduced in 1e7a444. It affects the API
to return sstable list containing a partition key, as sstables
in maintenance would be missed, fooling users of the API like
tools that could trust the output.

Each compaction group is returning the main and maintenance set
in table_state's main_sstable_set() and maintenance_sstable_set(),
respectively.

Fixes #11681.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-04 10:43:27 -03:00
Raphael S. Carvalho
eddf32b94c sstables: Fix cloning of compound_sstable_set
The intention was that its clone() would actually clone the content
of an existing set into a new one, but the current impl is actually
moving the sets instead of copying them. So the original set
becomes invalid. Luckily, this problem isn't triggered as we're
not exposing the compound set in the table's interface, so the
compound_sstable_set::clone() method isn't being called.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-04 10:43:25 -03:00
Felipe Mendes
f67bb43a7a locator: ec2_snitch: IMDSv2 support
Access to AWS Metadata may be configured in three distinct ways:
   1 - Optional HTTP tokens and HTTP endpoint enabled: The default as it works today
   2 - Required HTTP tokens and HTTP endpoint enabled: Which support is entirely missing today
   3 - HTTP endpoint disabled: Which effectively forbids one to use Ec2Snitch or Ec2MultiRegionSnitch

This commit makes the 2nd option the default which is not only AWS recommended option, but is also entirely compatible with the 1st option.
In addition, we now validate the HTTP response when querying the IMDS server. Therefore - should a HTTP 403 be received - Scylla will
properly notify users on what they are trying to do incorrectly in their setup.

The commit was tested under the following circumstances (covering all 3 variants):
 - Ec2Snitch: IMDSv2 optional & required, and HTTP server disabled.
 - Ec2MultiRegionSnitch: IMDSv2 optional & required, and HTTP server disabled.

Refs: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-service.html
      https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-options.html
      https://github.com/scylladb/scylladb/issues/9987
Fixes: https://github.com/scylladb/scylladb/issues/10490
Closes: https://github.com/scylladb/scylladb/issues/10490

Closes #11636
2022-10-04 15:48:42 +03:00
Avi Kivity
37c6b46d26 dirty_memory_manager: re-term "virtual dirty" to "unspooled dirty"
The "virtual dirty" term is not very informative. "Virtual" means
"not real", but it doesn't say in which way it isn't real.

In this case, virtual dirty refers to real dirty memory, minus
the portion of memtables that has been written to disk (but not
yet sealed - in that case it would not be dirty in the first
place).

I chose to call "the portion of memtables that has been written
to disk" as "spooled memory". At least the unique term will cause
people to look it up and may be easier to remember. From that
we have "unspooled memory".

I plan to further change the accounting to account for spooled memory
rather than unspooled, as that is a more natural term, but that is left
for later.

The documentation, config item, and metrics are adjusted. The config
item is practically unused so it isn't worth keeping compatibility here.
2022-10-04 14:03:59 +03:00
Avi Kivity
d02c407769 dirty_memory_manager: rename _virtual_region_group
Since we folded _real_region_group into _virtual_region_group, the
"virtual" tag makes no sense any more, so remove it.
2022-10-04 14:01:45 +03:00
Avi Kivity
b0814bdd42 api: column_family: fix memtable off-heap memory reporting
We report virtual memory used, but that's not a real accounting
of the actual memory used. Use the correct real_memory_used() instead.

Note that this isn't a recent regression and was probably broken forever.
However nobody looks at this measure (and it's usually close to the
correct value) so nobody noticed.

Since it's so minor, I didn't bother filing an issue.
2022-10-04 13:56:29 +03:00
Avi Kivity
bc2fcf5187 dirty_memory_manager: unscramble terminology
Before 95f31f37c1 ("Merge 'dirty_memory_manager: simplify
region_group' from Avi Kivity"), we had two region_group
objects, one _real_region_group and another _virtual_region_group,
each with a set of "soft" and "hard" limits and related functions
and members.

In 95f31f37c1, we merged _real_region_group into _virtual_region_group,
but unfortunately the _real_region_group members received the "hard"
prefix when they got merged. This overloads the meaning of "hard" -
is it related to soft/hard limit or is it related to the real/virtual
distinction?

This patch applied some renaming to restore consistency. Anything
that came from _virtual_region_group now has "virtual" in its name.
Anything that came from _real_region_group now has "real" in its name.
The terms are still pretty bad but at least they are consistent.
2022-10-04 13:56:28 +03:00
Kamil Braun
c200ae2228 Merge 'test.py topology Scylla REST API client' from Alecco
- Separate `aiohttp` client code
- Helper to access Scylla server REST API
- Use helper both in `ScyllaClusterManager` (test.py process) and `ManagerClient` (pytest process)
- Add `removenode` and `decommission` operations.

Closes #11653

* github.com:scylladb/scylladb:
  test.py: Scylla REST methods for topology tests
  test.py: rename server_id to server_ip
  test.py: HTTP client helper
  test.py: topology pass ManagerClient instead of...
  test.py: delete unimplemented remove server
  test.py: fix variable name ssl name clash
2022-10-04 11:50:18 +02:00
Botond Dénes
169a8a66f2 compatible_ring_position_or_view: make it cheap to copy
This class exists for one purpose only: to serve as glue code between
dht::ring_position and boost::icl::interval_map. The latter requires
that keys in its intervals are:
* default constructible
* copyable
* have standalone compare operations

For this reason we have to wrap `dht::ring_position` in a class,
together with a schema to provide all this. This is
`compatible_ring_position`. There is one further requirement by code
using the interval map: it wants to do lookups without copying the
lookup key(s). To solve this, we came up with
`compatible_ring_position_or_view` which is a union of a key or a key
view + schema. As we recently found out, boost::icl copies its keys **a
lot**. It seems to assume these keys are cheap to copy and carelessly
copies them around even when iterating over the map. But
`compatible_ring_position_or_view` is not cheap to copy as it copies a
`dht::ring_position` which allocates, and it does that via an
`std::optional` and `std::variant` to add insult to injury.
This patch make said class cheap to copy, by getting rid of the variant
and storing the `dht::ring_position` via a shared pointer. The view is
stored separately and either points to the ring position stored in the
shared pointer or to an outside ring position (for lookups).

Fixes: #11669

Closes #11670
2022-10-04 12:00:21 +03:00
Piotr Dulikowski
51f813d89b storage_proxy: update rate limited reads metric when coordinator rejects
The decision to reject a read operation can either be made by replicas,
or by the coordinator. In the second case, the

  scylla_storage_proxy_coordinator_read_rate_limited

metric was not incremented, but it should. This commit fixes the issue.

Fixes: #11651

Closes #11694
2022-10-04 10:33:58 +03:00
Pavel Emelyanov
9cd1f777a5 database.hh: Remove unused headers
Use forward declarations when needed

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11667
2022-10-04 09:01:38 +03:00
Botond Dénes
5fd4b1274e Merge 'compaction_manager: Don't let ENOSPC throw out of ::stop() method' from Pavel Emelyanov
The seastar defer_stop() helper is cool, but it forwards any exception from the .stop() towards the caller. In case the caller is main() the exception causes Scylla to abort(). This fires, for example, in compaction_manager::stop() when it steps on ENOSPC

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11662

* github.com:scylladb/scylladb:
  compaction_manager: Swallow ENOSPCs in ::stop()
  exceptions: Mark storage_io_error::code() with noexcept
2022-10-04 08:54:22 +03:00
Nadav Har'El
3a30fbd56c test/alternator: fix timeout in flaky test test_ttl_stats
The test `test_metrics.py::test_ttl_stats` tests the metrics associated
with Alternator TTL expiration events. It normally finishes in less than a
second (the TTL scanning is configured to run every 0.5 seconds), so we
arbitrarily set a 60 second timeout for this test to allow for extremely
slow test machines. But in some extreme cases even this was not enough -
in one case we measured the TTL scan to take 63 seconds.

So in this patch we increase the timeout in this test from 60 seconds
to 120 seconds. We already did the same change in other Alternator TTL
tests in the past - in commit 746c4bd.

Fixes #11695

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11696
2022-10-04 08:50:51 +03:00
Benny Halevy
46ebffcc93 db/large_data_handler: cql_table_large_data_handler: record large_collections
When the large_collection_detection cluster feature is enabled,
select the internal_record_large_cells_and_collections method
to record the large collection cell, storing also the collection_elements
column.

We want to do that only when the cluster feature is enabled
to facilitate rollback in case rolling upgrade is aborted,
otherwise system.large_cells won't be backward compatible
and will have to be deleted manually.

Delete the sstable from system.large_cells if it contains
elements_in_collection above threshold.

Closes #11449

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:10 +03:00
Benny Halevy
3f8bba202f db/large_data_handler: pass ref to feature_service to cql_table_large_data_handler
For recording collection_elements of large_collections when
the large_collection_detection feature is enabled.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:10 +03:00
Benny Halevy
dc4e7d8e01 db/large_data_handler: cql_table_large_data_handler: move ctor out of line
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:09 +03:00
Benny Halevy
f4c3070002 docs: large-rows-large-cells-tables: fix typos
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:09 +03:00
Benny Halevy
2f49eebb04 db/system_keyspace: add collection_elements column to system.large_cells
And bump the schema version offset since the new schema
should be distinguishable from the previous one.

Refs scylladb/scylladb#11660

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:08 +03:00
Benny Halevy
9ad41c700e gms/feature_service: add large_collection_detection cluster feature
And a corresponding db::schema_feature::SCYLLA_LARGE_COLLECTIONS

We want to enable the schema change supporting collection_elements
only when all nodes are upgraded so that we can roll back
if the rolling upgrade process is aborted.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:07 +03:00
Benny Halevy
9eeb8f2971 test: sstable_3_x_test: add test_sstable_too_many_collection_elements
Test that collections with too many elements are detected properly.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:07 +03:00
Benny Halevy
3c11937b00 test: lib: simple_schema: add support for optional collection column
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:06 +03:00
Benny Halevy
7b5f2d2e53 test: lib: simple_schema: build schema in ctor body
Rather when initializing _s.

Prepare for adding an optional collection column.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:06 +03:00
Benny Halevy
db01641a44 test: lib: simple_schema: cql: define s1 as static only if built this way
Keep the with_static ctor parameter as private member
to be used by the cql() method to define s1 either as static or not.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:05 +03:00
Benny Halevy
6dadca2648 db/large_data_handler: maybe_record_large_cells: consider collection_elements
Detect large_collections when the number of collection_elements
is above the configured threshold.

Next step would be to record the number of collection_elements
in the system.large_cells table, when the respective
cluster feature is enabled.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:05 +03:00
Benny Halevy
27ee75c54e db/large_data_handler: debug cql_table_large_data_handler::delete_large_data_entries
Log in debug level when deleting large data entry
from system table.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:04 +03:00
Benny Halevy
7dead10742 sstables: mx/writer: pass collection_elements to writer::maybe_record_large_cells
And update the sstable elements_in_collection
stats entry.

Next step would be to forward it to
large_data_handler().maybe_record_large_cells().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:41:58 +03:00
Benny Halevy
54ab038825 sstables: mx/writer: add large_data_type::elements_in_collection
Add a new large_data_stats type and entry for keeping
the collection_elements_count_threshold and the maximum value
of collection_elements.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:41:56 +03:00
Benny Halevy
a107f583fd db/large_data_handler: get the collection_elements_count_threshold
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:31:11 +03:00
Benny Halevy
167ec84eeb db/config: add compaction_collection_elements_count_warning_threshold
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:31:10 +03:00
Benny Halevy
5e88e6267e test: sstable_3_x_test: add test_sstable_write_large_cell
based on cell size threshold.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:31:09 +03:00
Benny Halevy
3980415d97 test: sstable_3_x_test: pass cell_threshold_bytes to large_data_handler
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:31:09 +03:00
Benny Halevy
3eb4cda8ea test: sstable_3_x_test: large_data_handler: prepare callback for testing large_cells
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:31:08 +03:00
Benny Halevy
0a9d3f24e6 test: sstable_3_x_test: large_data tests: use BOOST_REQUIRE_[GL]T
This way, the boost infrastructure prints the offending values
if the test assertion fails.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:31:07 +03:00
Benny Halevy
9668dd0e2d test: sstable_3_x_test: test_sstable_log_too_many_rows: use tests::random
So it would be reproducible based on the test random-seed

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:30:51 +03:00
Kamil Braun
114419d6ab service/raft: raft_group0_client: read on-disk an in-memory group0 upgrade atomically
`set_group0_upgrade_state` writes the on-disk state first, then
in-memory state second, both under a write lock.
`get_group0_upgrade_state` would only take the lock if the in-memory
state was `use_pre_raft_procedures`.

If there's an external observer who watches the on-disk state to decide
whether Raft upgrade finished yet, the following could happen:
1. The node wrote `use_post_raft_procedures` to disk but didn't update
   the in-memory state yet, which is still `synchronize`.
2. The external client reads the table and sees that the state is
   `use_post_raft_procedures`, and deduces that upgrade has finished.
3. The external client immediately tries to perform a schema change. The
   schema change code calls `get_group0_upgrade_state` which does not
   take the read lock and returns `synchronize`. The schema change gets
   denied because schema changes are not allowed in `synchronize`.

Make sure that `get_group0_upgrade_state` cannot execute in-between
writing to disk and updating the in-memory state by always taking the
read lock before reading the in-memory state. As it was before, it will
immediately drop the lock if the state is not `use_pre_raft_procedures`.

This is useful for upgrade tests, which read the on-disk state to decide
whether upgrade has finished and often try to perform a schema change
immediately afterwards.

Closes #11672
2022-10-03 19:04:16 +02:00
Alejo Sanchez
abf1425ad4 test.py: Scylla REST methods for topology tests
Provide a helper client for Scylla REST requests. Use it on both
ScyllaClusterManager (e.g. remove node, test.py process) and
ManagerClient (e.g. get uuid, pytest process).

For now keep using IPs as key in ScyllaCluster, but this will be changed
to UUID -> IP in the future. So, for now, pass both independently. Note
the UUID must be obtained from the server before stopping it.

Refresh client driver connection when decommissioning or removing
a node.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-10-03 19:01:03 +02:00
Alejo Sanchez
86c752c2a0 test.py: rename server_id to server_ip
In ScyllaCluster currently servers are tracked by the host IP. This is
not the host id (UUID). Fix the variable name accordingly

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-10-03 19:01:03 +02:00
Alejo Sanchez
a7a0b446f0 test.py: HTTP client helper
Split aiohttp client to a shared helper file.

While there, move aiohttp session setup back to constructors. When there
were teardown issues it looked it could be caused by aiohttp session
being created outside a coroutine. But this is proven not to be the case
after recent fixes. So move it back to the ManagerClient constructor.

On th other hand, create a close() coroutine to stop the aiohttp session.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-10-03 19:01:03 +02:00
Alejo Sanchez
41dbdf0f70 test.py: topology pass ManagerClient instead of...
cql connection

When there are topology changes, the driver needs to be updated. Instead
of passing the CassandraCluster.Connection, pass the ManagerClient
instance which manages the driver connection inside of it.

Remove workaround for test_raft_upgrade.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-10-03 19:00:47 +02:00
Alejo Sanchez
0c3a06d0d7 test.py: delete unimplemented remove server
Delete of Unused and unimplemented broken version of remove server.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-10-03 18:57:38 +02:00
Alejo Sanchez
98bc4c198f test.py: fix variable name ssl name clash
Change variable ssl to use_ssl to avoid clash with ssl module.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-10-03 18:57:38 +02:00
Avi Kivity
7626fd573a storage_proxy: reindent after coroutinization 2022-10-03 19:33:39 +03:00
Avi Kivity
019b18b232 storage_proxy: convert handle_read_digest() to a coroutine
The do_with() makes it at least a break-even, but there's some allocating
continuations that make it a win.

A variable named cmd had two different definitions (a value and a
lw_shared_ptr) that lived in different scopes. I renamed one to cmd1
to disambiguate. We should probably move that to the caller, but that
is not done here.
2022-10-03 19:33:39 +03:00
Avi Kivity
aa5f4bf1f3 storage_proxy: convert handle_read_mutation_data() to a coroutine
The do_with() makes it at least a break-even, but there's some allocating
continuations that make it a win.

A variable named cmd had two different definitions (a value and a
lw_shared_ptr) that lived in different scopes. I renamed one to cmd1
to disambiguate. We should probably move that to the caller, but that
is not done here.
2022-10-03 19:33:39 +03:00
Avi Kivity
bcd134e9b8 storage_proxy: convert handle_read_data() to a coroutine
The do_with() makes it at least a break-even, but there's some allocating
continuations that make it a win.

A variable named cmd had two different definitions (a value and a
lw_shared_ptr) that lived in different scopes. I renamed one to cmd1
to disambiguate. We should probably move that to the caller, but that
is not done here.
2022-10-03 19:33:39 +03:00
Avi Kivity
167c8b1b5e storage_proxy: convert handle_write() to a coroutine
A do_with() makes this at least a break-even.

Some internal lambdas were not converted since they commonly
do not allocate or block.

A finally() continuation is converted to seastar::defer().
2022-10-03 19:33:39 +03:00
Avi Kivity
741d6609a5 storage_proxy: convert handle_counter_mutation() to a coroutine
The do_with means the coroutine conversion is free, and conversion
of parallel_for_each to coroutine::parallel_for_each saves a
possible allocation (though it would not have been allocated usually.

An inner continuation is not converted since it usually doesn't
block, and therefore doesn't allocate.
2022-10-03 19:33:39 +03:00
Avi Kivity
ac5fae4b93 storage_proxy: convert query_nonsingular_mutations_locally() to a coroutine
It's simpler, and the do_with() allocation + task cancels out the
coroutine allocation + task.
2022-10-03 19:33:29 +03:00
Pavel Emelyanov
d22b130af1 compaction_manager: Swallow ENOSPCs in ::stop()
When being stopped compaction manager may step on ENOSPC. This is not a
reason to fail stopping process with abort, better to warn this fact in
logs and proceed as if nothing happened

refs: #11245

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-03 18:54:48 +03:00
Pavel Emelyanov
7ba1f551f3 exceptions: Mark storage_io_error::code() with noexcept
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-03 18:50:06 +03:00
Kamil Braun
67ee6500e3 service/raft: raft_group_registry: pass direct_fd_pinger by reference
It was passed to `raft_group_registry::direct_fd_proxy` by value. That
is a bug, we want to pass a reference to the instance that is living
inside `gossiper`.

Fortunately this bug didn't cause problems, because the pinger is only
used for one function, `get_address`, which looks up an address in a map
and if it doesn't find it, accesses the map that lives inside
`gossiper` on shard 0 (and then caches it in the local copy).

Explicitly delete the copy constructor of `direct_fd_pinger` so this
doesn't happen again.

Closes #11661
2022-10-03 16:40:35 +02:00
Tomasz Grabiec
9dae2b9c02 Merge 'mutation_fragment_stream_validator: various API improvements' from Botond Dénes
The low-level `mutation_fragment_stream_validator` gets `reset()` methods that until now only the high-level `mutation_fragment_stream_validating_filter` had.
Active tombstone validation is pushed down to the low level validator.
The low level validator, which was a pain to use until now due to being very fussy on which subset of its API one used, is made much more robust, not requiring the user to stick to a subset of its API anymore.

Closes #11614

* github.com:scylladb/scylladb:
  mutation_fragment_stream_validator: make interface more robust
  mutation_fragment_stream_validator: add reset() to validating filter
  mutation_fragment_stream_validator: move active tomsbtone validation into low level validator
2022-10-03 16:23:46 +02:00
Botond Dénes
95f31f37c1 Merge 'dirty_memory_manager: simplify region_group' from Avi Kivity
region_group evolved as a tree, each node of which contains some
regions (memtables). Each node has some constraints on memory, and
can start flushing and/or stop allocation into its memtables and those
below it when those constraints are violated.

Today, the tree has exactly two nodes, only one of which can hold memtables.
However, all the complexity of the tree remains.

This series applies some mechanical code transformations that remove
the tree structure and all the excess functionality, leaving a much simpler
structure behind.

Before:
 - a tree of region_group objects
 - each with two parameters: soft limit and hard limit
 - but only two instances ever instantiated
After:
 - a single region_group object
 - with three parameters - two from the bottom instance, one from the top instance

Closes #11570

* github.com:scylladb/scylladb:
  dirty_memory_manager: move third memory threshold parameter of region_group constructor to reclaim_config
  dirty_memory_manager: simplify region_group::update()
  dirty_memory_manager: fold region_group::notify_hard_pressure_relieved into its callers
  dirty_memory_manager: clean up region_group::do_update_hard_and_check_relief()
  dirty_memory_manager: make do_update_hard_and_check_relief() a member of region_group
  dirty_memory_manager: remove accessors around region_group::_under_hard_pressure
  dirty_memory_manager: merge memory_hard_limit into region_group
  dirty_memory_manager: rename members in memory_hard_limit
  dirty_memory_manager: fold do_update() into region_group::update()
  dirty_memory_manager: simplify memory_hard_limit's do_update
  dirty_memory_manager: drop soft limit / soft pressure members in memory_hard_limit
  dirty_memory_manager: de-template do_update(region_group_or_memory_hard_limit)
  dirty_memory_manager: adjust soft_limit threshold check
  dirty_memory_manager: drop memory_hard_limit::_name
  dirty_memory_manager: simplify memory_hard_limit configuration
  dirty_memory_manager: fold region_group_reclaimer into {memory_hard_limit,region_group}
  dirty_memory_manager: stop inheriting from region_group_reclaimer
  dirty_memory_manager: test: unwrap region_group_reclaimer
  dirty_memory_manager: change region_group_reclaimer configuration to a struct
  dirty_memory_manager: convert region_group_reclaimer to callbacks
  dirty_memory_manager: consolidate region_group_reclaimer constructors
  dirty_memory_manager: rename {memory_hard_limit,region_group}::notify_relief
  dirty_memory_manager: drop unused parameter to memory_hard_limit constructor
  dirty_memory_manager: drop memory_hard_limit::shutdown()
  dirty_memory_manager: split region_group hierarchy into separate classes
  dirty_memory_manager: extract code block from region_group::update
  dirty_memory_manager: move more allocation_queue functions out of region_group
  dirty_memory_manager: move some allocation queue related function definitions outside class scope
  dirty_memory_manager: move region_group::allocating_function and related classes to new class allocation_queue
  dirty_memory_manager: remove support for multiple subgroups
2022-10-03 13:22:47 +03:00
Anna Stuchlik
3950a1cac8 doc: apply the feedback to improve clarity 2022-10-03 11:14:51 +02:00
Botond Dénes
5621cdd7f9 db/view/view_builder: don't drop partition and range tombstones when resuming
The view builder builds the views from a given base table in
view_builder::batch_size batches of rows. After processing this many
rows, it suspends so the view builder can switch to building views for
other base tables in the name of fairness. When resuming the build step
for a given base table, it reuses the reader used previously (also
serving the role of a snapshot, pinning sstables read from). The
compactor however is created anew. As the reader can be in the middle of
a partition, the view builder injects a partition start into the
compactor to prime it for continuing the partition. This however only
included the partition-key, crucially missing any active tombstones:
partition tombstone or -- since the v2 transition -- active range
tombstone. This can result in base rows covered by either of this to be
resurrected and the view builder to generate view updates for them.
This patch solves this by using the detach-state mechanism of the
compactor which was explicitly developed for situations like this (in
the range scan code) -- resuming a read with the readers kept but the
compactor recreated.
Also included are two test cases reproducing the problem, one with a
range tombstone, the other with a partition tombstone.

Fixes: #11668

Closes #11671
2022-10-03 11:28:22 +03:00
Avi Kivity
2c744628ae Update abseil submodule
* abseil 9e408e05...7f3c0d78 (193):
  > Allows absl::StrCat to accept types that implement AbslStringify()
  > Merge pull request #1283 from pateldeev:any_inovcable_rename_true
  > Cleanup: SmallMemmove nullify should also be limited to 15 bytes
  > Cleanup: implement PrependArray and PrependPrecise in terms of InlineData
  > Cleanup: Move BitwiseCompare() to InlineData, and make it layout independent.
  > Change kPower10Table bounds to be half-open
  > Cleanup some InlineData internal layout specific details from cord.h
  > Improve the comments on the implementation of format hooks adl tricks.
  > Expand LogEntry method docs.
  > Documentation: Remove an obsolete note about the implementation of `Cord`.
  > `absl::base_internal::ReadLongFromFile` should use `O_CLOEXEC` and handle interrupts to `read`
  > Allows absl::StrFormat to accept types which implement AbslStringify()
  > Add common_policy_traits - a subset of hash_policy_traits that can be shared between raw_hash_set and btree.
  > Split configuration related to cycle clock into separate headers
  > Fix -Wimplicit-int-conversion and -Wsign-conversion warnings in btree.
  > Implement Eisel-Lemire for from_chars<float>
  > Import of CCTZ from GitHub.
  > Adds support for "%v" in absl::StrFormat and related functions for bool values. Note that %v prints bool values as "true" and "false" rather than "1" and "0".
  > De-pointerize LogStreamer::stream_, and fix move ctor/assign preservation of flags and other stream properties.
  > Explicitly disallows modifiers for use with %v.
  > Change the macro ABSL_IS_TRIVIALLY_RELOCATABLE into a type trait - absl::is_trivially_relocatable - and move it from optimization.h to type_traits.h.
  > Add sparse and string copy constructor benchmarks for hash table.
  > Make BTrees work with custom allocators that recycle memory.
  > Update the readme, and (internally) fix some export processes to better keep it up-to-date going forward.
  > Add the fact that CHECK_OK exits the program to the comment of CHECK_OK.
  > Adds support for "%v" in absl::StrFormat and related functions for numeric types, including integer and floating point values. Users may now specify %v and have the format specifier deduced. Integer values will print according to %d specifications, unsigned values will use %u, and floating point values will use %g. Note that %v does not work for `char` due to ambiguity regarding the intended output. Please continue to use %c for `char`.
  > Implement correct move constructor and assignment for absl::strings_internal::OStringStream, and mark that class final.
  > Add more options for `BM_iteration` in order to see better picture for choosing trade off for iteration optimizations.
  > Change `EndComparison` benchmark to not measure iteration. Also added `BM_Iteration` separately.
  > Implement Eisel-Lemire for from_chars<double>
  > Add `-llog` to linker options when building log_sink_set in logging internals.
  > Apply clang-format to btree.h.
  > Improve failure message: tell the values we don't like.
  > Increase the number of per-ObjFile program headers we can expect.
  > Fix "unsafe narrowing" warnings in absl, 8/n.
  > Fix format string error with an explicit cast
  > Add a case to detect when the Bazel compiler string is explicitly set to "gcc", instead of just detecting Bazel's default "compiler" string.
  > Fix "unsafe narrowing" warnings in absl, 10/n.
  > Fix "unsafe narrowing" warnings in absl, 9/n.
  > Fix stacktrace header includes
  > Add a missing dependency on :raw_logging_internal
  > CMake: Require at least CMake 3.10
  > CMake: install artifacts reflect the compiled ABI
  > Fixes bug so that `%v` with modifiers doesn't compile. `%v` is not intended to work with modifiers because the meaning of modifiers is type-dependent and `%v` is intended to be used in situations where the type is not important. Please continue using if `%s` if you require format modifiers.
  > Convert algorithm and container benchmarks to cc_binary
  > Merge pull request #1269 from isuruf:patch-1
  > InlinedVector: Small improvement to the max_size() calculation
  > CMake: Mark hash_testing as a public testonly library, as it is with Bazel
  > Remove the ABSL_HAVE_INTRINSIC_INT128 test from pcg_engine.h
  > Fix ClangTidy warnings in btree.h and btree_test.cc.
  > Fix log StrippingTest on windows when TCHAR = WCHAR
  > Refactors checker.h and replaces recursive functions with iterative functions for readability purposes.
  > Refactors checker.h to use if statements instead of ternary operators for better readability.
  > Import of CCTZ from GitHub.
  > Workaround for ASAN stack safety analysis problem with FixedArray container annotations.
  > Rollback of fix "unsafe narrowing" warnings in absl, 8/n.
  > Fix "unsafe narrowing" warnings in absl, 8/n.
  > Changes mutex profiling
  > InlinedVector: Correct the computation of max_size()
  > Adds support for "%v" in absl::StrFormat and related functions for string-like types (support for other builtin types will follow in future changes). Rather than specifying %s for strings, users may specify %v and have the format specifier deduced. Notably, %v does not work for `const char*` because we cannot be certain if %s or %p was intended (nor can we be certain if the `const char*` was properly null-terminated). If you have a `const char*` you know is null-terminated and would like to work with %v, please wrap it in a `string_view` before using it.
  > Fixed header guards to match style guide conventions.
  > Typo fix
  > Added some more no_test.. tags to build targets for controlling testing.
  > Remove includes which are not used directly.
  > CMake: Add an option to build the libraries that are used for writing tests without requiring Abseil's tests be built (default=OFF)
  > Fix "unsafe narrowing" warnings in absl, 7/n.
  > Fix "unsafe narrowing" warnings in absl, 6/n.
  > Release the Abseil Logging library
  > Switch time_state to explicit default initialization instead of value initialization.
  > spinlock.h: Clean up includes
  > Fix minor typo in absl/time/time.h comment: "ToDoubleNanoSeconds" -> "ToDoubleNanoseconds"
  > Support compilers that are unknown to CMake
  > Import of CCTZ from GitHub.
  > Change bit_width(T) to return int rather than T.
  > Import of CCTZ from GitHub.
  > Merge pull request #1252 from jwest591:conan-fix
  > Don't try to enable use of ARM NEON intrinsics when compiling in CUDA device mode. They are not available in that configuration, even if the host supports them.
  > Fix "unsafe narrowing" warnings in absl, 5/n.
  > Fix "unsafe narrowing" warnings in absl, 4/n.
  > Import of CCTZ from GitHub.
  > Update Abseil platform support policy to point to the Foundational C++ Support Policy
  > Import of CCTZ from GitHub.
  > Add --features=external_include_paths to Bazel CI to ignore warnings from dependencies
  > Merge pull request #1250 from jonathan-conder-sm:gcc_72
  > Merge pull request #1249 from evanacox:master
  > Import of CCTZ from GitHub.
  > Merge pull request #1246 from wxilas21:master
  > remove unused includes and add missing std includes for absl/status/status.h
  > Sort INTERNAL_DLL_TARGETS for easier maintenance.
  > Disable ABSL_HAVE_STD_IS_TRIVIALLY_ASSIGNABLE for clang-cl.
  > Map the absl::is_trivially_* functions to their std impl
  > Add more SimpleAtod / SimpleAtof test coverage
  > debugging: handle alternate signal stacks better on RISCV
  > Revert change "Fix "unsafe narrowing" warnings in absl, 4/n.".
  > Fix "unsafe narrowing" warnings in absl, 3/n.
  > Fix "unsafe narrowing" warnings in absl, 4/n.
  > Fix "unsafe narrowing" warnings in absl, 2/n.
  > debugging: honour `STRICT_UNWINDING` in RISCV path
  > Fix "unsafe narrowing" warnings in absl, 1/n.
  > Add ABSL_IS_TRIVIALLY_RELOCATABLE and ABSL_ATTRIBUTE_TRIVIAL_ABI macros for use with clang's __is_trivially_relocatable and [[clang::trivial_abi]].
  > Merge pull request #1223 from ElijahPepe:fix/implement-snprintf-safely
  > Fix frame pointer alignment check.
  > Fixed sign-conversion warning in code.
  > Import of CCTZ from GitHub.
  > Add missing include for std::unique_ptr
  > Do not re-close files on EINTR
  > Renamespace absl::raw_logging_internal to absl::raw_log_internal to match (upcoming) non-raw logging namespace.
  > Check for negative return values from ReadFromOffset
  > Use HTTPS RFC URLs, which work regardless of the browser's locale.
  > Avoid signedness change when casting off_t
  > Internal Cleanup: removing unused internal function declaration.
  > Make Span complain if constructed with a parameter that won't outlive it, except if that parameter is also a span or appears to be a view type.
  > any_invocable_test: Re-enable the two conversion tests that used to fail under MSVC
  > Add GetCustomAppendBuffer method to absl::Cord
  > debugging: add hooks for checking stack ranges
  > Minor clang-tidy cleanups
  > Support [[gnu::abi_tag("xyz")]] demangling.
  > Fix -Warray-parameter warning
  > Merge pull request #1217 from anpol:macos-sigaltstack
  > Undo documentation change on erase.
  > Improve documentation on erase.
  > Merge pull request #1216 from brjsp:master
  > string_view: conditional constexpr is no longer needed for C++14
  > Make exponential_distribution_test a bigger test (timeout small -> moderate).
  > Move Abseil to C++14 minimum
  > Revert commit f4988f5bd4176345aad2a525e24d5fd11b3c97ea
  > Disable C++11 testing, enable C++14 and C++20 in some configurations where it wasn't enabled
  > debugging: account for differences in alternate signal stacks
  > Import of CCTZ from GitHub.
  > Run flaky test in fewer configurations
  > AnyInvocable: Move credits to the top of the file
  > Extend visibility of :examine_stack to an upcoming Abseil Log.
  > Merge contiguous mappings from the same file.
  > Update versions of WORKSPACE dependencies
  > Use ABSL_INTERNAL_HAS_SSE2 instead of __SSE2__
  > PR #1200: absl/debugging/CMakeLists.txt: link with libexecinfo if needed
  > Update GCC floor container to use Bazel 5.2.0
  > Update GoogleTest version used by Abseil
  > Release absl::AnyInvocable
  > PR #1197: absl/base/internal/direct_mmap.h: fix musl build on mips
  > absl/base/internal/invoke: Ignore bogus warnings on GCC >= 11
  > Revert GoogleTest version used by Abseil to commit 28e1da21d8d677bc98f12ccc7fc159ff19e8e817
  > Update GoogleTest version used by Abseil
  > explicit_seed_seq_test: work around/disable bogus warnings in GCC 12
  > any_test: expand the any emplace bug suppression, since it has gotten worse in GCC 12
  > absl::Time: work around bogus GCC 12 -Wrestrict warning
  > Make absl::StdSeedSeq an alias for std::seed_seq
  > absl::Optional: suppress bogus -Wmaybe-uninitialized GCC 12 warning
  > algorithm_test: suppress bogus -Wnonnull warning in GCC 12
  > flags/marshalling_test: work around bogus GCC 12 -Wmaybe-uninitialized warning
  > counting_allocator: suppress bogus -Wuse-after-free warning in GCC 12
  > Prefer to fallback to UTC when the embedded zoneinfo data does not contain the requested zone.
  > Minor wording fix in the comment for ConsumeSuffix()
  > Tweak the signature of status_internal::MakeCheckFailString as part of an upcoming change
  > Fix several typos in comments.
  > Reformulate documentation of ABSL_LOCKS_EXCLUDED.
  > absl/base/internal/invoke.h: Use ABSL_INTERNAL_CPLUSPLUS_LANG for language version guard
  > Fix C++17 constexpr storage deprecation warnings
  > Optimize SwissMap iteration by another 5-10% for ARM
  > Add documentation on optional flags to the flags library overview.
  > absl: correct the stack trace path on RISCV
  > Merge pull request #1194 from jwnimmer-tri:default-linkopts
  > Remove unintended defines from config.h
  > Ignore invalid TZ settings in tests
  > Add ABSL_HARDENING_ASSERTs to CordBuffer::SetLength() and CordBuffer::IncreaseLengthBy()
  > Fix comment typo about absl::Status<T*>
  > In b-tree, support unassignable value types.
  > Optimize SwissMap for ARM by 3-8% for all operations
  > Release absl::CordBuffer
  > InlinedVector: Limit the scope of the maybe-uninitialized warning suppression
  > Improve the compiler error by removing some noise from it. The "deleted" overload error is useless to users. By passing some dummy string to the base class constructor we use a valid constructor and remove the unintended use of the deleted default constructor.
  > Merge pull request #714 from kgotlinux:patch-2
  > Include proper #includes for POSIX thread identity implementation when using that implementation on MinGW.
  > Rework NonsecureURBGBase seed sequence.
  > Disable tests on some platforms where they currently fail.
  > Fixed typo in a comment.
  > Rollforward of commit ea78ded7a5f999f19a12b71f5a4988f6f819f64f.
  > Add an internal helper for logging (upcoming).
  > Merge pull request #1187 from trofi:fix-gcc-13-build
  > Merge pull request #1189 from renau:master
  > Allow for using b-tree with `value_type`s that can only be constructed by the allocator (ignoring copy/move constructors).
  > Stop using sleep timeouts for Linux futex-based SpinLock
  > Automated rollback of commit f2463433d6c073381df2d9ca8c3d8f53e5ae1362.
  > time.h: Use uint32_t literals for calls to overloaded MakeDuration
  > Fix typos.
  > Clarify the behaviour of `AssertHeld` and `AssertReaderHeld` when the calling thread doesn't hold the mutex.
  > Enable __thread on Asylo
  > Add implementation of is_invocable_r to absl::base_internal for C++ < 17, define it as alias of std::is_invocable_r when C++ >= 17
  > Optimize SwissMap iteration for aarch64 by 5-6%
  > Fix detection of ABSL_HAVE_ELF_MEM_IMAGE on Haiku
  > Don’t use generator expression to build .pc Libs lines
  > Update Bazel used on MacOS CI
  > Import of CCTZ from GitHub.

Closes #11687
2022-10-03 11:06:37 +03:00
Botond Dénes
f4540ef0d6 Merge 'Upgrade nix devenv' from Michael Livshin
To recap: the Nix devenv ({default,shell,flake}.nix and friends) in Scylla is a nicer (for those who consider it so, that is) alternative to dbuild: a completely deterministic build environment without Docker.

In theory we could support much more (creating installable packages, container images, various deployment affordances, etc. -- Nix is, among other things, a kind of parallel-to-everything-else devops realm) but there is clearly no demand and besides duplicating the work the release team is already doing (and doing just fine, needless to say) would be pointless and wasteful.

This PR reflects the accumulated changes that I have been carrying locally for the past year or so.  The version currently in master _probably_ can still build Scylla, but that Scylla certainly would not pass unit tests.

What the previous paragraph seems to mean is, apparently I'm the only active user of Nix devenv for Scylla.  Which, in turn, presents some obvious questions for the maintainers:

- Does this need to live in the Scylla source at all?  (The changes to non-Nix-specific parts are minimal and unobtrusive, but they are still changes)
- If it's left in, who is going to maintain it going forward, should more users somehow appear?  (I'm perfectly willing to fix things up when alerted, but no timeliness guarantees)

Closes #9557

* github.com:scylladb/scylladb:
  nix: add README.md
  build: improvements & upgrades to Nix dev environment
  build: allow setting SCYLLA_RELEASE from outside
2022-10-03 09:40:09 +03:00
Botond Dénes
2041744132 Merge 'readers/mutlishard: don't mix coroutines and continuations in the do_fill_buffer()' from Avi Kivity
The combination is hard to read and modify.

Closes #11665

* github.com:scylladb/scylladb:
  readers/multishard: restore shard_reader_v2::do_fill_buffer() indentation
  readers/multishard: convert shard_reader_v2::do_fill_buffer() to a pure coroutine
2022-10-03 06:51:20 +03:00
Nadav Har'El
b8f8eb8710 Merge 'Improve test.py logging' from Kamil Braun
Include the unique test name (the unique name distinguishes between different test repeats) and the test case name where possible. Improve printing of clusters: include the cluster name and stopped servers. Fix some logging calls and add new ones.

Examples:
```
------ Starting test test_topology ------
```
became this:
```
------ Starting test test_topology.1::test_add_server_add_column ------
```

This:
```
INFO> Leasing Scylla cluster {127.191.142.1, 127.191.142.2, 127.191.142.3} for test test_add_server_add_column
```
became this:
```
INFO> Leasing Scylla cluster ScyllaCluster(name: 02cdd180-40d1-11ed-8803-3c2c30d32d96, running: {127.144.164.1, 127.144.164.2, 127.144.164.3}, stopped: {}) for test test_topology.1::test_add_server_add_column
```

Closes #11677

* github.com:scylladb/scylladb:
  test/pylib: scylla_cluster: improve cluster printing
  test/pylib: don't pass test_case_name to after-test endpoint
  test/pylib: scylla_cluster: track current test case name and print it
  test.py: pass the unique test name (e.g. `test_topology.1`) to cluster manager
  test/pylib: scylla_cluster: pass the test case name to `before_test`
  test/pylib: use "test_case_name" variable name when talking about test cases
2022-10-02 20:48:50 +03:00
Pavel Emelyanov
2b8636a2a9 storage_proxy.hh: Remove unused headers
Add needed forward declarations and fix indirect inclusions in some .ccs

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11679
2022-10-02 20:48:50 +03:00
Michał Chojnowski
4563cbe595 logalloc: prevent false positives in reclaim_timer
reclaim_timer uses a coarse clock, but does not account for
the measurement error introduced by that -- it can falsely
report reclaims as stalls, even if they are shorter by a full
coarse clock tick from the requested threshold
(blocked-reactor-notify-ms).

Notably, if the stall threshold happens to be smaller or equal to coarse
clock resolution, Scylla's log gets spammed with false stall reports.
The resolution of coarse clocks in Linux is 1/CONFIG_HZ. This is
typically equal to 1 ms or 4 ms, and stall thresholds of this order
can occur in practice.

Eliminate false positives by requiring the measured reclaim duration to
be at least 1 clock tick longer than the configured threshold for it to
be considered a stall.

Fixes #10981

Closes #11680
2022-10-02 13:41:40 +03:00
Avi Kivity
372eadf542 Merge "perftune related improvements in scylla_* scripts" from Vlad Zolotarov
"
This series adds a long waited transition of our auto-generation
code to irq_cpu_mask instead of 'mode' in perftune.yaml.

And then it fixes a regression in scylla_prepare perftune.yaml
auto-generation logic.
"

* 'scylla_prepare_fix_regression-v1' of https://github.com/vladzcloudius/scylla:
  scylla_prepare + scylla_cpuset_setup: make scylla_cpuset_setup idempotent without introducing regressions
  scylla_prepare: stop generating 'mode' value in perftune.yaml
2022-10-02 13:25:13 +03:00
Michael Livshin
d178ac17dc nix: add README.md
Signed-off-by: Michael Livshin <repo@cmm.kakpryg.net>
2022-10-02 12:26:02 +03:00
Michael Livshin
7bd13be3f2 build: improvements & upgrades to Nix dev environment
* Add some more useful stuff to the shell environment, so it actually
  works for debugging & post-mortem analysis.

* Wrap ccache & distcc transparently (distcc will be used unless
  NODISTCC is set to a non-empty value in the environment; ccache will
  be used if CCACHE_DIR is not empty).

* Package the Scylla Python driver (instead of the C* one).

* Catch up to misc build/test requirements (including optional) by
  requiring or custom-packaging: wasmtime 0.29.0, cxxbridge,
  pytest-asyncio, liburing.

* Build statically-linked zstd in a saner and more idiomatic fashion.

* In pure builds (where sources lack Git metadata), derive
  SCYLLA_RELEASE from source hash.

* Refactor things for more parameterization.

* Explicitly stub out installPhase (seeing that "nix build" succeeds
  up to installPhase means we didn't miss any dependencies).

* Add flake support.

* Add copious comments.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-10-02 11:47:16 +03:00
Michael Livshin
839d8f40e6 build: allow setting SCYLLA_RELEASE from outside
The extant logic for deriving the value of SCYLLA_RELEASE from the
source tree has those assumptions:

* The tree being built includes Git metadata.

* The value of `date` is trustworthy and interesting.

* There are no uncommitted changes (those relevant to building,
  anyway).

The above assumptions are either irrelevant or problematic in pure
build environments (such as the sandbox set up by `nix-build`):

* Pure builds use cleaned-up sources with all timestamps reset to Unix
  time 0.  Those cleaned-up sources are saved (in the Nix store, for
  example) and content-hashed, so leaving the (possibly huge) Git
  metadata increases the time to copy the sources and wastes disk
  space (in fact, Nix in flake mode strips `.git` unconditionally).

* Pure builds run in a sandbox where time is, likewise, reset to Unix
  time 0, so the output of `date` is neither informative nor useful.

Now, the only build step that uses Git metadata in the first place is
the SCYLLA_RELEASE value derivation logic.  So, essentially, answering
the question "is the Git metadata needed to build Scylla" is a matter
of definition, and is up to us.  If we elect to ignore Git metadata
and current time, we can derive SCYLLA_RELEASE value from the content
hash of the cleaned-up tree, regardless of the way that tree was
arrived at.

This change makes it possible to skip the derivation of SCYLLA_RELEASE
value from Git metadata and current time by way of setting
SCYLLA_RELEASE in the environment.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-10-02 11:47:16 +03:00
Avi Kivity
17b1cb4434 dirty_memory_manager: move third memory threshold parameter of region_group constructor to reclaim_config
Place it along the other parameters.
2022-09-30 22:17:37 +03:00
Avi Kivity
ecf30ee469 dirty_memory_manager: simplify region_group::update()
We notice there are two separate conditions controlling a call to
a single outcome, notify_pressure_relief(). Merge them into a single
boolean variable.
2022-09-30 22:15:45 +03:00
Avi Kivity
230fff299a dirty_memory_manager: fold region_group::notify_hard_pressure_relieved into its callers
It is trivial.
2022-09-30 22:11:01 +03:00
Avi Kivity
12b81173b9 dirty_memory_manager: clean up region_group::do_update_hard_and_check_relief()
Remove synthetic "rg" local.
2022-09-30 22:09:09 +03:00
Avi Kivity
e1bad8e883 dirty_memory_manager: make do_update_hard_and_check_relief() a member of region_group
It started life as something shared between memory_hard_limit and
region_group, but now that they are back being the same thing, we
can make it a member again.
2022-09-30 22:04:26 +03:00
Avi Kivity
6b21c10e9e dirty_memory_manager: remove accessors around region_group::_under_hard_pressure
It is now only accessed from within the class, so the
accessors don't help anything.
2022-09-30 21:59:46 +03:00
Avi Kivity
6a02bb7c2b dirty_memory_manager: merge memory_hard_limit into region_group
The two classes always have a 1:1 or 0:1 relationship, and
so we can just move all the members of memory_hard_limit
into region_group, with the functions that track the relationship
(memory_hard_limit::{add,del}()) removed.

The 0:1 relationship is maintained by initializing the
hard limit parameter with std::numeric_limits<size_t>::max().
The _hard_total_memory variable is always checked if it is
greater than this parameter in order to do anything, and
with this default it can never be.
2022-09-30 21:59:38 +03:00
Avi Kivity
45ab24e43d dirty_memory_manager: rename members in memory_hard_limit
In preparation for merging memory_hard_limit into region_group,
disambiguate similarly named members by adding the word "hard" in
random places.

memory_hard_limit and region_group are candidates for merging
because they constantly reference each other, and memory_hard_limit
does very little by itself.
2022-09-30 21:47:33 +03:00
Avi Kivity
aca96c4103 readers/multishard: restore shard_reader_v2::do_fill_buffer() indentation 2022-09-30 19:19:51 +03:00
Avi Kivity
b08196f3b3 readers/multishard: convert shard_reader_v2::do_fill_buffer() to a pure coroutine
do_full_buffer() is an eclectic mix of coroutines and continuations.
That makes it hard to follow what is running sequentially and
concurrently.

Convert it into a pure coroutine by changing internal continuations
to lambda coroutines. These lambda coroutines are guarded with
seastar::coroutine::lambda. Furthermore, a future that is co_awaited
is converted to immediate co_await (without an intermediate future),
since seastar::coroutine::lambda only works if the coroutine is awaited
in the same statement it is defined on.
2022-09-30 19:19:48 +03:00
Kamil Braun
b2cf610567 test/pylib: scylla_cluster: improve cluster printing
Print the cluster name and stopped servers in addition to the running
servers.

Fix a logging call which tried to print a server in place of a cluster
and even at that it failed (the server didn't have a hostname yet so it
printed as an empty string). Add another logging call.
2022-09-30 17:00:05 +02:00
Kamil Braun
05ed3769dd test/pylib: don't pass test_case_name to after-test endpoint
It's redundant now, the manager tracks the current test case using
before-test endpoint calls.
2022-09-30 16:41:45 +02:00
Kamil Braun
dc6f37b7f7 test/pylib: scylla_cluster: track current test case name and print it
Use `_before_test` calls to track the current test case name.
Concatenate it with the unique test name like this:
`test_topology.1::test_add_server_add_column`, and print it
instead of the test case name.
2022-09-30 16:38:35 +02:00
Kamil Braun
5be818d73b test.py: pass the unique test name (e.g. test_topology.1) to cluster manager
This helps us distinguish the different repeats of a test in logs.
Rename the variable accordingly in `ScyllaClusterManager`.
2022-09-30 16:24:10 +02:00
Kamil Braun
fde4642472 test/pylib: scylla_cluster: pass the test case name to before_test
We pass the test case name to `after_test` - so make it consistent.
Arguably, the test case name is more useful (as it's more precise) than
the test name.
2022-09-30 16:17:59 +02:00
Kamil Braun
43d8b4a214 test/pylib: use "test_case_name" variable name when talking about test cases
Distinguish "test name" (e.g. `test_topology`) from "test case name"
(e.g. `test_add_server_add_column` - a test case inside
`test_topology`).
2022-09-30 16:15:48 +02:00
Botond Dénes
060dda8e00 Merge 'Reduce dependencies on large data handler header' from Benny Halevy
Reduce the false dependencies on db/large_data_handler.hh by
not including it from commonly used header files, and rather including
it only in the source files that actually need it.

The is in preparation for https://github.com/scylladb/scylladb/issues/11449

Closes #11654

* github.com:scylladb/scylladb:
  test: lib: do not include db/large_data_handler.hh in test_service.hh
  test: lib: move sstable test_env::impl ctor out of line
  sstables: do not include db/large_data_handler.hh in sstables.hh
  api/column_family: add include db/system_keyspace.hh
2022-09-30 13:27:38 +03:00
Tomasz Grabiec
5268f0f837 test: lib: random_mutation_generator: Don't generate mutations with marker uncompacted with shadowable tombstone
The generator was first setting the marker then applied tombstones.

The marker was set like this:

  row.marker() = random_row_marker();

Later, when shadowable tombstones were applied, they were compacted
with the marker as expected.

However, the key for the row was chosen randomly in each iteration and
there are multiple keys set, so there was a possibility of a key clash
with an earlier row. This could override the marker without applying
any tombstones, which is conditional on random choice.

This could generate rows with markers uncompacted with shadowable tombstones.

This broken row_cache_test::test_concurrent_reads_and_eviction on
comparison between expected and read mutations. The latter was
compacted because it went through an extra merge path, which compacts
the row.

Fix by making sure there are no key clashes.

Closes #11663
2022-09-30 11:27:01 +03:00
Kamil Braun
1793d43b15 test/pylib: scylla_cluster: mark server_remove as not implemented
The `server_remove` function did a very weird thing: it shut down a
server and made the framework 'forget' about it. From the point of view
of the Scylla cluster and the driver the server was still there.

Replace the function's body with `raise NotImplementedError`. In the
future it can be replaced with an implementation that calls
`removenode` on the Scylla cluster.

Remove `test_remove_server_add_column` from `test_topology`. It
effectively does the same thing as `test_stop_server_add_column`, except
that the framework also 'forgets' about the stopped server. This could
lead to weird situations because the forgotten server's IP could be
reused in another test that was running concurrently with this test.

Closes #11657
2022-09-29 21:03:18 +03:00
Pavel Emelyanov
6a5b0d6c70 table: Handle storage_io_error's ENOSPC when flushing
Commit a9805106 (table: seal_active_memtable: handle ENOSPC error)
made memtable flushing code stand ENOSPC and continue flusing again
in the hope that the node administrator would provide some free space.

However, it looks like the IO code may report back ENOSPC with some
exception type this code doesn't expect. This patch tries to fix it

refs: #11245

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-29 19:16:30 +03:00
Pavel Emelyanov
826244084e table: Rewrap retry loop
The existing loop is very branchy in its attempts to find out whether or
not to abort. The "allowed_retries" count can be a good indicator of the
decision taken. This makes the code notably shorter and easier to extend

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-29 19:14:46 +03:00
Benny Halevy
776b009c0f test: lib: do not include db/large_data_handler.hh in test_service.hh
It was needed for defining and referencing nop_lp_handler
and in sstable_3_x_test for testing the large_data_handler.

Remove the include from the commonly used header file
to reduce the false dependencies on large_data_handler.hh

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-29 18:36:16 +03:00
Benny Halevy
678d88576b test: lib: move sstable test_env::impl ctor out of line
To prepare for removing the include of db/large_data_handler.hh
from test/lib/test_services.hh

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-29 18:35:12 +03:00
Botond Dénes
ad04f200d3 Merge 'database: automatically take snapshot of base table views' from Benny Halevy
The logic to reject explicit snapshot of views/indexes was improved in aa127a2dbb. However, we never implemented auto-snapshot of
view/indexes when taking a snapshot of the base table.

This is implemented in this patch.

The implementation is built on top of
ba42852b0e
so it would be hard to backport to 5.1 or earlier
releases.

Fixes #11612

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11616

* github.com:scylladb/scylladb:
  database: automatically take snapshot of base table views
  api: storage_service: reject snapshot of views in api layer
2022-09-29 13:33:31 +03:00
Benny Halevy
ae7fd1c7b2 sstables: do not include db/large_data_handler.hh in sstables.hh
Reduce dependencies by only forward-declaring
class db::large_data_handler in sstables.hh

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-29 12:42:58 +03:00
Benny Halevy
fb7e55b0a8 api/column_family: add include db/system_keyspace.hh
For db::system_keyspace::load_view_build_progress that currently
indirectly satisfied via sstables/sstables.hh ->
db/large_data_handler.hh

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-29 12:42:54 +03:00
Nadav Har'El
4a7794fb64 alternator: better error message when adding a GSI to an existing table
Due to issue #11567, Alternator do not yet support adding a GSI to an
existing table via UpdateTable with the GlobalSecondaryIndexUpdates
parameter.

However, currently, we print a misleading error message in this case,
complaining about the AttributeDefinitions parameter. This parameter
is also required with GlobalSecondaryIndexUpdates, but it's not the
main problem, and the user is likely to be confused why the error message
points to that specific paramter and what it means that this parameter
is claimed to be "not supported" (while it is supported, in CreateTable).
With this patch, we report that GlobalSecondaryIndexUpdates is not
supported.

This patch does not fix the unsupported feature - it just improves
the error message saying that it's not supported.

Refs #11567

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11650
2022-09-29 09:00:31 +03:00
Asias He
c194c811df repair: Yield in repair_service::do_decommission_removenode_with_repair
When walking through the ranges, we should yield to prevent stalls. We
do similar yield in other node operations.

Fix a stall in 5.1.dev.20220724.f46b207472a3 with build-id
d947aaccafa94647f71c1c79326eb88840c5b6d2

```
!INFO | scylla[6551]: Reactor stalled for 10 ms on shard 0. Backtrace:
0x4bbb9d2 0x4bba630 0x4bbb8e0 0x7fd365262a1f 0x2face49 0x2f5caff
0x36ca29f 0x36c89c3 0x4e3a0e1
````

Fixes #11146

Closes #11160
2022-09-28 18:21:35 +03:00
Avi Kivity
cf3830a249 Merge 'Add support for TRUNCATE USING TIMEOUT' from Benny Halevy
Extend the cql3 truncate statement to accept attributes,
similar to modification statements.

To achieve that we define cql3::statements::raw::truncate_statement
derived from raw::cf_statement, and implement its pure virtual
prepare() method to make a prepared truncate_statement.

The latter is no longer derived from raw::cf_statement,
and just stores a schema_ptr to get to the keyspace and column_family.

`test_truncate_using_timeout` cql-pytest was added to test
the new USING TIMEOUT feature.

Fixes #11408

Also, update docs/cql/ddl.rst truncate-statement section respectively.

Closes #11409

* github.com:scylladb/scylladb:
  docs: cql-extensions: add TRUNCATE to USING TIMEOUT section.
  docs: cql: ddl: add support for TRUNCATE USING TIMEOUT
  cql3, storage_proxy: add support for TRUNCATE USING TIMEOUT
  cql3: selectStatement: restrict to USING TIMEOUT in grammar
  cql3: deleteStatement: restrict to USING TIMEOUT|TIMESTAMP in grammar
2022-09-28 18:19:03 +03:00
Avi Kivity
19374779bb Merge 'Fix large data warning and docs' from Benny Halevy
The series contains fixes for system.large_* log warning and respective documentation.
This prepares the way for adding a new system.large_collections table (See #11449):

Fixes #11620
Fixes #11621
Fixes #11622

the respective fixes should be backported to different release branches, based on the respective patches they depend on (mentioned in each issue).

Closes #11623

* github.com:scylladb/scylladb:
  docs: adjust to sstable base name
  docs: large-partition-table: adjust for additional rows column
  docs: debugging-large-partition: update log warning example
  db/large_data_handler: print static cell/collection description in log warning
  db/large_data_handler: separate pk and ck strings in log warning with delimiter
2022-09-28 17:52:23 +03:00
Nadav Har'El
de1bc147bc Merge 'test.py: cleanups in topology test suites' from Kamil Braun
Fix the type of `create_server`, rename `topology_for_class` to `get_cluster_factory`, simplify the suite definitions and parameters passed to `get_cluster_factory`

Closes #11590

* github.com:scylladb/scylladb:
  test.py: replace `topology` with `cluster_size` in Topology tests
  test.py: rename `topology_for_class` to `get_cluster_factory`
  test/pylib: ScyllaCluster: fix create_server parameter type
2022-09-28 15:19:54 +03:00
Kamil Braun
1bcc28b48b test/topology_raft_disabled: reenable test_raft_upgrade
The test was disabled due to a bug in the Python driver which caused the
driver not to reconnect after a node was restarted (see
scylladb/python-driver#170).

Introduce a workaround for that bug: we simply create a new driver
session after restarting the nodes. Reenable the test.

Closes #11641
2022-09-28 15:13:42 +03:00
Mikołaj Grzebieluch
be8fcba8c1 raft: broadcast_tables: add support for bind variables
Extended the queries language to support bind variables which are bound in the
execution stage, before creating a raft command.

Adjusted `test_broadcast_tables.py` to prepare statements at the beginning of the test.

Fixed a small bug in `strongly_consistent_modification_statement::check_access`.

Closes #11525
2022-09-28 09:54:59 +03:00
Alejo Sanchez
02933c9b82 test.py: close aiohttp session for topology tests
Close the aiohttp ClientSession after pytest session finishes.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #11648
2022-09-27 18:09:08 +02:00
Kamil Braun
82481ae31b Merge 'raft server, log size limit in bytes' from Gusev Petr
Before this patch we could get an OOM if we
received several big commands. The number of
commands was small, but their total size
in bytes was large.

snapshot_trailing_size is needed to guarantee
progress. Without this limit the fsm could
get stuck if the size of the next item is greater than
 max_log_size - (size of trailing entries).

Closes #11397

* github.com:scylladb/scylladb:
  raft replication_test, make backpressure test to do actual backpressure
  raft server, shrink_to_fit on log truncation
  raft server, release memory if add_entry throws
  raft server, log size limit in bytes
2022-09-27 14:25:08 +02:00
Kamil Braun
ed67f0e267 Merge 'test.py: fix topology init error handling' from Alecco
When there are errors starting the first cluster(s) the logs of the server logs are needed. So move `.start()` to the `try` block in `test.py` (out of `asynccontextmanager`).

While there, make `ScyllaClusterManager.start()` idempotent.

Closes #11594

* github.com:scylladb/scylladb:
  test.py: fix ScyllaClusterManager start/stop
  test.py: fix topology init error handling
2022-09-27 11:36:07 +02:00
Petr Gusev
bc50b7407f raft replication_test, make backpressure test to do actual backpressure
Before this patch this test didn't actually experience any backpressure since all the commands were executed sequentially.
2022-09-27 12:04:14 +04:00
Petr Gusev
cbfe033786 raft server, shrink_to_fit on log truncation
We don't want to keep memory we don't use, shrink_to_fit guarantees that.

In fact, boost::deque frees up memory when items are deleted, so this change has little effect at the moment, but it may pay off if we change the container in the future.
2022-09-27 12:02:36 +04:00
Petr Gusev
b34dfed307 raft server, release memory if add_entry throws
We consume memory from semaphore in add_entry_on_leader, but never release it if add_entry throws.
2022-09-27 12:02:34 +04:00
Benny Halevy
b178813cba docs: cql-extensions: add TRUNCATE to USING TIMEOUT section.
List the queries that support the TIMEOUT parameter.

Mention the newly added support for TRUNCATE
USING TIMEOUT.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-26 18:30:39 +03:00
Benny Halevy
b0bad0b153 docs: cql: ddl: add support for TRUNCATE USING TIMEOUT
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-26 18:30:39 +03:00
Benny Halevy
64140ccf05 cql3, storage_proxy: add support for TRUNCATE USING TIMEOUT
Extend the cql3 truncate statement to accept attributes,
similar to modification statements.

To achieve that we define cql3::statements::raw::truncate_statement
derived from raw::cf_statement, and implement its pure virtual
prepare() method to make a prepared truncate_statement.

The latter, statements::truncate_statement, is no longer derived
from raw::cf_statement, and just stores a schema_ptr to get to the
keyspace and column_family names.

`test_truncate_using_timeout` cql-pytest was added to test
the new USING TIMEOUT feature.

Fixes #11408

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-26 18:30:39 +03:00
Benny Halevy
27d3e48005 cql3: selectStatement: restrict to USING TIMEOUT in grammar
It is preferred to reject USING TLL / TIMESTAMP at the grammar
level rather than functionally validating the USING attributes.

test_using_timeout was adjusted respectively to expect the
`SyntaxException` error rather than `InvalidRequest`.

Note that cql3::statements::raw::select_statement validate_attrs
now asserts that the ttl or the timestamp attributes aren't set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-26 18:30:39 +03:00
Benny Halevy
0728d33d5f cql3: deleteStatement: restrict to USING TIMEOUT|TIMESTAMP in grammar
It is preferred to reject USING TLL / TIMESTAMP at the grammar
level rather than functionally validating the USING attributes.

test_using_timeout was adjusted respectively to expect the
`SyntaxException` error rather than `InvalidRequest`.

Note that now delete_statement ctor asserts that the ttl
attribute is not set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-26 18:30:39 +03:00
Kamil Braun
696bdb2de7 test.py: replace topology with cluster_size in Topology tests
First, a reminder of a few basic concepts in Scylla:
- "topology" is a mapping: for each node, its DC and Rack.
- "replication strategy" is a method of calculating replica sets in
  a cluster. It is not a cluster-global property; each keyspace can have
  a different replication strategy. A cluster may have multiple
  keyspaces.
- "cluster size" is the number of nodes in a cluster.

Replication strategy is orthogonal to topology. Cluster size can be
derived from topology and is also orthogonal to replication strategy.

test.py was confusing the three concepts together. For some reason,
Topology suites were specifying a "topology" parameter which contained
replication strategy details - having nothing to do with topology. Also
it's unclear why a test suite would specify anything to do with
replication strategies - after all, a test may create keyspaces with
different replication strategies, and a suite may contain multiple
different tests.

Get rid of the "topology" parameter, replace it with a simple
"cluster_size". In the future we may re-introduce it when we actually
implement the possibility to start clusters with custom topologies
(which involves configuring the snitch etc.) Simplify the test.py code.
2022-09-26 15:17:50 +02:00
Botond Dénes
895522db23 mutation_fragment_stream_validator: make interface more robust
The validator has several API families with increasing amount of detail.
E.g. there is an `operator()(mutation_fragment_v2::kind)` and an
overload also taking a position. These different API families
currently cannot be mixed. If one uses one overload-set, one has to
stick with it, not doing so will generate false-positive failures.
This is hard to explain in documentation to users (provided they even
read it). Instead, just make the validator robust enough such that the
different API subsets can be mixed in any order. The validator will try
to make most of the situation and validate as much as possible.
Behind the scenes all the different validation methods are consolidated
into just two: one for the partition level, the other for the
intra-partition level. All the different overloads just call these
methods passing as much information as they have.
A test is also added to make sure this works.
2022-09-26 13:26:26 +03:00
Kamil Braun
0725ab3a3e test.py: rename topology_for_class to get_cluster_factory
The previous name had nothing to do with what the function calculated
and returned (it returned a `create_cluster` function; the standard name
for a function that constructs objects would be 'factory', so
`get_cluster_factory` is an appropriate name for a function that returns
cluster factories).
2022-09-26 11:45:44 +02:00
Kamil Braun
06cc4f9259 test/pylib: ScyllaCluster: fix create_server parameter type
The only usage of `ScyllaCluster` constructor passed a `create_server`
function which expected a `List[str]` for the second parameter, while
the constructor specified that the function should expect an
`Optional[List[str]]`. There was no reason for the latter, we can easily
fix this type error.

Also give a type hint for `create_cluster` function in
`PythonTestSuite.topology_for_class`. This is actually what catched the
type error.
2022-09-26 11:45:44 +02:00
Petr Gusev
27e60ecbf4 raft server, log size limit in bytes
Before this patch we could get an OOM if we
received several big commands. The number of
commands was small, but their total size
in bytes was large.

snapshot_trailing_size is needed to guarantee
progress. Without this limit the fsm could
get stuck if the size of the next item is
greater than max_log_size - (size of trailing entries).
2022-09-26 13:10:10 +04:00
Benny Halevy
d32c497cd9 database: automatically take snapshot of base table views
The logic to reject explicit snapshot of views/indexes
was improved in aa127a2dbb.
However, we never implemented auto-snapshot of
view/indexes when taking a snapshot of the base table.

This is implemented in this patch.

The implementation is built on top of
ba42852b0e
so it would be hard to backport to 5.1 or earlier
releases.

Fixes #11612

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-26 11:02:54 +03:00
Benny Halevy
55b0b8fe2c api: storage_service: reject snapshot of views in api layer
Rather than pushing the check to
`snapshot_ctl::take_column_family_snapshot`, just check
that explcitly when taking a snapshot of a particular
table by name over the api.

Other paths that call snapshot_ctl::take_column_family_snapshot
are internal and use it to snap views already.

With that, we can get rid of the allow_view_snapshots flag
that was introduced in aab4cd850c.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-26 10:44:56 +03:00
Botond Dénes
4d017b6d7e mutation_fragment_stream_validator: add reset() to validating filter
Allow the high level filtering validator to be reset() to a certain
position, so it can be used in situations where the consumption is not
continuous (fast-forwarding or paging).
2022-09-26 10:17:28 +03:00
Botond Dénes
a8cbf66573 mutation_fragment_stream_validator: move active tomsbtone validation into low level validator
Currently the active range tombstone change is validated in the high
level `mutation_fragment_stream_validating_stream`, meaning that users of
the low-level `mutation_fragment_stream_validator` don't benefit from
checking that tombstones are properly closed.
This patch moves the validation down to the low-level validator (which
is what the high-level one uses under the hood too), and requires all
users to pass information about changes to the active tombstone for each
fragment.
2022-09-26 10:17:27 +03:00
Nadav Har'El
868a884b79 test/cql-pytest: add reproducer for ignored IS NOT NULL
This test reproduces issue #10365: It shows that although "IS NOT NULL" is
not allowed in regular SELECT filters, in a materialized view it is allowed,
even for non-key columns - but then outright ignored and does not actually
filter out anything - a fact which already surprised several users.

The test also fails on Cassandra - it also wrongly allows IS NOT NULL
on the non-key columns but then ignores this in the filter. So the test
is marked with both xfail (known to fail on Scylla) and cassandra_bug
(fails on Cassandra because of what we consider to be a Cassandra bug).

Refs #10365
Refs #11606

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11615
2022-09-26 09:02:08 +03:00
Anna Stuchlik
c5285bcb14 doc: remove the section about updating OS packages during upgrade from upgrade guides for Ubunut and Debian (from 4.5 to 4.6)
Closes #11629
2022-09-26 08:04:02 +03:00
Avi Kivity
ad2f1dc704 Merge 'Avoid default initialization of token_metadata and topology when not needed' from Pavel Emelyanov
The goal is not to default initialize an object when its fields are about to be immediately overwritten by the consecutive code.

Closes #11619

* github.com:scylladb/scylladb:
  replication_strategy: Construct temp tokens in place
  topology: Define copy-sonctructor with init-lists
2022-09-25 18:08:42 +03:00
Jan Ciolek
ac152af88c expression: Add for_each_boolean factor
boolean_factors is a function that takes an expression
and extracts all children of the top level conjunction.

The problem is that it returns a vector<expression>,
which is inefficent.

Sometimes we would like to iterate over all boolean
factors without allocations. for_each_boolean_factor
is implemented for this purpose.

boolean_factors() can be implemented using
for_each_boolean_factor, so it's done to
reduce code duplication.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-09-25 16:34:22 +03:00
Avi Kivity
2a538f5543 Merge 'Cut one of snitch->gossiper links' from Pavel Emelyanov
Snitch uses gossiper for several reasons, one of is to re-gossip the topology-related app states when property-file snitch config changes. This set cuts this link by moving re-gossiping into the existing storage_service::snitch_reconfigured() subscription. Since initial snitch state gossiping happens in storage service as well, this change is not unreasonable.

Closes #11630

* github.com:scylladb/scylladb:
  storage_service: Re-gossiping snitch data in reconfiguration callback
  storage_service: Coroutinize snitch_reconfigured()
  storage_service: Indentation fix after previous patch
  storage_service: Reshard to shard-0 earlier
  storage_service: Refactor snitch reconfigured kick
2022-09-25 16:08:48 +03:00
Benny Halevy
a1adbf1f59 docs: adjust to sstable base name
Since 244df07771 (scylla 5.1),
only the sstable basename is kept in the large_* system tables.
The base path can be determined from the keyspace and
table name.

Fixes #11621

Adjust the examples in documentation respectively.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-25 14:38:13 +03:00
Benny Halevy
33924201cc docs: large-partition-table: adjust for additional rows column
Since a7511cf600 (scylla 5.0),
sstables containing partitions with too many rows are recorded in system.large_partitions.

Adjust the doc respectively.

Fixes #11622

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-25 14:38:13 +03:00
Benny Halevy
92ff17c6e3 docs: debugging-large-partition: update log warning example
The log warning format has changed since f3089bf3d1
and was fixed in the previous patch to include
a delimiter between the partition key, clustering key, and
column name.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-25 14:38:13 +03:00
Benny Halevy
fcbbc3eb9c db/large_data_handler: print static cell/collection description in log warning
When warning about a large cell/collection in a static row,
print that fact in the log warning to make it clearer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-25 14:37:42 +03:00
Benny Halevy
4670829502 db/large_data_handler: separate pk and ck strings in log warning with delimiter
Currently (since f3089bf3d1),
when printing a warning to the log about large rows and/or cells
the clustering key string is concatenated to the partition key string,
rendering the warning confsing and much less useful.

This patch adds a '/' delimiter to separate the fields,
and also uses one to separate the clustering key from the column name
for large cells.  In case of a static cell, the clustering key is null
hence the warning will look like: `pk//column`.

This patch does NOT change anything in the large_* system
table schema or contents.  It changes only the log warning format
that need not be backward compatible.

Fixes #11620

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-25 14:36:41 +03:00
Pavel Emelyanov
47958a4b37 storage_service: Re-gossiping snitch data in reconfiguration callback
Nowadays it's done inside snitch, and snitch needs to carry gossiper
refernece for that. There's an ongoing effort in de-globalizing snitch
and fixing its dependencies. This patch cuts this snitch->gossiper link
to facilitate the mentioned effort.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-23 14:31:55 +03:00
Pavel Emelyanov
932566d448 storage_service: Coroutinize snitch_reconfigured()
Next patch will add more sleeping code to it and it's simpler if the new
call is co_await-ed rather than .then()-ed

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-23 14:31:55 +03:00
Pavel Emelyanov
7fee98cad0 storage_service: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-23 14:31:55 +03:00
Pavel Emelyanov
3d4ea2c628 storage_service: Reshard to shard-0 earlier
It makes next patch shorter and nicer

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-23 14:31:55 +03:00
Pavel Emelyanov
11b79f9f80 storage_service: Refactor snitch reconfigured kick
The snitch_reconfigured calls update_topology with local node bcast
address argument. Things get simpler if the callee gets the address
itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-23 14:31:43 +03:00
Anna Stuchlik
46f0e99884 doc: add the link to the new Troubleshooting section and replace Scylla with ScyllaDB 2022-09-23 11:46:15 +02:00
Anna Stuchlik
af2a85b191 doc: add the new page to the toctree 2022-09-23 11:37:38 +02:00
Anna Stuchlik
b034e2856e doc: add a troubleshooting article about the missing configuration files 2022-09-23 11:17:18 +02:00
Botond Dénes
b9d55ee02f Merge 'Add cassandra functional - show warn/err when tombstone threshold reached.' from Taras Borodin
Add cassandra functional - show warn/err when tombstone_warn_threshold/tombstone_failure_threshold reached on select, by partitions. Propagate raw query_string from coordinator to replicas.

Closes #11356

* github.com:scylladb/scylladb:
  add utf8:validate to operator<< partition_key with_schema.
  Show warn message if `tombstone_warn_threshold` reached on querier.
2022-09-23 05:53:47 +03:00
Pavel Emelyanov
9e7407ff91 replication_strategy: Construct temp tokens in place
Otherwise, the token_metadata object is default-initialized, then it's
move-assigned from another object.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-22 19:19:32 +03:00
Pavel Emelyanov
d540af2cb0 topology: Define copy-sonctructor with init-lists
Otherwise the topology is default-constructed, then its fields
are copy-assigned with the data from the copy-from reference.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-22 19:18:58 +03:00
Tomasz Grabiec
ccbfe2ef0d Merge 'Fix invalid mutation fragment stream issues' from Botond Dénes
Found by a fragment stream validator added to the mutation-compactor (https://github.com/scylladb/scylladb/pull/11532). As that PR moves very slowly, the fixes for the issues found are split out into a PR of their own.
The first two of these issues seems benign, but it is important to remember that how benign an invalid fragment stream is depends entirely on the consumer of said stream. The present consumer of said streams may swallow the invalid stream without problem now but any future change may cause it to enter into a corrupt state.
The last one is a non-benign problem (again because the consumer reacts badly already) causing problems when building query results for range scans.

Closes #11604

* github.com:scylladb/scylladb:
  shard_reader: do_fill_buffer(): only update _end_of_stream after buffer is copied
  readers/mutation_readers: compacting_reader: remember injected partition-end
  db/view: view_builder::execute(): only inject partition-start if needed
2022-09-22 17:57:27 +02:00
Taras Borodin
c155ae1182 add utf8:validate to operator<< partition_key with_schema. 2022-09-22 16:42:31 +03:00
TarasBor
1f4a93da78 Show warn message if tombstone_warn_threshold reached on querier.
When querier read page with tombstones more than `tombstone_warn_threshold` limit - warning message appeared in logs.
If `tombstone_warn_threshold:0` feature disabled.

Refs scylladb#11410
2022-09-22 16:42:31 +03:00
Avi Kivity
0cbaef31c1 dirty_memory_manager: fold do_update() into region_group::update()
There is just one caller, and folding the two functions enables
simplification.
2022-09-22 15:51:19 +03:00
Avi Kivity
8672f2248c dirty_memory_manager: simplify memory_hard_limit's do_update
do_update() has an output parameter (top_relief) which can either
be set to an input parameter or left alone. Simplify it by returning
bool and letting the caller reuse the parameter's value instead.
2022-09-22 15:50:48 +03:00
Avi Kivity
1858268377 dirty_memory_manager: drop soft limit / soft pressure members in memory_hard_limit
They are write-only.

This corresponds to the fact that memory_hard_limit does not do
flushing (which is initiated by crossing the soft limit), it only
blocks new allocations.
2022-09-22 14:59:38 +03:00
Asias He
9ed401c4b2 streaming: Add finished percentage metrics for node ops using streaming
We have added the finished percentage for repair based node operations.

This patch adds the finished percentage for node ops using the old
streaming.

Example output:

scylla_streaming_finished_percentage{ops="bootstrap",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="decommission",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="rebuild",shard="0"} 0.561945
scylla_streaming_finished_percentage{ops="removenode",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="repair",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="replace",shard="0"} 1.000000

In addition to the metrics, log shows the percentage is added.

[shard 0] range_streamer - Finished 2698 out of 2817 ranges for rebuild, finished percentage=0.95775646

Fixes #11600

Closes #11601
2022-09-22 14:19:34 +03:00
Avi Kivity
8369741063 dirty_memory_manager: de-template do_update(region_group_or_memory_hard_limit)
We made this function a template to prevent code duplication, but now
memory_hard_limit was sufficiently simplified so that the implementations
can start to diverge.
2022-09-22 14:16:43 +03:00
Avi Kivity
76ced5a60c dirty_memory_manager: adjust soft_limit threshold check
Use `>` rather than `>=` to match the hard limit check. This will
aid simplification, since for memory_hard_limit the soft and hard limits
are identical.

This should not cause any material behavior change, we're not sensitive
to single byte accounting. Typical limits are on the order of gigabytes.
2022-09-22 14:06:01 +03:00
Avi Kivity
b9eb26cd77 dirty_memory_manager: drop memory_hard_limit::_name
It's write-only.
2022-09-22 14:01:57 +03:00
Avi Kivity
c64fb66cc3 dirty_memory_manager: simplify memory_hard_limit configuration
We observe that memory_hard_limit's reclaim_config is only ever
initialized as default, or with just the hard_limit parameter.
Since soft_limit defaults to hard_limit, we can collapse the two
into a limit. The reclaim callbacks are always left as the default
no-op functions, so we can eliminate them too.

This fits with memory_hard_limit only being responsible for the hard
limit, and for it not having any memtables to reclaim on its own.
2022-09-22 13:56:59 +03:00
Avi Kivity
2f907dc47d dirty_memory_manager: fold region_group_reclaimer into {memory_hard_limit,region_group}
region_group_reclaimer is used to initialize (by reference) instances of
memory_hard_limit and region_group. Now that it is a final class, we
can fold it into its users by pasting its contents into those users,
and using the initializer (reclaim_config) to initialize the users. Note
there is a 1:1 relationship between a region_group_reclaimer instance
and a {memory_hard_limit,region_group} instance.

It may seem like code duplication to paste the contents of one class into
two, but the two classes use region_group_reclaimer differently, and most
of the code is just used to glue different classes together, so the
next patches will be able to get rid of much of it.

Some notes:
 - no_reclaimer was replaced by a default reclaim_config, as that's how
   no_reclaimer was initialized
 - all members were added as private, except when a caller required one
   to be public
 - an under_presssure() member already existed, forwarding to the reclaimer;
   this was just removed.
2022-09-22 13:56:59 +03:00
Avi Kivity
d8f857e74b dirty_memory_manager: stop inheriting from region_group_reclaimer
This inheritance makes it harder to get rid of the class. Since
there are no longer any virtual functions in the class (apart from
the destructor), we can just convert it to a data member. In a few
places, we need forwarding functions to make formerly-inherited functions
visible to outside callers.

The virtual destructor is removed and the class is marked final to
verify it is no longer a base class anywhere.
2022-09-22 13:56:59 +03:00
Avi Kivity
26f3a123a5 dirty_memory_manager: test: unwrap region_group_reclaimer
In one test, region_group_reclaimer is wrapped in another class just
to toggle a bool, but with the new callbacks it's easy to just use
a bool instead.
2022-09-22 13:56:59 +03:00
Avi Kivity
1d3508e02c dirty_memory_manager: change region_group_reclaimer configuration to a struct
It's just so much nicer.

The "threshold" limit was renamed to "hard_limit" to contrast it with
"soft_limit" (in fact threshold is a good name for soft_limit, since
it's a point where the behavior begins to change, but that's too much
of a change).
2022-09-22 13:56:59 +03:00
Avi Kivity
2c54c7d51e dirty_memory_manager: convert region_group_reclaimer to callbacks
region_group_reclaimer is partially policy (deciding when to reclaim)
and partially mechanism (implementing reclaim via virtual functions).

Move the mechanism to callbacks. This will make it easy to fold the
policy part into region_group and memory_hard_limit. This folding is
expected to simplify things since most of region_group_reclaimer is
cross-class communication.
2022-09-22 13:56:59 +03:00
Avi Kivity
8fa0652e68 dirty_memory_manager: consolidate region_group_reclaimer constructors
Delegate to other constructors rather than repeating the code. Doesn't
help much here, but simplifies the next patch.
2022-09-22 13:56:59 +03:00
Avi Kivity
5efbfa4cab dirty_memory_manager: rename {memory_hard_limit,region_group}::notify_relief
It clashes with region_group_reclaimer::notify_relief, which does something
different. Since we plan to merge region_group_reclaimer into
memory_hard_limit and region_group (this can simplify the code), we
need to avoid duplicate function names.
2022-09-22 13:56:59 +03:00
Avi Kivity
a72ac14154 dirty_memory_manager: drop unused parameter to memory_hard_limit constructor 2022-09-22 13:56:59 +03:00
Avi Kivity
fca5689052 dirty_memory_manager: drop memory_hard_limit::shutdown()
It is empty.
2022-09-22 13:56:59 +03:00
Avi Kivity
152136630c dirty_memory_manager: split region_group hierarchy into separate classes
Currently, region_group forms a hierarchy. Originally it was a tree,
but previous work whittled it down to a parent-child relationship
(with a single, possible optional parent, and a single child).

The actual behavior of the parent and child are very different, so
it makes sense to split them. The main difference is that the parent
does not contain any regions (memtables), but the child does.

This patch mechanically splits the class. The parent is named
memory_hard_limit (reflecting its role to prevent lsa allocation
above the memtable configured hard limit). The child is still named
region_group.

Details of the transformation:
 - each function or data member in region_group is either moved to
   memory_hard_limit, duplicated in memory_hard_limit, or left in
   region_group.
 - the _regions and _blocked_requests members, which were always
   empty in the parent, were not duplicated. Any member that only accessed
   them was similarly left alone.
 - the "no_reclaimer" static member which was only used in the parent
   was moved there. Similarly the constructor which accepted it
   was moved.
 - _child was moved to the parent, and _parent was kept in the child
   (more or less the defining change of the split) Similarly
   add(region_group*) and del(region_group*) (which manage _child) were moved.
 - do_for_each_parent(), which iterated to the top of the tree, was removed
   and its callers manually unroll the loop. For the parent, this is just
   a single iteration (since we're iterating towards the root), for the child,
   this can be two iterations, but the second one is usually simpler since
   the parent has many members removed.
 - do_update(), introduced in the previous patch, was made a template that
   can act on either the parent or the child. It will be further simplified
   later.
 - some tests that check now-impossible topologies were removed.
 - the parent's shutdown() is trivial since it has no _blocked_requests,
   but it was kept to reduce churn in the callers.
2022-09-22 13:56:59 +03:00
Avi Kivity
009bd63217 dirty_memory_manager: extract code block from region_group::update
A mechanical transformation intended to allow reuse later. The function
doesn't really deserve to exist on its own, so it will be swallowed back
by its callers later.
2022-09-22 13:56:59 +03:00
Avi Kivity
34d5322368 dirty_memory_manager: move more allocation_queue functions out of region_group
More mechanical changes, reducing churn for later patches.
2022-09-22 13:56:59 +03:00
Avi Kivity
4bc2638cf9 dirty_memory_manager: move some allocation queue related function definitions outside class scope
It's easier to move them to a new owner (allocation_queue) if they are
not defined in the class.
2022-09-22 13:56:59 +03:00
Avi Kivity
71493c2539 dirty_memory_manager: move region_group::allocating_function and related classes to new class allocation_queue
region_group currently fulfills two roles: in one role, when instantiated
as dirty_memory_manager::_virtual_region_group, it is responsible
for holding functions that allocate memtable memory (writes) and only
allowing them to run when enough dirty memory has been flushed from other
memtables. The other role, when instantiated as
dirty_memory_manager::_real_region_group, is to provide a hard stop when
the total amount of dirty memory exceeds the limit, since the other limit
is only estimated.

We want to simplify the whole thing, which means not using the same class
for two different roles (or rather, we can use it for both roles if we
simplify the internals significantly).

As a first step towards clarifying what functionality is used in what
role, move some classes related to holding allocating functions to a new
class allocation_queue. We will gradually move move content there, reducing
the amount of role confusion in region_group.

Type aliases are added to reduce churn.
2022-09-22 13:56:59 +03:00
Avi Kivity
d21d2cdb3e dirty_memory_manager: remove support for multiple subgroups
We only have one parent/child relationship in the region group
hierarchy, so support for more is unneeded complexity. Replace
the subgroup vector with a single pointer, and delete a test
for the removed functionality.
2022-09-22 13:56:59 +03:00
Botond Dénes
0ccb23d02b shard_reader: do_fill_buffer(): only update _end_of_stream after buffer is copied
Commit 8ab57aa added a yield to the buffer-copy loop, which means that
the copy can yield before done and the multishard reader might see the
half-copied buffer and consider the reader done (because
`_end_of_stream` is already set) resulting in the dropping the remaining
part of the buffer and in an invalid stream if the last copied fragment
wasn't a partition-end.

Fixes: #11561
2022-09-22 13:54:36 +03:00
Botond Dénes
16a0025dc3 readers/mutation_readers: compacting_reader: remember injected partition-end
Currently injecting a partition-end doesn't update
`_last_uncompacted_kind`, which will allow for a subsequent
`next_partition()` call to trigger injecting a partition-end, leading to
an invalid mutation fragment stream (partition-end after partition-end).
Fix by changing `_last_uncompacted_kind` to `partition_end` when
injecting a partition-end, making subsequent injection attempts noop.

Fixes: #11608
2022-09-22 13:54:36 +03:00
Botond Dénes
681e6ae77f db/view: view_builder::execute(): only inject partition-start if needed
When resuming a build-step, the view builder injects the partition-start
fragment of the last processed partition, to bring the consumer
(compactor) into the correct state before it starts to consume the
remainder of the partition content. This results in an invalid fragment
stream when the partition was actually over or there is nothing left for
the build step. Make the inject conditional on when the reader contains
more data for the partition.

Fixes: #11607
2022-09-22 13:54:36 +03:00
Nadav Har'El
517c1529aa docs: update docs/alternator/getting-started.md
Update several aspects of the alternator/getting-started.md which were
not up-to-date:

* When the documented was written, Alternator was moving quickly so we
  recommended running a nightly version. This is no longer the case, so
  we should recommend running the latest stable build.

* The link to the download link is no longer helpful for getting Docker
  instructions (it shows some generic download options). Instead point to
  our dockerhub page.

* Replace mentions of "Scylla" by the new official name, "ScyllaDB".

* Miscelleneous copy-edits.

Fixes #11218

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11605
2022-09-22 11:08:05 +03:00
Piotr Sarna
481240b8b4 Merge 'Alternator: Run more TTL tests by default (and add a test for metrics)' from Nadav Har'El
We had quite a few tests for Alternator TTL in test/alternator, but most
of them did not run as part of the usual Jenkins test suite, because
they were considered "very slow" (and require a special "--runveryslow"
flag to run).

In this series we enable six tests which run quickly enough to run by
default, without an additional flag. We also make them even quicker -
the six tests now take around 2.5 seconds.

I also noticed that we don't have a test for the Alternator TTL metrics
- and added one.

Fixes #11374.
Refs https://github.com/scylladb/scylla-monitoring/issues/1783

Closes #11384

* github.com:scylladb/scylladb:
  test/alternator: insert test names into Scylla logs
  rest api: add a new /system/log operation
  alternator ttl: log warning if scan took too long.
  alternator,ttl: allow sub-second TTL scanning period, for tests
  test/alternator: skip fewer Alternator TTL tests
  test/alternator: test Alternator TTL metrics
2022-09-22 09:47:50 +02:00
Botond Dénes
ef7471c460 readers/mutation_reader: stream validator: fix log level detection logic
The mutation fragment stream validator filter has a detailed debug log
in its constructor. To avoid putting together this message when the log
level is above debug, it is enclosed in an if, activated when log level
is debug or trace... at least that was intended. Actually the if is
activated when the log level is debug or above (info, warn or error) but
is only actually logged if the log level is exactly debug. Fix the logic
to work as intended.

Closes #11603
2022-09-22 09:41:45 +03:00
Pavel Emelyanov
7ae73c665b gossiper: Remove some dead code
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11599
2022-09-22 06:58:29 +03:00
Pavel Emelyanov
5edeecf39b token_metadata: Provide dc/rack for bootstrapping nodes
The token_metadata::calculate_pending_ranges_for_bootstrap() makes a
clone of itself and adds bootstrapping nodes to the clone to calculate
ranges. Currently added nodes lack the dc/rack which affects the
calculations the bad way.

Unfortunately, the dc/rack for those nodes is not available on topology
(yet) and needs pretty heavy patching to have. Fortunately, the only
caller of this method has gossiper at hand to provide the dc/rack from.

fixes: #11531

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11596
2022-09-22 06:55:52 +03:00
Petr Gusev
210d9dd026 raft: fix snapshots leak
applier_fiber could create multiple snapshots between
io_fiber run. The fsm_output.snp variable was
overwritten by applier_fiber and io_fiber didn't drop
the previous snapshot.

In this patch we introduce the variable
fsm_output.snps_to_drop, store in it
the current snapshot id before applying
a new one, and then sequentially drop them in
io_fiber after storing the last snapshot_descriptor.

_sm_events.signal() is added to fsm::apply_snapshot,
since this method mutates the _output and thus gives a
reason to run io_fiber.

The new test test_frequent_snapshotting demonstrates
the problem by causing frequent snapshots and
setting the applier queue size to one.

Closes #11530
2022-09-21 12:46:26 +02:00
Kamil Braun
3b096b71c1 test/topology_raft_disabled: disable test_raft_upgrade
For some reason, the test is currently flaky on Jenkins. Apparently the
Python driver does not reconnect to the cluster after the cluster
restarts (well it does, but then it disconnects from one of the nodes
and never reconnects again). This causes the test to hang on "waiting
until driver reconnects to every server" until it times out.

Disable it for now so it doesn't block next promotion.
2022-09-21 12:32:40 +02:00
Nadav Har'El
22bb35e2cb Merge 'doc: update the "Counting all rows in a table is slow" page' from Anna Stuchlik
Fix https://github.com/scylladb/scylladb/issues/11373

- Updated the information on the "Counting all rows in a table is slow" page.
- Added COUNT to the list of selectors of the SELECT statement (somehow it was missing).
- Added the note to the description of the COUNT() function with a link to the KB page for troubleshooting if necessary. This will allow the users to easily find the KB page.

Closes #11417

* github.com:scylladb/scylladb:
  doc: add a comment to remove the note in version 5.1
  doc: update the information on the Countng all rows page and add the recommendation to upgrade ScyllaDB
  doc: add a note to the description of COUNT with a reference to the KB article
  doc: add COUNT to the list of acceptable selectors of the SELECT statement
2022-09-21 12:32:40 +02:00
Alejo Sanchez
510215d79a test.py: fix ScyllaClusterManager start/stop
Check existing is_running member to avoid re-starting.

While there, set it to false after stopping.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-21 11:42:02 +02:00
Alejo Sanchez
933d93d052 test.py: fix topology init error handling
Start ScyllaClusterManager within error handling so the ScyllaCluster
logs are available in case of error starting up.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-21 09:15:25 +02:00
Nadav Har'El
91bccee9be Update tools/java submodule
* tools/java b004da9d1b...5f2b91d774 (1):
  > install.sh is using wrong permissions for install cqlsh files

Fixes #11584
2022-09-20 14:42:34 +03:00
Avi Kivity
2cec417426 Merge 'tools: use the standard allocator' from Botond Dénes
Tools want to be as little disrupting to the environment they run in as possible, because they might be run in a production environment, next to a running scylladb production server. As such, the usual behavior of seastar applications w.r.t. memory is an anti-pattern for tools: they don't want to reserve most of the system memory, in fact they don't want to reserve any amount, instead consuming as much as needed on-demand.
To achieve this, tools want to use the standard allocator. To achieve this they need a seastar option to to instruct seastar to *not* configure and use the seastar allocator and they need LSA to cooperate with the standard allocator.
The former is provided by https://github.com/scylladb/seastar/pull/1211.
The latter is solved by introducing the concept of a `segment_store_backend`, which abstracts away how the memory arena for segments is acquired and managed. We then refactor the existing segment store so that the seastar allocator specific parts are moved to an implementation of this backend concept, then we introduce another backend implementation appropriate to the standard allocator.
Finally, tools configure seastar with the newly introduced option to use the standard allocator and similarly configure LSA to use the standard allocator appropriate backend.

Refs: https://github.com/scylladb/scylladb/issues/9882
This is the last major code piece in scylla for making tools production ready.

Closes #11510

* github.com:scylladb/scylladb:
  test/boost: add alternative variant of logalloc test
  tools: use standard allocator
  utils/logalloc: add use_standard_allocator_segment_pool_backend()
  utils/logalloc: introduce segment store backend for standard allocator
  utils/logalloc: rebase release segment-store on segment-store-backend
  utils/logalloc: introduce segment_store_backend
  utils/logalloc: push segment alloc/dealloc to segment_store
  test/boost/logalloc_test: make test_compaction_with_multiple_regions exception-safe
2022-09-20 12:59:34 +03:00
Nadav Har'El
4a453c411d Merge 'doc: add the upgrade guide from 5.0 to 5.1' from Anna Stuchlik
Fix https://github.com/scylladb/scylladb/issues/11376

This PR adds the upgrade guide from version 5.0 to 5.1. It involves adding new files (5.0-to-5.1) and language/formatting improvements to the existing content (shared by several upgrade guides).

Closes #11577

* github.com:scylladb/scylladb:
  doc: upgrade the command to upgrade the ScyllaDB image from 5.0 to 5.1
  doc: add the guide to upgrade ScyllaDB from 5.0 to 5.1
2022-09-20 11:52:59 +03:00
Nadav Har'El
d81bedd3be Merge 'doc: add ScyllaDB image upgrade guides for patch releases' from Anna Stuchlik
This PR adds the missing upgrade guides for upgrading the ScyllaDB image to a patch release:

- ScyllaDB 5.0: /upgrade/upgrade-opensource/upgrade-guide-from-5.x.y-to-5.x.z/upgrade-guide-from-5.x.y-to-5.x.z-image/
- ScyllaDB Enterprise: /upgrade/upgrade-enterprise/upgrade-guide-from-2021.1-to-2022.1/upgrade-guide-from-2022.1-to-2022.1-image/ (the file name is wrong and will be fixed with another PR)

In addition, the section regarding the recommended upgrade procedure has been improved.
Fixes https://github.com/scylladb/scylladb/issues/11450
Fixes https://github.com/scylladb/scylladb/issues/11452

Closes #11460

* github.com:scylladb/scylladb:
  doc: update the commands to upgrade the ScyllaDB image
  doc: fix the filename in the index to resolve the warnings and fix the link
  doc: apply feedback by adding she step fo load the new repo and fixing the links
  doc: fix the version name in file upgrade-guide-from-2021.1-to-2022.1-image.rst
  doc: rename the upgrade-image file to upgrade-image-opensource and update all the links to that file
  doc: update the Enterprise guide to include the Enterprise-onlyimage file
  doc: update the image files
  doc: split the upgrade-image file to separate files for Open Source and Enterprise
  doc: clarify the alternative upgrade procedures for the ScyllaDB image
  doc: add the upgrade guide for ScyllaDB Image from 2022.x.y. to 2022.x.z
  doc: add the upgrade guide for ScyllaDB Image from 5.x.y. to 5.x.z
2022-09-20 11:51:26 +03:00
Botond Dénes
4ef7b080e3 docs/using-scylla/migrate-scylla.rst: remove link to unirestore
It points to a private scylladb repo, which has no place in user-facing
documentation. For now there is no public replacement, but a similar
functionality is in the works for Scylla Manager.

Fixes: #11573

Closes #11580
2022-09-20 11:46:28 +03:00
Anna Stuchlik
7b2209f291 doc: upgrade the command to upgrade the ScyllaDB image from 5.0 to 5.1 2022-09-20 10:42:47 +02:00
Anna Stuchlik
db75adaf9a doc: update the commands to upgrade the ScyllaDB image 2022-09-20 10:36:18 +02:00
Nadav Har'El
4c93a694b7 cql: validate bloom_filter_fp_chance up-front
Scylla's Bloom filter implementation has a minimal false-positive rate
that it can support (6.71e-5). When setting bloom_filter_fp_chance any
lower than that, the compute_bloom_spec() function, which writes the bloom
filter, throws an exception. However, this is too late - it only happens
while flushing the memtable to disk, and a failure at that point causes
Scylla to crash.

Instead, we should refuse the table creation with the unsupported
bloom_filter_fp_chance. This is also what Cassandra did six years ago -
see CASSANDRA-11920.

This patch also includes a regression test, which crashes Scylla before
this patch but passes after the patch (and also passes on Cassandra).

Fixes #11524.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11576
2022-09-20 06:18:51 +03:00
Botond Dénes
60991358e8 Merge 'Improvements to test/lib/sstable_utils.hh' from Raphael "Raph" Carvalho
Changes done to avoid pitfalls and fix issues of sstable-related unit tests

Closes #11578

* github.com:scylladb/scylladb:
  test: Make fake sstables implicitly belong to current shard
  test: Make it clearer that sstables::test::set_values() modify data size
2022-09-20 06:14:07 +03:00
Nadav Har'El
a1ff865c77 Merge 'test/topology_raft_disabled: write basic raft upgrade test' from Kamil Braun
The test changes the servers' configuration to include `raft`
in the `experimental-features` list, then restarts them.

It waits until driver reconnects to every server after restarting.
Then it checks that upgrade eventually finishes on every server by
querying `group0_upgrade_state` key in `system.scylla_local`. Finally,
it performs a schema change and verifies that a corresponding entry has
appeared in `system.group0_history`.

The commit also increases the number of clusters in the suite cluster
pool. Since the suite contains only one test at this time this only has
an effect if we run the test multiple times (using `--repeat`).

Closes #11563

* github.com:scylladb/scylladb:
  test/topology_raft_disabled: write basic raft upgrade test
  test: setup logging in topology suites
2022-09-19 20:27:08 +03:00
Alejo Sanchez
087ae521c5 test.py: make client fail if before test check fails
Check if request to server side (test.py) failed and raise if so.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #11575
2022-09-19 18:04:07 +02:00
Raphael S. Carvalho
2f52698a26 test: Make fake sstables implicitly belong to current shard
Fake SSTables will be implicitly owned by the shard that created them,
allowing them to be called on procedures that assert the SSTables
are owned by the current shard, like the table's one that rebuilds
the sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-19 12:05:24 -03:00
Raphael S. Carvalho
697f200319 test: Make it clearer that sstables::test::set_values() modify data size
By adding a param with default value, we make it clear in the interface
that the procedure modifies sstable data size.

It can happen one calls this function without noticing it overrides
the data size previously set using a different function.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-19 12:01:24 -03:00
Anna Stuchlik
2513497f9a doc: add the guide to upgrade ScyllaDB from 5.0 to 5.1 2022-09-19 16:06:24 +02:00
Kamil Braun
b770443300 test/topology_raft_disabled: write basic raft upgrade test
The test changes the servers' configuration to include `raft`
in the `experimental-features` list, then restarts them.

It waits until driver reconnects to every server after restarting.
Then it checks that upgrade eventually finishes on every server by
querying `group0_upgrade_state` key in `system.scylla_local`. Finally,
it performs a schema change and verifies that a corresponding entry has
appeared in `system.group0_history`.

The commit also increases the number of clusters in the suite cluster
pool. Since the suite contains only one test at this time this only has
an effect if we run the test multiple times (using `--repeat`).
2022-09-19 13:29:35 +02:00
Kamil Braun
fd986bfed1 test: setup logging in topology suites
Make it possible to use logging from within tests in the topology
suites. The tests are executed using `pytest`, which uses a `pytest.ini`
file for logging configuration.

Also cleanup the `pytest.ini` files a bit.
2022-09-19 12:23:11 +02:00
Nadav Har'El
711dcd56b6 docs/alternator: refer to the right issue
In compatibility.md where we refer to the missing ability to add a GSI
to an existing table - let's refer to a new issue specifically about this
feature, instead of the old bigger issue about UpdateItem.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11568
2022-09-19 11:05:07 +03:00
Piotr Sarna
5597bc8573 Merge 'Alternator: test and fix crashes and errors...
when using ":attrs" attribute' from Nadav Har'El

This PR improves the testing for issue #5009 and fixes most of it (but
not all - see below). Issue #5009 is about what happens when a user
tries to use the name `:attrs` for an attribute - while Alternator uses
a map column with that name to hold all the schema-less attributes of an
item.  The tests we had for this issue were partial, and missed the
worst cases which could result in Scylla crashing on specially-crafted
PutItem or UpdateItem requests.

What the tests missed were the cases that `:attrs` is used as a
**non-key**. So in this PR we add additional tests for this case,
several of them fail or even crash Scylla, and then we fix all these
cases.

Issue #5009 remains open because using `:attrs` as the name of a **key**
is still not allowed. But because it results in a clean error message
when attempting to create a table with such a key, I consider this
remaining problem very minor.

Refs #5009.

Closes #11572

* github.com:scylladb/scylladb:
  alternator: fix crashes an errors when using ":attrs" attribute
  alternator: improve tests for reserved attribute name ":attrs"
2022-09-19 09:48:06 +02:00
Nadav Har'El
999ca2d588 alternator: fix crashes an errors when using ":attrs" attribute
Alternator uses a single column, a map, with the deliberately strange
name ":attrs", to hold all the schema-less attributes of an item.
The existing code is buggy when the user tries to write to an attribute
with this strange name ":attrs". Although it is extremely unlikely that
any user would happen to choose such a name, it is nevertheless a legal
attribute name in DynamoDB, and should definitely not cause Scylla to crash
as it does in some cases today.

The bug was caused by the code assuming that to check whether an attribute
is stored in its own column in the schema, we just need to check whether
a column with that name exists. This is almost true, except for the name
":attrs" - a column with this name exists, but it is a map - the attribute
with that name should be stored *in* the map, not as the map. The fix
is to modify that check to special-case ":attrs".

This fix makes the relevant tests, which used to crash or fail, now pass.

This fix solves most of #5009, but one point is not yet solved (and
perhaps we don't need to solve): It is still not allowed to use the
name ":attrs" for a **key** attribute. But trying to do that fails cleanly
(during the table creation) with an appropriate error message, so is only
a very minor compatibility issue.

Refs #5009

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-19 10:30:11 +03:00
Nadav Har'El
6f8dca3760 alternator: improve tests for reserved attribute name ":attrs"
As explained in issue #5009, Alternator currently forbids the special
attribute name ":attrs", whereas DynamoDB allows any string of approriate
length (including the specific string ":attrs") to be used.

We had only a partial test for this incompatibility, and this patch
improves the testing of this issue. In particular, we were missing a
test for the case that the name ":attrs" was used for a non-key
attribute (we only tested the case it was used as a sort key).

It turns out that Alternator crashes on the new test, when the test tries
to write to a non-key attribute called ":attrs", so we needed to mark
the new test with "skip". Moreover, it turns out that different code paths
handle the attribute name ":attrs" differently, and also crash or fail
in other ways - so we added more than one xfailing and skipped tests
that each fails in a different place (and also a few tests that do pass).

As usual, the new tests we checked to pass on DynamoDB.

Refs #5009

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-19 10:30:06 +03:00
Botond Dénes
3003f1d747 Merge 'alternator: small documentation and comment fixes' from Nadav Har'El
This tiny series fixes some small error and out-of-date information in Alternator documentation and code comments.

Closes #11547

* github.com:scylladb/scylladb:
  alternator ttl: comment fixes
  docs/alternator: fix mention of old alternator-test directory
2022-09-19 09:27:53 +03:00
Kamil Braun
348582c4c8 test/pylib: pool: make it possible to free up space
Some tests mark clusters as 'dirty', which makes them non-reusable by
later tests; we don't want to return them to the pool of clusters.

This use-case was covered by the `add_one` function in the `Pool` class.
However, it had the unintended side effect of creating extra clusters
even if there were no more tests that were waiting for new clusters.

Rewrite the implementation of `Pool` so it provides 3 interface
functions:
- `get` borrows an object, building it first if necessary
- `put` returns a borrowed object
- `steal` is called by a borrower to free up space in the pool;
  the borrower is then responsible for cleaning up the object.

Both `put` and `steal` wake up any outstanding `get` calls. Objects are
built only in `get`, so no objects are built if none are needed.

Closes #11558
2022-09-18 12:05:57 +03:00
Botond Dénes
22128977e4 test/boost: add alternative variant of logalloc test
Which intializes LSA with use_standard_allocator_segment_pool_backend()
running the logalloc_test suite on the standard allocator segment pool
backend. To avoid duplicating the test code, the new test-file pulls in
the test code via #include. I'm not proud of it, but it works and we
test LSA with both the debug and standard memory segment stores without
duplicating code.
2022-09-16 14:57:23 +03:00
Botond Dénes
13ace7a05e Merge "Fix RPC sockets configuration wrt topology" from Pavel Emelyanov
"
Messaging service checks dc/rack of the target node when creating a
socket. However, this information is not available for all verbs, in
particular gossiper uses RPC to get topology from other nodes.

This generates a chicken-and-egg problem -- to create a socket messaging
service needs topology information, but in order to get one gossiper
needs to create a socket.

Other than gossiper, raft starts sending its APPEND_ENTRY messages early
enough so that topology info is not avaiable either.

The situation is extra-complicated with the fact that sockets are not
created for individual verbs. Instead, verbs are groupped into several
"indices" and socket is created for it. Thus, the "gossiping" index that
includes non-gossiper verbs will create topology-less socket for all
verbs in it. Worse -- raft sends messages w/o solicited topology, the
corresponding socket is created with the assumption that the peer lives
in default dc and rack which doesn't matchthe local nodes' dc/rack and
the whole index group gets the "randomly" configured socket.

Also, the tcp-nodelay tries to implement similar check, but uses wrong
index of 1, so it's also fixed here.
"

* 'br-messaging-topology-ignoring-clients' of https://github.com/xemul/scylla:
  messaging_service: Fix gossiper verb group
  messaging_service: Mind the absence of topology data when creating sockets
  messaging_service: Templatize and rename remove_rpc_client_one
2022-09-16 13:27:56 +03:00
Botond Dénes
6a0db84706 tools: use standard allocator
Use the new seastar option to instruct seastar to not initialize and use
the seastar allocator, relying on the standard allocator instead.
Configure LSA with the standard allocator based segment store backend:
* scylla-types reserves 1MB for LSA -- in theory nothing here should use
  LSA, but just in case...
* scylla-sstable reserves 100MB for LSA, to avoid excessive trashing in
  the sstable index caches.

With this, tools now should allocate memory on demand, without reserving
a large chunk of (or all of) the available memory, as regular seastar
apps do.
2022-09-16 13:07:01 +03:00
Botond Dénes
a55903c839 utils/logalloc: add use_standard_allocator_segment_pool_backend()
Creating a standard-memory-allocator backend for the segment store.
This is targeted towards tools, which want to configure LSA with a
segment store backend that is appropriate for the standard allocator
(which they want to use).
We want to be able to use this in both release and debug mode. The
former will be used by tools and the latter will be used to run the
logalloc tests with this new backend, making sure it works and doesn't
regress. For this latter, we have to allow the release and debug stores
to coexist in the same build and for the debug store to be able to
delegate to the release store when the standard allocator backend is
used.
2022-09-16 13:02:40 +03:00
Kamil Braun
595472ac59 Merge 'Don't use qctx in CDC tables quering' from Pavel Emelyanov
There's a bunch of helpers for CDC gen service in db/system_keyspace.cc. All are static and use global qctx to make queries. Fortunately, both callers -- storage_service and cdc_generation_service -- already have local system_keyspace references and can call the methods via it, thus reducing the global qctx usage.

Closes #11557

* github.com:scylladb/scylladb:
  system_keyspace: De-static get_cdc_generation_id()
  system_keyspace: De-static cdc_is_rewritten()
  system_keyspace: De-static cdc_set_rewritten()
  system_keyspace: De-static update_cdc_generation_id()
2022-09-16 11:52:01 +02:00
Kamil Braun
0a6f601996 Merge 'Raft test topology fix request paths and API response handling' from Alecco
- Raise on response not HTTP 200 for `.get_text()` helper
- Fix API paths
- Close and start a fresh driver when restarting a server and it's the only server in the cluster
- Fix stop/restart response as text instead of inspecting (errors are status 500 and raise exceptions)

Closes #11496

* github.com:scylladb/scylladb:
  test.py: handle duplicate result from driver
  test.py: log server restarts for topology tests
  test.py: log actions for topology tests
  Revert "test.py: restart stopped servers before...
  test.py: ManagerClient API fix return text
  test.py: ManagerClient raise on HTTP != 200
  test.py: ManagerClient fix paths to updated resource
2022-09-16 11:29:10 +02:00
Botond Dénes
c1c74005b7 utils/logalloc: introduce segment store backend for standard allocator
To be used by tools, this store backend is compatible with the standard
allocator as it acquires the memory arena for segments via mmap().
2022-09-16 12:16:57 +03:00
Botond Dénes
d2a7ebbe66 utils/logalloc: rebase release segment-store on segment-store-backend
Rebase the seastar allocator based segment store implementation on the
recently introduced segment store backend which is now abstracts away
how memory for segments is obtained.
This patch also introduces an explicit `segment_npos` to be used for
cases when a segment -> index mapping fails (segment doesn't belong to
the store). Currently the seastar allocator based store simply doesn't
handle this case, while the standard allocator based store uses 0 as the
implicit invalid index.
2022-09-16 12:16:57 +03:00
Botond Dénes
3717f7740d utils/logalloc: introduce segment_store_backend
We want to make it possible to select the segment-store to be used for
LSA -- the seastar allocator based one or the standard allocator based
on -- at runtime. Currently this choice is made at compile time via
preprocessor switches.
The current standard memory based store is specialized for debug build,
we want something more similar to the seastar standard memory allocator
based one. So we introduce a segment store backend for the current
seastar allocator based store, which abstracts how the backing memory
for all segments is allocated/freed, while keeping the segment <-> index
mapping common. In the next patches we will rebase the current seastar
allocator based segment store on this backend and later introduce
another backend for standard allocator, targeted for release builds.
2022-09-16 12:16:57 +03:00
Botond Dénes
5ea4d7fb39 utils/logalloc: push segment alloc/dealloc to segment_store
Currently the actual alloc/dealloc of memory for segments is located
outside the segment stores. We want to abstract away how segments are
allocated, so we move this logic too into the segment store. For now
this results in duplicate code in the two segment store implementations,
but this will soon be gone.
2022-09-16 12:16:57 +03:00
Botond Dénes
e82ea2f3ad test/boost/logalloc_test: make test_compaction_with_multiple_regions exception-safe
Said test creates two vectors, the vector storage being allocated with
the default allocator, while its content being allocated on LSA. If an
exception is thrown however, both are freed via the default allocator,
triggering an assert in LSA code. Move the cleanup into a `defer()` so
the correct cleanup sequence is executed even on exceptions.
2022-09-16 12:16:57 +03:00
Pavel Emelyanov
e221bb0112 system_keyspace: De-static get_cdc_generation_id()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-16 08:34:15 +03:00
Pavel Emelyanov
fe48b66c0a cross-shard-barrier: Capture shared barrier in complete
When cross-shard barrier is abort()-ed it spawns a background fiber
that will wake-up other shards (if they are sleeping) with exception.

This fiber is implicitly waited by the owning sharded service .stop,
because barrier usage is like this:

    sharded<service> s;
    co_await s.invoke_on_all([] {
        ...
        barrier.abort();
    });
    ...
    co_await s.stop();

If abort happens, the invoke_on_all() will only resolve _after_ it
queues up the waking lambdas into smp queues, thus the subseqent stop
will queue its stopping lambdas after barrier's ones.

However, in debug mode the queue can be shuffled, so the owning service
can suddenly be freed from under the barrier's feet causing use after
free. Fortunately, this can be easily fixed by capturing the shared
pointer on the shared barrier instead of a regular pointer on the
shard-local barrier.

fixes: #11303

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11553
2022-09-16 08:21:02 +03:00
Michał Chojnowski
78850884d2 test: perf: perf_fast_forward: fix an error message
The test is supposed to give a helpful error message when the user forgets to
run --populate before the benchmark. But this must have become broken at some
point, because execute_cql() terminates the program with an unhelpful
("unconfigured table config") message, which doesn't mention --populate.

Fix that by catching the exception and adding the helpful tip.

Closes #11533
2022-09-15 19:30:10 +02:00
Avi Kivity
d3b8c0c8a6 logalloc: don't crash while reporting reclaim stalls if --abort-on-seastar-bad-alloc is specified
The logger is proof against allocation failures, except if
--abort-on-seastar-bad-alloc is specified. If it is, it will crash.

The reclaim stall report is likely to be called in low memory conditions
(reclaim's job is to alleviate these conditions after all), so we're
likely to crash here if we're reclaiming a very low memory condition
and have a large stall simultaneously (AND we're running in a debug
environment).

Prevent all this by disabling --abort-on-seastar-bad-alloc temporarily.

Fixes #11549

Closes #11555
2022-09-15 19:24:39 +02:00
Pavel Emelyanov
4f67898e7b system_keyspace: De-static cdc_is_rewritten()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-15 18:44:59 +03:00
Pavel Emelyanov
736021ee98 system_keyspace: De-static cdc_set_rewritten()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-15 18:44:53 +03:00
Pavel Emelyanov
b3d139bbdb system_keyspace: De-static update_cdc_generation_id()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-15 18:44:40 +03:00
Michał Chojnowski
cdb3e71045 sstables: add a flag for disabling long-term index caching
Long-term index caching in the global cache, as introduced in 4.6, is a major
pessimization for workloads where accesses to the index are (spacially) sparse.
We want to have a way to disable it for the affected workloads.

There is already infrastructure in place for disabling it for BYPASS CACHE
queries. One way of solving the issue is hijacking that infrastructure.

This patch adds a global flag (and a corresponding CLI option) which controls
index caching. Setting the flag to `false` causes all index reads to behave
like they would in BYPASS CACHE queries.

Consequences of this choice:

- The per-SSTable partition_index_cache is unused. Every index_reader has
  its own, and they die together. Independent reads can no longer reuse the
  work of other reads which hit the same index pages. This is not crucial,
  since partition accesses have no (natural) spatial locality. Note that
  the original reason for partition_index_cache -- the ability to share
  reads for the lower and upper bound of the query -- is unaffected.
- The per-SSTable cached_file is unused. Every index_reader has its own
  (uncached) input stream from the index file, and every
  bsearch_clustered_cursor has its own cached_file, which dies together with
  the cursor. Note that the cursor still can perform its binary search with
  caching. However, it won't be able to reuse the file pages read by
  index_reader. In particular, if the promoted index is small, and fits inside
  the same file page as its index_entry, that page will be re-read.
  It can also happen that index_reader will read the same index file page
  multiple times. When the summary is so dense that multiple index pages fit in
  one index file page, advancing the upper bound, which reads the next index
  page, will read the same index file page. Since summary:disk ratio is 1:2000,
  this is expected to happen for partitions with size greater than 2000
  partition keys.

Fixes #11202
2022-09-15 17:16:26 +03:00
David Garcia
3cc80da6af docs: update theme 1.3
Update conf.py

Closes #11330
2022-09-15 16:56:41 +03:00
Anna Stuchlik
e5c9f3c8a2 doc: fix the filename in the index to resolve the warnings and fix the link 2022-09-15 15:53:23 +02:00
Anna Stuchlik
338b45303a doc: apply feedback by adding she step fo load the new repo and fixing the links 2022-09-15 15:40:20 +02:00
Alejo Sanchez
92129f1d47 test.py: handle duplicate result from driver
Sometimes the driver calls twice the callback on ready done future with
a None result. Log it and avoid setting the local future twice.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-15 15:12:50 +02:00
Alejo Sanchez
2da7304696 test.py: log server restarts for topology tests
Add missing logging for server restart.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-15 15:10:29 +02:00
Alejo Sanchez
61a92afa2d test.py: log actions for topology tests
For debugging, log driver connection, before and after checks, and
topology changes.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-15 15:10:29 +02:00
Botond Dénes
05ef13a627 Merge 'Add support to split large partitions across SSTables' from Raphael "Raph" Carvalho
Introduces support to split large partitions during compaction. Today, compaction can only split input data at partition boundary, so a large partition is stored in a single file. But that can cause many problems, like memory pressure (e.g.: https://github.com/scylladb/scylladb/issues/4217), and incremental compaction can also not fulfill its promise as the file storing the large partition can only be released once exhausted.

The first step was to add clustering range metadata for first and last partition keys (retrieved from promoted index), which is crucial to determine disjointness at clustering level, and also the order at which the disjoint files should be opened for incremental reading.

The second step was to extend sstable_run to look at clustering dimension, so a set of files storing disjoint ranges for the same partition can live in the same sstable run.

The final step was to introduce the option for compaction to split large partition being written if it has exceeded the size threshold.

What's next? Following this series, a reader will be implemented for sstable_run that will incrementally open the readers. It can be safely built on the assumption of the disjoint invariant after the second step aforementioned.

Closes #11233

* github.com:scylladb/scylladb:
  test: Add test for large partition splitting on compaction
  compaction: Add support to split large partitions
  sstable: Extend sstable_run to allow disjointness on the clustering level
  sstables: simplify will_introduce_overlapping()
  test: move sstable_run_disjoint_invariant_test into sstable_datafile_test
  test: lib: Fix inefficient merging of mutations in make_sstable_containing()
  sstables: Keep track of first partition's first pos and last partition's last pos
  sstables: Rename min/max position_range to a descriptive name
  sstables_manager: Add sstable metadata reader concurrency semaphore
  sstables: Add ability to find first or last position in a partition
2022-09-15 16:08:56 +03:00
Alejo Sanchez
604f7353ef Revert "test.py: restart stopped servers before...
teardown..."

This reverts commit df1ca57fda.

In order to prevent timeouts on teardown queries, the previous commit
added functionality to restart servers that were down. This issue is
fixed in fc0263fc9b so there's no longer need to restart stopped servers
on test teardown.
2022-09-15 14:47:01 +02:00
Alejo Sanchez
ed81f1a85c test.py: ManagerClient API fix return text
For ManagerClient request API, don't return status, raise an exception.
Server side errors are signaled by status 500, not text body.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-15 14:47:01 +02:00
Alejo Sanchez
4a5f2418ec test.py: ManagerClient raise on HTTP != 200
Raise an exception if the request result is not HTTP 200 for .get()
helper.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-15 14:47:01 +02:00
Alejo Sanchez
a84bde38c0 test.py: ManagerClient fix paths to updated resource
Fix missing path renames for server-side rename
"node" -> "server" API.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-15 14:47:01 +02:00
Kamil Braun
728161003a Merge 'raft server, abort on background errors' from Gusev Petr
Halted background fibers render raft server effectively unusable, so
report this explicitly to the clients.

Fix: #11352

Closes #11370

* github.com:scylladb/scylladb:
  raft server, status metric
  raft server, abort group0 server on background errors
  raft server, provide a callback to handle background errors
  raft server, check aborted state on public server public api's
2022-09-15 14:12:11 +02:00
Alejo Sanchez
b8f68729b0 test.py: Pool add fresh when item not returned
Pool.get() might have waiting callers, so if an item is not returned
to the pool after use, tell the pool to add a new one and tell the pool
an entry was taken (used for total running entries, i.e. clusters).

Use it when a ScyllaCluster is dirty and not returned.

While there improve logging and docstrings.

Issue reported by @kbr-.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #11546
2022-09-15 13:56:44 +03:00
Pavel Emelyanov
82162be1f1 messaging_service: Remove init/uninit helpers
These two are just getting in the way when touching inter-components
dependencies around messaging service. Without it m.-s. start/stop
just looks like any other service out there

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11535
2022-09-15 11:54:46 +03:00
Raphael S. Carvalho
0a8afe18ca cql: Reject create and alter table with DateTieredCompactionStrategy
It's been ~1 year (2bf47c902e) since we set restrict_dtcs
config option to WARN, meaning users have been warned about the
deprecation process of DTCS.

Let's set the config to TRUE, meaning that create and alter statements
specifying DTCS will be rejected at the CQL level.

Existing tables will still be supported. But the next step will
be about throwing DTCS code into the shadow realm, and after that,
Scylla will automatically fallback to STCS (or ICS) for users which
ignored the deprecation process.

Refs #8914.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #11458
2022-09-15 11:46:18 +03:00
Alejo Sanchez
7e3389ee43 test.py: schema timeout less than request timeout
When a server is down, the driver expects multiple schema timeouts
within the same request to handle it properly.

Found by @kbr-

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #11544
2022-09-15 11:43:52 +03:00
Raphael S. Carvalho
a04047f390 compaction: Properly handle stop request for off-strategy
If user stops off-strategy via API, compaction manager can decide
to give up on it completely, so data will sit unreshaped in
maintenance set, preventing it from being compacted with data
in the main set. That's problematic because it will probably lead
to a significant increase in read and space amplification until
off-strategy is triggered again, which cannot happen anytime
soon.

Let's handle it by moving data in maintenance set into main one,
even if unreshaped. Then regular compaction will be able to
continue from where off-strategy left off.

Fixes #11543.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #11545
2022-09-15 09:21:22 +03:00
Nadav Har'El
33e6a88d9a alternator ttl: comment fixes
This patch fixes a few errors and out-of-date descriptions in comments
in alternator/ttl.cc. No functional changes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-15 00:03:43 +03:00
Nadav Har'El
8af9437508 docs/alternator: fix mention of old alternator-test directory
The directory that used to be called alternator-test is now (and has
been for a long time) really test/alternator. So let's fix the
references to it in docs/alternator/alternator.md.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-15 00:03:43 +03:00
Pavel Emelyanov
2c74062962 messaging_service: Fix gossiper verb group
When configuring tcp-nodelay unconditionally, messaging service thinks
gossiper uses group index 1, though it had changed some time ago and now
those verbs belong to group 0.

fixes: #11465

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-14 20:40:47 +03:00
Pavel Emelyanov
7bdad47de2 messaging_service: Mind the absence of topology data when creating sockets
When a socket is created to serve a verb there may be no topology
information regarding the target node. In this case current code
configures socket as if the peer node lived in "default" dc and rack of
the same name. If topology information appears later, the client is not
re-connected, even though it could providing more relevant configuration
(e.g. -- w/o encryption)

This patch checks if the topology info is needed (sometimes it's not)
and if missing it configures the socket in the most restrictive manner,
but notes that the socket ignored the topology on creation. When
topology info appears -- and this happens when a node joins the cluster
-- the messaging service is kicked to drop all sockets that ignored the
topology, so thay they reconnect later.

The mentioned "kick" comes from storage service on-join notification.
More correct fix would be if topology had on-change notification and
messaging service subscribed on it, but there are two cons:

- currently dc/rack do not change on the fly (though they can, e.g. if
  gossiping property file snitch is updated without restart) and
  topology update effectively comes from a single place

- updating topology on token-metadata is not like topology.update()
  call. Instead, a clone of token metadata is created, then update
  happens on the clone, then the clone is committed into t.m. Though
  it's possible to find out commit-time which nodes changed their
  topology, but since it only happens on join this complexity likely
  doesn't worth the effort (yet)

fixes: #11514
fixes: #11492
fixes: #11483

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-14 20:30:51 +03:00
Pavel Emelyanov
5ffc9d66ec messaging_service: Templatize and rename remove_rpc_client_one
It actually finds and removes a client and in its new form it also
applies filtering function it, so some better name is called for

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-14 20:30:07 +03:00
Raphael S. Carvalho
20a6483678 test: Add test for large partition splitting on compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:23:19 -03:00
Raphael S. Carvalho
e2ccafbe38 compaction: Add support to split large partitions
Adds support for splitting large partitions during compaction.

Large partitions introduce many problems, like memory overhead and
breaks incremental compaction promise. We want to split large
partitions across fixed-size fragments. We'll allow a partition
to exceed size limit by 10%, as we don't want to unnecessarily split
partitions that just crossed the limit boundary.

To avoid having to open a minimal of 2 fragments in a read, partition
tombstone will be replicated to every fragment storing the
partition.

The splitting isn't enabled by default, and can be used by
strategies that are run aware like ICS. LCS still cannot support
it as it's still using physical level metadata, not run id.

An incremental reader for sstable runs will follow soon.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:23:16 -03:00
Raphael S. Carvalho
4bc24acf81 sstable: Extend sstable_run to allow disjointness on the clustering level
After commit 0796b8c97a, sstable_run won't accept a fragment
that introduces key overlapping. But once we split large partitions,
fragments in the same run may store disjoint clustering ranges
of the same partition. So we're extending sstable_run to look
at clustering dimension, so fragments storing disjoint clustering
ranges of the same large partition can co-exist in the same run.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:51 -03:00
Raphael S. Carvalho
574e656793 sstables: simplify will_introduce_overlapping()
An element S1 is completely ordered before S2, if S1's last key is
lower than S2's first key.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:51 -03:00
Raphael S. Carvalho
13942ec947 test: move sstable_run_disjoint_invariant_test into sstable_datafile_test
That's where it belongs.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:51 -03:00
Raphael S. Carvalho
e1560c6b7f test: lib: Fix inefficient merging of mutations in make_sstable_containing()
make_sstable_containing() was absurdly slow when merging thousands of
mutations belonging to the same key, as it was unnecessarily copying
the mutation for every merge, producing bad complexity.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:51 -03:00
Raphael S. Carvalho
5937765009 sstables: Keep track of first partition's first pos and last partition's last pos
With first partition's first position and last partition's last
partition, we'll be able to determine which fragments composing a
sstable run store a large partition that was split.

Then sstable run will be able to detect if all fragments storing
a given large partition are disjoint in the clustering level.

Fixes #10637.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:51 -03:00
Raphael S. Carvalho
a4bbdfcc58 sstables: Rename min/max position_range to a descriptive name
The new descriptive name is important to make a distinction when
sstable stores position range for first and last rows instead
of min and max.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:51 -03:00
Raphael S. Carvalho
e099a9bf3b sstables_manager: Add sstable metadata reader concurrency semaphore
Let's introduce a reader_concurrency_semaphore for reading sstable
metadata, to avoid an OOM due to unlimited concurrency.
The concurrency on startup is not controlled, so it's important
to enforce a limit on the amount of memory used by the parallel
readers.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:51 -03:00
Raphael S. Carvalho
9bcad9ffa8 sstables: Add ability to find first or last position in a partition
This new method allows sstable to load the first row of the first
partition and last row of last partition.

That's useful for incremental reading of sstable run which will
be split at clustering boundary.

To get the first row, it consumes the first row (which can be
either a clustering row or range tombstone change) and returns
its position_in_partition.
To get the last row, it does the same as above but in reverse
mode instead.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:48 -03:00
Nadav Har'El
77467bcbcd Merge 'test/pylib: APIs to read and modify configuration from tests' from Kamil Braun
We introduce `server_get_config` to fetch the entire configuration dict
and `update_config` to update a value under the given key.

Closes #11493

* github.com:scylladb/scylladb:
  test/pylib: APIs to read and modify configuration from tests
  test/pylib: ScyllaServer: extract _write_config_file function
  test/pylib: ScyllaCluster: extend ActionReturn with dict data
  test/pylib: ManagerClient: introduce _put_json
  test/pylib: ManagerClient: replace `_request` with `_get`, `_get_text`
  test: pylib: store server configuration in `ScyllaServer`
2022-09-14 18:49:55 +03:00
Kefu Chai
2a74a0086f docs: fix typos
* s/udpates/updates/
* s/opetarional/operational/

Signed-off-by: Kefu Chai <tchaikov@gmail.com>

Closes #11541
2022-09-14 17:04:05 +03:00
Kamil Braun
73bf781e17 test/pylib: APIs to read and modify configuration from tests
We introduce `server_get_config` to fetch the entire configuration dict
and `update_config` to update a value under the given key.
2022-09-14 12:46:41 +02:00
Kamil Braun
1f550428a9 test/pylib: ScyllaServer: extract _write_config_file function
For refreshing the on-disk config file with the config stored in dict
form in the `self.config` field.
2022-09-14 12:46:41 +02:00
Kamil Braun
52e52e8503 test/pylib: ScyllaCluster: extend ActionReturn with dict data
For returning types more complex than text. Also specify a default empty
string value for the `msg` field for non-text return values.
2022-09-14 12:46:41 +02:00
Kamil Braun
c9348ae8ea test/pylib: ManagerClient: introduce _put_json
For sending PUT requests to the Manager (such as updating
configuration).
2022-09-14 12:46:41 +02:00
Kamil Braun
d81c722476 test/pylib: ManagerClient: replace _request with _get, _get_text
`_request` performed a GET request and extracted a text body out of the
response.

Split it into `_get`, which only performs the request, and `_get_text`,
which calls `_get` and extracts the body as text.

Also extract a `_resource_uri` function which will be used for other
request types.
2022-09-14 12:46:41 +02:00
Kamil Braun
9d39e14518 test: pylib: store server configuration in ScyllaServer
In following commits we will make this configuration accessible from
tests through the Manager (for fetching and updating).
2022-09-14 12:46:41 +02:00
Nadav Har'El
cf30432715 Merge 'test: add a topology suite with Raft disabled' from Kamil Braun
Add a suite which is basically equivalent to `topology` except that it
doesn't start servers with Raft enabled.

The suite will be used to test the Raft upgrade procedure.

The suite contains a basic test just to check the suite itself can run;
the test will be removed when 'real' tests are added.

Closes #11487

* github.com:scylladb/scylladb:
  test.py: PythonTestSuite: sum default config params with user-provided ones
  test: add a topology suite with Raft disabled
  test: pylib: use Python dicts to manipulate `ScyllaServer` configuration
  test: pylib: store `config_options` in `ScyllaServer`
2022-09-14 13:37:44 +03:00
Pavel Emelyanov
43131976e9 updateable_value: Update comment about cross-shard copying
refs: #7316

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11538
2022-09-14 12:35:56 +02:00
Michał Chojnowski
9b6fc553b4 db: commitlog: don't print INFO logs on shutdown
The intention was for these logs to be printed during the
database shutdown sequence, but it was overlooked that it's not
the only place where commitlog::shutdown is called.
Commitlogs are started and shut down periodically by hinted handoff.
When that happens, these messages spam the log.

Fix that by adding INFO commitlog shutdown logs to database::stop,
and change the level of the commitlog::shutdown log call to DEBUG.

Fixes #11508

Closes #11536
2022-09-14 11:30:53 +03:00
Avi Kivity
a24a8fd595 Update seastar submodule
* seastar cbb0e888d8...601e0776c0 (1):
  > coroutine: explain and mitigate the lambda coroutine fiasco

Closes #11537
2022-09-13 22:37:29 +03:00
Petr Gusev
4ff0807cd0 raft server, status metric 2022-09-13 19:34:22 +04:00
Alejo Sanchez
6799e766ca test.py: topology increment timeouts even more
Due to slow debug machines timing out, bump up all timeouts
significantly.

The cause was ExecutionProfile request_timeout. Also set a high
heartbeat timeout and bump already set timeouts to be safe, too.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #11516
2022-09-13 11:57:31 +02:00
Piotr Dulikowski
e69b44a60f exception: fix the error code used for rate_limit_exception
Per-partition rate limiting added a new error type which should be
returned when Scylla decides to reject an operation due to per-partition
rate limit being exceeded. The new error code requires drivers to
negotiate support for it, otherwise Scylla will report the error as
`Config_error`. The existing error code override logic works properly,
however due to a mistake Scylla will report the `Config_error` code even
if the driver correctly negotiated support for it.

This commit fixes the problem by specifying the correct error code in
`rate_limit_exception`'s constructor.

Tested manually with a modified version of the Rust driver which
negotiates support for the new error. Additionally, tested what happens
when the driver doesn't negotiate support (Scylla properly falls back to
`Config_error`).

Branches: 5.1
Fixes: #11517

Closes #11518
2022-09-13 11:46:15 +02:00
Nadav Har'El
8ece63c433 Merge 'Safemode - Introduce TimeWindowCompactionStrategy Guardrails'
This series introduces two configurable options when working with TWCS tables:

- `restrict_twcs_default_ttl` - a LiveUpdate-able tri_mode_restriction which defaults to WARN and will notify the user whenever a TWCS table is created without a `default_time_to_live` setting
- `twcs_max_window_count` - Which forbids the user from creating TWCS tables whose window count (buckets) are past a certain threshold. We default to 50, which should be enough for most use cases, and a setting of 0 effectively disables the check.

Refs: #6923
Fixes: #9029

Closes #11445

* github.com:scylladb/scylladb:
  tests: cql_query_test: add mixed tests for verifying TWCS guard rails
  tests: cql_query_test: add test for TWCS window size
  tests: cql_query_test: add test for TWCS tables with no TTL defined
  cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables
  cql: add max window restriction for TimeWindowCompactionStrategy
  time_window_compaction_strategy: reject invalid window_sizes
  cql3 - create/alter_table_statement: Make check_restricted_table_properties accept a schema_ptr
2022-09-12 23:55:51 +03:00
Botond Dénes
045b053228 Update seastar submodule
* seastar 2b2f6c08...cbb0e888 (10):
  > memory: allow user to select allocator to be used at runtime
  > perftune.py: correct typos
  > Merge 'seastar-addr2line: support more flexible syslog-style backtraces' from Benny Halevy
  > Fix instruction count for start_measuring_time
  > build: s/c-ares::c-ares/c-ares::cares/
  > Merge 'shared_ptr_debug_helper: turn assert into on_internal_error_abort' from Benny Halevy
  > test: fix use after free in the loopback socket
  > doc/tutorial.md: fix docker command for starting hello-world_demo
  > httpd: add a ctor without addr parameter
  > dns: dns_resolver: sock_entry: move-construct tcp/udp entries in place

Closes #11526
2022-09-12 18:34:22 +03:00
Avi Kivity
62ac3432c9 Merge "Always notify dropped RPC connections" from Pavel E
"
This set makes messaging service notify connection drop listeners
when connection is dropped for _any_ reason and cleans things up
around it afterwards
"

* 'br-messaging-notify-connection-drop' of https://github.com/xemul/scylla:
  messaging_service: Relax connection drop on re-caching
  messaging_service: Simplify remove_rpc_client_one()
  messaging_service: Notify connection drop when connection is removed
2022-09-12 17:02:51 +03:00
Yaron Kaikov
27e326652b build_docker.sh:fix python2 dependency
Following the revert of b004da9d1b which solved https://github.com/scylladb/scylla-pkg/issues/3094

updating docker dependency to match `scylla-tools-java` requirements

Closes #11522
2022-09-12 13:33:06 +03:00
Kamil Braun
2fe3e67a47 gms: feature_service: don't distinguish between 'known' and 'supported' features
`feature_service` provided two sets of features: `known_feature_set` and
`supported_feature_set`. The purpose of both and the distinction between
them was unclear and undocumented.

The 'supported' features were gossiped by every node. Once a feature is
supported by every node in the cluster, it becomes 'enabled'. This means
that whatever piece of functionality is covered by the feature, it can
by used by the cluster from now on.

The 'known' set was used to perform feature checks on node start; if the
node saw that a feature is enabled in the cluster, but the node does not
'know' the feature, it would refuse to start. However, if the feature
was 'known', but wasn't 'supported', the node would not complain. This
means that we could in theory allow the following scenario:
1. all nodes support feature X.
2. X becomes enabled in the cluster.
3. the user changes the configuration of some node so feature X will
   become unsupported but still known.
4. The node restarts without error.

So now we have a feature X which is enabled in the cluster, but not
every node supports it. That does not make sense.

It is not clear whether it was accidental or purposeful that we used the
'known' set instead of the 'supported' set to perform the feature check.

What I think is clear, is that having two sets makes the entire thing
unnecessarily complicated and hard to think about.

Fortunately, at the base to which this patch is applied, the sets are
always the same. So we can easily get rid of one of them.

I decided that the name which should stay is 'supported', I think it's
more specific than 'known' and it matches the name of the corresponding
gossiper application state.

Closes #11512
2022-09-12 13:09:12 +03:00
Takuya ASADA
cd5320fe60 install.sh: add --without-systemd option
Since we fail to write files to $USER/.config on Jenkins jobs, we need
an option to skip installing systemd units.
Let's add --without-systemd to do that.

Also, to detect the option availability, we need to increment
relocatable package version.

See scylladb/scylla-dtest#2819

Closes #11345
2022-09-12 13:04:00 +03:00
Avi Kivity
521127a253 Update tools/jmx submodule
* tools/jmx 06f2735...88d9bdc (1):
  > install.sh: add --without-systemd option
2022-09-12 13:02:16 +03:00
Kamil Braun
ce7bb8b6d0 test.py: PythonTestSuite: sum default config params with user-provided ones
Previously, if the suite.yaml file provided
`extra_scylla_config_options` but didn't provide values for `authorizer`
or `authenticator` inside the config options, the harness wouldn't give
any defaults for these keys. It would only provide defaults for these
keys if suite.yaml didn't specify `extra_scylla_config_options` at all.

It makes sense to give the user the ability to provide extra options
while relying on harness defaults for `authenticator` and `authorizer`
if the user doesn't care about them.
2022-09-12 11:58:05 +02:00
Kamil Braun
1661fe9f37 test: add a topology suite with Raft disabled
Add a suite which is basically equivalent to `topology` except that it
doesn't start servers with Raft enabled.

The suite will be used to test the Raft upgrade procedure.

The suite contains a basic test just to check the suite itself can run;
the test will be removed when 'real' tests are added.
2022-09-12 11:58:05 +02:00
Kamil Braun
311806244d test: pylib: use Python dicts to manipulate ScyllaServer configuration
Previously we used a formattable string to represent the configuration;
values in the string were substituted by Python's formatting mechanism
and the resulting string was stored to obtain the config file.

This approach had some downsides, e.g. it required boilerplate work to
extend: to add a new config options, you would have to modify this
template string.

Instead we can represent the configuration as a Python dictionary. Dicts
are easy to manipulate, for example you can sum two dicts; if a key
appears in both, the second dict 'wins':
```
{1:1} | {1:2} == {1:2}
```

This makes the configuration easy to extend without having to write
boilerplate: if the user of `ScyllaServer` wants to add or override a
config option, they can simply add it to the `config_options` dict and
that's it - no need to modify any internal template strings in
`ScyllaServer` implementation like before. The `config_options` dict is
simply summed with the 'base' config dict of `ScyllaServer`
(`config_options` is the right summand so anything in there overrides
anything in the base dict).

An example of this extensibility is the `authenticator` and `authorizer`
options which no longer appear in `scylla_cluster.py` module after this
change, they only appear in the suite.yaml file.

Also, use "workdir" option instead of specifying data dir, commitlog
dir etc. separately.
2022-09-12 11:57:58 +02:00
Kamil Braun
fd19825eaa test: pylib: store config_options in ScyllaServer
Previously the code extracted `authenticator` and `authorizer` keys from
the config options and stored them.

Store the entire dict instead. The new code is easier to extend if we
want to make more options configurable.
2022-09-12 11:57:18 +02:00
Pavel Emelyanov
5663b3eda5 messaging_service: Relax connection drop on re-caching
When messaging_service::get_rpc_client() picks up cached socket and
notices error on it, it drops the connection and creates a new one. The
method used to drop the connection is the one that re-lookups the verb
index again, which is excessive. Tune this up while at it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-12 12:05:02 +03:00
Nadav Har'El
b0371b6bf8 test/alternator: insert test names into Scylla logs
The output of test/alternator/run ends in Scylla's full log file, where
it is hard to understand which log messages are related to which test.
In this patch, we add a log message (using the new /system/log REST API)
every time a test is started and ends.

The messages look like this:

   INFO  2022-08-29 18:07:15,926 [shard 0] api - /system/log:
   test/alternator: Starting test_ttl.py::test_describe_ttl_without_ttl
   ...
   INFO  2022-08-29 18:07:15,930 [shard 0] api - /system/log:
   test/alternator: Ended test_ttl.py::test_describe_ttl_without_ttl

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-12 10:32:56 +03:00
Nadav Har'El
a81310e23d rest api: add a new /system/log operation
Add a new REST API operation, taking a log level and a message, and
printing it into the Scylla log.

This can be useful when a test wants to mark certain positions in the
log (e.g., to see which other log messages we get between the two
positions). An alternative way to achieve this could have been for the
test to write directly into the log file - but an on-disk log file is
only one of the logging options that Scylla support, and the approach
in this patch allows to add log message regardless of how Scylla keeps
the logs.

In motivation of this feature is that in the following patch the
test/alternator framework will add log messages when starting and
ending tests, which can help debug test failures.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-12 10:32:56 +03:00
Nadav Har'El
b9792ffb06 alternator ttl: log warning if scan took too long.
Currently, we log at "info" level how much time remained at the end of
a full TTL scan until the next scanning period (we sleep for that time).
If the scan was slower than the period, we didn't print anything.

Let's print a warning in this case - it can be useful for debugging,
and also users should know when their desired scan period is not being
honored because the full scan is taking longer than the desired scan
period.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-12 10:32:56 +03:00
Nadav Har'El
e7e9adc519 alternator,ttl: allow sub-second TTL scanning period, for tests
Alternator has the "alternator_ttl_period_in_seconds" parameter for
controlling how often the expiration thread looks for expired items to
delete. It is usually a very large number of seconds, but for tests
to finish quickly, we set it to 1 second.
With 1 second expiration latency, test/alternator/test_ttl.py took 5
seconds to run.

In this patch, we change the parameter to allow a  floating-point number
of seconds instead of just an integer. Then, this allows us to halve the
TTL period used by tests to 0.5 seconds, and as a result, the run time of
test_ttl.py halves to 2.5 seconds. I think this is fast enough for now.

I verified that even if I change the period to 0.1, there is no noticable
slowdown to other Alternator tests, so 0.5 is definitely safe.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-12 10:32:56 +03:00
Nadav Har'El
746c4bd9eb test/alternator: skip fewer Alternator TTL tests
Most of the Alternator TTL tests are extremely slow on DynamoDB because
item expiration may be delayed up to 24 hours (!), and in practice for
10 to 30 minutes. Because of this, we marked most of these tests
with the "veryslow" mark, causing them to be skipped by default - unless
pytest is given the "--runveryslow" option.

The result was that the TTL tests were not run in the normal test runs,
which can allow regressions to be introduced (luckily, this hasn't happened).

However, this "veryslow" mark was excessive. Many of the tests are very
slow only on DynamoDB, but aren't very slow on Scylla. In particular,
many of the tests involve waiting for an item to expire, something that
happens after the configurable alternator_ttl_period_in_seconds, which
is just one second in our tests.

So in this patch, we remove the "veryslow" mark from 6 tests of Alternator TTL
tests, and instead use two new fixtures - waits_for_expiration and
veryslow_on_aws - to only skip the test when running on DynamoDB or
when alternator_ttl_period_in_seconds is high - but in our usual test
environment they will not get skipped.

Because 5 of these 6 tests wait for an item to expire, they take one
second each and this patch adds 5 seconds to the Alternator test
runtime. This is unfortunate (it's more than 25% of the total Alternator
test runtime!) but not a disaster, and we plan to reduce this 5 second
time futher in the following patch, but decreasing the TTL scanning
period even further.

This patch also increases the timeout of several of these tests, to 120
seconds from the previous 10 seconds. As mentioned above, normally,
these tests should always finish in alternator_ttl_period_in_seconds
(1 second) with a single scan taking less than 0.2 seconds, but in
extreme cases of debug builds on overloaded test machines, we saw even
60 seconds being passed, so let's increase the maximum. I also needed
to make the sleep time between retries smaller, not a function of the
new (unrealistic) timeout.

4 more tests remain "veryslow" (and won't run by default) because they
are take 5-10 seconds each (e.g., a test which waits to see that an item
does *not* get expired, and a test involving writing a lot of data).
We should reconsider this in the future - to perhaps run these tests in
our normal test runs - but even for now, the 6 extra tests that we
start running are a much better protection against regressions than what
we had until now.

Fixes #11374

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

x

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-12 10:32:56 +03:00
Nadav Har'El
297109f6ee test/alternator: test Alternator TTL metrics
This patch adds a test for the metrics generated by the background
expiration thread run for Alternator's TTL feature.

We test three of the four metrics: scylla_expiration_scan_passes,
scylla_expiration_scan_table and scylla_expiration_items_deleted.
The fourth metric, scylla_expiration_secondary_ranges_scanned, counts the
number of times that this node took over another node's expiration duty.
so requires a multi-node cluster to test, and we can't test it in the
single-node cluster test framework.

To see TTL expiration in action this test may need to wait up to the
setting of alternator_ttl_period_in_seconds. For a setting of 1
second (the default set by test/alternator/run), this means this
test can take up to 1 second to run. If alternator_ttl_period_in_seconds
is set higher, the test is skipped unless --runveryslow is requested.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-12 10:32:56 +03:00
Botond Dénes
a0392bc1eb Merge 'doc: update the default SStable format' from Anna Stuchlik
The purpose of this PR is to update the information about the default SStable format.
It

Closes #11431

* github.com:scylladb/scylladb:
  doc: simplify the information about default formats in different versions
  doc: update the SSTables 3.0 Statistics File Format to add the UUID host_id option of the ME format
  doc: add the information regarding the ME format to the SSTables 3.0 Data File Format page
  doc: fix additional information regarding the ME format on the SStable 3.x page
  doc: add the ME format to the table
  add a comment to remove the information when the documentation is versioned (in 5.1)
  doc: replace Scylla with ScyllaDB
  doc: fix the formatting and language in the updated section
  doc: fix the default SStable format
2022-09-12 09:50:01 +03:00
Pavel Emelyanov
f3dfc9dbd4 system_keyspace: Don't load preferred IPs if not asked for
If snitch->prefer_local() is false, advertised (via gossiper)
INTERNAL_IPs are not suggested to messaging service to use. The same
should apply to boot-time when messaging service is loaded with those
IPs taken from the system.peers table.

fixes: #11353
tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/2172/

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220909144800.23122-1-xemul@scylladb.com>
2022-09-12 09:48:23 +03:00
Botond Dénes
9db940ff1b Merge "Make network_topology_strategy_test use topology" from Pavel Emelyanov
"
The test in question plays with snitches to simulate the topology
over which tokens are spread. This set replaces explicit snitch
usage with temporary topology object.

Some snitch traces are still left, but those are for token_metadata
internal which still call global snitch for DC/RACK.
"

* 'br-tests-use-topology-not-snitch' of https://github.com/xemul/scylla:
  network_topology_strategy_test: Use topology instead of snitch
  network_topology_strategy_test: Populate explicit topology
2022-09-12 09:40:17 +03:00
Avi Kivity
6c797587c7 dirty_memory_manager: region_group: remove sorting of subgroups
dirty_memory_manager tracks lsa regions (memtables) under region_group:s,
in order to be able to pick up the largest memtable as a candidate for
flushing.

Just as region_group:s contain regions, they can also contain other
region_group:s in a nested structure. It also tracks the nested region_group
that contains the largest region in a binomial heap.

This latter facility is no longer used. It saw use when we had the system
dirty_memory_manager nested under the user dirty_memory_manager, but
that proved too complicated so it was undone. We still nest a virtual
region_group under the real region_group, and in fact it is the
virtual region_group that holds the memtables, but it is accessed
directly to find the largest memtable (region_group::get_largest_region)
and so all the mechanism that sorts region_group:s is bypassed.

Start to dismantle this house of cards by removing the subgroup
sorting. Since the hierarchy has exactly one parent and one child,
it's clearly useless. This is seen by the fact that we can just remove
everything related.

We still need the _subgroups member to hold the virtual region_group;
it's replaced by a vector. I verified that the non-intrusive vector
is exception safe since push_back() happens at the very end; in any
case this is early during setup where we aren't under memory pressure.

A few tests that check the removed functionality are deleted.

Closes #11515
2022-09-12 09:29:08 +03:00
Botond Dénes
0e2d6cfd61 Merge 'Introduce Compaction Groups' from Raphael "Raph" Carvalho
Compaction group can be defined as a set of files that can be compacted together. Today, all sstables belonging to a table in a given shard belong to the same group. So we can say there's one group per table per shard. As we want to eventually allow isolation of data that shouldn't be mixed, e.g. data from different vnodes, then we want to have more than one group per table per shard. That's why compaction groups is being introduced here.

Today, all memtables and sstables are stored in a single structure per table. After compaction groups, there will be memtables and sstables for each group in the table.

As we're taking an incremental approach, table still supports a single group. But work was done on preparing table for supporting multiple groups. Completing that work is actually the next step. Also, a procedure for deriving the group from token is introduced, but today it always return the single group owned by the table. Once multiple groups are supported, then that procedure should be implemented to map a token to a group.

No semantics was changed by this series.

Closes #11261

* github.com:scylladb/scylladb:
  replica: Move memtables to compaction_group
  replica: move compound SSTable set to compaction group
  replica: move maintenance SSTable set to compaction_group
  replica: move main SSTable set to compaction_group
  replica: Introduce compaction_group
  replica: convert table::stop() into coroutine
  compaction_manager: restore indentation
  compaction_manager: Make remove() and stop_ongoing_compactions() noexcept
  test: sstable_compaction_test: Don't reference main sstable set directly
  test: sstable_utils: Set data size fields for fake SSTable
  test: sstable_compaction_test: remove needless usage of column_family_test::add_sstable
2022-09-12 09:28:44 +03:00
Botond Dénes
5374f0edbf Merge 'Task manager' from Aleksandra Martyniuk
Task manager for observing and managing long-running, asynchronous tasks in Scylla
with the interface for the user. It will allow listing of tasks, getting detailed
task status and progression, waiting for their completion, and aborting them.
The task manager will be configured with a “task ttl” that determines how long
the task status is kept in memory after the task completes.

At first it will support repair and compaction tasks, and possibly more in the future.

Currently:
Sharded `task_manager` is started in `main.cc` where it is further passed
to `http_context` for the purpose of user interface.

Task manager's tasks are implemented in two two layers: the abstract
and the implementation one. The latter is a pure virtual class which needs
to be overriden by each module. Abstract layer provides the methods that
are shared by all modules and the access to module-specific methods.

Each module can access task manager, create and manage its tasks through
`task_manager::module` object. This way data specific to a module can be
separated from the other modules.

User can access task manager rest api interface to track asynchronous tasks.
The available options consist of:
- getting a list of modules
- getting a list of basic stats of all tasks in the requested module
- getting the detailed status of the requested task
- aborting the requested task
- waiting for the requested task to finish

To enable testing of the provided api, test specific task implementation and module
are provided. Their lifetime can be simulated with the standalone test api.
These components are compiled and the tests are run in all but release build modes.

Fixes: #9809

Closes #11216

* github.com:scylladb/scylladb:
  test: task manager api test
  task_manager: test api layer implementation
  task_manager: add test specific classes
  task_manager: test api layer
  task_manager: api layer implementation
  task_manager: api layer
  task_manager: keep task_manager reference in http_context
  start sharded task manager
  task_manager: create task manager object
2022-09-12 09:26:46 +03:00
Petr Gusev
1b5fa4088e raft server, abort group0 server on background errors 2022-09-12 10:16:43 +04:00
Petr Gusev
e92dc9c15b raft server, provide a callback to handle background errors
Fix: #11352
2022-09-12 10:16:43 +04:00
Petr Gusev
c57238d3d6 raft server, check aborted state on public server public api's
Fix: #11352
2022-09-12 10:16:40 +04:00
Felipe Mendes
6a3d8607b4 tests: cql_query_test: add mixed tests for verifying TWCS guard rails
This patch adds set of 10 cenarios that have been unveiled during additional testing.
In particular, most of the scenarios cover ALTER TABLE statements, which - if not handled -
may break the guardrails safe-mode. The situations covered are:

- STCS->TWCS with no TTL defined
- STCS->TWCS with small TTL
- STCS->TWCS with large TTL value
- TWCS table with small to large TTL
- No TTL TWCS to large TTL and then small TTL
- twcs_max_window_count LiveUpdate - Decrease TTL
- twcs_max_window_count LiveUpdate - Switch CompactionStrategy
- No TTL TWCS table to STCS
- Large TTL TWCS table, modify attribute other than compaction and default_time_to_live
- Large TTL STCS table, fail to switch to TWCS with no TTL explicitly defined
2022-09-11 17:57:14 -03:00
Felipe Mendes
a7a91e3216 tests: cql_query_test: add test for TWCS window size
This patch adds a test for checking the validity of tables using TimeWindowCompactionStrategy
with an incorrect number of compaction windows.

The twcs_max_window_count LiveUpdate-able parameter is also disabled during the execution of the
test in order to ensure that users can effectively disable the enforcement, should they want.
2022-09-11 17:38:25 -03:00
Felipe Mendes
1c5d46877e tests: cql_query_test: add test for TWCS tables with no TTL defined
This patch adds a testcase for TimeWindowCompactionStrategy tables created with no
default_time_to_live defined. It makes use of the LiveUpdate-able restrict_twcs_default_ttl
parameter in order to determine whether TWCS tables without TTL should be forbidden or not.

The test replays all 3 possible variations of the tri_mode_restriction and verifies tables
are correctly created/altered according to the current setting on the replica which receives
the request.
2022-09-11 16:55:46 -03:00
Felipe Mendes
7fec4fcaa6 cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables
TimeWindowCompactionStrategy (TWCS) tables are known for being used explicitly for time-series workloads. In particular, most of the time users should specify a default_time_to_live during table creation to ensure data is expired such as in a sliding window. Failure to do so may create unbounded windows - which - depending on the compaction window chosen, may introduce severe latency and operational problems, due to unbounded window growth.

However, there may be some use cases which explicitly ingest data by using the `USING TTL` keyword, which effectively has the same effect. Therefore, we can not simply forbid table creations without a default_time_to_live explicitly set to any value other than 0.

The new restrict_twcs_without_default_ttl option has three values: "true", "false", and "warn":

We default to "warn", which will notify the user of the consequences when creating a TWCS table without a default_time_to_live value set. However, users are encouraged to switch it to "true", as - ideally - a default_time_to_live value should always be expected to prevent applications failing to ingest data against the database ommitting the `USING TTL` keyword.
2022-09-11 16:50:42 -03:00
Felipe Mendes
a3356e866b cql: add max window restriction for TimeWindowCompactionStrategy
The number of potential compaction windows (or buckets) is defined by the default_time_to_live / sstable_window_size ratio. Every now and then we end up in a situation on where users of TWCS end up underestimating their window buckets when using TWCS. Unfortunately, scenarios on which one employs a default_time_to_live setting of 1 year but a window size of 30 minutes are not rare enough.

Such configuration is known to only make harm to a workload: As more and more windows are created, the number of SSTables will grow in the same pace, and the situation will only get worse as the number of shards increase.

This commit introduces the twcs_max_window_count option, which defaults to 50, and will forbid the Creation or Alter of tables which get past this threshold. A value of 0 will explicitly skip this check.

Note: this option does not forbid the creation of tables with a default_time_to_live=0 as - even though not recommended - it is perfectly possible for a TWCS table with default TTL=0 to have a bound window, provided any ingestion statements make use of 'USING TTL' within the CQL statement, in addition to it.
2022-09-11 16:50:22 -03:00
Felipe Mendes
f1ffb501f0 time_window_compaction_strategy: reject invalid window_sizes
Scylla mistakenly allows an user to configure an invalid TWCS window_size <= 0, which effectively breaks the notion of compaction windows.
Interestingly enough, a <= 0 window size should be considered an undefined behavior as either we would create a new window every 0 duration (?) or the table would behave as STCS, the reader is encouraged to figure out which one of these is true. :-)

Cassandra, on the other hand, will properly throw a ConfigurationException when receiving such invalid window sizes and we now match the behavior to the same as Cassandra's.

Refs: #2336
2022-09-11 16:40:03 -03:00
Raphael S. Carvalho
f5715d3f0b replica: Move memtables to compaction_group
Now memtables live in compaction_group. Also introduced function
that selects group based on token, but today table always return
the single group managed by it. Once multiple groups are supported,
then the function should interpret token content to select the
group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
f4579795e6 replica: move compound SSTable set to compaction group
The group is now responsible for providing the compound set.
table still has one compound set, which will span all groups for
the cases we want to ignore the group isolation.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
6717d96684 replica: move maintenance SSTable set to compaction_group
This commit is restricted to moving maintenance set into compaction_group.
Next, we'll introduce compound set into it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
ce8e5f354c replica: move main SSTable set to compaction_group
This commit is restricted to moving main set into compaction_group.
Next, we'll move maintenance set into it and finally the memtable.

A method is introduced to figure out which group a sstable belongs
to, but it's still unimplemented as table is still limited to
a single group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
4871f1c97c replica: Introduce compaction_group
Compaction group is a new abstraction used to group SSTables
that are eligible to be compacted together. By this definition,
a table in a given shard has a single compaction group.
The problem with this approach is that data from different vnodes
is intermixed in the same sstable, making it hard to move data
in a given sstable around.
Therefore, we'll want to have multiple groups per table.

A group can be thought of an isolated LSM tree where its memtable
and sstable files are isolated from other groups.

As for the implementation, the idea is to take a very incremental
approach.
In this commit, we're introducing a single compaction group to
table.
Next, we'll migrate sstable and maintenance set from table
into that single compaction group. And finally, the memtable.

Cache will be shared among the groups, for simplicity.
It works due to its ability to invalidate a subset of the
token range.

There will be 1:1 relationship between compaction_group and
table_state.
We can later rename table_state to compaction_group_state.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
a6ecadf3de replica: convert table::stop() into coroutine
await_pending_ops() is today marked noexcept, so doesn't have to
be implemented with finally() semantics.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
44913ebbd0 compaction_manager: restore indentation
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
888660fa44 compaction_manager: Make remove() and stop_ongoing_compactions() noexcept
stop_ongoing_compactions() is made noexcept too as it's called from
remove() and we want to make the latter noexcept, to allow compaction
group to qualify its stop function as noexcept too.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
65414e6756 test: sstable_compaction_test: Don't reference main sstable set directly
Preparatory change for main sstable set to be moved into compaction
group. After that, tests can no longer direct access the main
set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
dfa7273127 test: sstable_utils: Set data size fields for fake SSTable
So methods that look at data size and require it to be higher than 0
will work on fake SSTables created using set_values().

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
4fa8159a13 test: sstable_compaction_test: remove needless usage of column_family_test::add_sstable
column_family_test::add_sstable will soon be changed to run in a thread,
and it's not needed in this procedure, so let's remove its usage.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Jadw1
ba461aca8b cql-pytest: more neutral command in cql_test_connection fixture
I found 'use system` to not be neutral enough (e.g. in case of testing
describe statement). `BEGIN BATCH APPLY BATCH` sounds better.

Closes #11504
2022-09-11 18:49:06 +03:00
Nadav Har'El
d71098a3b8 Update tools/java submodule
* tools/java b7a0c5bd31...b004da9d1b (1):
  > Revert "dist/debian:add python3 as dependency"
2022-09-11 17:45:43 +03:00
Pavel Emelyanov
bbad3eac63 pylib: Cast port number config to int explicitly
Otherwise it crashes some python versions.

The cast was there before a2dd64f68f
explicitly dropped one while moving the code between files.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11511
2022-09-09 18:08:08 +02:00
Kamil Braun
be1ef9d2a7 gms: feature_service: remove the USES_RAFT feature
It was not and won't be used for anything.
Note that the feature was always disabled or masked so no node ever
announced it, thus it's safe to get rid of.

Closes #11505
2022-09-09 18:05:46 +02:00
Michał Chojnowski
47844689d8 token_metadata: make local_dc_filter a lambda, not a std::function
This std::function causes allocations, both on construction
and in other operations. This costs ~2200 instructions
for a DC-local query. Fix that.

Closes #11494
2022-09-09 18:05:46 +02:00
Michał Chojnowski
af7ace3926 utils: config_file: fix handling of workdir,W in the YAML file
Option names given in db/config.cc are handled for the command line by passing
them to boost::program_options, and by YAML by comparing them with YAML
keys.
boost::program_options has logic for understanding the
long_name,short_name syntax, so for a "workdir,W" option both --workdir and -W
worked, as intended. But our YAML config parsing doesn't have this logic
and expected "workdir,W" verbatim, which is obviously not intended. Fix that.

Fixes #7478
Fixes #9500
Fixes #11503

Closes #11506
2022-09-09 18:05:46 +02:00
Kamil Braun
dba595d347 Merge 'Minimal implementation of Broadcast Tables' from Mikołaj Grzebieluch
Broadcast tables are tables for which all statements are strongly
consistent (linearizable), replicated to every node in the cluster and
available as long as a majority of the cluster is available. If a user
wants to store a “small” volume of metadata that is not modified “too
often” but provides high resiliency against failures and strong
consistency of operations, they can use broadcast tables.

The main goal of the broadcast tables project is to solve problems which
need to be solved when we eventually implement general-purpose strongly
consistent tables: designing the data structure for the Raft command,
ensuring that the commands are idempotent, handling snapshots correctly,
and so on.

In this MVP (Minimum Viable Product), statements are limited to simple
SELECT and UPDATE operations on the built-in table. In the future, other
statements and data types will be available but with this PR we can
already work on features like idempotent commands or snapshotting.
Snapshotting is not handled yet which means that restarting a node or
performing too many operations (which would cause a snapshot to be
created) will give incorrect results.

In a follow-up, we plan to add end-to-end Jepsen tests
(https://jepsen.io/). With this PR we can already simulate operations on
lists and test linearizability in linear complexity. This can also test
Scylla's implementation of persistent storage, failure detector, RPC,
etc.

Design doc: https://docs.google.com/document/d/1m1IW320hXtsGulzSTSHXkfcBKaG5UlsxOpm6LN7vWOc/edit?usp=sharing

Closes #11164

* github.com:scylladb/scylladb:
  raft: broadcast_tables: add broadcast_kv_store test
  raft: broadcast_tables: add returning query result
  raft: broadcast_tables: add execution of intermediate language
  raft: broadcast_tables: add compilation of cql to intermediate language
  raft: broadcast_tables: add definition of intermediate language
  db: system_keyspace: add broadcast_kv_store table
  db: config: add BROADCAST_TABLES feature flag
2022-09-09 18:05:37 +02:00
Aleksandra Martyniuk
55cd8fe3bf test: task manager api test
Test of a task manager api.
2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk
ec86410094 task_manager: test api layer implementation
The implementation of a test api that helps testing task manager
api. It provides methods to simulate the operations that can happen
on modules and theirs task. Through the api user can: register
and unregister the test module and the tasks belonging to the module,
and finish the tasks with success or custom error.
2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk
b1fa6e49af task_manager: add test specific classes
Add test_module and test_task classes inheriting from respectively
task_manager::module and task_manager::task::impl that serve
task manager testing.
2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk
42f36db55b task_manager: test api layer
The test api that helps testing task manager api. It can be used
to simulate the operations that can happen on modules and theirs
task. Through the api user can: register and unregister the test
module and the tasks belonging to the module, and finish the tasks
with success or custom error.
2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk
c9637705a6 task_manager: api layer implementation
The implementation of a task manager api layer. It provides
methods to list the modules registered in task_manager, list
tasks belonging to the given module, abort, wait for or retrieve
a status of the given task.
2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk
07043cee68 task_manager: api layer
The task manager api layer. It can be used to list the modules
registered in task_manager, list tasks belonging to the given
module, abort, wait for or retrieve a status of the given task.
2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk
b87a0a74ab task_manager: keep task_manager reference in http_context
Keep a reference to sharded<task_manager> as a member
of http_context so it can be reached from rest api.
2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk
9e68c8d445 start sharded task manager
Sharded task manager object is started in main.cc.
2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk
2439e55974 task_manager: create task manager object
Implementation of a task manager that allows tracking
and managing asynchronous tasks.

The tasks are represented by task_manager::task class providing
members common to all types of tasks. The methods that differ
among tasks of different module can be overriden in a class
inheriting from task_manager::task::impl class. Each task stores
its status containing parameters like id, sequence number, begin
and end time, state etc. After the task finishes, it is kept
in memory for configurable time or until it is unregistered.
Tasks need to be created with make_task method.

Each module is represented by task_manager::module type and should
have an access to task manager through task_manager::module methods.
That allows to easily separate and collectively manage data
belonging to each module.
2022-09-09 14:29:28 +02:00
Pavel Emelyanov
24d68e1995 messaging_service: Simplify remove_rpc_client_one()
Make it void as after previous patch no code is interesed in this value

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-09 12:50:41 +03:00
Pavel Emelyanov
ca92ed65e5 messaging_service: Notify connection drop when connection is removed
There are two methods to close an RPC socket in m.s. -- one that's
called on error path of messaging_service::send_... and the other one
that's called upon gossiper down/leave/cql-off notifications. The former
one notifies listeners about connection drop, the latter one doesn't.

The only listener is the storage-proxy which, in turn, kicks database to
release per-table cache hitrate data. Said that, when a node goes down
(or when an operator shuts down its transport) the hit-rate stats
regarding this node are leaked.

This patch moves notification so that any socket drop calls notification
and thus releases the hitrates.

fixes: #11497

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-09 12:47:38 +03:00
Kamil Braun
0efdc45d59 Merge 'test.py: remove top level conftest and improve logging' from Alecco
- To isolate the different pytest suites, remove the top level conftest
  and move needed contents to existing `test/pylib/cql_repl/conftest.py`
  and `test/topology/conftest.py`.
- Add logging to CQL and Python suites.
- Log driver version for CQL and topology tests.

Closes #11482

* github.com:scylladb/scylladb:
  test.py: enable log capture for Python suite
  test.py: log driver name/version for cql/topology
  test.py: remove top level conftest.py
2022-09-08 16:25:24 +02:00
Anna Stuchlik
54d6d8b8cc doc: fix the version name in file upgrade-guide-from-2021.1-to-2022.1-image.rst 2022-09-08 15:38:11 +02:00
Anna Stuchlik
6ccc838740 doc: rename the upgrade-image file to upgrade-image-opensource and update all the links to that file 2022-09-08 15:38:11 +02:00
Anna Stuchlik
22317f8085 doc: update the Enterprise guide to include the Enterprise-onlyimage file 2022-09-08 15:38:11 +02:00
Anna Stuchlik
593f987bb2 doc: update the image files 2022-09-08 15:38:10 +02:00
Anna Stuchlik
42224dd129 doc: split the upgrade-image file to separate files for Open Source and Enterprise 2022-09-08 15:38:10 +02:00
Anna Stuchlik
64a527e1d3 doc: clarify the alternative upgrade procedures for the ScyllaDB image 2022-09-08 15:38:10 +02:00
Anna Stuchlik
5136d7e6d7 doc: add the upgrade guide for ScyllaDB Image from 2022.x.y. to 2022.x.z 2022-09-08 15:38:10 +02:00
Anna Stuchlik
f1ef6a181e doc: add the upgrade guide for ScyllaDB Image from 5.x.y. to 5.x.z 2022-09-08 15:38:10 +02:00
Mikołaj Grzebieluch
eb610c45fe raft: broadcast_tables: add broadcast_kv_store test
Test queries scylla with following statements:
* SELECT value FROM system.broadcast_kv_store WHERE key = CONST;
* UPDATE system.broadcast_kv_store SET value = CONST WHERE key = CONST;
* UPDATE system.broadcast_kv_store SET value = CONST WHERE key = CONST IF value = CONST;
where CONST is string randomly chosen from small set of random strings
and half of conditional updates has condition with comparison to last
written value.
2022-09-08 15:25:36 +02:00
Mikołaj Grzebieluch
803115d061 raft: broadcast_tables: add returning query result
Intermediate language added new layer of abstraction between cql
statement and quering mutations, thus this commit adds new layer of
abstraction between mutations and returning query result.

Result can't be directly returned from `group0_state_machine::apply`, so
we decided to hold query results in map inside `raft_group0_client`. It can
be safely read after `add_entry_unguarded`, because this method waits
for applying raft command. After translating result to `result_message`
or in case of exception, map entry is erased.
2022-09-08 15:25:36 +02:00
Mikołaj Grzebieluch
db88525774 raft: broadcast_tables: add execution of intermediate language
Extended `group0_command` to enable transmission of `raft::broadcast_tables::query`.
Added `add_entry_unguarded` method in `raft_group0_client` for dispatching raft
commands without `group0_guard`.
Queries on group0_kv_store are executed in `group_0_state_machine::apply`,
but for now don't return results. They don't use previous state id, so they will
block concurrent schema changes, but these changes won't block queries.

In this version snapshots are ignored.
2022-09-08 15:25:36 +02:00
Mikołaj Grzebieluch
82df8a9905 raft: broadcast_tables: add compilation of cql to intermediate language
We decided to extend `cql_statement` hierarchy with `strongly_consistent_modification_statement`
and `strongly_consistent_select_statement`. Statements operating on
system.broadcast_kv_store will be compiled to these new subclasses if
BROADCAST_TABLES flag is enabled.

If the query is executed on a shard other than 0 it's bounced to that shard.
2022-09-08 15:25:36 +02:00
Wojciech Mitros
7effd4c53a wasm: directly handle recycling of invalidated instance
An instance may be invalidated before we try to recycle it.
We perform this by setting its value to a nullopt.
This patch adds a check for it when calculating its size.
This behavior didn't cause issues before because the catch
clause below caught errors caused by calling value() on
a nullopt, even though it was intended for errors from
get_instance_size.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #11500
2022-09-08 15:39:28 +03:00
Geoffrey Beausire
f435276d2e Merge tokens for everywhere_topology
With EverywhereStrategy, we know that all tokens will be on the same node and the data is typically sparse like LocalStrategy.

Result of testing the feature:
Cluster: 2 DC, 2 nodes in each DC, 256 tokens per nodes, 14 shards per node

Before: 154 scanning operations
After: 14 scanning operations (~10x improvement)

On bigger cluster, it will probably be even more efficient.

Closes #11403
2022-09-08 15:33:23 +03:00
Mikołaj Grzebieluch
c541d5c363 raft: broadcast_tables: add definition of intermediate language
In broadcast tables, raft command contains a whole program to be executed.
Sending and parsing on each node entire CQL statement is inefficient,
thus we decided to compile it to an intermediate language which can be
easily serializable.

This patch adds a definition of such a language. For now, only the following
types of statements can be compiled:
* select value where key = CONST from system.broadcast_kv_store;
* update system.broadcast_kv_store set value = CONST where key = CONST;
* update system.broadcast_kv_store set value = CONST where key = CONST if value = CONST;
where CONST is string literal.
2022-09-08 14:03:51 +02:00
Michał Chojnowski
0c54e7c5c7 sstables: index_reader: remove a stray vsprintf call from the hot path
sstable::get_filename() constructs the filename from components, which
takes some work. It happens to be called on every
index_reader::index_reader() call even though it's only used for TRACE
logs. That's 1700 instructions (~1% of a full query) wasted on every
SSTable read. Fix that.

Closes #11485
2022-09-08 14:29:23 +03:00
Michał Chojnowski
c61b901828 utils: logalloc: prefer memory::free_memory() to memory::stats().free_memory
The former is a shortcut that does not involve a copy of all stats.
This saves some instructions in the hot path.

Closes #11495
2022-09-08 14:12:20 +03:00
Botond Dénes
438aaf0b85 Merge 'Deglobalize repair history maps' from Benny Halevy
Change a8ad385ecd introduced
```
thread_local std::unordered_map<utils::UUID, seastar::lw_shared_ptr<repair_history_map>> repair_history_maps;
```

We're trying to avoid global scoped variables as much as we can so this should probably be embedded in some sharded service.

This series moves the thread-local `repair_history_maps` instances to `compaction_manager`
and passes a reference to the shard compaction_manager to functions that need it for compact_for_query
and compact_for_compaction.

Since some paths don't need it and don't have access to the compactio_manager,
the series introduced `utils::optional_reference<T>` that allows to pass nullopt.
In this case, `get_gc_before_for_key` behaves in `tombstone_gc_mode::repair` as if the table wasn't repaired and tombstones are not garbage-collected.

Fixes #11208

Closes #11366

* github.com:scylladb/scylladb:
  tombstone_gc: deglobalize repair_history_maps
  mutation_compactor: pass tombstone_gc_state to compact_mutation_state
  mutation_partition: compact_for_compaction_v2: get tombstone_gc_state
  mutation_partition: compact_for_compaction: get tombstone_gc_state
  mutation_readers: pass tombstone_gc_state to compating_reader
  sstables: get_gc_before_*: get tombstone_gc_state from caller
  compaction: table_state: add virtual get_tombstone_gc_state method
  db: view: get_tombstone_gc_state from compaction_manager
  db: view: pass base table to view_update_builder
  repair: row_level: repair_update_system_table_handler: get get_tombstone_gc_state for db compaction_manager
  replica: database: get_tombstone_gc_state from compaction_manager
  compaction_manager: add tombstone_gc_state
  replica: table: add get_compaction_manager function
  tombstone_gc: introduce tombstone_gc_state
  repair_service: simplify update_repair_time error handling
  tombstone_gc: update_repair_time: get table_id rather than schema_ptr
  tombstone_gc: delete unused forward declaration
  database: do not drop_repair_history_map_for_table in detach_column_family
2022-09-08 14:08:38 +03:00
Botond Dénes
9d1cc5e616 Merge 'doc: update the OS support for versions 2022.1 and 2022.2' from Anna Stuchlik
The scope of this PR:
- Removing support for Ubuntu 16.04 and Debian 9.
- Adding support for Debian 11.

Closes #11461

* github.com:scylladb/scylladb:
  doc: remove support for Debian 9 from versions 2022.1 and 2022.2
  doc: remove support for Ubuntu 16.04 from versions 2022.1 and 2022.2
  doc: add support for Debian 11 to versions 2022.1 and 2022.2
2022-09-08 13:27:47 +03:00
Anna Stuchlik
0dee507c48 doc: fix the upgrade version in the upgrade guide for RHEL and CentOS
Closes #11477
2022-09-08 13:26:49 +03:00
Alejo Sanchez
4190c61dbf test.py: enable log capture for Python suite
Enable pytest log capture for Python suite. This will help debugging
issues in remote machines.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-08 11:37:32 +02:00
Alejo Sanchez
c6a048827a test.py: log driver name/version for cql/topology
Log the python driver name and version to help debugging on third party
machines.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-08 11:37:32 +02:00
Alejo Sanchez
a2dd64f68f test.py: remove top level conftest.py
Remove top level conftest so different suites have their own (as it was
before).

Move minimal functionality into existing test/pylib/cql_repl/conftest.py
so cql tests can run on their own.

Move param setting into test/topology/conftest.py.

Use uuid module for unique keyspace name for cql tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-08 11:37:32 +02:00
Alejo Sanchez
d892d194fb test.py: remove spurious after test check
Before/after test checks are done per test case, there's no longer need
to check after pytest finishes.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #11489
2022-09-08 11:33:37 +02:00
Felipe Mendes
7ccf8ed221 cql3 - create/alter_table_statement: Make check_restricted_table_properties accept a schema_ptr
As check_restricted_table_properties() is invoked both within CREATE TABLE and ALTER TABLE CQL statements,
we currently have no way to determine whether the operation was either a CREATE or ALTER. In many situations,
it is important to be able to distinguish among both operations, such as - for example - whether a table already has
a particular property set or if we are defining it within the statement.

This patch simply adds a std::optional<schema_ptr> to check_restricted_table_properties() and updates its caller.

Whenever a CREATE TABLE statement is issued, the method is called as a std::nullopt, whereas if an ALTER TABLE is
issued instead, we call it with a schema_ptr.
2022-09-07 21:27:32 -03:00
Kamil Braun
ff4430d8ea test: topology: make imports friendlier for tools (such as mypy)
When importing from `pylib`, don't modify `sys.path` but use the fact
that both `test/` and `test/pylib/` directories contain an `__init__.py`
file, so `test.pylib` is a valid module if we start with `test/` as the
Python package root.

Both `pytest` and `mypy` (and I guess other tools) understand this
setup.

Also add an `__init__.py` to `test/topology/` so other modules under the
`test/` directory will be able to import stuff from `test/topology/`
(i.e. from `test.topology.X import Y`).

Closes #11467
2022-09-07 23:52:50 +03:00
Karol Baryła
1c2eef384d transport/server.cc: Return correct size of decompressed lz4 buffer
An incorrect size is returned from the function, which could lead to
crashes or undefined behavior. Fix by erroring out in these cases.

Fixes #11476
2022-09-07 10:58:23 +03:00
Nadav Har'El
e5f6adf46c test/alternator: improve tests for DescribeTable for indexes
I created new issues for each missing field in DescribeTable's
response for GSIs and LSIs, so in this patch we edit the xfail
messages in the test to refer to these issues.

Additionally, we only had a test for these fields for GSIs, so this
patch also adds a similar test for LSIs. I turns out there is a
difference between the two tests -  the two fields IndexStatus and
ProvisionedThroughput are returned for GSIs, but not for LSIs.

Refs #7750
Refs #11466
Refs #11470
Refs #11471

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11473
2022-09-07 09:50:16 +02:00
Benny Halevy
e9cfe9e572 tombstone_gc: deglobalize repair_history_maps
Move the thread-local instances of the
per-table repair history maps into compaction_manager.

Fixes #11208

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-07 07:43:15 +03:00
Benny Halevy
8b38893895 mutation_compactor: pass tombstone_gc_state to compact_mutation_state
Used in get_gc_before.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-07 07:43:15 +03:00
Benny Halevy
d86810d22c mutation_partition: compact_for_compaction_v2: get tombstone_gc_state
To be passed down to compact_mutation_state in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-07 07:43:15 +03:00
Benny Halevy
0627667a06 mutation_partition: compact_for_compaction: get tombstone_gc_state
And pass down to `do_compact`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-07 07:43:15 +03:00
Benny Halevy
7e4612d3aa mutation_readers: pass tombstone_gc_state to compating_reader
To be passed further done to `compact_mutation_state` in
a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-07 07:43:14 +03:00
Benny Halevy
572d534d0d sstables: get_gc_before_*: get tombstone_gc_state from caller
Pass the tombstone_gc_state from the compaction_strategy
to sstables get_gc_before_* functions using the table state
to get to the tombstone_gc_state.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:05:39 +03:00
Benny Halevy
2cd3fc2f36 compaction: table_state: add virtual get_tombstone_gc_state method
and override it in table::table_state to get the tombstone_gc_state
from the table's compaction_manager.

It is going to be used in the next patched to pass the gc state
from the compaction_strategy down to sstables and compaction.

table_state_for_test was modified to just keep a null
tombstone_gc_state.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:05:39 +03:00
Benny Halevy
6fb4b5555d db: view: get_tombstone_gc_state from compaction_manager
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:05:39 +03:00
Benny Halevy
71ede6124a db: view: pass base table to view_update_builder
To be used by generate_update() for getting the
tombstone_gc_state via the table's compaction_manager.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:04:23 +03:00
Benny Halevy
6a11c410fd repair: row_level: repair_update_system_table_handler: get get_tombstone_gc_state for db compaction_manager
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:04:16 +03:00
Benny Halevy
3b0147390b replica: database: get_tombstone_gc_state from compaction_manager
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:02:54 +03:00
Benny Halevy
8b841e1207 compaction_manager: add tombstone_gc_state
Add a tombstone_gc_state member and methods to get it.

Currently the tombstone_gc_state is default constructed,
but a following patch will move the thread-local
repair history maps into the compaction_manager as a member
and then the _tombstone_gc_state member will be initialized
from that member.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:02:54 +03:00
Benny Halevy
1ce50439af replica: table: add get_compaction_manager function
so to let a view get the tombstone_gc_state via
the compaction_manager of the base table.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:02:54 +03:00
Benny Halevy
5dd15aa3c8 tombstone_gc: introduce tombstone_gc_state
and use it to access the repair history maps.

At this introductory patch, we use default-constructed
tombstone_gc_state to access the thread-local maps
temporarily and those use sites will be replaced
in following patches that will gradually pass
the tombstone_gc_state down from the compaction_manager
to where it's used.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:02:54 +03:00
Benny Halevy
b2b211568e repair_service: simplify update_repair_time error handling
There's no need for per-shard try/catch here.
Just catch exceptions from the overall sharded operation
to update_repair_time.

Also, update warning to indicate that only updating the repair history
time failed, not "Loading repair history".

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 22:43:08 +03:00
Benny Halevy
7d13811297 tombstone_gc: update_repair_time: get table_id rather than schema_ptr
The function doesn't need access to the whole schema.
The table_id is just enough to get by.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 22:43:08 +03:00
Benny Halevy
442d43181c tombstone_gc: delete unused forward declaration
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 22:43:08 +03:00
Benny Halevy
3d88fe9729 database: do not drop_repair_history_map_for_table in detach_column_family
drop_repair_history_map_for_table is called on each shard
when database::truncate is done, and the table is stopped.

dropping it before the table is stopped is too early.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 22:43:08 +03:00
Nadav Har'El
ee606a5d52 Merge 'doc: fix the CQL version in the Interfaces table' from Anna Stuchlik
Fix https://github.com/scylladb/scylla-doc-issues/issues/816
Fix https://github.com/scylladb/scylla-docs/issues/1613

This PR fixes the CQL version in the Interfaces page, so that it is the same as in other places across the docs and in sync with the version reported by the ScyllaDB (see https://github.com/scylladb/scylla-doc-issues/issues/816#issuecomment-1173878487).

To make sure the same CQL version is used across the docs, we should use the `|cql-version| `variable rather than hardcode the version number on several pages.
The variable is specified in the conf.py file:
```
rst_prolog = """
.. |cql-version| replace:: 3.3.1
"""
```

Closes #11320

* github.com:scylladb/scylladb:
  doc: add the Cassandra version on which the tools are based
  doc: fix the version number
  doc: update the Enterprise version where the ME format was introduced
  doc: add the ME format to the Cassandar Compatibility page
  doc: replace Scylla with ScyllaDB
  doc: rewrite the Interfaces table to the new format to include more information about CQL support
  doc: remove the CQL version from pages other than Cassandra compatibility
  doc: fix the CQL version in the Interfaces table
2022-09-06 19:02:44 +03:00
Asias He
792a91b1fa storage_service: Drop ignore dead nodes option for bootstrap and decommission in log
The ignore dead node options are not really supported at the moment.
Drop it in the log to reduce confusion.

Closes #11426
2022-09-06 18:21:21 +03:00
Anna Stuchlik
4c7aa5181e doc: remove support for Debian 9 from versions 2022.1 and 2022.2 2022-09-06 14:04:22 +02:00
Anna Stuchlik
dfc7203139 doc: remove support for Ubuntu 16.04 from versions 2022.1 and 2022.2 2022-09-06 14:01:35 +02:00
Anna Stuchlik
dd4979ffa8 doc: add support for Debian 11 to versions 2022.1 and 2022.2 2022-09-06 13:54:08 +02:00
Pavel Emelyanov
398e9f8593 network_topology_strategy_test: Use topology instead of snitch
Most of the test's cases use rack-inferring snitch driver and get
DC/RACK from it via the test_dc_rack() helper. The helper was introduced
in one of the previous sets to populate token metadata with some DC/RACK
as normal tokens manipulations required respective endpoint in topology.

This patch removes the usage of global snitch and replaces it with the
pre-populated topology. The pre-population is done in rack-inferring
snitch like manner, since token_metadata still uses global snitch and
the locations from snitch and this temporary topology should match.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-06 12:26:30 +03:00
Pavel Emelyanov
d8b2940cd8 network_topology_strategy_test: Populate explicit topology
There's a test case that makes its own snitch driver that generates
pre-claculated DC/RACK data for test endpoints. This patch replaces this
custom snitch driver with a standalone topology object.

Note: to get DC/RACK info from this topo the get_location() is used
since the get_rack()/get_datacenter() are still wrappers around global
snitch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-06 12:24:39 +03:00
Botond Dénes
b89b84ad3c compaction: scrub/abort: be more verbose
Currently abort-mode scrub exits with a message which basically says
"some problem was found", with no details on what problem it found. Add
a detailed error report on the found problem before aborting the scrub.

Closes #11418
2022-09-06 11:42:34 +03:00
Avi Kivity
3dc39474ec Merge 'tools/scylla-types: add tokenof and shardof actions' from Botond Dénes
`tokenof` calculates and prints the token of a partition-key.
`shardof` calculates the token and finds the owner shard of a partition-key. The number of shards has to be provided by the `--sharads` parameter. Ignore msb bits param can be tweaked with the `--ignore-msb-bits` parameter, which defaults to 12.

Examples:
```
$ scylla types tokenof --full-compound -t UTF8Type -t SimpleDateType -t UUIDType 000d66696c655f696e7374616e63650004800049190010c61a3321045941c38e5675255feb0196
(file_instance, 2021-03-27, c61a3321-0459-41c3-8e56-75255feb0196): -5043005771368701888

$ scylla types shardof --full-compound -t UTF8Type -t SimpleDateType -t UUIDType --shards=7 000d66696c655f696e7374616e63650004800049190010c61a3321045941c38e5675255feb0196
(file_instance, 2021-03-27, c61a3321-0459-41c3-8e56-75255feb0196): token: -5043005771368701888, shard: 1
```

Closes #11436

* github.com:scylladb/scylladb:
  tools/scylla-types: add shardof action
  tools/scylla-types: pass variable_map to action handlers
  tools/scylla-types: add tokenof action
  tools/scylla-types: extract printing code into functions
2022-09-06 11:25:54 +03:00
Pavel Emelyanov
42c9f35374 topology: Mark compare_endpoints() arguments as const
Continuation to debfcc0e (snitch: Move sort_by_proximity() to topology).
The passed addresses are not modified by the helper. They are not yet
const because the method was copy-n-pasted from snitch where it wasn't
such.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220906074708.29574-1-xemul@scylladb.com>
2022-09-06 11:03:13 +03:00
Yaron Kaikov
4459cecfd6 Docs: fix wrong manifest file for enterprise releases
In
https://docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.1-to-2022.1/upgrade-guide-from-2022.1-to-2022.1-image.html,
manifest file location is pointing the wrong filename for enterprise

Fixing

Closes #11446
2022-09-06 06:28:16 +03:00
Avi Kivity
ae4b2ee583 locator: token_metadata: drop unused and dangerous accessors
The mutable get_datacenter_endpoints() and get_datacenter_racks() are
dangerous since they expose internal members without enforcing class
invariants. Fortunately they are unused, so delete them.

Closes #11454
2022-09-06 06:08:02 +03:00
Avi Kivity
3f8cb608c3 Merge "Move auxiliary topology sorters from snitch" from Pavel E
"
There are two helpers on snitch that manipulate lists of nodes taking their
dc/rack into account. This set moves these methods from snitch to topology
and storage proxy.
"

* 'br-snitch-move-proximity-sorters' of https://github.com/xemul/scylla:
  snitch: Move sort_by_proximity() to topology
  topology: Add "enable proximity sorting" bit
  code: Call sort_endpoints_by_proximity() via topology
  snitch, code: Remove get_sorted_list_by_proximity()
  snitch: Move is_worth_merging_for_range_query to proxy
2022-09-05 17:25:08 +03:00
Anna Stuchlik
b0ebf0902c doc: add the Cassandra version on which the tools are based 2022-09-05 14:45:15 +02:00
Pavel Emelyanov
debfcc0eff snitch: Move sort_by_proximity() to topology
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-05 15:17:04 +03:00
Pavel Emelyanov
41973c5bf7 topology: Add "enable proximity sorting" bit
There's one corner case in nodes sorting by snitch. The simple snitch
code overloads the call and doesn't sort anything. The same behavior
should be preserved by (future) topology implementation, but it doesn't
know the snitch name. To address that the patch adds a boolean switch on
topology that's turned off by main code when it sees the snitch is
"simple" one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-05 15:15:07 +03:00
Pavel Emelyanov
b6fdea9a79 code: Call sort_endpoints_by_proximity() via topology
The method is about to be moved from snitch to topology, this patch
prepares the rest of the code to use the latter to call it. The
topology's method just calls snitch, but it's going to change in the
next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-05 15:14:01 +03:00
Pavel Emelyanov
4184091f1c snitch, code: Remove get_sorted_list_by_proximity()
There are two sorting methods in snitch -- one sorts the list of
addresses in place, the other one creates a sorted copy of the passed
const list (in fact -- the passed reference is not const, but it's not
modified by the method). However, both callers of the latter anyway
create their own temporary list of address, so they don't really benefit
from snitch generating another copy.

So this patch leaves just one sorting method -- the in-place one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-05 15:11:37 +03:00
Pavel Emelyanov
642e50f3e3 snitch: Move is_worth_merging_for_range_query to proxy
Proxy is the only place that calls this method. Also the method name
suggests it's not something "generic", but rather an internal logic of
proxy's query processing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-05 15:10:46 +03:00
Anna Stuchlik
7294fce065 doc: simplify the information about default formats in different versions 2022-09-05 11:36:24 +02:00
Avi Kivity
3ef8d616f6 Merge 'Fix wrong commit on scylla_raid_setup: prevent mount failed for /var/lib/scylla(#11399)' from Takuya ASADA
On #11399, I mistakenly committed bug fix of first patch (40134ef) to second one (8835a34).
So the script will broken when 40134ef only, it's not looks good when we backport it to older version.
Let's revert commits and make them single commit.

Closes #11448

* github.com:scylladb/scylladb:
  scylla_raid_setup: prevent mount failed for /var/lib/scylla
  Revert "scylla_raid_setup: check uuid and device path are valid"
  Revert "scylla_raid_setup: prevent mount failed for /var/lib/scylla"
2022-09-05 12:16:10 +03:00
Mikołaj Grzebieluch
726658f073 db: system_keyspace: add broadcast_kv_store table
First implementation of strongly consistent everywhere tables operates on simple table
representing string to string map.

Add hard-coded schema for broadcast_kv_store table (key text primary key,
value text). This table is under system keyspace and is created if and only if
BROADCAST_TABLES feature is enabled.
2022-09-05 11:11:08 +02:00
Mikołaj Grzebieluch
5b1421cc33 db: config: add BROADCAST_TABLES feature flag
Add experimental flag 'broadcast-tables' for enabling BROADCAST_TABLES feature.
This feature requires raft group0, thus enabling it without RAFT will cause an error.
2022-09-05 11:11:08 +02:00
Avi Kivity
e3cdc8c4d3 Update tools/java submodule (python3 dependency)
* tools/java 6995a83cc1...b7a0c5bd31 (1):
  > dist/debian:add python3 as dependency
2022-09-05 12:08:24 +03:00
Takuya ASADA
d676c22f09 scylla_raid_setup: prevent mount failed for /var/lib/scylla
Just like 4a8ed4c, we also need to wait for udev event completion to
create /dev/disk/by-uuid/$UUID for newly formatted disk, to mount the
disk just after formatting.
Also added code to check make sure uuid and uuid based device path are valid.

Fixes #11359

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2022-09-05 17:52:49 +09:00
Takuya ASADA
ede7da366b Revert "scylla_raid_setup: check uuid and device path are valid"
This reverts commit 40134efee4.
2022-09-05 17:52:42 +09:00
Takuya ASADA
841c686301 Revert "scylla_raid_setup: prevent mount failed for /var/lib/scylla"
This reverts commit 8835a34ab6.
2022-09-05 17:52:41 +09:00
Piotr Sarna
2379a25ade alternator: propagate authenticated user in client state
From now on, when an alternator user correctly passed an authentication
step, their assigned client_state will have that information,
which also means proper access to service level configuration.
Previously the username was only used in tracing.
2022-09-05 10:43:29 +02:00
Anna Stuchlik
39e6002fc8 doc: fix the version number 2022-09-05 10:04:34 +02:00
Piotr Sarna
66f7ab666f client_state: add internal constructor with auth_service
The constructor can be used as backdoor from frontends other than
CQL to create a session with an authenticated user, with access
to its attached service level information.
2022-09-05 10:03:00 +02:00
Piotr Sarna
9511c21686 alternator: pass auth_service and sl_controller to server
It's going to be needed to recreate a client state for an authenticated
user.
2022-09-05 10:03:00 +02:00
Anna Stuchlik
81f32899d0 doc: update the Enterprise version where the ME format was introduced 2022-09-05 10:02:57 +02:00
Botond Dénes
f8b38cbe09 Merge 'doc: add support for Ubuntu 22.04 in ScyllaDB Enterprise' from Anna Stuchlik
Fix https://github.com/scylladb/scylladb/issues/11430

@tzach I've added support for Ubuntu 22.04 to the row for version 2022.2. Does that version support Debian 11? That information is also missing (it was only added to OSS 5.0 and 5.1).

Closes #11437

* github.com:scylladb/scylladb:
  doc: add support for Ubuntu 22.04 to the Enterprise table
  doc: rename the columns in the Enterpise section to be in sync with the OSS section
2022-09-05 06:42:55 +03:00
Anna Stuchlik
41b91e3632 doc: fix the architecture type on the upgrade page
Closes #11438
2022-09-05 06:30:51 +03:00
Botond Dénes
21ef0c64f1 tools/scylla-types: add shardof action
Decorates a partition key and calculates which shard it belongs to,
given the shard count (--shards) and the ignore msb bits
(--ignore-msb-bits) parameters. The latter is optional and is defaulted to
12.

Example:

    $ scylla types shardof --full-compound -t UTF8Type -t SimpleDateType -t UUIDType --shards=7 000d66696c655f696e7374616e63650004800049190010c61a3321045941c38e5675255feb0196
    (file_instance, 2021-03-27, c61a3321-0459-41c3-8e56-75255feb0196): token: -5043005771368701888, shard: 1
2022-09-05 06:22:57 +03:00
Botond Dénes
4333d33f01 tools/scylla-types: pass variable_map to action handlers
Allowing them to have get the value of extra command line parameters.
2022-09-05 06:22:55 +03:00
Botond Dénes
58d4f22679 tools/scylla-types: add tokenof action
Calculate and print the token of a partition-key.
Example:

    $ scylla types tokenof --full-compound -t UTF8Type -t SimpleDateType -t UUIDType 000d66696c655f696e7374616e63650004800049190010c61a3321045941c38e5675255feb0196
    (file_instance, 2021-03-27, c61a3321-0459-41c3-8e56-75255feb0196): -5043005771368701888
2022-09-05 06:20:10 +03:00
Botond Dénes
be9d1c4df4 sstables: crawling mx-reader: make on_out_of_clustering_range() no-op
Said method currently emits a partition-end. This method is only called
when the last fragment in the stream is a range tombstone change with a
position after all clustered rows. The problem is that
consume_partition_end() is also called unconditionally, resulting in two
partition-end fragments being emitted. The fix is simple: make this
method a no-op, there is nothing to do there.

Also add two tests: one targeted to this bug and another one testing the
crawling reader with random mutations generated for random schema.

Fixes: #11421

Closes #11422
2022-09-04 20:02:50 +03:00
Botond Dénes
3e69fe0fe7 scylla-gdb.py: scylla repairs: print only address of repair_meta
Instead of the entire object. Repair meta is a large object, its
printout floods the output of the command. Print only its address, the
user can print the objects it is interested in.

Closes #11428
2022-09-04 19:58:42 +03:00
Yaron Kaikov
9f9ee8a812 build_docker.sh: Build docker based on Ubuntu:22.04
Ubuntu 20.04 has less than 3 years of OS support remaining.

We should switch to Ubuntu 22.04 to reduce the need for OS upgrades in newly installed clusters.

Closes #11440
2022-09-04 14:00:27 +03:00
Avi Kivity
61769d3b21 Merge "Make messaging service use topology for DC/RACK" from Pavel E
"
Messaging needs to know DC/RACK for nodes to decide whether it needs to
do encryption or compression depending on the options. As all the other
services did it still uses snitch to get it, but simple switch to use
topology needs extra care.

The thing is that messaging can use internal IP instead of endpoints.
Currently it's snitch who tries har^w somehow to resolve this, in
particular -- if the DC/RACK is not found for the given argument it
assumes that it might be internal IP and calls back messaging to convert
it to the endpoint. However, messaging does know when it uses which
address and can do this conversion itself.

So this set eliminates few more global snitch usages and drops the
knot tieing snitch, gossiper and messaging with each-other.
"

* 'br-messaging-use-topology-1.2' of https://github.com/xemul/scylla:
  messaging: Get DC/RACK from topology
  messaging, topology: Keep shared_token_metadata* on messaging
  messaging: Add is_same_{dc|rack} helpers
  snitch, messaging: Dont relookup dc/rack on internal IP
2022-09-04 13:54:34 +03:00
Pavel Emelyanov
6dedc69608 topology: Do not add bootstrapping nodes to topology
Recent change in topology (commit 4cbe6ee9 titled
"topology: Require entry in the map for update_normal_tokens()")
made token_metadata::update_normal_tokens() require the entry presense
in the embedded topology object. Respectively, the commit in question
equipped most callers of update_normal_tokens() with preceeding
topology update call to satisfy the requirement.

However, tokens are put into token_metadata not only for normal state,
but also for bootstrapping, and one place that added bootstrapping
tokens errorneously got topology update. This is wrong -- node must
not be present in the topology until switching into normal state. As
the result several tests with bootstrapping nodes started to fail.

The fix removes topology update for bootstrapping nodes, but this
change reveals few other places that piggy-backed this mistaken
update, so noy _they_ need to update topology themselves.

tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/2040/
       update_cluster_layout_tests.py::test_simple_add_new_node_while_schema_changes_with_repair
       update_cluster_layout_tests.py::test_simple_kill_new_node_while_bootstrapping_with_parallel_writes_in_multidc
       repair_based_node_operations_test.py::test_lcs_reshape_efficiency

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220902082753.17827-1-xemul@scylladb.com>
2022-09-04 13:53:38 +03:00
Avi Kivity
16a3e55aa1 Update seastar submodule
* seastar f2d70c4a17...2b2f6c080e (4):
  > perftune.py: special case a former 'MQ' mode in the new auto-detection code
  > iostream: Generalize flush and batched flush
  > Merge "Equip sharded<>::invoke_on_all with unwrap_sharded_args" from Pavel E
  > Merge "perftune.py: cosmetic fixes" from VladZ

Closes #11434
2022-09-04 10:19:48 +03:00
Anna Stuchlik
5d09e1a912 doc: add the ME format to the Cassandar Compatibility page 2022-09-02 15:12:30 +02:00
Anna Stuchlik
dfb7a221db doc: update the SSTables 3.0 Statistics File Format to add the UUID host_id option of the ME format 2022-09-02 14:55:11 +02:00
Anna Stuchlik
f1184e1470 doc: add the information regarding the ME format to the SSTables 3.0 Data File Format page 2022-09-02 14:48:58 +02:00
Anna Stuchlik
177a1d4396 doc: fix additional information regarding the ME format on the SStable 3.x page 2022-09-02 14:41:55 +02:00
Anna Stuchlik
b3eacdca75 doc: add the ME format to the table 2022-09-02 14:26:16 +02:00
Anna Stuchlik
af4d1b80d8 doc: add support for Ubuntu 22.04 to the Enterprise table 2022-09-02 12:43:04 +02:00
Anna Stuchlik
947f8769f4 doc: rename the columns in the Enterpise section to be in sync with the OSS section 2022-09-02 12:31:57 +02:00
Pavel Emelyanov
f0580aedaf messaging: Get DC/RACK from topology
Now everything is prepared for that

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-02 11:34:57 +03:00
Botond Dénes
be70fcf587 tools/scylla-types: extract printing code into functions
To make the individual overloads on the exact type usable on their own.
2022-09-02 07:46:18 +03:00
Botond Dénes
2c46c24608 Merge 'doc: change the tool names to "Scylla SStable" and "Scylla Types"' from Anna Stuchlik
Fix https://github.com/scylladb/scylladb/issues/11393

- Rename the tool names across the docs.
- Update the examples to replace `scylla-sstable` and `scylla-types` with `scylla sstable` and `scylla types`, respectively.

Closes #11432

* github.com:scylladb/scylladb:
  doc: update the tool names in the toctree and reference pages
  doc: rename the scylla-types tool as Scylla Types
  doc: rename the scylla-sstable tool as Scylla SStable
2022-09-01 16:32:18 +03:00
Anna Stuchlik
18da200669 doc: update the tool names in the toctree and reference pages 2022-09-01 15:09:12 +02:00
Anna Stuchlik
c255399f27 doc: rename the scylla-types tool as Scylla Types 2022-09-01 15:05:44 +02:00
Anna Stuchlik
d0cb24feaa doc: rename the scylla-sstable tool as Scylla SStable 2022-09-01 14:45:19 +02:00
Anna Stuchlik
1834d5d121 add a comment to remove the information when the documentation is versioned (in 5.1) 2022-09-01 12:57:15 +02:00
Anna Stuchlik
476107912c doc: replace Scylla with ScyllaDB 2022-09-01 12:52:58 +02:00
Anna Stuchlik
8aae8a3cef doc: fix the formatting and language in the updated section 2022-09-01 12:50:04 +02:00
Anna Stuchlik
ff4ae879cb doc: fix the default SStable format 2022-09-01 12:47:11 +02:00
Pavel Emelyanov
e147681d85 messaging, topology: Keep shared_token_metadata* on messaging
Messaging will need to call topology methods to compare DC/RACK of peers
with local node. Topology now resides on token metadata, so messaging
needs to get the dependency reference.

However, messaging only needs the topology when it's up and running, so
instead of producing a life-time reference, add a pointer, that's set up
on .start_listen(), before any client pops up, and is cleared on
.shutdown() after all connections are dropped.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-01 11:32:34 +03:00
Pavel Emelyanov
551c51b5bf messaging: Add is_same_{dc|rack} helpers
For convenience of future patching

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-01 11:32:34 +03:00
Pavel Emelyanov
c08c370c2c snitch, messaging: Dont relookup dc/rack on internal IP
When getting dc/rack snitch may perform two lookups -- first time it
does it using the provided IP, if nothing is found snitch assumes that
the IP is internal one, gets the corresponding public one and searches
again.

The thing is that the only code that may come to snitch with internal
IP is the messaging service. It does so in two places: when it tries
to connect to the given endpoing and when it accepts a connection.

In the former case messaging performs public->internal IP conversion
itself and goes to snitch with the internal IP value. This place can get
simpler by just feeding the public IP to snich, and converting it to the
internal only to initiate the connection.

In the latter case the accepted IP can be either, but messaging service
has the public<->private map onboard and can do the conversion itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-01 11:32:34 +03:00
Avi Kivity
fe401f14de Merge 'scylla_raid_setup: prevent mount failed for /var/lib/scylla' from Takuya ASADA
Just like 4a8ed4cc6f, we also need to wait for udev event completion to
create /dev/disk/by-uuid/$UUID for newly formatted disk, to mount the
disk just after formatting.

Also added code to check make sure uuid and uuid based device path are valid.

Fixes #11359

Closes #11399

* github.com:scylladb/scylladb:
  scylla_raid_setup: prevent mount failed for /var/lib/scylla
  scylla_raid_setup: check uuid and device path are valid
2022-08-31 19:59:38 +03:00
Kefu Chai
a5e696fab8 storage_service, test: drop unused storage_service_config
this setting was removed back in
dcdd207349, so despite that we are still
passing `storage_service_config` to the ctor of `storage_service`,
`storage_service::storage_service()` just drops it on the floor.

in this change, `storage_service_config` class is removed, and all
places referencing it are updated accordingly.

Signed-off-by: Kefu Chai <tchaikov@gmail.com>

Closes #11415
2022-08-31 19:49:13 +03:00
Botond Dénes
2a3012db7f docs/README.md: expand prerequisites list
poetry and make was missing from the list.

Closes #11391
2022-08-31 17:00:59 +03:00
Botond Dénes
cb98d4f5da docs: admin-tools: remove defunct sstable-index
Said tool was supplanted by scylla-sstable in 4.6. Remove the page as
well as all references to it.

Closes #11392
2022-08-31 17:00:04 +03:00
Avi Kivity
a9a230afbe scripts: introduce script to apply email, working around google groups brokeness
Google Groups recently started rewriting the From: header, garbaging
our git log. This script rewrites it back, using the Reply-To header
as a still working source.

Closes #11416
2022-08-31 14:47:24 +03:00
Botond Dénes
b9fc504fb2 Merge 'doc: cql-extensions.md: improve description of synchronous views' from Nadav Har'El
It was pointed out to me that our description of the synchronous_updates
materialized-view option does not make it clear enough what is the
default setting, or why a user might want to use this option.

This patch changes the description to (I hope) better address these
issues.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11404

* github.com:scylladb/scylladb:
  doc: cql-extensions.md: replace "Scylla" by "ScyllaDB"
  doc: cql-extensions.md: improve description of synchronous views
2022-08-31 14:33:39 +03:00
Avi Kivity
2ab5cbd841 Merge 'Docs: document how scylla-sstable obtains its schema' from Botond Dénes
This is a very important aspect of the tool that was completely missing from the document before. Also add a comparison with SStableDump.
Fixes: https://github.com/scylladb/scylladb/issues/11363

Closes #11390

* github.com:scylladb/scylladb:
  docs: scylla-sstable.rst: add comparison with SStableDump
  docs: scylla-sstable.rst: add section about providing the schema
2022-08-31 14:28:52 +03:00
Anna Stuchlik
72b77b8c78 doc: add a comment to remove the note in version 5.1 2022-08-31 12:49:10 +02:00
Anna Stuchlik
b4bbd1fd53 doc: update the information on the Countng all rows page and add the recommendation to upgrade ScyllaDB 2022-08-31 12:39:05 +02:00
Nadav Har'El
ad0f6158c4 doc: cql-extensions.md: replace "Scylla" by "ScyllaDB"
It was recently decided that the database should be referred to as
"ScyllaDB", not "Scylla". This patch changes existing references
in docs/cql/cql-extensions.md.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-31 13:23:24 +03:00
Nadav Har'El
ec8e98e403 doc: cql-extensions.md: improve description of synchronous views
It was pointed out to me that our description of the synchronous_updates
materialized-view option does not make it clear enough what is the
default setting, or why a user might want to use this option.

This patch changes the description to (I hope) better address these
issues.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-31 13:22:24 +03:00
Anna Stuchlik
180fc73695 doc: add a note to the description of COUNT with a reference to the KB article 2022-08-31 12:11:12 +02:00
Anna Stuchlik
cff849d845 doc: add COUNT to the list of acceptable selectors of the SELECT statement 2022-08-31 11:59:20 +02:00
Avi Kivity
421557b40a Merge "Provide DC/RACK when populating topology" from Pavel E
"
The topology object maintains all sort of node/DC/RACK mappings on
board. When new entries are added to it the DC and RACK are taken
from the global snitch instance which, in turn, checks gossiper,
system keyspace and its local caches.

This set make topology population API require DC and RACK via the
call argument. In most of the cases the populating code is the
storage service that knows exactly where to get those from.

After this set it will be possible to remove the dependency knot
consiting of snitch, gossiper, system keyspace and messaging.
"

* 'br-topology-dc-rack-info' of https://github.com/xemul/scylla:
  toplogy: Use the provided dc/rack info
  test: Provide testing dc/rack infos
  storage_service: Provide dc/rack for snitch reconfiguration
  storage_service: Provide dc/rack from system ks on start
  storage_service: Provide dc/rack from gossiper for replacement
  storage_service: Provide dc/rack from gossiper for remotes
  storage_service,dht,repair: Provide local dc/rack from system ks
  system_keyspace: Cache local dc-rack on .start()
  topology: Some renames after previous patch
  topology: Require entry in the map for update_normal_tokens()
  topology: Make update_endpoint() accept dc-rack info
  replication_strategy: Accept dc-rack as get_pending_address_ranges argument
  dht: Carry dc-rack over boot_strapper and range_streamer
  storage_service: Make replacement info a real struct
2022-08-31 12:53:06 +03:00
Benny Halevy
c284c32f74 Update seastar submodule
* seastar f9f5228b74...f2d70c4a17 (51):

  >  cmake: attach property to Valgrind not to hwloc
  >  Create the seastar_memory logger in all builds
  >  drop unused parameters
  >  Merge "Unify pollable_fd shutdown and abort_{reader|writer}" from Pavel E
  > >  pollable_fd: Replace two booleans with a mask
  > >  pollable_fd: Remove abort_reader/_writer
  >  Merge "Improve Rx channels assignment" from Vlad
  > >  perftune.py: fix comments of IRQ ordering functors
  > >  perftune.py: add VIRTIO fast path IRQs ordering functor
  > >  perftune.py: reduce number of Rx channels to the number of IRQ CPUs
  > >  perftune.py: introduce a --num-rx-queues parameter
  >  program_options: enable optional selection_value
  >  .gitignore: ignore the directories generated by VS Code and CLion.
  >  httpd: compare the Connection header value in a case-insensitive manner.
  >  httpd: move the logic of keepalive to a separate method.
  >  register one default priority class for queue
  >  Reset _total_stats before each run
  >  log: add colored logging support
  >  Merge "perftune.py: add NUMA aware auto-detection for big machines" from Vlad
  > >  perftune.py: mention 'irq_cpu_mask' in the description of the script operation
  > >  perftune.py: NetPerfTuner: fix bits counting in self.irqs_cpu_mask wider than 32 bits
  > >  perftune.py: PerfTuneBase.cpu_mask_is_zero(cpu_mask): cosmetics: fix a comment and a variable name
  > >  perftune.py: PerfTuneBase.cpu_mask_is_zero(cpu_mask): take into account omitted zero components of the mask
  > >  perftune.py: PerfTuneBase.compute_cpu_mask_for_mode(): cosmetics: fix a variable name
  > >  perftune.py: stop printing 'mode' in --dump-options-file
  > >  perftune.py: introduce a generic auto_detect_irq_mask(cpu_mask) function
  > >  perftune.py: DiskPerfTuner: use self.irqs_cpu_mask for tuning non-NVME disks
  > >  perftune.py: stop auto-detecting and using 'mode' internally
  > >  perftune.py: introduce --get-irq-cpu-mask command line parameter
  > >  perftune.py: introduce --irq-core-auto-detection-ratio parameter
  >  build: add a space after function name
  >  Update HACKING.md
  >  log: do not inherit formatter<seastar::log_level> from formatter<string_view>
  >  Merge "Mark connected_socket::shutdown_...'s internals noexcept" from Pavel E
  > > native-stack: Mark tcp::in_state() (and its wrappers) const noexcept
  > > native-stack: Mark tcb::close and tcb::abort_reader noexcept
  > > native-stack: Mark tcp::connection::close_{read|write}() noexcept
  > > native-stack: Mark tcb::clear_delayed_ack() and tcb::stop_retransmit_timer() noexcept
  > > tls: Mark session::close() noexcept
  > > file_desc: Add fdinfo() helper
  > > posix-stack: Mark posix_connected_socket_impl::shutdown_{input|output}() noexcept
  > > tests: Mark loopback_buffer::shutdown() noexcept
  >  Merge "Enhance RPC connection error injector" from Pavel E
  > >  loopback_socket: Shuffle error injection
  > >  loopback_socket: Extend error injection
  > >  loopback_socket: Add one-shot errors
  > >  loopback_socket: Add connection error injection
  > >  rpc_test: Extend error injector with kind
  > >  rpc_test: Inject errors on all paths
  > >  rpc_test: Use injected connect error
  > >  rpc_test: De-duplicate test socket creation
  >  Merge 'tls: vec_push: handle async errors rather than throwing on_internal_error' from Benny Halevy
  > >  tls: do_handshake: handle_output_error of gnutls_handshake
  > >  tls: session: vec_push: return output_pending error
  > >  tls: session: vec_push: reindent
  >  log: disambiguate formatter<log_level> from operator<<
  >  tls_test: Fix spurious fail in test_x509_client_with_builder_system_trust_multiple (et al)

Fixes scylladb/scylladb#11252

Closes #11401
2022-08-31 12:12:48 +03:00
Botond Dénes
dca351c2a6 Merge 'doc: add the upgrade guide for ScyllaDB image from 2021.1 to 2022.1' from Anna Stuchlik
This PR is related to  https://github.com/scylladb/scylla-docs/issues/4124 and https://github.com/scylladb/scylla-docs/issues/4123.

**New Enterprise Upgrade Guide from 2021.1 to 2022.2**
I've added the upgrade guide for ScyllaDB Enterprise image. In consists of 3 files:
/upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p1.rst
upgrade/_common/upgrade-image.rst
 /upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p2.rst

**Modified Enterprise Upgrade Guides 2021.1 to 2022.2**
I've modified the existing guides for Ubuntu and Debian to use the same files as above, but exclude the image-related information:
/upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p1.rst + /upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p2.rst = /upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian.rst

To make things simpler and remove duplication, I've replaced the guides for Ubuntu 18 and 20 with a generic Ubuntu guide.

**Modified Enterprise Upgrade Guides from 4.6 to 5.0**
These guides included a bug: they included the image-related information (about updating OS packages), because a file that includes that information was included by mistake. What's worse, it was duplicated. After the includes were removed, image-related information is no longer included in the Ubuntu and Debian guides (this fixes https://github.com/scylladb/scylla-docs/issues/4123).

I've modified the index file to be in sync with the updates.

Closes #11285

* github.com:scylladb/scylladb:
  doc: reorganize the content to list the recommended way of upgrading the image first
  doc: update the image upgrade guide for ScyllaDB image to include the location of the manifest file
  doc: fix the upgrade guides for Ubuntu and Debian by removing image-related information
  doc: update the guides for Ubuntu and Debian to remove image information and the OS version number
  doc: add the upgrade guide for ScyllaDB image from 2021.1 to 2022.1
2022-08-31 07:24:55 +03:00
Gleb Natapov' via ScyllaDB development
0d20830863 direct_failure_detector: reduce severity of ping error logging
Having an error while pinging a peer is not a critical error. The code
retires and move on. Lets log the message with less severity since
sometimes those error may happen (for instance during node replace
operation some nodes refuse to answer to pings) and dtest complains that
there are unexpected errors in the logs.

Message-Id: <Ywy5e+8XVwt492Nc@scylladb.com>
2022-08-31 07:11:59 +03:00
Raphael S. Carvalho
631b2d8bdb replica: rename table::on_compaction_completion and coroutinize it
on_compaction_completion() is not very descriptive. let's rename
it, following the example of
update_sstable_lists_on_off_strategy_completion().

Also let's coroutinize it, so to remove the restriction of running
it inside a thread only.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #11407
2022-08-31 06:17:20 +03:00
Nadav Har'El
a797512148 Merge 'Raft test topology start stopped servers' from Alecco
Test teardown involves dropping the test keyspace. If there are stopped servers occasionally we would see timeouts.

Start stopped servers after a test is finished (and passed).

Revert previous commit making teardown async again.

Closes #11412

* github.com:scylladb/scylladb:
  test.py: restart stopped servers before teardown...
  Revert "test.py: random tables make DDL queries async"
2022-08-30 22:48:47 +03:00
Pavel Emelyanov
e5e75ba43c Merge 'scylla-gdb.py: bring scylla reads-stats up-to-date' from Botond Dénes
Said command is broken since 4.6, as the type of `reader_concurrency_semaphore::_permit_list` was changed without an accompanying update to this command. This series updates said command and adds it to the list of tested commands so we notice if it breaks in the future.

Closes #11389

* github.com:scylladb/scylladb:
  test/scylla-gdb: test scylla read-stats
  scylla-gdb.py: read_stats: update w.r.t. post 4.5 code
  scylla-gdb.py: improve string_view_printer implementation
2022-08-30 20:24:02 +03:00
Nadav Har'El
56d714b512 Merge 'Docs: Update support OS' from Tzach Livyatan
This PR change the CentOS 8 support to Rocky, and add 5.1 and 2022.1, 2022.2 rows to the list of Scylla releases

Closes #11383

* github.com:scylladb/scylladb:
  OS support page: use CentOS not Centos
  OS support page: add 5.1, 2022.1 and 2022.2
  OS support page: Update CentOS 8 to Rocky 8
2022-08-30 18:02:44 +03:00
Anna Stuchlik
0d3285dd3c doc: replace Scylla with ScyllaDB 2022-08-30 15:35:06 +02:00
Anna Stuchlik
ab04ed2fda doc: rewrite the Interfaces table to the new format to include more information about CQL support 2022-08-30 15:31:41 +02:00
Anna Stuchlik
4ac5574f1d doc: remove the CQL version from pages other than Cassandra compatibility 2022-08-30 13:58:26 +02:00
Alejo Sanchez
df1ca57fda test.py: restart stopped servers before teardown...
for topology tests

Test teardown involves dropping the test keyspace. If there are stopped
servers occasionally we would see timeouts.

Start stopped servers after a test is finished.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-30 11:40:40 +02:00
Alejo Sanchez
e5eac22a37 Revert "test.py: random tables make DDL queries async"
This reverts commit 67c91e8bcd.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-30 10:54:33 +02:00
Anna Stuchlik
99cee1aceb doc: reorganize the content to list the recommended way of upgrading the image first 2022-08-30 10:11:02 +02:00
Anna Stuchlik
ffe6f97c06 doc: update the image upgrade guide for ScyllaDB image to include the location of the manifest file 2022-08-30 10:01:56 +02:00
Tzach Livyatan
4e413787d2 doc: Fix nodetool flush example
`nodetool flush` have a space between *keyspace* and *table* names

See
https://docs.scylladb.com/stable/operating-scylla/nodetool-commands/flush
for the right syntax.

Fixes #11314

Closes #11334
2022-08-29 15:06:38 +03:00
Nadav Har'El
eed65dfc2d Merge 'db: schema_tables: Make table creation shadow earlier concurrent changes' from Tomasz Grabiec
Issuing two CREATE TABLE statements with a different name for one of
the partition key columns leads to the following assertion failure on
all replicas:

scylla: schema.cc:363: schema::schema(const schema::raw_schema&, std::optional<raw_view_info>): Assertion `!def.id || def.id == id - column_offset(def.kind)' failed.

The reason is that once the create table mutations are merged, the
columns table contains two entries for the same position in the
partition key tuple.

If the schemas were the same, or not conflicting in a way which leads
to abort, the current behavior would be to drop the older table as if
the last CREATE TABLE was preceded by a DROP TABLE.

The proposed fix is to make CREATE TABLE mutation include a tombstone
for all older schema changes of this table, effectively overriding
them. The behavior will be the same as if the schemas were not
different, older table will be dropped.

Fixes #11396

Closes #11398

* github.com:scylladb/scylladb:
  db: schema_tables: Make table creation shadow earlier concurrent changes
  db: schema_tables: Fix formatting
  db: schema_mutations: Make operator<<() print all mutations
  schema_mutations: Make it a monoid by defining appropriate += operator
2022-08-29 14:21:07 +03:00
Tomasz Grabiec
ae8d2a550d db: schema_tables: Make table creation shadow earlier concurrent changes
Issuing two CREATE TABLE statements with a different name for one of
the partition key columns leads to the following assertion failure on
all replicas:

scylla: schema.cc:363: schema::schema(const schema::raw_schema&, std::optional<raw_view_info>): Assertion `!def.id || def.id == id - column_offset(def.kind)' failed.

The reason is that once the create table mutations are merged, the
columns table contains two entries for the same position in the
partition key tuple.

If the schemas were the same, or not conflicting in a way which leads
to abort, the current behavior would be to drop the older table as if
the last CREATE TABLE was preceded by a DROP TABLE.

The proposed fix is to make CREATE TABLE mutation include a tombstone
for all older schema changes of this table, effectively overriding
them. The behavior will be the same as if the schemas were not
different, older table will be dropped.

Fixes #11396
2022-08-29 12:06:02 +02:00
Benny Halevy
d588e2a7c5 release: properly evaluate SCYLLA_BUILD_MODE_* macros
Patch 765d2f5e46 did not
evaluate the #if SCYLLA_BUILD_MODE directives properly
and it always matched SCYLLA_BULD_MODE == release.

This change fixes that by defining numerical codes
for each build mode and using macro expansion to match
the define SCYLLA_BUILD_MODE against these codes.

Also, ./configure.py was changes to pass SCYLLA_BUILD_MODE
to all .cc source files, and makes sure it is defined
in build_mode.hh.

Support was added for coverage build mode,
and an #error was added if SCYLLA_BUILD_MODE
was not recognized by the #if ladder directives.

Additional checks verifying the expected SEASTAR_DEBUG
against SCYLLA_BUILD_MODE were added as well,

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11387
2022-08-29 10:20:19 +03:00
Botond Dénes
0f4666010a docs: scylla-sstable.rst: add comparison with SStableDump
The two tools have very similar goals, user might wonder when to use one
or the other.
Also add a link to sstabledump.rst to scylla-sstable.
2022-08-29 08:29:14 +03:00
Botond Dénes
65da6a26a3 docs: scylla-sstable.rst: add section about providing the schema
Providing the schema for the scylla-sstable tool is an important topic
that was completely missing from the description so far.
2022-08-29 08:29:09 +03:00
Alejo Sanchez
67c91e8bcd test.py: random tables make DDL queries async
There are async timeouts for ALTER queries. Seems related to othe issues
with the driver and async.

Make these queries synchronous for now.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #11394
2022-08-28 10:38:39 +03:00
Felipe Mendes
fd5cb85a7a alternator - Doc - Update DescribeTable response and introduce hashing function differences
This commit introduces the following changes to Alternator compability doc:

* As of https://github.com/scylladb/scylladb/pull/11298 Alternator will return ProvisionedThroughput in DescribeTable API calls. We add the fact that tables will default to a BillingMode of PAY_PER_REQUEST (this wasn't made explicit anywhere in the docs), and that the values for RCUs/WCUs are hardcoded to 0.
* Mention the fact that ScyllaDB (thus Alternator) hashing function is different than AWS proprietary implementation for DynamoDB. This is mostly of an implementation aspect rather than a bug, but it may cause user confusion when/if comparing the ResultSet between DynamoDB and Alternator returned from Table Scans.

Refs: https://github.com/scylladb/scylladb/issues/11222
Fixes: https://github.com/scylladb/scylladb/issues/11315

Closes #11360
2022-08-28 10:29:07 +03:00
Takuya ASADA
8835a34ab6 scylla_raid_setup: prevent mount failed for /var/lib/scylla
Just like 4a8ed4c, we also need to wait for udev event completion to
create /dev/disk/by-uuid/$UUID for newly formatted disk, to mount the
disk just after formatting.

Fixes #11359
2022-08-27 03:27:44 +09:00
Takuya ASADA
40134efee4 scylla_raid_setup: check uuid and device path are valid
Added code to check make sure uuid and uuid based device path are valid.
2022-08-27 03:08:31 +09:00
Tomasz Grabiec
661db2706f db: schema_tables: Fix formatting 2022-08-26 17:37:48 +02:00
Tomasz Grabiec
a020c4644c db: schema_mutations: Make operator<<() print all mutations 2022-08-26 16:48:15 +02:00
Tomasz Grabiec
cf034c1891 schema_mutations: Make it a monoid by defining appropriate += operator 2022-08-26 16:48:15 +02:00
Kamil Braun
6c16ae4868 Merge 'raft, limit for command size' from Gusev Petr
Commitlog imposes a limit on the size of mutations
and throws an exception if it's exceeded. In case of
schema changes before raft this exception was delivered
to the client. Now it happens while saving the raft
command in io_fiber in persistence->store_log_entries
and what the client gets is just a timeout exception,
which doesn't say much about the cause of the problem.
This patch introduces an explicit command size limit
and provides a clear error message in this case.

Closes #11318

* github.com:scylladb/scylladb:
  raft, use max_command_size to satisfy commitlog limit
  raft, limit for command size
2022-08-26 12:20:58 +02:00
Pavel Emelyanov
6405aba748 toplogy: Use the provided dc/rack info
Previous patches made all the callers of topology.update_endpoint()
(via token_metadata.update_topology()) provide correct dc/rack info
for the endpoint. It's now possible to stop using global snitch by
topology and just rely on the dc/rack argument.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 10:02:00 +03:00
Pavel Emelyanov
10e8804417 test: Provide testing dc/rack infos
There's a test that's sensitive to correct dc/rack info for testing
entries. To populate them it uses global rack-inferring snitch instance
or a special "testing" snitch. To make it continue working add a helper
that would populate the topology properly (spoiler: next branch will
replace it with explicitly populated topology object).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 10:00:04 +03:00
Pavel Emelyanov
f6abc3f759 storage_service: Provide dc/rack for snitch reconfiguration
When snitch reconfigures (gossiper-property-file one) it kicks storage
service so that it updates itself. This place also needs to update the
dc/rack info about itself, the correct (new) values are taken from the
snitch itself.

There's a bug here -- system.local table it not update with new data
until restart.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:58:34 +03:00
Pavel Emelyanov
f8614fe039 storage_service: Provide dc/rack from system ks on start
When a node starts it loads the information about peers from
system.peers table and populates token metadata and topology with this
information. The dc/rack are taken from the sys-ks cache here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:57:15 +03:00
Pavel Emelyanov
5d5782a086 storage_service: Provide dc/rack from gossiper for replacement
When a node it started to replace another node it updates token metadata
and topology with the target information eary. The tokens are now taken
from gossiper shadow round, this patch makes the same for dc/rack info.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:55:31 +03:00
Pavel Emelyanov
6b70358616 storage_service: Provide dc/rack from gossiper for remotes
When a node is notified about other nodes state change it may want to
update the topology information about it. In all those places the
dc/rack into about the peer is provided by the gossiper.

Basically, these updates mirror the relevant updates of tokens on the
token metadata object.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:53:54 +03:00
Pavel Emelyanov
43e83c5415 storage_service,dht,repair: Provide local dc/rack from system ks
When a node starts it adds itself to the topology. Mostly it's done in
the storage_service::join_cluster() and whoever it calls. In all those
places the dc/rack for the added node is taken from the system keyspace
(it's cache was populated with local dc/rack by the previous patch).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:52:16 +03:00
Pavel Emelyanov
a03d6f7751 system_keyspace: Cache local dc-rack on .start()
There's a cache of endpoint:{dc,rack} on system keyspace cache, but the
local node is not there, because this data is populated from the peers
table, while local node's dc/rack is in snitch (or system.local table).

At the same time, storage_service::join_cluster() and whoever it calls
(e.g. -- the repair) will need this info on start and it's convenient
to have this data on sys-ks cache.

It's not on the peers part of the cache because next branch removes this
map and it's going to be very clumsy to have a whole container with just
one enty in it.

There's a peer code in system_keyspace::setup() that gets the local node
dc/rack and committs it into the system.local table. However, putting
the data into cache is done on .start(). This is because cql-test-env
needs this data cached too, but it doesn't call sys_ks.setup(). Will be
cleaned some other day.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:47:30 +03:00
Pavel Emelyanov
c043f6fa96 topology: Some renames after previous patch
The topology::update_endpoint() is now a plain wrapper over private
::add_endpoint() method of the same class. It's simpler to merge them

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:46:26 +03:00
Pavel Emelyanov
4cbe6ee9f4 topology: Require entry in the map for update_normal_tokens()
The method in question tries to be on the safest side and adds the
enpoint for which it updates the tokens into the topology. From now on
it's up to the caller to put the endpoint into topology in advance.

So most of what this patch does is places topology.update_endpoint()
into the relevant places of the code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:44:08 +03:00
Pavel Emelyanov
5fc9854eae topology: Make update_endpoint() accept dc-rack info
The method in question populates topology's internal maps with endpoint
vs dc/rack relations. As for today the dc/rack values are taken from the
global snitch object (which, in turn, goes to gossiper, system keyspace
and its internal non-updateable cache for that).

This patch prepares the ground for providing the dc/rack externally via
argument. By now it's just and argument with empty strings, but next
patches will populate it with real values (spoiler: in 99% it's storage
service that calls this method and each call will know where to get it
from for sure)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:41:09 +03:00
Pavel Emelyanov
7305061674 replication_strategy: Accept dc-rack as get_pending_address_ranges argument
The method creates a copy of token metadata and pushes an endpoint (with
some tokens) into it. Next patches will require providing dc/rack info
together with the endpoint, this patch prepares for that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:39:44 +03:00
Pavel Emelyanov
360c4f8608 dht: Carry dc-rack over boot_strapper and range_streamer
Both classes may populate (temporarly clones of) token metadata object
with endpoint:tokens pairs for the endpoint they work with. Next patches
will require that endpoint comes with the dc/rack info. This patch makes
sure dht classes have the necessary information at hand (for now it's
just empty pair of strings).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:37:02 +03:00
Pavel Emelyanov
c7a3fed225 storage_service: Make replacement info a real struct
This is to extend it in one of the next patches

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:36:16 +03:00
Botond Dénes
4d33812a77 test/scylla-gdb: test scylla read-stats
This command was not run before, allowing it to silently break.
2022-08-26 08:08:28 +03:00
Botond Dénes
82c157368a scylla-gdb.py: read_stats: update w.r.t. post 4.5 code
scylla_read_stats is not up-to-date wr.r.t. the type of
`reader_concurrency_semaphore::_permit_list`, which was changed in 4.6.
Bring it up-to-date, keeping it backwards compatible with 4.5 and older
releases.
2022-08-26 07:25:40 +03:00
Botond Dénes
107fd97f45 scylla-gdb.py: improve string_view_printer implementation
The `_M_str` member of an `std::string_view` is not guaranteed to be a
valid C string (.e.g. be null terminated). Printing it directly often
resulted in printing partial strings or printing gibberish, effecting in
particular the semaphore diagnostics dumps (scylla read-stats).
Use a more reliable method: read `_M_len` amount of bytes from `_M_str`
and decode as UTF-8.
2022-08-26 07:25:11 +03:00
Avi Kivity
0dbcd13a0f config: change logging::settings constructor call to use designated initializer
Safer wrt reordering, and more readable too.

Closes #11382
2022-08-26 06:14:01 +03:00
Konstantin Osipov
4e128bafb5 docs: clarify the tricky field of row existence in LWT
Closes #11372
2022-08-26 06:10:45 +03:00
Vlad Zolotarov
c538cc2372 scylla_prepare + scylla_cpuset_setup: make scylla_cpuset_setup idempotent without introducing regressions
This patch fixes the regression introduced by 3a51e78 which broke
a very important contract: perftune.yaml should not be "touched"
by Scylla scriptology unless explicitly requested.

And a call for scylla_cpuset_setup is such an explicit request.

The issue that the offending patch was intending to fix was that
cpuset.conf was always generated anew for every call of
scylla_cpuset_setup - even if a resulting cpuset.conf would come
out exactly the same as the one present on the disk before tha call.

And since the original code was following the contract mentioned above
it was also deleting perftune.yaml every time too.
However, this was just an unavoidable side-effect of that cpuset.conf
re-generation.

The above also means that if scylla_cpuset_setup doesn't write to cpuset.conf
we should not "touch" perftune.yaml and vise versa.

This patch implements exactly that together with reverting the dangerous
logic introduced by 3a51e78.

Fixes #11385
Fixes #10121
2022-08-25 13:03:02 -04:00
Vlad Zolotarov
80917a1054 scylla_prepare: stop generating 'mode' value in perftune.yaml
Modern perftune.py supports a more generic way of defining IRQ CPUs:
'irq_cpu_mask'.

This patch makes our auto-generation code create a perftune.yaml
that uses this new parameter instead of using outdated 'mode'.

As a side effect, this change eliminates the notion of "incorrect"
value in cpuset.conf - every value is valid now as long as it fits into
the 'all' CPU set of the specific machine.

Auto-generated 'irq_cpu_mask' is going to include all bits from 'all'
CPU mask except those defined in cpuset.conf.

Fixes #9903
2022-08-25 13:02:57 -04:00
Benny Halevy
765d2f5e46 release: define SCYLLA_BUILD_MODE_STR by stringifying SCYLLA_BUILD_MODE
Currently SCYLLA_BULD_MODE is defined as a string by the cxxflags
generated by configure.py.  This is not very useful since one cannot use
it in a @if preprocessor directive.

Instead, use -DSCYLLA_BULD_MODE=release, for example, and define a
SCYLLA_BULD_MODE_STR as the dtirng representation of it.

In addition define the respective
SCYLLA_BUILD_MODE_{RELEASE,DEV,DEBUG,SANITIZE} macros that can be easily
used in @ifdef (or #ifndef :)) for conditional compilation.

The planned use case for it is to enable a task_manager test module only
in non-release modes.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11357
2022-08-25 16:50:42 +02:00
Tzach Livyatan
e86cd3684e OS support page: use CentOS not Centos 2022-08-25 17:33:50 +03:00
Wojciech Mitros
49dba4f0c1 functions: fix dropping of a keyspace with an aggregate in it
Currently, if a keyspace has an aggregate and the keyspace
is dropped, the keyspace becomes corrupted and another keyspace
with the same name cannot be created again

This is caused by the fact that when removing an aggregate, we
call create_aggregate() to get values for its name and signature.
In the create_aggregate(), we check whether the row and final
functions for the aggregate exist.
Normally, that's not an issue, because when dropping an existing
aggregate alone, we know that its UDFs also exist. But when dropping
and entire keyspace, we first drop the UDFs, making us unable to drop
the aggregate afterwards.

This patch fixes this behavior by removing the create_aggregate()
from the aggregate dropping implementation and replacing it with
specific calls for getting the aggregate name and signature.

Additionally, a test that would previously fail is added to
cql-pytest/test_uda.py where we drop a keyspace with an aggregate.

Fixes #11327

Closes #11375
2022-08-25 16:28:57 +02:00
Tzach Livyatan
f6157a38a0 OS support page: add 5.1, 2022.1 and 2022.2 2022-08-25 16:44:40 +03:00
Tzach Livyatan
1697f17d90 OS support page: Update CentOS 8 to Rocky 8 2022-08-25 16:43:24 +03:00
Tomasz Grabiec
83850e247a Merge 'raft: server: handle aborts when waiting for config entry to commit' from Kamil Braun
Changing configuration involves two entries in the log: a 'joint
configuration entry' and a 'non-joint configuration entry'. We use
`wait_for_entry` to wait on the joint one. To wait on the non-joint one,
we use a separate promise field in `server`. This promise wasn't
connected to the `abort_source` passed into `set_configuration`.

The call could get stuck if the server got removed from the
configuration and lost leadership after committing the joint entry but
before committing the non-joint one, waiting on the promise. Aborting
wouldn't help. Fix this by subscribing to the `abort_source` in
resolving the promise exceptionally.

Furthermore, make sure that two `set_configuration` calls don't step on
each other's toes by one setting the other's promise. To do that, reset
the promise field at the end of `set_configuration` and check that it's
not engaged at the beginning.

Fixes #11288.

Closes #11325

* github.com:scylladb/scylladb:
  test: raft: randomized_nemesis_test: additional logging
  raft: server: handle aborts when waiting for config entry to commit
2022-08-25 12:49:09 +02:00
Avi Kivity
df87949241 Merge "Remove batch tokens update helper" from Pavel E
"
On token_metadata there are two update_normal_tokens() overloads --
one updates tokens for a single endpoint, another one -- for a set
(well -- std::map) of them. Other than updating the tokens both
methods also may add an endpoint to the t.m.'s topology object.

There's an ongoing effort in moving the dc/rack information from
snitch to topology, and one of the changes made in it is -- when
adding an entry to topology, the dc/rack info should be provided
by the caller (which is in 99% of the cases is the storage service).
The batched tokens update is extremely unfriendly to the latter
change. Fortunately, this helper is only used by tests, the core
code always uses fine-grained tokens updating.
"

* 'br-tokens-update-relax' of https://github.com/xemul/scylla:
  token_metadata: Indentation fix after prevuous patch
  token_metadata: Remove excessive empty tokens check
  token_metadata: Remove batch tokens updating method
  tests: Use one-by-one tokens updating method
2022-08-25 12:01:58 +02:00
Wojciech Mitros
9e6e8de38f tests: prevent test_wasm from occasional failing
Some cases in test_wasm.py assumed that all cases
are ran in the same order every time and depended
on values that should have been added to tables in
previous cases. Because of that, they were sometimes
failing. This patch removes this assumption by
adding the missing inserts to the affected cases.

Additionally, an assert that confirms low miss
rate of udfs is more precise, a comment is added
to explain it clearly.

Closes #11367
2022-08-25 11:32:06 +03:00
Kamil Braun
90233551be test: raft: randomized_nemesis_test: don't access failure detector service after it's stopped
It could happen that we accessed failure detector service after it was
stopped if a reconfiguration happened in the 'right' moment. This would
resolve in an assertion failure. Fix this.

Closes #11326
2022-08-25 11:32:06 +03:00
Tomasz Grabiec
1d0264e1a9 Merge 'Implement Raft upgrade procedure' from Kamil Braun
Start with a cluster with Raft disabled, end up with a cluster that performs
schema operations using group 0.

Design doc:
https://docs.google.com/document/d/1PvZ4NzK3S0ohMhyVNZZ-kCxjkK5URmz1VP65rrkTOCQ/
(TODO: replace this with .md file - we can do it as a follow-up)

The procedure, on a high level, works as follows:
- join group 0
- wait until every peer joined group 0 (peers are taken from `system.peers`
  table)
- enter `synchronize` upgrade state, in which group 0 operations are disabled
- wait until all members of group 0 entered `synchronize` state or some member
  entered the final state
- synchronize schema by comparing versions and pulling if necessary
- enter the final state (`use_new_procedures`), in which group 0 is used for
  schema operations.

With the procedure comes a recovery mode in case the upgrade procedure gets
stuck (and it may if we lose a node during recovery - the procedure, to
correctly establish a single group 0 cluster, requires contacting every node).

This recovery mode can also be used to recover clusters with group 0 already
established if they permanently lose a majority of nodes - killing two birds with
one stone. Details in the last commit message.

Read the design doc, then read the commits in topological order
for best reviewing experience.

---

I did some manual tests: upgrading a cluster, using the cluster to add nodes,
remove nodes (both with `decommission` and `removenode`), replacing nodes.
Performing recovery.

As a follow-up, we'll need to implement tests using the new framework (after
it's ready). It will be easy to test upgrades and recovery even with a single
Scylla version - we start with a cluster with the RAFT flag disabled, then
rolling-restart while enabling the flag (and recovery is done through simple
CQL statements).

Closes #10835

* github.com:scylladb/scylladb:
  service/raft: raft_group0: implement upgrade procedure
  service/raft: raft_group0: extract `tracker` from `persistent_discovery::run`
  service/raft: raft_group0: introduce local loggers for group 0 and upgrade
  service/raft: raft_group0: introduce GET_GROUP0_UPGRADE_STATE verb
  service/raft: raft_group0_client: prepare for upgrade procedure
  service/raft: introduce `group0_upgrade_state`
  db: system_keyspace: introduce `load_peers`
  idl-compiler: introduce cancellable verbs
  message: messaging_service: cancellable version of `send_schema_check`
2022-08-25 11:32:06 +03:00
Pavel Emelyanov
d8c5044eee token_metadata: Indentation fix after prevuous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-24 08:24:21 +03:00
Pavel Emelyanov
8238c38e9f token_metadata: Remove excessive empty tokens check
After the previous patch empty passed tokens make the helper co_return
early, so this if is the dead code

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-24 08:24:21 +03:00
Pavel Emelyanov
056d21c050 token_metadata: Remove batch tokens updating method
No users left.

The endpoint_tokens.empty() check is removed, only tests could trigger
it, but they didn't and are patched out.

Indentation is left broken

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-24 08:24:21 +03:00
Pavel Emelyanov
1d437302a8 tests: Use one-by-one tokens updating method
Tests are the only users of batch tokens updating "sugar" which
actually makes things more complicated

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-24 08:24:21 +03:00
Pavel Emelyanov
18fa5038b1 replication_strategy: Remove unused method
The get_pending_address_ranges() accepting a single token is not in use,
its peer that accepts a set of tokens is

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11358
2022-08-23 20:23:50 +02:00
Avi Kivity
6ce5e9079c Merge 'utils/logalloc: consolidate lsa state in shard tracker' from Botond Dénes
Currently the state of LSA is scattered across a handful of global variables. This series consolidates all these into a single one: the shard tracker. Beyond reducing the number of globals (the less globals, the better) this paves the way for a planned de-globalization of the shard tracker itself.
There is one separate global left, the static migrators registry. This is left as-is for now.

Closes #11284

* github.com:scylladb/scylladb:
  utils/logalloc: remove reclaim_timer:: globals
  utils/logalloc: make s_sanitizer_report_backtrace global a member of tracker
  utils/logalloc: tracker_reclaimer_lock: get shard tracker via constructor arg
  utils/logalloc: move global stat accessors to tracker
  utils/logalloc: allocating_section: don't use the global tracker
  utils/logalloc: pass down tracker::impl reference to segment_pool
  utils/logalloc: move segment pool into tracker
  utils/logalloc: add tracker member to basic_region_impl
  utils/logalloc: make segment independent of segment pool
2022-08-23 18:51:14 +02:00
Benny Halevy
a980510654 table: seal_active_memtable: handle ENOSPC error
Aborting too soon on ENOSPC is too harsh, leading to loss of
availability of the node for reads, while restarting it won't
solve the ENOSPC condition.

Fixes #11245

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11246
2022-08-23 17:58:20 +02:00
Tomasz Grabiec
9c4e32d2e2 Merge 'raft: server: drop waiters in applier_fiber instead of io_fiber' from Kamil Braun
When `io_fiber` fetched a batch with a configuration that does not
contain this node, it would send the entries committed in this batch to
`applier_fiber` and proceed by any remaining entry dropping waiters (if
the node was no longer a leader).

If there were waiters for entries committed in this batch, it could
either happen that `applier_fiber` received and processed those entries
first, notifying the waiters that the entries were committed and/or
applied, or it could happen that `io_fiber` reaches the dropping waiters
code first, causing the waiters to be resolved with
`commit_status_unknown`.

The second scenario is undesirable. For example, when a follower tries
to remove the current leader from the configuration using
`modify_config`, if the second scenario happens, the follower will get
`commit_status_unknown` - this can happen even though there are no node
or network failures. In particular, this caused
`randomized_nemesis_test.remove_leader_with_forwarding_finishes` to fail
from time to time.

Fix it by serializing the notifying and dropping of waiters in a single
fiber - `applier_fiber`. We decided to move all management of waiters
into `applier_fiber`, because most of that management was already there
(there was already one `drop_waiters` call, and two `notify_waiters`
calls). Now, when `io_fiber` observes that we've been removed from the
config and no longer a leader, instead of dropping waiters, it sends a
message to `applier_fiber`. `applier_fiber` will drop waiters when
receiving that message.

Improve an existing test to reproduce this scenario more frequently.

Fixes #11235.

Closes #11308

* github.com:scylladb/scylladb:
  test: raft: randomized_nemesis_test: more chaos in `remove_leader_with_forwarding_finishes`
  raft: server: drop waiters in `applier_fiber` instead of `io_fiber`
  raft: server: use `visit` instead of `holds_alternative`+`get`
2022-08-23 17:19:44 +02:00
Avi Kivity
fd9d8ddb3e Merge 'distributed_loader: Restore separate processing of keyspace init prio/normal' from Calle Wilund
Fixes #11349

In 7396de7 (and refactorings before it) the set of prioritized keyspaces (and processing thereof)
was removed, due to apparent non-usage (which is true for open-source version).

This functionality is however required for certain features of the enterprise version (ear).
As such is needs to be restored and reenabled. This patch set does so, adapted
to the recent version of this file.

Closes #11350

* github.com:scylladb/scylladb:
  distributed_loader: Restore separate processing of keyspace init prio/normal
  Revert "distributed_loader: Remove unused load-prio manipulations"
2022-08-23 16:25:48 +02:00
Kamil Braun
e350e37605 service/raft: raft_group0: implement upgrade procedure
A listener is created inside `raft_group0` for acting when the
SUPPORTS_RAFT feature is enabled. The listener is established after the
node enters NORMAL status (in `raft_group0::finish_setup_after_join()`,
called at the end of `storage_service::join_cluster()`).

The listener starts the `upgrade_to_group0` procedure.

The procedure, on a high level, works as follows:
- join group 0
- wait until every peer joined group 0 (peers are taken from
  `system.peers` table)
- enter `synchronize` upgrade state, in which group 0 operations are
  disabled (see earlier commit which implemented this logic)
- wait until all members of group 0 entered `synchronize` state or some
  member entered the final state
- synchronize schema by comparing versions and pulling if necessary
- enter the final state (`use_new_procedures`), in which group 0 is used
  for schema operations (only those for now).

The devil lies in the details, and the implementation is ugly compared
to this nice description; for example there are many retry loops for
handling intermittent network failures. Read the code.

`leave_group0` and `remove_group0` were adjusted to handle the upgrade
procedure being run correctly; if necessary, they will wait for the
procedure to finish.

If the upgrade procedure gets stuck (and it may, since it requires all
nodes to be available to contact them to correctly establish a single
group 0 raft cluster); or if a running cluster permanently loses a
majority of nodes, causing group 0 unavailability; the cluster admin
is not left without help.

We introduce a recovery mode, which allows the admin to
completely get rid of traces of existing group 0 and restart the
upgrade procedure - which will establish a new group 0. This works even
in clusters that never upgraded but were bootstrapped using group 0 from
scratch.

To do that, the admin does the following on every node:
- writes 'recovery' under 'group0_upgrade_state' key
  in `system.scylla_local` table,
- truncates the `system.discovery` table,
- truncates the `system.group0_history` table,
- deletes group 0 ID and group 0 server ID from `system.scylla_local`
  (the keys are `raft_group0_id` and `raft_server_id`
then the admin performs a rolling restart of their cluster. The nodes
restart in a "group 0 recovery mode", which simply means that the nodes
won't try to perform any group 0 operations. Then the admin calls
`removenode` to remove the nodes that are down. Finally, the admin
removes the `group0_upgrade_state` key from `system.scylla_local`,
rolling-restarts the cluster, and the cluster should establish group 0
anew.

Note that this recovery procedure will have to be extended when new
stuff is added to group 0 - like topology change state. Indeed, observe
that a minority of nodes aren't able to receive committed entries from a
leader, so they may end up in inconsistent group 0 states. It wouldn't
be safe to simply create group 0 on those nodes without first ensuring
that they have the same state from which group 0 will start.
Right now the state only consist of schema tables, and the upgrade
procedure ensures to synchronize them, so even if the nodes started in
inconsistent schema states, group 0 will correctly be established.
(TODO: create a tracking issue? something needs to remind us of this
 whenever we extend group 0 with new stuff...)
2022-08-23 13:51:01 +02:00
Kamil Braun
b42dfbc0aa test: raft: randomized_nemesis_test: additional logging
Add some more logging to `randomized_nemesis_test` such as logging the
start and end of a reconfiguration operation in a way that makes it easy
to find one given the other in the logs.
2022-08-23 13:14:30 +02:00
Kamil Braun
efad6fe9b4 raft: server: handle aborts when waiting for config entry to commit
Changing configuration involves two entries in the log: a 'joint
configuration entry' and a 'non-joint configuration entry'. We use
`wait_for_entry` to wait on the joint one. To wait on the non-joint one,
we use a separate promise field in `server`. This promise wasn't
connected to the `abort_source` passed into `set_configuration`.

The call could get stuck if the server got removed from the
configuration and lost leadership after committing the joint entry but
before committing the non-joint one, waiting on the promise. Aborting
wouldn't help. Fix this by subscribing to the `abort_source` in
resolving the promise exceptionally.

Furthermore, make sure that two `set_configuration` calls don't step on
each other's toes by one setting the other's promise. To do that, reset
the promise field at the end of `set_configuration` and check that it's
not engaged at the beginning.

Fixes #11288.
2022-08-23 13:14:29 +02:00
Calle Wilund
54aca8e814 distributed_loader: Restore separate processing of keyspace init prio/normal
Fixes #11349

In 7396de7 (and refactorings before it) the set of prioritized keyspaces (and processing thereof)
was removed, due to apparent non-usage (which is true for open-source version).

This functionality is however required for certain features of the enterprise version (ear).
As such is needs to be restored and reenabled. This patch and revert before it does so, adapted
to the recent version of this file.
2022-08-23 10:39:19 +00:00
Calle Wilund
d9c391e366 Revert "distributed_loader: Remove unused load-prio manipulations"
This reverts commit 7396de72b1.

In 7396de7 (and refactorings before it) the set of prioritized keyspaces (and processing thereof)
was removed, due to apparent non-usage (which is true for open-source version).

This functionality is however required for certain features of the enterprise version (ear).
As such is needs to be restored and reenabled. This reverts the actual commit, patch after
ensures we use the prio set.
2022-08-23 10:34:05 +00:00
Avi Kivity
5d1ff17ddf Merge 'Streaming: define plan_id as a strong tagged_uuid type' from Benny Halevy
This series turns plan_id from a generic UUID into a strong type so it can't be used interchangeably with other uuid's.
While at it, streaming/stream_fwd.hh was added for forward declarations and the definition of plan_id.
Also, `stream_manager::update_progress` parameter name was renamed to plan_id to represent its assumed content, before changing its type to `streaming::plan_id`.

Closes #11338

* github.com:scylladb/scylladb:
  streaming: define plan_id as a strong tagged_uuid type
  stream_manager: update_progress: rename cf_id param to plan_id
  streaming: add forward declarations in stream_fwd.hh
2022-08-23 10:48:34 +02:00
Petr Gusev
aa88d58539 raft, use max_command_size to satisfy commitlog limit
Commitlog imposes a limit on the size of mutations
and throws an exception if it's exceeded. In case of
schema changes before raft this exception was delivered
to the client. Now it happens while saving the raft
command in io_fiber in persistence->store_log_entries
and what the client gets is just a timeout exception,
which doesn't say much about the cause of the problem.
This patch introduces an explicit command size limit
and provides a clear error message in this case.
2022-08-23 12:09:32 +04:00
Tomasz Grabiec
0e5b86d3da Merge 'Optimize mutation consume of range tombstones in reverse' from Benny Halevy
Reversing the whole range_tombstone_list
into reversed_range_tombstones is inefficient
and can lead to reactor stalls with a large number of
range tombstones.

Instead, iterate over the range_tombsotne_list in reverse
direction and reverse each range_tombstone as we go,
keeping the result in the optional cookie.reversed_rt member.

While at it, this series contains some other cleanups on this path
to improve the code readability and maybe make the compiler's life
easier as for optimizing the cleaned-up code.

Closes #11271

* github.com:scylladb/scylladb:
  mutation: consume_clustering_fragments: get rid of reversed_range_tombstones;
  mutation: consume_clustering_fragments: reindent
  mutation: consume_clustering_fragments: shuffle emit_rt logic around
  mutation: consume, consume_gently: simplify partition_start logic
  mutation: consume_clustering_fragments: pass iterators to mutation_consume_cookie ctor
  mutation: consume_clustering_fragments: keep the reversed schema in cookie
  mutation: clustering_iterators: get rid of current_rt
  mutation_test: test_mutation_consume_position_monotonicity: test also consume_gently
2022-08-23 10:05:39 +02:00
Botond Dénes
5bc499080d utils/logalloc: remove reclaim_timer:: globals
One of them (_active_timer) is moved to shard tracker, the other is made
a simple local in reclaim_timer.
2022-08-23 10:38:58 +03:00
Botond Dénes
5f8971173e utils/logalloc: make s_sanitizer_report_backtrace global a member of tracker
We want to consolidate all the logalloc state into a single object: the
shard tracker. Replacing this global with a member in said object is
part of this effort.
2022-08-23 10:38:58 +03:00
Botond Dénes
499b9a3a7c utils/logalloc: tracker_reclaimer_lock: get shard tracker via constructor arg 2022-08-23 10:38:58 +03:00
Botond Dénes
7d17d675af utils/logalloc: move global stat accessors to tracker
These are pretend free functions, accessing globals in the background,
make them a member of the tracker instead, which everything needed
locally to compute them. Callers still have to access these stats
through the global tracker instance, but this can be changed to happen
through a local instance. Soon....
2022-08-23 10:38:58 +03:00
Botond Dénes
f406151a86 utils/logalloc: allocating_section: don't use the global tracker
Instead, get the tracker instance from the region. This requires adding
a `region&` parameter to `with_reserve()`.
This brings us one step closer to eliminating the global tracker.
2022-08-23 10:38:58 +03:00
Botond Dénes
e968866fa1 utils/logalloc: pass down tracker::impl reference to segment_pool
To get rid of some usages of `shard_tracker()`.
2022-08-23 10:38:58 +03:00
Botond Dénes
3bd94e41bf utils/logalloc: move segment pool into tracker
Instead of a separate global segment pool instance, make it a member of
the already global tracker. Most users are inside the tracker instance
anyway. Outside users can access the pool through the global tracker
instance.
2022-08-23 10:38:58 +03:00
Botond Dénes
5b86dfc35a utils/logalloc: add tracker member to basic_region_impl
For now this member is initialized from the global tracker instance. But
it allows the members of region impl to be detached from said global,
making a step towards removing it.
2022-08-23 10:38:58 +03:00
Botond Dénes
f4056bd344 utils/logalloc: make segment independent of segment pool
segment has some members, which simply forward the call to a
segment_pool method, via the global segment_pool instance. Remove these
and make the callers use the segment pool directly instead.
2022-08-23 10:38:58 +03:00
Nadav Har'El
9c15659194 Merge 'test.py: bump timeout of async requests for topology' from Alecco
Topology tests do async requests using the Python driver. The driver's
API for async doesn't use the session timeout.

Pass 60 seconds timeout (default is 10) to match the session's.

Fixes https://github.com/scylladb/scylladb/issues/11289

Closes #11348

* github.com:scylladb/scylladb:
  test.py: bump schema agreement timeout for topology tests
  test.py: bump timeout of async requests for topology
  test.py: fix bad indent
2022-08-23 10:30:59 +03:00
Raya Kurlyand
bc7539cff0 Update auditing.rst
https://github.com/scylladb/scylladb/issues/11341

Closes #11347
2022-08-23 06:59:41 +03:00
Botond Dénes
331033adae Merge 'Fix frozen mutation consume ordering' from Benny Halevy
Currently, frozen_mutation is not consumed in position_in_partition
order as all range tombstones are consumed before all rows.

This violates the range_tombstone_generator invariants
as its lower_bound needs to be monotonically increasing.

Fix this by adding mutation_partition_view::accept_ordered
and rewriting do_accept_gently to do the same,
both making sure to consume the range tombstones
and clustering rows in position_in_partition order,
similar to the mutation consume_clustering_fragments function.

Add a unit test that verifies that.

Fixes #11198

Closes #11269

* github.com:scylladb/scylladb:
  mutation_partition_view: make mutation_partition_view_virtual_visitor stoppable
  frozen_mutation: consume and consume_gently in-order
  frozen_mutation: frozen_mutation_consumer_adaptor: rename rt to rtc
  frozen_mutation: frozen_mutation_consumer_adaptor: return early when flush returns stop_iteration::yes
  frozen_mutation: frozen_mutation_consumer_adaptor: consume static row unconditionally
  frozen_mutation: frozen_mutation_consumer_adaptor: flush current_row before rt_gen
2022-08-23 06:37:04 +03:00
Alejo Sanchez
01cac33472 test.py: bump schema agreement timeout for topology tests
Increase the schema agreement timeout to match other timeouts.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-22 21:07:55 +02:00
Alejo Sanchez
f9d31112cf test.py: bump timeout of async requests for topology
Topology tests do async requests using the Python driver. The driver's
API for async doesn't use the session timeout.

Pass 60 seconds timeout (default is 10) to match the session's.

This will hopefully will fix timeout failures on debug mode.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-22 21:07:03 +02:00
Benny Halevy
357e805e1f mutation_partition_view: make mutation_partition_view_virtual_visitor stoppable
So that the frozen_mutation consumer can return
stop_iteration::yes if it wishes to stop consuming at
some clustering position.

In this case, on_end_of_partition must still be called
so a closing range_tombstone_change can be emitted to the consumer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-22 20:12:58 +03:00
Mikołaj Sielużycki
b5380baf8a frozen_mutation: consume and consume_gently in-order
Currently, frozen_mutation is not consumed in position_in_partition
order as all range tombstones are consumed before all rows.

This violates the range_tombstone_generator invariants
as its lower_bound needs to be monotonically increasing.

Fix this by adding mutation_partition_view::accept_ordered
and rewriting do_accept_gently to do the same,
both making sure to consume the range tombstones
and clustering rows in position_in_partition order,
similar to the mutation consume_clustering_fragments function.

Add a unit test that verifies that.

Fixes #11198

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-22 20:12:20 +03:00
Kamil Braun
e0c6153adf test: raft: randomized_nemesis_test: more chaos in remove_leader_with_forwarding_finishes
Improve the randomness of this test, making it a bit easier to
reproduce the scenarios that the test aims to catch.

Increase timeouts a bit to account for this additional randomness.
2022-08-22 18:53:48 +02:00
Kamil Braun
db2a3deda1 raft: server: drop waiters in applier_fiber instead of io_fiber
When `io_fiber` fetched a batch with a configuration that does not
contain this node, it would send the entries committed in this batch to
`applier_fiber` and proceed by any remaining entry dropping waiters (if
the node was no longer a leader).

If there were waiters for entries committed in this batch, it could
either happen that `applier_fiber` received and processed those entries
first, notifying the waiters that the entries were committed and/or
applied, or it could happen that `io_fiber` reaches the dropping waiters
code first, causing the waiters to be resolved with
`commit_status_unknown`.

The second scenario is undesirable. For example, when a follower tries
to remove the current leader from the configuration using
`modify_config`, if the second scenario happens, the follower will get
`commit_status_unknown` - this can happen even though there are no node
or network failures. In particular, this caused
`randomized_nemesis_test.remove_leader_with_forwarding_finishes` to fail
from time to time.

Fix it by serializing the notifying and dropping of waiters in a single
fiber - `applier_fiber`. We decided to move all management of waiters
into `applier_fiber`, because most of that management was already there
(there was already one `drop_waiters` call, and two `notify_waiters`
calls). Now, when `io_fiber` observes that we've been removed from the
config and no longer a leader, instead of dropping waiters, it sends a
message to `applier_fiber`. `applier_fiber` will drop waiters when
receiving that message.

Fixes #11235.
2022-08-22 18:53:44 +02:00
Kamil Braun
5badf20c7a raft: server: use visit instead of holds_alternative+get
In `std::holds_alternative`+`std::get` version, the `get` performs a
redundant check. Also `std::visit` gives a compile-time exhaustiveness
check (whether we handled all possible cases of the `variant`).
2022-08-22 18:47:48 +02:00
Benny Halevy
314e45d957 streaming: define plan_id as a strong tagged_uuid type
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-22 19:45:30 +03:00
Benny Halevy
add612bc52 mutation: consume_clustering_fragments: get rid of reversed_range_tombstones;
Reversing the whole range_tombstone_list
into reversed_range_tombstones is inefficient
and can lead to reactor stalls with a large number of
range tombstones.

Instead, iterator over the range_tombsotne_list in reverse
direction and reverse each range_tombstone as we go,
keeping the result in the optional cookie.reversed_rt member.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-22 19:42:52 +03:00
Alejo Sanchez
87c233b36b test.py: fix bad indent
Fix leftover bad indent

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-22 14:29:54 +02:00
Nadav Har'El
941c719a23 alternator: return ProvisionedThroughput in DescribeTable
DescribeTable is currently hard-coded to return PAY_PER_REQUEST billing
mode. Nevertheless, even in PAY_PER_REQUEST mode, the DescribeTable
operation must return a ProvisionedThroughput structure, listing both
ReadCapacityUnits and WriteCapacityUnits as 0. This requirement is not
stated in some DynamoDB documentation but is explictly mentioned in
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_ProvisionedThroughput.html
Also in empirically, DynamoDB returns ProvisionedThroughput with zeros
even in PAY_PER_REQUEST mode. We even had an xfailing test to confirm this.

The ProvisionedThroughput structure being missing was a problem for
applications like DynamoDB connectors for Spark, if they implicitly
assume that ProvisionedThroughput is returned by DescribeTable, and
fail (as described in issue #11222) if it's outright missing.

So this patch adds the missing ProvisionedThroughput structure, and
the xfailing test starts to pass.

Note that this patch doesn't change the fact that attempting to set
a table to PROVISIONED billing mode is ignored: DescribeTable continues
to always return PAY_PER_REQUEST as the billing mode and zero as the
provisioned capacities.

Fixes #11222

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11298
2022-08-22 09:58:09 +02:00
Takuya ASADA
60e8f5743c systemd: drop StandardOutput=syslog
On recent version of systemd, StandardOutput=syslog is obsolete.
We should use StandardOutput=journal instead, but since it's default value,
so we can just drop it.

Fixes #11322

Closes #11339
2022-08-22 10:47:37 +03:00
Benny Halevy
fa7033bc2b configure: add --perf-tests-debuginfo option
Provides separate control over debuginfo for perf tests
since enabling --tests-debuginfo affects both today
causing the Jenkins archives of perf tests binaries to
inflate considerably.

Refs https://github.com/scylladb/scylla-pkg/issues/3060

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11337
2022-08-21 19:08:21 +03:00
Konstantin Osipov
4b6ed7796b test.py: extend documentation
Add documentation about python, topology tests,
server pooling and provide some debugging tips.

Closes #11317
2022-08-21 17:55:49 +03:00
Benny Halevy
3554533e2c stream_manager: update_progress: rename cf_id param to plan_id
Before changing its type to streaming::plan_id
this patch clarifies that the parameter actually represents
the plan id and not the table id as its name suggests.

For reference, see the call to update_progress in
`stream_transfer_task::execute`, as well as the function
using _stream_bytes which map key is the plan id.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-21 16:56:41 +03:00
Benny Halevy
c1fc0672a5 streaming: add forward declarations in stream_fwd.hh
To be used for defining streaming::plan_id
in the next patcvh.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-21 16:00:02 +03:00
Anna Stuchlik
3b93184680 doc: fix the description of ORDER BY for the SELECT statement
Closes #11272
2022-08-21 15:28:15 +03:00
Tzach Livyatan
8fc58300ea Update Alternator Markdown file to use automatic link notation
Closes #11335
2022-08-21 13:32:57 +03:00
Piotr Sarna
484004e766 Merge 'Fix mutation commutativity with shadowable tombstone'
from Tomasz Grabiec

This series fixes lack of mutation associativity which manifests as
sporadic failures in
row_cache_test.cc::test_concurrent_reads_and_eviction due to differences
in mutations applied and read.

No known production impact.

Refs https://github.com/scylladb/scylladb/issues/11307

Closes #11312

* github.com:scylladb/scylladb:
  test: mutation_test: Add explicit test for mutation commutativity
  test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones
  db: mutation_partition: Drop unnecessary maybe_shadow()
  db: mutation_partition: Maintain shadowable tombstone invariant when applying a hard tombstone
  mutation_partition: row: make row marker shadowing symmetric
2022-08-20 16:46:32 +02:00
Kamil Braun
2ba1fb0490 service/raft: raft_group0: extract tracker from persistent_discovery::run
Extract it to a top-level abstraction, write comments.
It will be reused in the following commit.
2022-08-19 19:15:19 +02:00
Kamil Braun
f7e02a7de9 service/raft: raft_group0: introduce local loggers for group 0 and upgrade 2022-08-19 19:15:19 +02:00
Kamil Braun
ac5f4248a9 service/raft: raft_group0: introduce GET_GROUP0_UPGRADE_STATE verb
During the upgrade procedure nodes will want to obtain the upgrade state
of other nodes to proceed. This is what the new verb is for.
2022-08-19 19:15:19 +02:00
Kamil Braun
43687be1f1 service/raft: raft_group0_client: prepare for upgrade procedure
Now, whether an 'group 0 operation' (today it means schema change) is
performed using the old or new methods, doesn't depend on the local RAFT
fature being enabled, but on the state of the upgrade procedure.

In this commit the state of the upgrade is always
`use_pre_raft_procedures` because the upgrade procedure is not
implemented yet. But stay tuned.

The upgrade procedure will need certain guarantees: at some point it
switches from `use_pre_raft_procedures` to `synchronize` state. During
`synchronize` schema changes must be disabled, so the procedure can
ensure that schema is in sync across the entire cluster before
establishing group 0. Thus, when the switch happens, no schema change
can be in progress.

To handle all this weirdness we introduce `_upgrade_lock` and
`get_group0_upgrade_state` which takes this lock whenever it returns
`use_pre_raft_procedures`. Creating a `group0_guard` - which happens at
the start of every group 0 operation - will take this lock, and the lock
holder shall be stored inside the guard (note: the holder only holds the
lock if `use_pre_raft_procedures` was returned, no need to hold it for
other cases). Because `group0_guard` is held for the entire duration of
a group 0 operation, and because the upgrade procedure will also have to
take this lock whenever it wants to change the upgrade state (it's an
rwlock), this ensures that no group 0 operation that uses the old ways
is happening when we change the state.

We also implement `wait_until_group0_upgraded` using a condition
variable. It will be used by certain methods during upgrade (later
commits; stay tuned).

Some additional comments were written.
2022-08-19 19:15:19 +02:00
Kamil Braun
7e56251aea service/raft: introduce group0_upgrade_state
Define an enum class, `group0_upgrade_state`, describing the state of
the upgrade procedure (implemented in later commits).

Provide IDL definitions for (de)serialization.

The node will have its current upgrade state stored on disk in
`system.scylla_local` under the `group0_upgrade_state` key. If the key
is not present we assume `use_pre_raft_procedures` (meaning we haven't
started upgrading yet or we're at the beginning of upgrade).

Introduce `system_keyspace` accessor methods for storing and retrieving
the on-disk state.
2022-08-19 19:15:19 +02:00
Kamil Braun
547134faf4 db: system_keyspace: introduce load_peers
Load the addresses of our peers from `system.peers`.

Will be used be the Raft upgrade procedure to obtain the set of all
peers.
2022-08-19 19:15:18 +02:00
Kamil Braun
a5b465b796 idl-compiler: introduce cancellable verbs
The compiler allowed passing a `with_timeout` flag to a verb definition;
it then generated functions for sending and handling RPCs that accepted
a timeout parameter.

We would like to generate functions that accept an `abort_source` so an
RPC can be cancelled from the sender side. This is both more and less
powerful than `with_timeout`. More powerful because you can abort on
other conditions than just reaching a certain point in time. Less
powerful because you can't abort the receiver. In any case, sometimes
useful.

For this the `cancellable` flag was added.
You can't use `with_timeout` and `cancellable` at the same verb.

Note that this uses an already existing function in RPC module,
`send_message_cancellable`.
2022-08-19 19:15:18 +02:00
Kamil Braun
9e5a81da4a message: messaging_service: cancellable version of send_schema_check
This RPC will be used during the Raft upgrade procedure during schema
synchronization step.

Make a version which can be cancelled when the upgrade procedure gets
aborted.
2022-08-19 19:15:18 +02:00
Nadav Har'El
516089beb0 Merge 'Raft test topology II part 1' from Alecco
- Remove `ScyllaCluster.__getitem__()`  (pending request by @kbr- in a previous pull request), for this remove all direct access to servers from caller code
- Increase Python driver timeouts (req by @nyh)
- Improve `ManagerClient` API requests: use `http+unix://<sockname>/<resource>` instead of `http://localhost/<resource>` and callers of the helper method only pass the resource
- Improve lint and type hints

Closes #11305

* github.com:scylladb/scylladb:
  test.py: remove ScyllaCluster.__getitem__()
  test.py: ScyllaCluster check kesypace with any server
  test.py: ScyllaCluster server error log method
  test.py: ScyllaCluster read_server_log()
  test.py: save log point for all running servers
  test.py: ScyllaCluster provide endpoint
  test.py: build host param after before_test
  test.py: manager client disable lint warnings
  test.py: scylla cluster lint and type hint fixes
  test.py: increase more timeouts
  test.py: ManagerClient improve API HTTP requests
2022-08-18 20:27:50 +03:00
Alejo Sanchez
fe07f9ceed test.py: make topology conftest module paths work when imported
To allow other suites to use topology suite conftest, add pylib to the
module lookup path.

Closes #11313
2022-08-18 20:22:35 +03:00
Konstantin Osipov
7481f0d404 test.py: simplify CQL test search
No need to repeat code available in the base class.

Closes #11156
2022-08-18 19:28:43 +03:00
Benny Halevy
7747b8fa33 sstables: define run_identifier as a strong tagged_uuid type
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11321
2022-08-18 19:03:10 +03:00
Avi Kivity
35fbba3a5b Revert "gms: gossiper: include nodes with empty feature sets when calculating enabled features"
This reverts commit 08842444b4. It causes
a failure in test_shutdown_all_and_replace_node.

Fixes #11316.
2022-08-18 15:01:50 +03:00
Kamil Braun
b52429f724 Merge 'raft: relax some error severity' from Gleb Natapov
Dtest fails if it sees an unknown errors in the logs. This series
reduces severity of some errors (since they are actually expected during
shutdown) and removes some others that duplicate already existing errors
that dtest knows how to deal with. Also fix one case of unhandled
exception in schema management code.

* 'dtest-fixes-v1' of github.com:gleb-cloudius/scylla:
  raft: getting abort_requested_exception exception from a sm::apply is not a critical error
  schema_registry: fix abandoned feature warning
  service: raft: silence rpc::closed_errors in raft_rpc
2022-08-18 12:16:44 +02:00
Anna Stuchlik
dc307b6895 doc: fix the CQL version in the Interfaces table 2022-08-18 12:02:42 +02:00
Petr Gusev
eedfd7ad9b raft, limit for command size
Adds max_command_size to the raft configuration and
restricts commands to this limit.
2022-08-18 13:35:49 +04:00
Tomasz Grabiec
5a9df433c6 test: mutation_test: Add explicit test for mutation commutativity 2022-08-17 17:39:54 +02:00
Tomasz Grabiec
3d9efee3bf test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones
Given 3 row mutations:

m1 = {
      marker: {row_marker: dead timestamp=-9223372036854775803},
      tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775807, deletion_time=0}, {tombstone: none}}
}

m2 = {
      marker: {row_marker: timestamp=-9223372036854775805}
}

m3 = {
      tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775806, deletion_time=2}, {tombstone: none}}
}

We get different shadowable tombstones depending on the order of merging:

(m1 + m2) + m3 = {
       marker: {row_marker: dead timestamp=-9223372036854775803},
       tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775806, deletion_time=2}, {tombstone: none}}

m1 + (m2 + m3) = {
       marker: {row_marker: dead timestamp=-9223372036854775803},
       tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775807, deletion_time=0}, {tombstone: none}}
}

The reason is that in the second case the shadowable tombstone in m3
is shadwed by the row marker in m2. In the first case, the marker in
m2 is cancelled by the dead marker in m1, so shadowable tombstone in
m3 is not cancelled (the marker in m1 does not cancel because it's
dead).

This wouldn't happen if the dead marker in m1 was accompanied by a
hard tombstone of the same timestamp, which would effectively make the
difference in shadowable tombstones irrelevant.

Found by row_cache_test.cc::test_concurrent_reads_and_eviction.

I'm not sure if this situation can be reached in practice (dead marker
in mv table but no row tombstone).

Work it around for tests by producing a row tombstone if there is a
dead marker.

Refs #11307
2022-08-17 17:39:54 +02:00
Tomasz Grabiec
56e5b6f095 db: mutation_partition: Drop unnecessary maybe_shadow()
It is performed inside row_tombstone::apply() invoked in the preceding line.
2022-08-17 17:39:54 +02:00
Tomasz Grabiec
9c66c9b3f0 db: mutation_partition: Maintain shadowable tombstone invariant when applying a hard tombstone
When the row has a live row marker which shadows the shadowable
tombstone, the shadowable tombstone should not be effective. The code
assumes that _shadowable always reflects the current tombstone, so
maybe_shadow() needs to be called whenever marker or regular tombstone
changes. This was not ensured by row::apply(tombstone).

This causes problems in tests which use random_mutation_generator,
which generates mutations which would violate this invariant, and as a
result, mutation commutativity would be violated.

I am not aware of problems in production code.
2022-08-17 17:34:13 +02:00
Botond Dénes
778f5adde7 mutation_partition: row: make row marker shadowing symmetric
Currently row marker shadowing the shadowable tombstone is only checked
in `apply(row_marker)`. This means that shadowing will only be checked
if the shadowable tombstone and row marker are set in the correct order.
This at the very least can cause flakyness in tests when a mutation
produced just the right way has a shadowable tombstone that can be
eliminated when the mutation is reconstructed in a different way,
leading to artificial differences when comparing those mutations.

This patch fixes this by checking shadowing in
`apply(shadowable_tombstone)` too, making the shadowing check symmetric.

There is still one vulnerability left: `row_marker& row_marker()`, which
allow overwriting the marker without triggering the corresponding
checks. We cannot remove this overload as it is used by compaction so we
just add a comment to it warning that `maybe_shadow()` has to be manually
invoked if it is used to mutate the marker (compaction takes care of
that). A caller which didn't do the manual check is
mutation_source_test: this patch updates it to use `apply(row_marker)`
instead.

Fixes: #9483

Tests: unit(dev)

Closes #9519
2022-08-17 17:22:13 +02:00
Benny Halevy
8f0376bba1 mutation: consume_clustering_fragments: reindent
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 16:45:20 +03:00
Benny Halevy
749371c2b0 mutation: consume_clustering_fragments: shuffle emit_rt logic around
To prepare for a following patch that will get rid of
the cookie.reversed_range_tombstones list.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 16:44:23 +03:00
Benny Halevy
0e21073c38 mutation: consume, consume_gently: simplify partition_start logic
Concentrate the logic in a single
(!cookie.partition_start_consumed) block

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 15:49:12 +03:00
Benny Halevy
d661b84d51 mutation: consume_clustering_fragments: pass iterators to mutation_consume_cookie ctor
and set crs and rts only in the block where they are used,
so we can get rid of reversed_range_tombstones.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 15:30:36 +03:00
Benny Halevy
f1b7a1a6f1 mutation: consume_clustering_fragments: keep the reversed schema in cookie
Rather than reversing the schema on every call
just keep the potentially reversed schema in cookie.

Othwerwise, cookie.schema was write only.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 15:30:36 +03:00
Benny Halevy
a230ea0019 mutation: clustering_iterators: get rid of current_rt
It is currently write-only.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 15:30:16 +03:00
Benny Halevy
017f9b4131 mutation_test: test_mutation_consume_position_monotonicity: test also consume_gently
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 14:43:52 +03:00
Alejo Sanchez
d732d776ed test.py: remove ScyllaCluster.__getitem__()
Users of ScyllaCluster should not directly manage its ScyllaServers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-17 10:24:48 +02:00
Alejo Sanchez
729f8e2834 test.py: ScyllaCluster check kesypace with any server
Directly pick any server instead of calling self[0].

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-17 10:24:48 +02:00
Alejo Sanchez
7ad7a5e718 test.py: ScyllaCluster server error log method
Provide server error logs to caller (test.py).

Avoids direct access to list of servers.

To be done later: pick the failed server. For now it just provides the
log of one server.

While there, fix type hints.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-17 10:24:48 +02:00
Alejo Sanchez
e755207fcc test.py: ScyllaCluster read_server_log()
Instead of accessing the first server, now test.py asks ScyllaCluster
for the server log.

In a later commit, ScyllaCluster will pick the appropriate server.

Also removes another direct access to the list of servers we want to get
rid of.
2022-08-17 10:24:48 +02:00
Alejo Sanchez
f141ab95f9 test.py: save log point for all running servers
For error reporting, before a test a mark of the log point in time is
saved. Previously, only the log of the first server was saved. Now it's
done for all running servers.

While there, remove direct access to servers on test.py.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-17 10:24:48 +02:00
Alejo Sanchez
8fff636776 test.py: ScyllaCluster provide endpoint
For pytest CQL driver connections a host id (IP) is used. Provide it
with a method.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-17 10:24:48 +02:00
Alejo Sanchez
5bd266424e test.py: build host param after before_test
If no server started, there is no server in the cluster list. So only
build the pytest --host param after before_test check is done.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-17 10:24:48 +02:00
Alejo Sanchez
30c8e961ba test.py: manager client disable lint warnings
Disable noisy lint warnings.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-17 10:24:48 +02:00
Alejo Sanchez
2b4c7fbb8a test.py: scylla cluster lint and type hint fixes
Add missing docstrings, reorder imports, add type hints, improve
formatting, fix variable names, fix line lengths, iterate over dicts not
keys, and disable noisy lint warnings.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-17 10:24:48 +02:00
Alejo Sanchez
566a4ebf4e test.py: increase more timeouts
Increase Python driver connection timeouts to deal with extreme cases
for slow debug builds in slow machines as done (and explained) in
95bd02246a.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-17 10:24:48 +02:00
Alejo Sanchez
ce27c02d91 test.py: ManagerClient improve API HTTP requests
Use the AF Unix socket name as host name instead of localhost and avoid
repeating the full URL for callers of _request() for the Manager API
requests from the client.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-17 10:24:48 +02:00
Benny Halevy
1b997a8514 frozen_mutation: frozen_mutation_consumer_adaptor: rename rt to rtc
It is a range_tombstone_change, not a range_tombstone.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 10:17:42 +03:00
Benny Halevy
87fd4a7d82 frozen_mutation: frozen_mutation_consumer_adaptor: return early when flush returns stop_iteration::yes
If the consumer return stop_iteration::yes for a flushed
row (static or clustered, we should return early and
no consume any more fragments, until `on_end_of_partition`,
where we may still consume a closing range_tombstone_change
past the last consumed row.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 10:17:42 +03:00
Benny Halevy
f11a5e2ec8 frozen_mutation: frozen_mutation_consumer_adaptor: consume static row unconditionally
Consuming the static row is the first ooportunity for
the consumer to return stop_iteration::yes, so there's no
point in checking `_stop_consuming` before consuming it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 10:17:42 +03:00
Benny Halevy
4b4eb9037a frozen_mutation: frozen_mutation_consumer_adaptor: flush current_row before rt_gen
We already flushed rt_gen when building the current_row
When we get to flush_rows_and_tombstones, we should
just consume it, as the passed position is not if the
current_row but rather a position following it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 10:17:42 +03:00
Nadav Har'El
055340ae39 cql-pytest: increase more timeouts
In commit 7eda6b1e90, we increased the
request_timeout parameter used by cql-pytest tests from the default of
10 seconds to 120 seconds. 10 seconds was usually more than enough for
finishing any Scylla request, but it turned out that in some extreme
cases of a debug build running on an extremely over-committed machine,
the default timeout was not enough.

Recently, in issue #11289 we saw additional cases of timeouts which
the request_timeout setting did *not* solve. It turns out that the Python
CQL driver has two additional timeout settings - connect_timeout and
control_connection_timeout, which default to 5 seconds and 2 seconds
respectively. I believe that most of the timeouts in issue #11289
come from the control_connection_timeout setting - by changing it
to a tiny number (e.g., 0.0001) I got the same error messages as those
reported in #11289. The default of that timeout - 2 seconds - is
certainly low enough to be reached on an extremely over-committed
machine.

So this patch significantly increases both connect_timeout and
control_connection_timeout to 60 seconds. We don't care that this timeout
is ridiculously large - under normal operations it will never be reached.
There is no code which loops for this amount of time, for example.

Refs #11289 (perhaps even Fixes, we'll need to see that the test errors
go away).

NOTE: This patch only changes test/cql-pytest/util.py, which is only
used by the cql-pytest test suite. We have multiple other test suites which
copied this code, and those test suites might need fixing separately.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11295
2022-08-16 19:11:59 +03:00
Kamil Braun
08842444b4 gms: gossiper: include nodes with empty feature sets when calculating enabled features
Right now, if there's a node for which we don't know the features
supported by this node (they are neither persisted locally, nor gossiped
by that node), we would skip this node in calculating the set
of enabled features and potentially enable a feature which shouldn't be
enabled - because that node may not know it. We should only enable a
feature when we know that all nodes have upgraded and know the feature.

This bug caused us problems when we tried to move RAFT out of
experimental. There are dtests such as `partitioner_tests.py` in which
nodes would enable features prematurely, which caused the Raft upgrade
procedure to break (the procedure starts only when all nodes upgrade
and announce that they know the SUPPORTS_RAFT cluster feature).

Closes #11225
2022-08-16 19:07:41 +03:00
Piotr Sarna
cf30d4cbcf Merge 'Secondary index of collection columns' from Nadav Har'El
This pull request introduces global secondary-indexing for non-frozen collections.

The intent is to enable such queries:

```
CREATE TABLE test(int id, somemap map<int, int>, somelist<int>, someset<int>, PRIMARY KEY(id));
CREATE INDEX ON test(keys(somemap));
CREATE INDEX ON test(values(somemap));
CREATE INDEX ON test(entries(somemap));
CREATE INDEX ON test(values(somelist));
CREATE INDEX ON test(values(someset));

-- index on test(c) is the same as index on (values(c))
CREATE INDEX IF NOT EXISTS ON test(somelist);
CREATE INDEX IF NOT EXISTS ON test(someset);
CREATE INDEX IF NOT EXISTS ON test(somemap);

SELECT * FROM test WHERE someset CONTAINS 7;
SELECT * FROM test WHERE somelist CONTAINS 7;
SELECT * FROM test WHERE somemap CONTAINS KEY 7;
SELECT * FROM test WHERE somemap CONTAINS 7;
SELECT * FROM test WHERE somemap[7] = 7;
```

We use here all-familiar materialized views (MVs). Scylla treats all the
collections the same way - they're a list of pairs (key, value). In case
of sets, the value type is dummy one. In case of lists, the key type is
TIMEUUID. When describing the design, I will forget that there is more
than one collection type.  Suppose that the columns in the base table
were as follows:

```
pkey int, ckey1 int, ckey2 int, somemap map<int, text>, PRIMARY KEY(pkey, ckey1, ckey2)
```

The MV schema is as follows (the names of columns which are not the same
as in base might be different). All the columns here form the primary
key.

```
-- for index over entries
indexed_coll (int, text), idx_token long, pkey int, ckey1 int, ckey2 int
-- for index over keys
indexed_coll int, idx_token long, pkey int, ckey1 int, ckey2 int
-- for index over values
indexed_coll text, idx_token long, pkey int, ckey1 int, ckey2 int, coll_keys_for_values_index int
```

The reason for the last additional column is that the values from a collection might not be unique.

Fixes #2962
Fixes #8745
Fixes #10707

This patch does not implement **local** secondary indexes for collection columns: Refs #10713.

Closes #10841

* github.com:scylladb/scylladb:
  test/cql-pytest: un-xfail yet another passing collection-indexing test
  secondary index: fix paging in map value indexing
  test/cql-pytest: test for paging with collection values index
  cql, view: rename and explain bytes_with_action
  cql, index: make collection indexing a cluster feature
  test/cql-pytest: failing tests for oversized key values in MV and SI
  cql: fix secondary index "target" when column name has special characters
  cql, index: improve error messages
  cql, index: fix default index name for collection index
  test/cql-pytest: un-xfail several collecting indexing tests
  test/cql-pytest/test_secondary_index: verify that local index on collection fails.
  docs/design-notes/secondary_index: add `VALUES` to index target list
  test/cql-pytest/test_secondary_index: add randomized test for indexes on collections
  cql-pytest/cassandra_tests/.../secondary_index_test: fix error message in test ported from Cassandra
  cql-pytest/cassandra_tests/.../secondary_index_on_map_entries,select_test: test ported from Cassandra is expected to fail, since Scylla assumes that comparison with null doesn't throw error, just evaluates to false. Since it's not a bug, but expected behavior from the perspective of Scylla, we don't mark it as xfail.
  test/boost/secondary_index_test: update for non-frozen indexes on collections
  test/cql-pytest: Uncomment collection indexes tests that should be working now
  cql, index: don't use IS NOT NULL on collection column
  cql3/statements/select_statement: for index on values of collection, don't emit duplicate rows
  cql/expr/expression, index/secondary_index_manager: needs_filtering and index_supports_expression rewrite to accomodate for indexes over collections
  cql3, index: Use entries() indexes on collections for queries
  cql3, index: Use keys() and values() indexes on collections for queries.
  types/tuple: Use std::begin() instead of .begin() in tuple_type_impl::build_value_fragmented
  cql3/statements/index_target: throw exception to signalize that we didn't miss returning from function
  db/view/view.cc: compute view_updates for views over collections
  view info: has_computed_column_depending_on_base_non_primary_key
  column_computation: depends_on_non_primary_key_column
  schema, index/secondary_index_manager: make schema for index-induced mv
  index/secondary_index_manager: extract keys, values, entries types from collection
  cql3/statements/: validate CREATE INDEX for index over a collection
  cql3/statements/create_index_statement,index_target: rewrite index target for collection
  column_computation.hh, schema.cc: collection_column_computation
  column_computation.hh, schema.cc: compute_value interface refactor
  Cql.g, treewide: support cql syntax `INDEX ON table(VALUES(collection))`
2022-08-16 14:18:51 +02:00
Nadav Har'El
fbb0b66d0c test/cql-pytest: fix run's "--ssl" option
Commit 23acc2e848 broke the "--ssl" option of test/cql-pytest/run
(which makes Scylla - and cqlpytest - use SSL-encrypted CQL).
The problem was that there was a confusion between the "ssl" module
(Python's SSL support) and a new "ssl" variable. A rename and a missing
"import" solves the breakage.

We never noticed this because Jenkins does *not* run cql-pytest/run
with --ssl (actually, it no longer runs cql-pytest/run at all).
It is still a useful option for checking SSL-related problems in Scylla
and Seastar.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11292
2022-08-16 12:29:05 +02:00
Kamil Braun
4e35e62597 Merge 'Raft test topology part 3' from Alecco
Test schema changes when there was an underlying topology change.

- per test case checks of cluster health and cycling
- helper class to do cluster manager API requests
- tests can perform topology changes: stop/start/restart servers
- modified clusters are marked dirty and discarded after the test case
- cql connection is updated per topology change and per cluster change

Closes #11266

* github.com:scylladb/scylladb:
  test.py: test topology and schema changes
  test.py: ClusterManager API mark cluster dirty
  test.py: call before/after_test for each test case
  test.py: handle driver connection in ManagerClient
  test.py: ClusterManager API and ManagerClient
  test.py: improve topology docstring
2022-08-16 11:00:26 +02:00
Avi Kivity
afa7960926 Merge 'database: evict all inactive reads for table when detaching table' from Botond Dénes
Currently, when detaching the table from the database, we force-evict all queriers for said table. This series broadens the scope of this force-evict to include all inactive reads registered at the semaphore. This ensures that any regular inactive read "forgotten" for any reason in the semaphore, will not end up in said readers accessing a dangling table reference when destroyed later.

Fixes: https://github.com/scylladb/scylladb/issues/11264

Closes #11273

* github.com:scylladb/scylladb:
  querier: querier_cache: remove now unused evict_all_for_table()
  database: detach_column_family(): use reader_concurrency_semaphore::evict_inactive_reads_for_table()
  reader_concurrency_semaphore: add evict_inactive_reads_for_table()
2022-08-15 19:05:59 +03:00
Botond Dénes
d56dcb842c db/virtual_table: add virtual destructor to virtual_table
It should have had one, derived instances are stored and destroyed via
the base-class. The only reason this haven't caused bugs yet is that
derived instances happen to not have any non-trivial members yet.

Closes #11293
2022-08-15 16:58:05 +03:00
Avi Kivity
73d4930815 Merge 'test/lib: various improvements to sstable test env' from Botond Dénes
A mixed bag of improvements developed as part of another PR (https://github.com/scylladb/scylladb/pull/10736). Said PR was closed so I'm submitting these improvements separately.

Closes #11294

* github.com:scylladb/scylladb:
  test/lib: move convenience table config factory to sstable_test_env
  test/lib/sstable_test_env: move members to impl struct
  test/lib/sstable_utils: use test_env::do_with_async()
2022-08-15 16:57:01 +03:00
Botond Dénes
92e5f438a4 querier: querier_cache: remove now unused evict_all_for_table() 2022-08-15 14:16:41 +03:00
Botond Dénes
2b1eb6e284 database: detach_column_family(): use reader_concurrency_semaphore::evict_inactive_reads_for_table()
Instead of querier_cache::evict_all_for_table(). The new method cover
all queriers and in addition any other inactive reads registered on the
semaphore. In theory by the time we detach a table, no regular inactive
reads should be in the semaphore anymore, but if there is any still, we
better evict them before the table is destroyed, they might attempt to
access it in when destroyed later.
2022-08-15 14:16:41 +03:00
Botond Dénes
e55ccbde8f reader_concurrency_semaphore: add evict_inactive_reads_for_table()
Allowing for evicting all inactive reads that belong to a certain table.
2022-08-15 14:16:41 +03:00
Botond Dénes
c8ef356859 test/lib: move convenience table config factory to sstable_test_env
All users of `column_family_test_config()`, get the semaphore parameter
for it from `sstable_test_env`. It is clear that the latter serves as
the storage space for stable objects required by the table config. This
patch just enshrines this fact by moving the config factory method to
`sstable_test_env`, so it can just get what it needs from members.
2022-08-15 11:23:59 +03:00
Botond Dénes
c0e017e0f7 test/lib/sstable_test_env: move members to impl struct
All present members of sstable_test_env are std::unique_ptr<>:s because
they require stable addresses. This makes their handling somewhat
awkward. Move all of them into an internal `struct impl` and make that
member a unique ptr.
2022-08-15 11:20:09 +03:00
Botond Dénes
a9f296ed47 test/lib/sstable_utils: use test_env::do_with_async()
Instead of manually instantiating test_env.
2022-08-15 11:19:27 +03:00
Botond Dénes
a9573b84c5 Merge 'commitlog: Revert/modify fac2bc4 - do footprint add in delete' from Calle Wilund
Fixes #11184
Fixes #11237

In prev (broken) fix for https://github.com/scylladb/scylladb/issues/11184 we added the footprint for left-over
files (replay candidates) to disk footprint on commitlog init.

This effectively prevents us from creating segments iff we have tight limits. Since we nowadays do quite a bit of inserts _before_ commitlog replay (system.local, but...) we can end up in a situation where we deadlock start because we cannot get to the actual replay that will eventually free things.

Another, not thought through, consequence is that we add a single footprint to _all_ commitlog shard instances - even though only shard 0 will get to actually replay + delete (i.e. drop footprint).
So shards 1-X would all be either locked out or performance degraded.

Simplest fix is to add the footprint in delete call instead. This will lock out segment creation until delete call is done, but this is fast. Also ensures that only replay shard is involved.

To further emphasize this, don't store segments found on init scan in all shard instances,
instead retrieve (based on low time-pos for current gen) when required. This changes very little, but we at last don't store
pointless string lists in shards 1 to X, and also we can potentially ask for the list twice.
More to the point, goes better hand-in-hand with the semantics of "delete_segments", where any file sent in is
considered candidate for recycling, and included in footprint.

Closes #11251

* github.com:scylladb/scylladb:
  commitlog: Make get_segments_to_replay on-demand
  commitlog: Revert/modify fac2bc4 - do footprint add in delete
2022-08-15 09:10:32 +03:00
Botond Dénes
8f10413087 Merge 'doc: describe specifying workload attributes with service levels' from Anna Stuchlik
Fix https://github.com/scylladb/scylladb/issues/11197

This PR adds a new page where specifying workload attributes with service levels is described and adds it to the menu.

Also, I had to fix some links because of the warnings.

Closes #11209

* github.com:scylladb/scylladb:
  doc: remove the reduntant space from index
  doc: update the syntax for defining service level attributes
  doc: rewording
  doc: update the links to fix the warnings
  doc: add the new page to the toctree
  doc: add the descrption of specifying workload attributes with service levels
  doc: add the definition of workloads to the glossary
2022-08-15 07:14:28 +03:00
Nadav Har'El
c8b5c3595e Merge 'cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query()' from Avi Kivity
Increase readability in preparation for managing topology with
effective_replication_map (continuing 69aea59d9).

Closes #11290

* github.com:scylladb/scylladb:
  cql3: select_statement: improve loop termination condition in  indexed_table_select_statement::do_execute_base_query()
  cql3: select_statement: reindent indexed_table_select_statement::do_execute_base_query()
  cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query()
  cql3: select_statement: de-result_wrap indexed_table_select_statement::do_execute_base_query()
2022-08-14 23:26:06 +03:00
Nadav Har'El
4a4231ea53 Merge 'storage_proxy: coroutinize some counter mutate functions' from Avi Kivity
In preparation for effective_replication_map hygiene, convert
some counter functions to coroutines to simplify the changes.

Closes #11291

* github.com:scylladb/scylladb:
  storage_proxy: mutate_counters_on_leader: coroutinize
  storage_proxy: mutate_counters: coroutinize
  storage_proxy: mutate_counters: reorganize error handling
2022-08-14 23:16:42 +03:00
Avi Kivity
8070cdbbf9 storage_proxy: mutate_counters_on_leader: coroutinize
Simplify ahead of refactoring for consistent effective_replication_map.
2022-08-14 17:36:58 +03:00
Avi Kivity
6e330d98d2 storage_proxy: mutate_counters: coroutinize
Simplify ahead of refactoring for consistent effective_replication_map.

This is probably a pessimization of the error case, but the error case
will be terrible in any case unless we resultify it.
2022-08-14 17:28:46 +03:00
Avi Kivity
105b066ff7 storage_proxy: mutate_counters: reorganize error handling
Move the error handling function where it's used so the code
is more straightforward.

Due to some std::move()s later, we must still capture the schema early.
2022-08-14 17:13:22 +03:00
Avi Kivity
fbaa280acd cql3: select_statement: improve loop termination condition in indexed_table_select_statement::do_execute_base_query()
Move the termination condition to the front of the loop so it's
clear why we're looping and when we stop.

It's less than perfectly clean since we widen the scope of some variables
(from loop-internal to loop-carried), but IMO it's clearer.
2022-08-14 15:40:45 +03:00
Avi Kivity
60c7c11c96 cql3: select_statement: reindent indexed_table_select_statement::do_execute_base_query()
Reindent after coroutinization. No functional changes.
2022-08-14 15:35:36 +03:00
Avi Kivity
492dc6879e cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query()
It's much easier to maintain this way. Since it uses ranges_to_vnodes,
it interacts with topology and needs integration into
effective_replication_map management.

The patch leaves bad indentation and an infinite-looking loop in
the interest of minimization, but that will be corrected later.

Note, the test for `!r.has_value()` was eliminated since it was
short-circuited by the test for `!rqr.has_value()` returning from
the coroutine rather than propagating an error.
2022-08-14 15:31:45 +03:00
Avi Kivity
973034978c cql3: select_statement: de-result_wrap indexed_table_select_statement::do_execute_base_query()
We use result_wrap() in two places, but that makes coroutinizing the
containing function a little harder, since it's composed of more lambdas.

Remove the wrappers, gaining a bit of performance in the error case.
2022-08-14 15:22:18 +03:00
Kamil Braun
b4c5b79f5e db: system_distributed_keyspace: don't call on_internal_error in check_exists
The function `check_exists` checks whether a given table exists, giving
an error otherwise. It previously used `on_internal_error`.

`check_exists` is used in some old functions that insert CDC metadata to
CDC tables. These tables are no longer used in newer Scylla versions
(they were replaced with other tables with different schema), and this
function is no longer called. The table definitions were removed and
these tables are no longer created. They will only exists in clusters
that were upgraded from old versions of Scylla (4.3) through a sequence
of upgrades.

If you tried to upgrade from a very old version of Scylla which had
neither the old or the new tables to a modern version, say from 4.2 to
5.0, you would get `on_internal_error` from this `check_exists`
function. Fortunately:
1. we don't support such upgrade paths
2. `on_internal_error` in production clusters does not crash the system,
   only throws. The exception would be catched, printed, and the system
   would run (just without CDC - until you finished upgrade and called
   the propoer nodetool command to fix the CDC module).

Unfortunately, there is a dtest (`partitioner_tests.py`) which performs
an unsupported upgrade scenario - it starts Scylla from Cassandra (!)
work directories, which is like upgrading from a very old version of
Scylla.

This dtest was not failing due to another bug which masked the problem.
When we try to fix the bug - see #11225 - the dtest starts hitting the
assertion in `check_exists`. Because it's a test, we configure
`on_internal_error` to crash the system.

The point of this commit is to not crash the system in this rare
scenario which happens only in some weird tests. We now throw
`std::runtime_error` instead of calling `on_internal_error`. In the
dtest, we already ignore the resulting CDC error appearing in the logs
(see scylladb/scylla-dtest#2804). Together with this change, we'll be
able to fix the #11225 bug and pass this test.

Closes #11287
2022-08-14 13:12:03 +03:00
Nadav Har'El
329068df99 test/cql-pytest: un-xfail yet another passing collection-indexing test
After collection indexing has been implemented, yet another test which
failed because of #2962 now passes. So remove the "xfail" marker.

Refs #2962

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Nadav Har'El
f6f18b187a secondary index: fix paging in map value indexing
When indexing a map column's *values*, if the same value appears more
than once, the same row will appear in the index more than once. We had
code that removed these duplicates, but this deduplication did not work
across page boundaries. We had two xfailing tests to demonstrate this bug.

In this patch we fix this bug by looking at the page's start and not
generating the same row again, thereby getting the same deduplication
we had inside pages - now across pages.

The previously-xfailing tests now pass, and their xfail tag is removed.
I also added another test, for the case where the base table has only
partition keys without clustering keys. This second test is important
because the code path for the partition-key-only case is different,
and the second test exposed a bug in it as well (which is also fixed
in this patch).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Nadav Har'El
dc445b9a73 test/cql-pytest: test for paging with collection values index
If a map has several keys with the same value, then the "values(m)" index
must remember all of them as matching the same row - because later we may
remove one of these keys from the map but the row would still need to
match the value because of the remaining keys.

We already had a test (test_index_map_values) that although the same row
appears more than once for this value, when we search for this value the
result only returns the row once. Under the hood, Scylla does find the
same value multiple times, but then eliminates the duplicate matched raw
and returns it only once.

But there is a complication, that this de-duplication does not easily
span *paging*. So in this patch we add a test that checks that paging
does not cause the same row to be returned more than once.
Unfortunately, this test currently fails on Scylla so marked "xfail".
It passes on Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Nadav Har'El
5d556115a1 cql, view: rename and explain bytes_with_action
The structure "bytes_with_action" was very hard to understand because of
its mysterious and general-sounding name, and no comments.

In this patch I add a large comment explaining its purpose, and rename
it to a more suitable name, view_key_and_action, which suggests that
each such object is about one view key (where to add a view row), and
an additional "action" that we need to take beyond adding the view row.

This is the best I can do to make this code easier to understand without
completely reorganizing it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Nadav Har'El
8b00c91c13 cql, index: make collection indexing a cluster feature
Prevent a user from creating a secondary index on a collection column if
the cluster has any nodes which don't support this feature. Such nodes
will not be able to correctly handle requests related to this index,
so better not allow creating one.

Attempting to create an index on a collection before the entire cluster
supports this feature will result in the error:

    Indexing of collection columns not supported by some older nodes
    in this cluster. Please upgrade them.

Tested by manually disabling this feature in feature_service.cc and
seeing this error message during collection indexing test.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Nadav Har'El
aa86f808a6 test/cql-pytest: failing tests for oversized key values in MV and SI
In issue #9013, we noticed that if a value larger than 64 KB is indexed,
the write fails in a bad way, and we fixed it. But the test we wrote
when fixing that issue already suggested that something was still wrong:
Cassandra failed the write cleanly, with an InvalidRequest, while Scylla
failed with a mysterious WriteFailure (with a relevant error message
only in the log).

This patch adds several xfailing tests which demonstrate what's still
wrong. This is also summarized in issue #8627:

1. A write of an oversized value to an indexed column returns the wrong
   error message.
2. The same problem also exists when indexing a collection, and the indexed
   key or value is oversized.
3. The situation is even less pleasant when adding an index to a table
   with pre-existing data and an oversized value. In this case, the
   view building will fail on the bad row, and never finish.
4. We have exactly the same bugs not just with indexes but also with
   materialized views. Interestingly, Cassandra has similar bugs in
   materialized views as well (but not in the secondary index case,
   where Cassandra does behave as expected).

Refs #8627.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Nadav Har'El
2c244c6e09 cql: fix secondary index "target" when column name has special characters
Unfortunately, we encode the "target" of a secondary index in one of
three ways:

1. It can be just a column name
2. It can be a string like keys(colname) - for the new type of
   collection indexes introduced in this series.
3. It can be a JSON map ({ ... }). This form is used for local indexes.

The code parsing this target - target_parser::parse() - needs not to
confuse these different formats. Before this patch, if the column name
contains special characters like braces or parentheses (this is allowed
in CQL syntax, via quoting), we can confuse case 1, 2, and 3: A column
named "keys(colname)" will be confused for case 2, and a column named
"{123}" will be confused with case 3.

This problem can break indexing of some specially-crafted column names -
as reproduced by test_secondary_index.py::test_index_quoted_names.

The solution adopted in this patch is that the column name in case 1
should be escaped somehow so it cannot be possibly confused with either
cases 2 and 3. The way we chose is to convert the column name to CQL (with
column_definition::as_cql_name()). In other words, if the column name
contains non-alphanumeric characters, it is wrapped in quotes and also
quotes are doubled, as in CQL. The result of this can't be confused
with case 2 or 3, neither of which may begin with a quote.

This escaping is not the minimal we could have done, but incidentally it
is exactly what Cassandra does as well, so I used it as well.

This change is *mostly* backward compatible: Already-existing indexes will
still have unescaped column names stored for their "target" string,
and the unescaping code will see they are not wrapped in quotes, and
not change them. Backward compatibility will only fail on existing indexes
on columns whose name begin and end in the quote characters - but this
case is extremely unlikely.

This patch illustrates how un-ideal our index "target" encoding is,
but isn't what made it un-ideal. We should not have used three different
formats for the index target - the third representation (JSON) should
have sufficed. However, two two other representations are identical
to Cassandra's, so using them when we can has its compatibility
advantages.

The patch makes test_secondary_index.py::test_index_quoted_names pass.

Fixes #10707.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Nadav Har'El
56204a3794 cql, index: improve error messages
Before this patch, trying to create an index on entries(x) where x is
not a map results in an error message:

  Cannot create index on index_keys_and_values of column x

The string "index_keys_and_values" is strange - Cassandra prints the
easier to understand string "entries()" - which better corresponds to
what the user actually did.

It turns out that this string "index_keys_and_values" comes from an
elaborate set of variables and functions spanning multiple source files,
used to convert our internal target_type variable into such a string.
But although this code was called "index_option" and sounded very
important, it was actually used just for one thing - error messages!

So in this patch we drop the entire "index_option" abstraction,
replacing it by a static trivial function defined exactly where
it's used (create_index_statement.cc), which prints a target type.
While at it, we print "entries()" instead of "index_keys_and_values" ;-)

After this patch, the
test_secondary_index.py::test_index_collection_wrong_type

finally passes (the previous patch fixed the default table names it
assumes, and this patch fixes the expected error messages), so its
"xfail" tag is removed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Nadav Har'El
84461f1827 cql, index: fix default index name for collection index
When creating an index "CREATE INDEX ON tbl(keys(m))", the default name
of the index should be tbl_m_idx - with just "m". The current code
incorrectly used the default name tbl_m_keys_idx, so this patch adds
a test (which passes on Cassandra, and after this patch also on Scylla)
and fixes the default name.

It turns out that the default index name was based on a mysterious
index_target::as_string(), which printed the target "keys(m)" as
"m_keys" without explaining why it was so. This method was actually
used only in three places, and all of them wanted just the column
name, without the "_keys" suffix! So in this patch we rename the
mysterious as_string() to column_name(), and use this function instead.

Now that the default index name uses column_name() and gets just
column_name(), the correct default index name is generated, and the
test passes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Nadav Har'El
94ba03a4d6 test/cql-pytest: un-xfail several collecting indexing tests
After the previous patches implemented collection indexing, several
tests in test/cql-pytest/test_secondary_index.py that were marked
with "xfail" started to pass - so here we remove the xfail.

Only three collection indexing tests continue to xfail:
test_secondary_index.py::test_index_collection_wrong_type
test_secondary_index.py::test_index_quoted_names (#10707)
test_secondary_index.py::test_local_secondary_index_on_collection (#10713)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Michał Radwański
2690ecd65d test/cql-pytest/test_secondary_index: verify that local index on
collection fails.

Collection indexing is being tracked by #2962. Global secondary index
over collection is enabled by #10123. Leave this test to track this
behaviour.

Related issue: #10713
2022-08-14 10:29:52 +03:00
Michał Radwański
1d852a9c7f docs/design-notes/secondary_index: add VALUES to index target list
A new secondary index target is being supported, which is `VALUES(v)`.
2022-08-14 10:29:52 +03:00
Michał Radwański
25f4c905f5 test/cql-pytest/test_secondary_index: add randomized test for indexes on
collections
2022-08-14 10:29:52 +03:00
Michał Radwański
2a8289c101 cql-pytest/cassandra_tests/.../secondary_index_test: fix error message
in test ported from Cassandra
2022-08-14 10:29:52 +03:00
Michał Radwański
fb476702a7 cql-pytest/cassandra_tests/.../secondary_index_on_map_entries,select_test: test
ported from Cassandra is expected to fail, since Scylla assumes that
comparison with null doesn't throw error, just evaluates to false. Since
it's not a bug, but expected behavior from the perspective of Scylla, we
don't mark it as xfail.
2022-08-14 10:29:52 +03:00
Michał Radwański
f572051ee9 test/boost/secondary_index_test: update for non-frozen indexes on
collections
2022-08-14 10:29:52 +03:00
Karol Baryła
9e377b2824 test/cql-pytest: Uncomment collection indexes tests that should be working now 2022-08-14 10:29:52 +03:00
Nadav Har'El
67990d2170 cql, index: don't use IS NOT NULL on collection column
When the secondary-index code builds a materialized view on column x, it
adds "x IS NOT NULL" to the where-clause of the view, as required.

However, when we index a collection column, we index individual pieces
of the collection (keys, values), the the entire collection, so checking
if the entire collection is null does not make sense. Moreover, for a
collection column x, "x IS NOT NULL" currently doesn't work and throws
errors when evaluating that expression when data is written to the table.

The solution used in this patch is to simply avoid adding the "x IS NOT
NULL" when creating the materialized view for a collection index.
Everything works just fine without it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Michał Radwański
bd44bc3e35 cql3/statements/select_statement: for index on values of collection,
don't emit duplicate rows

The index on collection values is special in a way, as its' clustering key contains not only the
base primary key, but also a column that holds the keys of the cells in the collection, which
allows to distinguish cells with different keys but the same value.
This has an unwanted consequence, that it's possible to receive two identical base table primary
keys from indexed_table_select_statement::find_index_clustering_rows. Thankfully, the duplicate
primary keys are guaranteed to occur consequently.
2022-08-14 10:29:52 +03:00
Michał Radwański
10e241988e cql/expr/expression, index/secondary_index_manager: needs_filtering and
index_supports_expression rewrite to accomodate for indexes over
collections
2022-08-14 10:29:52 +03:00
Karol Baryła
ac97086855 cql3, index: Use entries() indexes on collections for queries
Previous commit added the ability to use GSI over non-frozen collections in queries,
but only the keys() and values() indexes. This commit adds support for the missing
index type - entries() index.

Signed-off-by: Karol Baryła <karol.baryla@scylladb.com>
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Karol Baryła
7966841d37 cql3, index: Use keys() and values() indexes on collections for queries.
Previous commits added the possibility of creating GSI on non-frozen collections.
This (and next) commit allow those indexes to actually be used by queries.
This commit enables both keys() and values() indexes, as they are pretty similar.
2022-08-14 10:29:52 +03:00
Karol Baryła
aa47f4a15c types/tuple: Use std::begin() instead of .begin() in tuple_type_impl::build_value_fragmented
std::begin in concept for build_value_fragmented's parameter allows creating it from an array
2022-08-14 10:29:52 +03:00
Michał Radwański
e6521ff8ba cql3/statements/index_target: throw exception to signalize that we
didn't miss returning from function

GCC doesn't consider switches over enums to be exhaustive. Replace
bogous return value after a switch where each of the cases return, with
an exception.
2022-08-14 10:29:52 +03:00
Michał Radwański
32289d681f db/view/view.cc: compute view_updates for views over collections
For collection indexes, logic of computing values for each of the column
needed to change, since a single particular column might produce more
than one value as a result.

The liveness info from individual cells of the collection impacts the
liveness info of resulting rows. Therefore it is needed to rewrite the
control flow - instead of functions getting a row from get_view_row and
later computing row markers and applying it, they compute these values
by themselves.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:49 +03:00
Michał Radwański
112086767c view info: has_computed_column_depending_on_base_non_primary_key
In case of secondary indexes, if an index does not contain any column from
the base which makes up for the primary key, then it is assumed that
during update, a change to some cells from the base table cannot cause
that we're dealing with a different row in the view. This however
doesn't take into account the possibility of computed columns which in
fact do depend on some non-primary-key columns. Introduce additional
property of an index,
has_computed_column_depending_on_base_non_primary_key.
2022-08-14 10:29:14 +03:00
Michał Radwański
4cfd264e5d column_computation: depends_on_non_primary_key_column
depends_on_non_primary_key_column for a column computation is needed to
detect a case where the primary key of a materialized view depends on a
non primary key column from the base table, but at the same time, the view
itself doesn't have non-primary key columns. This is an issue, since as
for now, it was assumed that no non-primary key columns in view schema
meant that the update cannot change the primary key of the view, and
therefore the update path can be simplified.
2022-08-14 10:29:14 +03:00
Michał Radwański
f1a9def2e1 schema, index/secondary_index_manager: make schema for index-induced mv
Indexes over collections use materialized views. Supposing that we're
dealing with global indexes, and that pk, ck were the partition and
clustering keys of the base table, the schema of the materialized view,
apart from having idx_token (which is used to preserve the order on the
entries in the view), has a computed column coll_value (the name is not
guaranteed to be exactly) and potentially also
coll_keys_for_values_index, if the index was over collection values.
This is needed, since values in a specific collection need not be
unique.

To summarize, the primary key is as follows:

coll_value, idx_token, pk, ck, coll_keys_for_values_index?

where coll_value is the computed value from the collection, be it a key
from the collection, a value from the collection, or the tuple containing
both.
2022-08-14 10:29:14 +03:00
Michał Radwański
60d50f6016 index/secondary_index_manager: extract keys, values, entries types from collection
These functions are relevant for indexes over collections (creating
schema for a materialized view related to the index).

Signed-off-by: Michał Radwański <michal.radwanski@scylladb.com>
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:14 +03:00
Michał Radwański
cbe33f8d7a cql3/statements/: validate CREATE INDEX for index over a collection
Allow CQL like this:
CREATE INDEX idx ON table(some_map);
CREATE INDEX idx ON table(KEYS(some_map));
CREATE INDEX idx ON table(VALUES(some_map));
CREATE INDEX idx ON table(ENTRIES(some_map));
CREATE INDEX idx ON table(some_set);
CREATE INDEX idx ON table(VALUES(some_set));
CREATE INDEX idx ON table(some_list);
CREATE INDEX idx ON table(VALUES(some_list));

This is needed to support creating indexes on collections.
2022-08-14 10:29:13 +03:00
Michał Radwański
997682ed72 cql3/statements/create_index_statement,index_target: rewrite index target for collection
The syntax used for creating indexes on collections that is present in
Cassandra is unintuitive from the internal representation point of view.

For instance, index on VALUES(some_set) indexes the set elements, which
in the internal representation are keys of collection. Rewrite the index
target after receiving it, so that the index targets are consistent with
the representation.
2022-08-14 10:29:13 +03:00
Michał Radwański
ebc4ad4713 column_computation.hh, schema.cc: collection_column_computation
This type of column computation will be used for creating updates to
materialized views that are indexes over collections.

This type features additional function, compute_values_with_action,
which depending on an (optional) old row and new row (the update to the
base table) returns multiple bytes_with_action, a vector of pairs
(computed value, some action), where the action signifies whether a
deletion of row with a specific key is needed, or creation thereby.
2022-08-14 10:29:13 +03:00
Michał Radwański
2babee2cdc column_computation.hh, schema.cc: compute_value interface refactor
The compute_value function of column_computation has had previously the
following signature:
    virtual bytes_opt compute_value(const schema& schema, const partition_key& key, const clustering_row& row) const override;

This is superfluous, since never in the history of Scylla, the last
parameter (row) was used in any implentation, and never did it happen
that it returned bytes_opt. The absurdity of this interface can be seen
especially when looking at call sites like following, where dummy empty
row was created:
```
    token_column.get_computation().compute_value(
            *_schema, pkv_linearized, clustering_row(clustering_key_prefix::make_empty()));
```
2022-08-14 10:29:13 +03:00
Michał Radwański
166afd46b5 Cql.g, treewide: support cql syntax INDEX ON table(VALUES(collection))
Brings support of cql syntax `INDEX ON table(VALUES(collection))`, even
though there is still no support for indexes over collections.
Previously, index_target::target_type::values was refering to values of
a regular (non-collection) column. Rename it to `regular_values`.

Fixes #8745.
2022-08-14 10:29:13 +03:00
Piotr Sarna
fe617ed198 Merge 'db/system_keyspace: in system.local, use broadcast_rpc_address in rpc_address column' from Piotr Dulikowski
Previously, the `system.local`'s `rpc_address` column kept local node's
`rpc_address` from the scylla.yaml configuration. Although it sounds
like it makes sense, there are a few reasons to change it to the value
of scylla.yaml's `broadcast_rpc_address`:

- The `broadcast_rpc_address` is the address that the drivers are
  supposed to connect to. `rpc_address` is the address that the node
  binds to - it can be set for example to 0.0.0.0 so that Scylla listens
  on all addresses, however this gives no useful information to the
  driver.
- The `system.peers` table also has the `rpc_address` column and it
  already keeps other nodes' `broadcast_rpc_address`es.
- Cassandra is going to do the same change in the upcoming version 4.1.

Fixes: #11201

Closes #11204

* github.com:scylladb/scylladb:
  db/system_keyspace: fix indentation after previous patch
  db/system_keyspace: in system.local, use broadcast_rpc_address in rpc_address column
2022-08-12 16:24:28 +02:00
Anna Stuchlik
41362829b5 doc: fix the upgrade guides for Ubuntu and Debian by removing image-related information 2022-08-12 14:39:10 +02:00
Anna Stuchlik
b45ba69a6c doc: update the guides for Ubuntu and Debian to remove image information and the OS version number 2022-08-12 14:05:49 +02:00
Anna Stuchlik
24acffc2ce doc: add the upgrade guide for ScyllaDB image from 2021.1 to 2022.1 2022-08-12 13:47:03 +02:00
Piotr Sarna
1ab4c6aab3 Merge 'cql3: enable collections as UDA accumulators' from Wojciech Mitros
Currently, the initial values of UDA accumulators are converted
to strings using the to_string() method and from strings using the
from_string() method. The from_string() method is not implemented
for collections, and it can't be implemented without changing the
string format, because in that format, we cannot differentiate
whether a separator is a part of a value or is an actual separator
between values. In particular, the separators are not escaped
in the collection values.

Instead of from_string()/to_string() the cql parser is used
for creating a value from a string (the same , and to_parsable_string()
is used to converting a value into a string.

A test using a list as an accumulator is added to
cql-pytest/test_uda.py.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #11250

* github.com:scylladb/scylladb:
  cql3: enable collections as UDA accumulators
  cql3: extend implementation of to_bytes for raw_value
2022-08-12 12:51:17 +02:00
Botond Dénes
ceb1cdcb7a Merge 'doc: fix the typo on the Fault Tolerance page' from Anna Stuchlik
Fix https://github.com/scylladb/scylla-doc-issues/issues/438
In addition, I've replaced "Scylla" with "ScyllaDB" on that page.

Closes #11281

* github.com:scylladb/scylladb:
  doc: replace Scylla with ScyllaDB on the Fault Tolerance page
  doc: fis the typo in the note
2022-08-12 06:58:39 +03:00
Nadav Har'El
c27f431580 test/alternator: fix a flaky test for full-table scan page size
This patch fixes the test test_scan.py::test_scan_paging_missing_limit
which failed in a Jenkins run once (that we know of).

That test verifies that an Alternator Scan operation *without* an explicit
"Limit" is nevertheless paged: DynamoDB (and also Scylla) wanted this page
size to be 1 MB, but it turns out (see #10327) that because of the details
of how Scylla's scan works, the page size can be larger than 1 MB. How much
larger? I ran this test hundreds of times and never saw it exceed a 3 MB
page - so the test asserted the page must be smaller than 4 MB. But now
in one run - we got to this 4 MB and failed the test.

So in this patch we increase the table to be scanned from 4 MB to 6 MB,
and assert the page size isn't the full 6 MB. The chance that this size will
eventually fail as well should be (famous last words...) very small for
two reasons: First because 6 MB is even higher than I the maximum I saw
in practice, and second because empirically I noticed that adding more
data to the table reduces the variance of the page size, so it should
become closer to 1 MB and reduce the chance of it reaching 6 MB.

Refs #10327

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11280
2022-08-12 06:57:45 +03:00
Botond Dénes
2a39d6518d Merge 'doc: clarify the disclaimer about reusing deleted counter column values' from Anna Stuchlik
Fix https://github.com/scylladb/scylla-doc-issues/issues/857

Closes #11253

* github.com:scylladb/scylladb:
  doc: language improvemens to the Counrers page
  doc: fix the external link
  doc: clarify the disclaimer about reusing deleted counter column values
2022-08-12 06:56:28 +03:00
Botond Dénes
10371441c9 Merge 'docs: add a disclaimer about not supporting local counters by SSTableLoader' from Anna Stuchlik
Fix https://github.com/scylladb/scylla-doc-issues/issues/867
Plus some language, formatting, and organization improvements.

Closes #11248

* github.com:scylladb/scylladb:
  doc: language, formatting, and organization improvements
  doc: add a disclaimer about not supporting local counters by SSTableLoader
2022-08-12 06:55:00 +03:00
Benny Halevy
d295d8e280 everywhere: define locator::host_id as a strong tagged_uuid type
So it can be distinguished from other uuid-based
identifiers in the system.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11276
2022-08-12 06:01:44 +03:00
Botond Dénes
69aea59d97 Merge 'storage_proxy: use consistent topology, prepare for fencing' from Avi Kivity
Replication is a mix of several inputs: tokens and token->node mappings (topology),
the replication strategy, replication strategy parameters. These are all captured
in effective_replication_map.

However, if we use effective_replication_map:s captured at different times in a single
query, then different uses may see different inputs to effective_replication_map.

This series protects against that by capturing an effective_replication_map just
once in a query, and then using it. Furthermore, the captured effective_replication_map
is held until the query completes, so topology code can know when a topology is no
longer is use (although this isn't exploited in this series).

Only the simple read and write paths are covered. Counters and paxos are left for
later.

I don't think the series fixes any bugs - as far as I could tell everything was happening
in the same continuation. But this series ensures it.

Closes #11259

* github.com:scylladb/scylladb:
  storage_proxy: use consistent topology
  storage_proxy: use consistent replication map on read path
  storage_proxy: use consistent replication map on write path
  storage_proxy: convert get_live{,_sorted}_endpoints() to accept an effective_replication_map
  consistency_level: accept effective_replication_map as parameter, rather than keyspace
  consistency_level: be more const when using replication_strategy
2022-08-12 06:00:30 +03:00
Alejo Sanchez
10baac1c84 test.py: test topology and schema changes
Add support for topology changes: add/stop/remove/restart/replace node.

Test simple schema changes when changing topology.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-11 23:39:13 +02:00
Alejo Sanchez
7f32fc0cc7 test.py: ClusterManager API mark cluster dirty
Allow tests to manually mark current cluster dirty.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-11 23:39:13 +02:00
Alejo Sanchez
a585a82ad1 test.py: call before/after_test for each test case
Preparing for topology tests with changing clusters, run before and
after checks per test case.

Change scope of pytest fixtures to function as we need them per test
casse.

Add server and client API logic.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-11 23:39:13 +02:00
Alejo Sanchez
eedc866433 test.py: handle driver connection in ManagerClient
Preparing for cluster cycling, handle driver connection in ManagerClient.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-11 23:39:13 +02:00
Alejo Sanchez
fe561a7dbd test.py: ClusterManager API and ManagerClient
Add an API via Unix socket to Manager so pytests can query information
about the cluster. Requests are managed by ManagerClient helper class.

The socket is placed inside a unique temporary directory for the
Manager (as safe temporary socket filename is not possible in Python).

Initial API services are manager up, cluster up, if cluster is dirty,
cql port, configured replicas (RF), and list of host ids.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-11 23:39:13 +02:00
Alejo Sanchez
aad015d4e2 test.py: improve topology docstring
Improve docstring of TopologyTestSuite to reflect its differences with
other test suites.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-11 23:39:13 +02:00
Avi Kivity
a2c4f5aa1a storage_proxy: use consistent topology
Derive the topology from captured and stable effective_replication_map
instead of getting a fresh topology from storage_proxy, since the
fresh topology may be inconsistent with the running query.

digest_read_resolver did not capture an effective_replication_map, so
that is added.
2022-08-11 17:58:42 +03:00
Avi Kivity
883518697b storage_proxy: use consistent replication map on read path
Capture a replication map just once in
abstract_read_executor::_effective_replication_map_ptr. Although it isn't
used yet, it serves to keep a reference count on topology (for fencing),
and some accesses to topology within reads still remain, which can be
converted to use the member in a later patch.
2022-08-11 17:58:42 +03:00
Avi Kivity
01a614fb4d storage_proxy: use consistent replication map on write path
Capture a replication map just once in
abstract_write_handler::_effective_replication_map_ptr and use it
in all write handlers. A few accesses to get the topology still remain,
they will be fixed up in a later patch.
2022-08-11 17:58:42 +03:00
Avi Kivity
f1b0e3d58e storage_proxy: convert get_live{,_sorted}_endpoints() to accept an effective_replication_map
Allow callers to use consistent effective_replication_map:s across calls
by letting the caller select the object to use.
2022-08-11 17:58:42 +03:00
Avi Kivity
46bd0b1e62 consistency_level: accept effective_replication_map as parameter, rather than keyspace
A keyspace is a mutable object that can change from time to time. An
effective_replication_map captures the state of a keyspace at a point in
time and can therefore be consistent (with care from the caller).

Change consistency_level's functions to accept an effective_replication_map.
This allows the caller to ensure that separate calls use the same
information and are consistent with each other.

Current callers are likely correct since they are called from one
continuation, but it's better to be sure.
2022-08-11 17:58:42 +03:00
Avi Kivity
1078d1bfda consistency_level: be more const when using replication_strategy
We don't modify the replication_strategy here, so use const. This
will help when the object we get is const itself, as it will be in
the next patches.
2022-08-11 17:58:42 +03:00
Wojciech Mitros
48bd752971 cql3: enable collections as UDA accumulators
Currently, the initial values of UDA accumulators are converted
to strings using the to_string() method and from strings using the
from_string() method. The from_string() method is not implemented
for collections, and it can't be implemented without changing the
string format, because in that format, we cannot differentiate
whether a separator is a part of a value or is an actual separator
between values. In particular, the separators are not escaped
in the collection values. For example, a list with string elements:
'a, b', 'c' would be represented as a string 'a, b, c', while now
it is represented as "['a, b', 'c']".
Some types that were parsable are now represented in a different
way. For example, a tuple ('a', null, 0) was represented as
"a:\@:0", and now it is "('a', null, 0)".

Instead of from_string()/to_string() the cql parser is used
for creating a value from a string (the same , and to_parsable_string()
is used to converting a value into a string.

A test using a list as an accumulator is added to
cql-pytest/test_uda.py.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-08-11 16:23:57 +02:00
Anna Stuchlik
f5a49688ae doc: replace Scylla with ScyllaDB on the Fault Tolerance page 2022-08-11 16:14:33 +02:00
Anna Stuchlik
7218a977df doc: fis the typo in the note 2022-08-11 16:09:49 +02:00
Botond Dénes
d407d3b480 Merge 'Calculate effective_replication_map: prevent stalls with everywhere_replication_strategy' from Benny Halevy
For replication strategies like "everywhere"
and "local" that return the same set of endpoints
for all tokens, we can call rs->calculate_natural_endpoints
one once and reuse the result for all token.

Note that ideally the replication_map could contain only
a single token range for this case, but that does't seem to work yet.

Add `maybe_yield()` calls to the tight loop
to prevent reactor stalls on large clusters when copying
a long vector returned by everywhere_replication_strategy
to potentially 1000's of tokens in the map.

Nicholas Peshek wrote in
https://github.com/scylladb/scylladb/issues/10337#issuecomment-1211152370

about similar patch by Geoffrey Beausire:
994c6ecf3c

> Yep. That dropped our startup from 3000+ seconds to about 40.

Fixes #10337

Closes #11277

* github.com:scylladb/scylladb:
  abstract_replication_strategy: calculate_effective_replication_map: optimize for static replication strategies
  abstract_replication_strategy: add has_uniform_natural_endpoints
2022-08-11 15:19:47 +03:00
Gleb Natapov
e5157b27ad raft: getting abort_requested_exception exception from a sm::apply is not a critical error
During shutdown it is normal to get abort_requested_exception exception
from a state machine "apply" method. Do not rethrow it as
state_machine_error, just abort an applier loop with an info message.
2022-08-11 15:11:21 +03:00
Gleb Natapov
9977851eb1 schema_registry: fix abandoned feature warning
maybe_sync ignores failed feature in case waiting is aborted. Fix it.
2022-08-11 15:11:21 +03:00
Gleb Natapov
eed8e19813 service: raft: silence rpc::closed_errors in raft_rpc
Before the patch if an RPC connection was established already then the
close error was reported by the RPC layer and then duplicated by
raft_rpc layer. If a connection cannot be established because the remote
node is already dead RPC does not report the error since we decided that
in that case gossiper and failure detector messages can be used to
detect the dead node case and there is no reason to pollute the logs
with recurring errors. This aligns raft behaviour with what we already
have in storage_proxy that does not report closed errors as well.
2022-08-11 15:11:21 +03:00
Anna Stuchlik
1603129275 doc: remove the reduntant space from index 2022-08-11 12:36:16 +02:00
Anna Stuchlik
ee258cb0af doc: update the syntax for defining service level attributes 2022-08-11 12:32:38 +02:00
Petr Gusev
4bc6611829 raft read_barrier, retry over intermittent rpc failures
If the leader was unavailable during read_barrier,
closed_error occurs, which was not handled in any way
and eventually reached the client. This patch adds retries in this case.

Fix: scylladb#11262
Refs: #11278

Closes #11263
2022-08-11 13:31:19 +03:00
Amnon Heiman
5ac20ac861 Reduce the number of per-scheduling group metrics
This patch reduces the number of metrics ScyllaDB generates.

Motivation: The combination of per-shard with per-scheduling group
generates a lot of metrics. When combined with histograms, which require
many metrics, the problem becomes even bigger.

The two tools we are going to use:
1. Replace per-shard histograms with summaries
2. Do not report unused metrics.

The storage_proxy stats holds information for the API and the metrics
layer.  We replaced timed_rate_moving_average_and_histogram and
time_estimated_histogram with the unfied
timed_rate_moving_average_summary_and_histogram which give us an option
to report per-shard summaries instead of histogram.

All the counters, histograms, and summaries were marked as
skip_when_empty.

The API was modified to use
timed_rate_moving_average_summary_and_histogram.

Closes #11173
2022-08-11 13:31:19 +03:00
Benny Halevy
9167b857e9 abstract_replication_strategy: calculate_effective_replication_map: optimize for static replication strategies
For replication strategies like "everywhere"
and "local" that return the same set of endpoints
for all tokens, we can call rs->calculate_natural_endpoints
one once and reuse the result for all token.

Note that ideally the replication_map could contain only
a single token range for this case, but that does't seem to work yet.

Add maybe_yield() calls to the tight loop
to prevent reactor stalls on large clusters when copying
a long vector returned by everywhere_replication_strategy
to potentially 1000's of tokens in the map.

Nicholas Peshek wrote in
https://github.com/scylladb/scylladb/issues/10337#issuecomment-1211152370

about similar patch by Geoffrey Beausire:
994c6ecf3c

> Yep. That dropped our startup from 3000+ seconds to about 40.

Fixes #10337

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-11 10:35:29 +03:00
Benny Halevy
eb678e723b abstract_replication_strategy: add has_uniform_natural_endpoints
So that using calaculate_natural_endpoints can be optimized
for strategies that return the same endpoints for all tokens,
namely everywhere_replication_strategy and local_strategy.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-11 10:34:14 +03:00
Calle Wilund
a729c2438e commitlog: Make get_segments_to_replay on-demand
Refs #11237

Don't store segments found on init scan in all shard instances,
instead retrieve (based on low time-pos for current gen) when
required. This changes very little, but we at last don't store
pointless string lists in shards 1 to X, and also we can potentially
ask for the list twice. More to the point, goes better hand-in-hand
with the semantics of "delete_segments", where any file sent in is
considered candidate for recycling, and included in footprint.
2022-08-11 06:41:23 +00:00
Nadav Har'El
d03bd82222 Revert "test: move scylla_inject_error from alternator/ to cql-pytest/"
This reverts commit 8e892426e2 and fixes
the code in a different way:

That commit moved the scylla_inject_error function from
test/alternator/util.py to test/cql-pytest/util.py and renamed
test/alternator/util.py. I found the rename confusing and unnecessary.
Moreover, the moved function isn't even usable today by the test suite
that includes it, cql-pytest, because it lacks the "rest_api" fixture :-)
so test/cql-pytest/util.py wasn't the right place for it anyway.
test/rest_api/rest_util.py could have been a good place for this function,
but there is another complication: Although the Alternator and rest_api
tests both had a "rest_api" fixture, it has a different type, which led
to the code in rest_api which used the moved function to have to jump
through hoops to call it instead of just passing "rest_api".

I think the best solution is to revert the above commit, and duplicate
the short scylla_inject_error() function. The duplication isn't an
exact copy - the test/rest_api/rest_util.py version now accepts the
"rest_api" fixture instead of the URL that the Alternator version used.

In the future we can remove some of this duplication by having some
shared "library" code but we should do it carefully and starting with
agreeing on the basic fixtures like "rest_api" and "cql", without that
it's not useful to share small functions that operate on them.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11275
2022-08-11 06:43:26 +03:00
Wojciech Mitros
42e0fb90ea cql3: extend implementation of to_bytes for raw_value
When called with a null_value or an unset_value,
raw_value::to_bytes() threw an std::get error for
wrong variant. This patch adds a description for
the errors thrown, and adds a to_bytes_opt() method
that instead of throwing returns a std::nullopt.
2022-08-10 16:40:22 +02:00
Avi Kivity
e9cbc9ee85 Merge 'Add support for empty replica pages' from Botond Dénes
Many tombstones in a partition is a problem that has been plaguing queries since the inception of Scylla (and even before that as they are a pain in Apache Cassandra too). Tombstones don't count towards the query's page limit, neither the size nor the row number one. Hence, large spans of tombstones (be that row- or range-tombstones) are problematic: the query can time out while processing this span of tombstones, as it waits for more live rows to fill the page. In the extreme case a partition becomes entirely unreadable, all read attempts timing out, until compaction manages to purge the tombstones.
The solution proposed in this PR is to pass down a tombstone limit to replicas: when this limit is reached, the replica cuts the page and marks it as short one, even if the page is empty currently. To make this work, we use the last-position infrastructure added recently by 3131cbea62, so that replicas can provide the position of the last processed item to continue the next page from. Without this no forward progress could be made in the case of an empty page: the query would continue from the same position on the next page, having to process the same span of tombstones.
The limit can be configured with the newly added `query_tombstone_limit` configuration item, defaulted to 10000. The coordinator will pass this to the newly added `tombstone_limit` field of `read_command`, if the `replica_empty_pages` cluster feature is set.

Upgrade sanity test was conducted as following:
* Created cluster of 3 nodes with RF=3 with master version
* Wrote small dataset of 1000 rows.
* Deleted prefix of 980 rows.
* Started read workload: `scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100`
* Also did some manual queries via `cqlsh` with smaller page size and tracing on.
* Stopped and upgraded each node one-by-one. New nodes were started by `--query-tombstone-page-limit=10`.
* Confirmed there are no errors or read-repairs.

Perf regression test:
```
build/release/test/perf/perf_simple_query_g -c1 -m2G --concurrency=1000 --task-quota-ms 10 --duration=60
```
Before:
```
median 133665.96 tps ( 62.0 allocs/op,  12.0 tasks/op,   43007 insns/op,        0 errors)
median absolute deviation: 973.40
maximum: 135511.63
minimum: 104978.74
```
After:
```
median 129984.90 tps ( 62.0 allocs/op,  12.0 tasks/op,   43181 insns/op,        0 errors)
median absolute deviation: 2979.13
maximum: 134538.13
minimum: 114688.07
```
Diff: +~200 instruction/op.

Fixes: https://github.com/scylladb/scylla/issues/7689
Fixes: https://github.com/scylladb/scylla/issues/3914
Fixes: https://github.com/scylladb/scylla/issues/7933
Refs: https://github.com/scylladb/scylla/issues/3672

Closes #11053

* github.com:scylladb/scylladb:
  test/cql-pytest: add test for query tombstone page limit
  query-result-writer: stop when tombstone-limit is reached
  service/pager: prepare for empty pages
  service/storage_proxy: set smallest continue pos as query's continue pos
  service/storage_proxy: propagate last position on digest reads
  query: result_merger::get() don't reset last-pos on short-reads and last pages
  query: add tombstone-limit to read-command
  service/storage_proxy: add get_tombstone_limit()
  query: add tombstone_limit type
  db/config: add config item for query tombstone limit
  gms: add cluster feature for empty replica pages
  tree: don't use query::read_command's IDL constructor
2022-08-10 13:38:06 +03:00
Avi Kivity
32b405d639 Merge 'doc: change the default for the overprovisioned option' from Anna Stuchlik
Fix https://github.com/scylladb/scylla-doc-issues/issues/842

This PR changes the default for the `overprovisioned` option from `disabled` to `enabled`, according to https://github.com/scylladb/scylla-doc-issues/issues/842.
In addition, I've used this opportunity to replace "Scylla" with "ScyllaDB" on the updated page.

Closes #11256

* github.com:scylladb/scylladb:
  doc: replace Scylla with ScyllaDB in the product name
  doc: change the default for the overprovisioned option
2022-08-10 12:43:44 +03:00
Raphael S. Carvalho
ace6334619 replica: table: kill unused _sstables_staging
Good change as it's one less thing to worry about in compaction
group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-08-10 12:32:13 +03:00
Kamil Braun
cff595211e Merge 'Raft test topology part 2' from Alecco
Give cluster control to pytests. While there add missing stop gracefully and add server to ScyllaCluster.
Clusters can be marked dirty but they are not recycled yet. This will be done in a later series.

Closes #11219

* github.com:scylladb/scylladb:
  test.py: ScyllaCluster add_server() mark dirty
  test.py: ScyllaCluster add server management
  test.py: improve seeds for new servers
  test.py: Topology tests and Manager for Scylla clusters
  test.py: rename scylla_server to scylla_cluster
  test.py: function for python driver connection
  test.py: ScyllaCluster add_server helper
  test.py: shutdown control connection during graceful shutdown
  test.py: configurable authenticator and authorizer
  test.py: ScyllaServer stop gracefully
  test.py: FIXME for bad cluster log handling logic
2022-08-10 11:13:21 +02:00
Michał Chojnowski
de0f2c21ec configure.py: make messaging_service.cc the first source file
Currently messaging_service.o takes the longest of all core objects to
compile.  For a full build of build/release/scylla, with current ninja
scheduling, on a 32-hyperthread machine, the last ~16% of the total
build time is spent just waiting on messaging_service.o to finish
compiling.

Moving the file to the top of the list makes ninja start its compilation
early and gets rid of that single-threaded tail, improving the total build
time.

Closes #11255
2022-08-10 11:18:09 +03:00
Calle Wilund
8116c56807 commitlog: Revert/modify fac2bc4 - do footprint add in delete
Fixes #11184
Fixes #11237

In prev (broken) fix for #11184 we added the footprint for left-over
files (replay candidates) to disk footprint on commitlog init.
This effectively prevents us from creating segments iff we have tight
limits. Since we nowadays do quite a bit of inserts _before_ commitlog
replay (system.local, but...) we can end up in a situation where we
deadlock start because we cannot get to the actual replay that will
eventually free things.
Another, not thought through, consequence is that we add a single
footprint to _all_ commitlog shard instances - even though only
shard 0 will get to actually replay + delete (i.e. drop footprint).
So shards 1-X would all be either locked out or performance degraded.

Simplest fix is to add the footprint in delete call instead. This will
lock out segment creation until delete call is done, but this is fast.
Also ensures that only replay shard is involved.
2022-08-10 08:04:03 +00:00
Botond Dénes
e27127bb7f test/cql-pytest: add test for query tombstone page limit
Check that the replica returns empty pages as expected, when a large
tombstone prefix/span is present. Large = larger than the configured
query_tombstone_limit (using a tiny value of 10 in the test to avoid
having to write many tombstones).
2022-08-10 09:14:59 +03:00
Tomasz Grabiec
8ee5b69f80 test: row_cache: Use more narrow key range to stress overlapping reads more
This makes catching issues related to concurrent access of same or
adjacent entries more likely. For example, catches #11239.

Closes #11260
2022-08-10 06:53:54 +03:00
Botond Dénes
7730419f5c query-result-writer: stop when tombstone-limit is reached
The query result writer now counts tombstones and cuts the page (marking
it as a short one) when the tombstone limit is reached. This is to avoid
timing out on large span of tombstones, especially prefixes.
In the case of unpaged queries, we fail the read instead, similarly to
how we do with max result size.
If the limit is 0, the previous behaviour is used: tombstones are not
taken into consideration at all.
2022-08-10 06:03:38 +03:00
Botond Dénes
8066dbc635 service/pager: prepare for empty pages
The pager currently assumes that an empty pages means the query is
exhausted. Lift this assumption, as we will soon have empty short
pages.
Also, paging using filtering also needs to use the replica-provided
last-position when the page is empty.
2022-08-10 06:03:38 +03:00
Botond Dénes
6a7dedfe34 service/storage_proxy: set smallest continue pos as query's continue pos
We expect each replica to stop at exactly the same position when the
digests match. Soon however, if replicas have a lot of tombstones, some
may stop earlier then the others. As long as all digests match, this is
fine but we need to make sure we continue from the smallest such
positions on the next page.
2022-08-10 06:03:38 +03:00
Botond Dénes
2656968db2 service/storage_proxy: propagate last position on digest reads
We want to transmit the last position as determined by the replica on
both result and digest reads. Result reads already do that via the
query::result, but digest reads don't yet as they don't return the full
query::result structure, just the digest field from it. Add the last
position to the digest read's return value and collect these in the
digest resolver, along with the returned digests.
2022-08-10 06:03:37 +03:00
Botond Dénes
8c0dd99f7c query: result_merger::get() don't reset last-pos on short-reads and last pages
When merging multiple query-results, we use the last-position of the
last result in the combined one as the combined result's last position.
This only works however if said last result was included fully.
Otherwise we have to discard the last-position included with the result
and the pager will use the position of the last row in the combined
result as the last position.
The commit introducing the above logic mistakenly discarded the last
position when the result is a short read or a page is not full. This is
not necessary and even harmful as it can result in an empty combined
result being delivered to the pager, without a last-position.
2022-08-10 06:01:49 +03:00
Botond Dénes
d1d53f1b84 query: add tombstone-limit to read-command
Propagate the tombstone-limit from coordinator to replicas, to make sure
all is using the same limit.
2022-08-10 06:01:47 +03:00
Anna Stuchlik
43cc17bf5d doc: replace Scylla with ScyllaDB in the product name 2022-08-09 16:19:55 +02:00
Anna Stuchlik
d21b92fb13 doc: change the default for the overprovisioned option 2022-08-09 16:09:29 +02:00
Anna Stuchlik
c3dbb9706e doc: language improvemens to the Counrers page 2022-08-09 14:35:44 +02:00
Alejo Sanchez
05afca2199 test.py: ScyllaCluster add_server() mark dirty
When changing topology, tests will add servers. Make add_server mark the
cluster dirty. But mark the cluster as not dirty after calling
add_server when installing the cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-09 14:26:13 +02:00
Alejo Sanchez
f1a6e4bda9 test.py: ScyllaCluster add server management
Preparing for topology changes, implement the primitives for managing
ScyllaServers in ScyllaCluster. The states are started, stopped, and
removed. Started servers can be stopped or restarted. Stopped servers
can be started. Stopped servers can be removed (destroyed).

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-09 14:26:13 +02:00
Alejo Sanchez
a6448458bb test.py: improve seeds for new servers
Instead of only using last started server as seed, use all started
servers as seed for new servers.

This also avoids tracking last server's state.

Pass empty list instead of None.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-09 14:26:13 +02:00
Alejo Sanchez
83dab6045b test.py: Topology tests and Manager for Scylla clusters
Preparing to cycle clusters modified (dirty) and use multiple clusters
per topology pytest, introduce Topology tests and Manager class to
handle clusters.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-09 14:26:13 +02:00
Alejo Sanchez
14328d1e42 test.py: rename scylla_server to scylla_cluster
This file's most important class is ScyllaCluster, so rename it
accordingly.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-09 14:26:13 +02:00
Alejo Sanchez
dcd8d77f34 test.py: function for python driver connection
Isolate python driver connection on its own function. Preparing for
harness client fixture to handle the connection.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-09 14:26:13 +02:00
Alejo Sanchez
1db31ebfdc test.py: ScyllaCluster add_server helper
For future use from API.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-09 14:26:13 +02:00
Konstantin Osipov
c81c8af1ba test.py: shutdown control connection during graceful shutdown 2022-08-09 14:26:13 +02:00
Alejo Sanchez
bc494754e8 test.py: configurable authenticator and authorizer
For scylla servers, keep default PasswordAuthenticator and
CassandraAuthorizer but allow this to be configurable per test suite.
Use AllowAll* for topology test suite.

Disabling authentication avoids complications later for topology tests
as system_auth kespace starts with RF=1 and tests take down nodes. The
keyspace would need to change RF and run repair. Using AllowAll avoids
this problem altogether.

A different cql fixture is created without auth for topology tests.

Topology tests require servers without auth from scylla.yaml conf.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-09 14:26:13 +02:00
Alejo Sanchez
6437a7f467 test.py: ScyllaServer stop gracefully
Add stop_gracefully() method. Terminates a server in a clean way.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-09 14:26:13 +02:00
Alejo Sanchez
573ed429ad test.py: FIXME for bad cluster log handling logic
The code in test.py using a ScyllaCluster is getting a server id and
taking logs from only the first server.

If there is a failure in another server it's not reported properly.

And CQL connection will go only to the first server.

Also, it might be better to have ScyllaCluster to handle these matters
and be more opaque.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-09 14:26:13 +02:00
Anna Stuchlik
15c24ba3e0 doc: fix the external link 2022-08-09 14:20:54 +02:00
Anna Stuchlik
82d1f67378 doc: clarify the disclaimer about reusing deleted counter column values 2022-08-09 14:12:37 +02:00
Avi Kivity
be44fd63f9 Merge 'Make get_range_addresses async and hold effective_replication_map_ptr around it' from Benny Halevy
This series converts the synchronous `effective_replication_map::get_range_addresses` to async
by calling the replication strategy async entry point with the same name, as its callers are already async
or can be made so easily.

To allow it to yield and work on a coherent view of the token_metadata / topology / replication_map,
let the callers of this patch hold a effective_replication_map per keyspace and pass it down
to the (now asynchronous) functions that use it (making affected storage_service methods static where possible
if they no longer depend on the storage_service instance).

Also, the repeated calls to everywhere_replication_strategy::calculate_natural_endpoints
are optimized in this series by introducing a virtual abstract_replication_strategy::has_static_natural_endpoints predicate
that is true for local_strategy and everywhere_replication_strategy, and is false otherwise.
With it, functions repeatedly calling calculate_natural_endpoints in a loop, for every token, will call it only once since it will return the same result every time anyhow.

Refs #11005

Doesn't fix the issue as the large allocation still remains until we make change dht::token_range_vector chunked (chunked_vector cannot be used as is at the moment since we require the ability to push also to the front when unwrapping)

Closes #11009

* github.com:scylladb/scylladb:
  effective_replication_map: make get_range_addresses asynchronous
  range_streamer: add_ranges and friends: get erm as param
  storage_service: get_new_source_ranges: get erm as param
  storage_service: get_changed_ranges_for_leaving: get erm as param
  storage_service: get_ranges_for_endpoint: get erm as param
  repair: use get_non_local_strategy_keyspaces_erms
  database: add get_non_local_strategy_keyspaces_erms
  database: add get_non_local_strategy_keyspaces
  storage_service: coroutinize update_pending_ranges
  effective_replication_map: add get_replication_strategy
  effective_replication_map: get_range_addresses: use the precalculated replication_map
  abstract_replication_strategy: get_pending_address_ranges: prevent extra vector copies
  abstract_replication_strategy: reindent
  utils: sequenced_set: expose set and `contains` method
  abstract_replication_strategy: calculate_natural_endpoints: return endpoint_set
  utils: sequenced_set: templatize VectorType
  utils: sanitize sequenced_set
  utils: sequenced_set: delete mutable get_vector method
2022-08-09 13:25:53 +03:00
Benny Halevy
f01a526887 docs: debugging: mention use of release number on backtrace.scylladb.com
Following scylladb/scylla_s3_reloc_server@af17e4ffcd
(scylladb/scylla_s3_reloc_server#28), the release number can
be used to search the relcatable package and/or decode a respective
backtrace.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11247
2022-08-09 12:49:59 +03:00
Avi Kivity
d4c986e4fa Merge 'doc: add the upgrade guide from 5.0 to 2022.1 on Ubuntu 20.04' from Anna Stuchlik
Ubuntu 22.04 is supported by both ScyllaDB Open Source 5.0 and Enterprise 2022.1.

Closes #11227

* github.com:scylladb/scylladb:
  doc: add the redirects from Ubuntu version specific to version generic pages
  doc: remove version-speific content for Ubuntu and add the generic page to the toctree
  doc: rename the file to include Ubuntu
  doc: remove the version number from the document and add the link to Supported Versions
  doc: add a generic page for Ubuntu
  doc: add the upgrade guide from 5.0 to 2022.1 on Ubuntu 2022.1
2022-08-09 12:49:16 +03:00
Asias He
12ab2c3d8d storage_service: Prevent removed node to restart and join the cluster
1) Start node1,2,3
2) Stop node3
3) Run nodetool removenode $host_id_of_node3
4) Restart node3

Step 4 is wrong and not allowed. If it happens it will bring back node3
to the cluster.

This patch adds a check during node restart to detect such operation
error and reject the restart.

With this patch, we would see the following in step 4.

```
init - Startup failed: std::runtime_error (The node 127.0.0.3 with
host_id fa7e500a-8617-4de4-8efd-a0e177218ee8 is removed from the
cluster. Can not restart the removed node to join the cluster again!)
```

Refs #11217

Closes #11244
2022-08-09 12:46:21 +03:00
Avi Kivity
1d4bf115e2 Merge 'row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy' from Tomasz Grabiec
Scenario:

cache = [
    row(pos=2, continuous=false),
    row(pos=after(2), dummy=true)
]

Scanning read starts, starts populating [-inf, before(2)] from sstables.

row(pos=2) is evicted.

cache = [
    row(pos=after(2), dummy=true)
]

Scanning read finishes reading from sstables.

Refreshes cache cursor via
partition_snapshot_row_cursor::maybe_refresh(), which calls
partition_snapshot_row_cursor::advance_to() because iterators are
invalidated. This advances the cursor to
after(2). no_clustering_row_between(2, after(2)) returns true, so
advance_to() returns true, and maybe_refresh() returns true. This is
interpreted by the cache reader as "the cursor has not moved forward",
so it marks the range as complete, without emitting the row with
pos=2. Also, it marks row(pos=after(2)) as continuous, so later reads
will also miss the row.

The bug is in advance_to(), which is using
no_clustering_row_between(a, b) to determine its result, which by
definition excludes the starting key.

Discovered by row_cache_test.cc::test_concurrent_reads_and_eviction
with reduced key range in the random_mutation_generator (1024 -> 16).

Fixes #11239

Closes #11240

* github.com:scylladb/scylladb:
  test: mvcc: Fix illegal use of maybe_refresh()
  tests: row_cache_test: Add test_eviction_of_upper_bound_of_population_range()
  tests: row_cache_test: Introduce one_shot mode to throttle
  row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy
2022-08-09 12:39:10 +03:00
Anna Stuchlik
e753b4e793 doc: language, formatting, and organization improvements 2022-08-09 10:34:22 +02:00
Tomasz Grabiec
f59d2d9bf8 range_tombstone_list: Avoid amortized_reserve()
We can use std::in_place_type<> to avoid constructing op before
calling emplace_back(). As a reuslt, we can avoid reserving space. The
reserving was there to avoid the need to roll-back in case
emplace_back() throws.

Kudos to Kamil for suggesting this.

Closes #11238
2022-08-09 11:34:16 +03:00
Avi Kivity
8d37370a71 Revert "Merge "memtable-sstable: Add compacting reader when flushing memtable." from Mikołaj"
This reverts commit bcadd8229b, reversing
changes made to cf528d7df9. Since
4bd4aa2e88 ("Merge 'memtable, cache: Eagerly
compact data with tombstones' from Tomasz Grabiec"), memtable is
self-compacting and the extra compaction step only reduces throughput.

The unit test in memtable_test.cc is not reverted as proof that the
revert does not cause a regression.

Closes #11243
2022-08-09 11:23:29 +03:00
Anna Stuchlik
61d33cb2a8 doc: add a disclaimer about not supporting local counters by SSTableLoader 2022-08-09 10:00:14 +02:00
Anna Stuchlik
4be88e1a79 doc: add the redirects from Ubuntu version specific to version generic pages 2022-08-09 09:43:28 +02:00
Raphael S. Carvalho
337390d374 forward_service: execute_on_this_shard: avoid reallocation and copy
avoid about log2(256)=8 reallocations when pushing partition ranges to
be fetched. additionally, also avoid copying range into ranges
container. current_range will not contain the last range, after
moved, but will still be engaged by the end of the loop, allowing
next iteration to happen as expected.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #11242
2022-08-09 09:08:53 +02:00
Botond Dénes
1b669cefed service/storage_proxy: add get_tombstone_limit()
To be used by coordinator side code to determine the correct tombstone
limit to pass to read-command (tombstone limit field added in the next
commit). When this limit is non-zero, the replica will start cutting
pages after the tombstone limit is surpassed.
This getter works similarly to `get_max_result_size()`: if the cluster
feature for empty replica pages is set, it will return the value
configured via db::config::query_tombstone_limit. System queries always
use a limit of 0 (unlimited tombstones).
2022-08-09 10:00:40 +03:00
Botond Dénes
8cd2ef7a42 query: add tombstone_limit type
Will be used in read_command. Add it before it is added to read-command
so we can use the unlimited constant in code added in preparation to
that.
2022-08-09 10:00:40 +03:00
Botond Dénes
33f0447ba0 db/config: add config item for query tombstone limit
This will be the value used to break pages, after processing the
specified amount of tombstones. The page will be cut even if empty.
We could maybe use the already existing tombstone_{warn,fail}_threshold
instead and use them as a soft/hard limit pair, like we did with page
sizes.
2022-08-09 10:00:40 +03:00
Botond Dénes
1bc14b5e3b gms: add cluster feature for empty replica pages
So we can start using them only when the entire cluster supports it.
2022-08-09 10:00:40 +03:00
Botond Dénes
60a0e3d88b tree: don't use query::read_command's IDL constructor
It is not type safe: has multiple limits passed to it as raw ints, as
well as other types that ints implicitly convert to. Furthermore the row
limit is passed in two separate fields (lower 32 bits and upper 32
bits). All this make this constructor a minefield for humans to use. We
have a safer constructor for some time but some users of the old one
remain. Move them to the safe one.
2022-08-09 10:00:37 +03:00
Tomasz Grabiec
05b0a62132 test: mvcc: Fix illegal use of maybe_refresh()
maybe_refresh() can only be called if the cursor is pointing at a row.
The code was calling it before the cursor was advanced, and was
thus relying on implementation detail.
2022-08-09 02:28:56 +02:00
Tomasz Grabiec
ce624048d9 tests: row_cache_test: Add test_eviction_of_upper_bound_of_population_range()
Reproducer for #11239.
2022-08-09 02:28:56 +02:00
Tomasz Grabiec
6aaa6f8744 tests: row_cache_test: Introduce one_shot mode to throttle 2022-08-09 02:28:56 +02:00
Tomasz Grabiec
a6a61eaf96 row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy
Scenario:

cache = [
    row(pos=2, continuous=false),
    row(pos=after(2), dummy=true)
]

Scanning read starts, starts populating [-inf, before(2)] from sstables.

row(pos=2) is evicted.

cache = [
    row(pos=after(2), dummy=true)
]

Scanning read finishes reading from sstables.

Refreshes cache cursor via
partition_snapshot_row_cursor::maybe_refresh(), which calls
partition_snapshot_row_cursor::advance_to() because iterators are
invalidated. This advances the cursor to
after(2). no_clustering_row_between(2, after(2)) returns true, so
advance_to() returns true, and maybe_refresh() returns true. This is
interpreted by the cache reader as "the cursor has not moved forward",
so it marks the range as complete, without emitting the row with
pos=2. Also, it marks row(pos=after(2)) as continuous, so later reads
will also miss the row.

The bug is in advance_to(), which is using
no_clustering_row_between(a, b) to determine its result, which by
definition excludes the starting key.

Discovered by row_cache_test.cc::test_concurrent_reads_and_eviction
with reduced key range in the random_mutation_generator (1024 -> 16).

Fixes #11239
2022-08-09 02:28:56 +02:00
Takuya ASADA
3ffc978166 main: move preinit_description to main()
We don't need to wait for handling version options after scylla_main()
called, we can handle it in main() instead.

Closes #11221
2022-08-08 18:31:43 +03:00
Benny Halevy
91ab8ee1c3 effective_replication_map: make get_range_addresses asynchronous
So it may yield, preenting reactor stalls as seen in #11005.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:01 +03:00
Benny Halevy
9b2af3f542 range_streamer: add_ranges and friends: get erm as param
Rather than getting it in the callee, let the caller
(e.g.  storage_service)
hold the erm and pass it down to potentially multiple
async functions.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:01 +03:00
Benny Halevy
194b9af8d6 storage_service: get_new_source_ranges: get erm as param
Rather than getting it in the callee, let the caller
hold the erm and pass it down to potentially multiple
async functions.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:01 +03:00
Benny Halevy
b50c79eab3 storage_service: get_changed_ranges_for_leaving: get erm as param
Rather than getting it in the callee, let the caller
hold the erm and pass it down to potentially multiple
async functions.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:01 +03:00
Benny Halevy
a5d7ade237 storage_service: get_ranges_for_endpoint: get erm as param
Let its caller Pass the effective_replication_map ptr
so we can get it at the top level and keep it alive
(and coherent) through multiple asynchronous calls.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:01 +03:00
Benny Halevy
cffe00cc58 repair: use get_non_local_strategy_keyspaces_erms
Use get_non_local_strategy_keyspaces_erms for getting
a coherent set of keyspace names and their respective
effective replication strategy.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:01 +03:00
Benny Halevy
db5c5ca59e database: add get_non_local_strategy_keyspaces_erms
To be used for getting a coheret set of all keyspaces
with non-local replication strategy and their respective
effective_replication_map.

As an example, use it in this patch in
storage_service::update_pending_ranges.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:01 +03:00
Benny Halevy
7ee6048255 database: add get_non_local_strategy_keyspaces
For node operations, we currently call get_non_system_keyspaces
but really want to work on all keyspace that have non-local
replication strategy as they are replicated on other nodes.

Reflect that in the replica::database function name.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:01 +03:00
Benny Halevy
d8484b3ee6 storage_service: coroutinize update_pending_ranges
Before we make a change in getting the
keyspaces and their effective_replication_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:01 +03:00
Benny Halevy
e541009f65 effective_replication_map: add get_replication_strategy
And use it in storage_service::get_changed_ranges_for_leaving.

A following patch will pass the e_r_m to
storage_service::get_changed_ranges_for_leaving, rather than
getting it there.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:00 +03:00
Benny Halevy
6794e15163 effective_replication_map: get_range_addresses: use the precalculated replication_map
There is no need to call get_natural_endpoints for every token
in sorted_tokens order, since we can just get the precalculated
per-token endpoints already in the _replication_map member.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:00 +03:00
Benny Halevy
1d4aea4441 abstract_replication_strategy: get_pending_address_ranges: prevent extra vector copies
Reduce large allocations and reactor stalls seen in #11005
by open coding `get_address_ranges` and using std::vector::insert
to efficiently appending the ranges returned by `get_primary_ranges_for`
onto the returned token_range_vector in contrast to building
an unordered_multimap<inet_address, dht::token_range> first in
`get_address_ranges` and traversing it and adding one token_range
at a time.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:00 +03:00
Benny Halevy
7811b0d0aa abstract_replication_strategy: reindent 2022-08-08 17:31:00 +03:00
Benny Halevy
ebe1edc091 utils: sequenced_set: expose set and contains method
And use that in sights using the endpoint set
returned by abstract_replication_strategy::calculate_natural_endpoints.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:00 +03:00
Benny Halevy
7017ad6822 abstract_replication_strategy: calculate_natural_endpoints: return endpoint_set
So it could be used also for easily searching for an endpoint.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:00 +03:00
Benny Halevy
38934413d4 utils: sequenced_set: templatize VectorType
Se we can use basic_sequenced_set<T, std::small_vector<T, N>>
as locator::endpoint_set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:00 +03:00
Benny Halevy
df380c30b9 utils: sanitize sequenced_set
And templatize its Vector type so it can be used
with a small_vector for inet_address_vector_replica_set.

Mark the methods const/noexcept as needed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:00 +03:00
Benny Halevy
57d9275d4a utils: sequenced_set: delete mutable get_vector method
It is dangerous to use since modifying the
sequenced_set vector will make it go out of sync
with the associated unordered_set member, making
the object unusable.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:00 +03:00
Yaron Kaikov
2fe2306efb configure.py: add date-stamp parameter
When starting `Build` job we have a situation when `x86` and `arm`
starting in different dates causing the all process to fail

As suggested by @avikivity , adding a date-stamp parameter and will pass
it through downstream jobs to get one release for each job

Ref: scylladb/scylla-pkg#3008

Closes #11234
2022-08-08 17:28:38 +03:00
Anna Stuchlik
7d4770c116 doc: remove version-speific content for Ubuntu and add the generic page to the toctree 2022-08-08 16:18:20 +02:00
Anna Stuchlik
eb60e5757a doc: rename the file to include Ubuntu 2022-08-08 16:12:02 +02:00
Anna Stuchlik
011e2fad60 doc: remove the version number from the document and add the link to Supported Versions 2022-08-08 16:11:14 +02:00
Anna Stuchlik
83c08ac5fa doc: add a generic page for Ubuntu 2022-08-08 16:04:59 +02:00
Avi Kivity
871127f641 Update tools/java submodule
* tools/java ad6764b506...6995a83cc1 (1):
  > dist/debian: drop upgrading from scylla-tools < 2.0
2022-08-08 16:51:14 +03:00
Anna Stuchlik
260f85643d doc: specify the recommended AWS instance types 2022-08-08 14:35:54 +02:00
Anna Stuchlik
2c69a8f458 doc: replace the tables with a generic description of support for Im4gn and Is4gen instances 2022-08-08 14:17:59 +02:00
Botond Dénes
49c00fa989 Merge 'Define strong uuid-class types for table_id, table_schema_version and query_id' from Benny Halevy
We would like to define more distinct types
that are currently defined as aliases to utils::UUID to identify
resources in the system, like table id and schema
version id.

As with counter_id, the motivation is to restrict
the usage of the distinct types so they can be used
(assigned, compared, etc.) only with objects of
the same type.  Using with a generic UUID will
then require explicit conversion, that we want to
expose.

This series starts with cleaning up the idl header definition
by adding support for `import` and `include` statements in the idl-compiler.
These allow the idl header to become self-sufficient
and then remove manually-added includes from source files.
The latter usually need only the top level idl header
and it, in turn, should include other headers if it depends on them.

Then, a UUID_class template was defined as a shared boiler plate
for the various uuid-class.  First, we convert counter_id to use it,
rather than mimicking utils::UUID on its own.

On top of utils::UUID_class<T>, we define table_id,
table_schema_version, and query_id.

Following up on this series, we should define more commonly used
types like: host_id, streaming_plan_id, paxos_ballot_id.

Fixes #11207

Closes #11220

* github.com:scylladb/scylladb:
  query-request, everywhere: define and use query_id as a strong type
  schema, everywhere: define and use table_schema_version as a strong type
  schema, everywhere: define and use table_id as a strong type
  schema: include schema_fwd.hh in schema.hh
  system_keyspace: get_truncation_record: delete unused lambda capture
  utils: uuid: define appending_hash<utils::tagged_uuid<Tag>>
  utils: tagged_uuid: rename to_uuid() to uuid()
  counters: counter_id: use base class create_random_id
  counters: base counter_id on utils::tagged_uuid
  utils: tagged_uuid: mark functions noexcept
  utils: tagged_uuid: bool: reuse uuid::bool operator
  raft: migrate tagged_id definition to utils::tagged_uuid
  utils: uuid: mark functions noexcept
  counters: counter_id delete requirement for triviality
  utils: bit_cast: require TriviallyCopyable To
  repair: delete unused include of utils/bit_cast.hh
  bit_cast: use std::bit_cast
  idl: make idl headers self-sufficient
  db: hints: sync_point: do not include idl definition file
  db/per_partition_rate_limit: tidy up headers self-sufficiency
  idl-compiler: include serialization impl and visitors in generated dist.impl.hh files
  idl-compiler: add include statements
  idl_test: add a struct depending on UUID
2022-08-08 13:20:40 +03:00
Benny Halevy
c71ef330b2 query-request, everywhere: define and use query_id as a strong type
Define query_id as a tagged_uuid
So it can be differentiated from other uuid-class types.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:13:28 +03:00
Benny Halevy
2b017ce285 schema, everywhere: define and use table_schema_version as a strong type
Define table_schema_version as a distinct tagged_uuid class,
So it can be differentiated from other uuid-class types,
in particular table_id.

Added reversed(table_schema_version) for convenience
and uniformity since the same logic is currently open coded
in several places.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:09:45 +03:00
Benny Halevy
257d74bb34 schema, everywhere: define and use table_id as a strong type
Define table_id as a distinct utils::tagged_uuid modeled after raft
tagged_id, so it can be differentiated from other uuid-class types,
in particular from table_schema_version.

Fixes #11207

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:09:41 +03:00
Benny Halevy
26aacb328e schema: include schema_fwd.hh in schema.hh
And remove repeated definitions and forward declarations
of the same types in both places.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:28 +03:00
Benny Halevy
6e77ad9392 system_keyspace: get_truncation_record: delete unused lambda capture
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:28 +03:00
Benny Halevy
a390b8475b utils: uuid: define appending_hash<utils::tagged_uuid<Tag>>
And simplify usage for appending_hash<counter_shard_view>
respectively.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:28 +03:00
Benny Halevy
8235cfdf7a utils: tagged_uuid: rename to_uuid() to uuid()
To make it more generic, similar to other uuid() get
methods we have.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
813cffc2b5 counters: counter_id: use base class create_random_id
Rather than defining generate_random,
and use respectively in unit tests.
(It was inherited from raft::internal::tagged_id.)

This allows us to shorten counter_id's definition
to just using utils::tagged_uuid<struct counter_id_tag>.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
e9cc24bc18 counters: base counter_id on utils::tagged_uuid
Use the common base class for uuid-based types.

tagged_uuid::to_uuid defined here for backward
compatibility, but it will be renamed in the next patch
to uuid().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
082d5efca8 utils: tagged_uuid: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
1b78f8ba82 utils: tagged_uuid: bool: reuse uuid::bool operator
utils::UUID defined operator bool the same way,
rely on it rather than reimplementing it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
6436c614d7 raft: migrate tagged_id definition to utils::tagged_uuid
So it can be used for other types in the system outside
of raft, like counter_id, table_id, table_schema_version,
and more.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
f0567ab853 utils: uuid: mark functions noexcept
Before we define a tagged_uuid template.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
ea91ccaa20 counters: counter_id delete requirement for triviality
This stemmed from utils/bit_cast overly strict requirement.
Now that it was relaxed, these is no need for this static assert
as counter_id is trivially copyable, and that is checked
by bit_cast {read,write}_unaligned

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
c68e929b80 utils: bit_cast: require TriviallyCopyable To
Like std::bit_cast (https://en.cppreference.com/w/cpp/numeric/bit_cast)
we only require To to be trivially copyable.

There's no need for it to be a trivial type.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
2948a4feb6 repair: delete unused include of utils/bit_cast.hh
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
79000bc02e bit_cast: use std::bit_cast
Now that scylla requries c++20 there's no
need to define our own implementation in utils/bit_cast.hh

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
1fda686f96 idl: make idl headers self-sufficient
Add include statements to satisfy dependencies.

Delete, now unneeded, include directives from the upper level
source files.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
cfc7e9aa59 db: hints: sync_point: do not include idl definition file
idl definition files are not intended for direct
inclusion in .cc files.

Data types it represents are supposed to be defined
in regular C++ header, so define them in db/hints/scyn_point.hh
and include it rather then idl/hinted_handoff.idl.hh.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
82fa205723 db/per_partition_rate_limit: tidy up headers self-sufficiency
include what's needed where needed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
83811b8e35 idl-compiler: include serialization impl and visitors in generated dist.impl.hh files
They are generally required by the serialization implementation.
This will simplify using them without having to hand pick
what header to include in the .cc file that includes them.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
da4f0aae37 idl-compiler: add include statements
For generating #include directives in the generated files,
so we don't have to hand-craft include the dependencies
in the right order.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
4f275a17b4 idl_test: add a struct depending on UUID
For testing the next change which adds
import and include statements to the idl language.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Avi Kivity
ba42852b0e Merge 'Overhaul truncate and snapshot' from Benny Halevy
This series is aimed at fixing #11132.

To get there, the series untangles the functions that currently depend on the the cross-shard coordination in table::snapshot,
namely database::truncate and consequently database::drop_column_family.

database::get_table_on_all_shards is added here as a helper to get a foreign shared ptr of the the table shard from all shards,
and it is later used by multiple functions to truncate and then take a snapshot of the sharded table.

database::truncate_table_on_all_shards is defined to orchestrate the truncate process end-to-end, flushing or clearing all table shards before taking a snapshot if needed, using the newly defined table::snapshot_on_all_shards, and by that leaving only the discard_sstables job to the per-shard database::truncate function.

The latter, snapshot_on_all_shards, orchestrates the snapshot process on all shards - getting rid of the per-shard table::snapshot function (after refactoring take_snapshot and finalize_snapshot out of it), and the associated dreaded data structures: snapshot_manager and pending_snapshots.

Fixes #11132.

Closes #11133

* github.com:scylladb/scylladb:
  table: reindent write_schema_as_cql
  table: coroutinize write_schema_as_cql
  table: seal_snapshot: maybe_yield when iterating over the table names
  table: reindent seal_snapshot
  table: coroutinize seal_snapshot
  table: delete unused snapshot_manager and pending_snapshots
  table: delete unused snapshot function
  table: snapshot_on_all_shards: orchestrate snapshot process
  table: snapshot: move pending_snapshots.erase from seal_snapshot
  table: finalize_snapshot: take the file sets as a param
  table: make seal_snapshot a static member
  table: finalize_snapshot: reindent
  table: refactor finalize_snapshot out of snapshot
  table: snapshot: keep per-shard file sets in snapshot_manager
  table: take_snapshot: return foreign unique ptr
  table: take_snapshot: maybe yield in per-sstable loop
  table: take_snapshot: simplify tables construction code
  table: take_snapshot: reindent
  table: take_snapshot: simplify error handling
  table: refactor take_snapshot out of snapshot
  utils: get rid of joinpoint
  database: get rid of timestamp_func
  database: truncate: snapshot table in all-shards layer
  database: truncate: flush table and views in all-shards layer
  database: truncate: stop and disable compaction in all-shards layer
  database: truncate: move call to set_low_replay_position_mark to all-shards layer
  database: truncate: enter per-shard table async_gate in all-shards layer
  database: truncate: move check for schema_tables keyspace to all-shards layer.
  database: snapshot_table_on_all_shards: reindent
  table: add snapshot_on_all_shards
  database: add snapshot_table_on_all_shards
  database: rename {flush,snapshot}_on_all and make static
  database: drop_table_on_all_shards: truncate and stop table in upper layer
  database: drop_table_on_all_shards: get all table shards before drop_column_family on each
  database: drop_column_family: define table& cf
  database: drop_column_family: reuse uuid for evict_all_for_table
  database: drop_column_family: move log message up a layer
  database: truncate: get rid of the unused ks param
  database: add truncate_table_on_all_shards
  database: drop_table_on_all_shards: do not accept a truncated_at timestamp_func
  database: truncate: get optional snapshot_name from caller
  database: truncate: fix assert about replay_position low_mark
  database_test: apply_mutation on the correct db shard
2022-08-07 19:15:42 +03:00
Benny Halevy
45ce635527 table: reindent write_schema_as_cql
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
3b2cce068a table: coroutinize write_schema_as_cql
and make sure to always close the output stream.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
dbae7807d1 table: seal_snapshot: maybe_yield when iterating over the table names
Add maybe_yield calls in tight loop, potentially
over thousands of sstable names to prevent reactor stalls.
Although the per-sstable cost is very small, we've experienced
stalls realted to printing in O(#sstables) in compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
3ba0c72b77 table: reindent seal_snapshot
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
41a2d09a5d table: coroutinize seal_snapshot
Handle exceptions, making sure the output
stream is properly closed in all cases,
and an intermediate error, if any, is returned as the
final future.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
5316dbbe78 table: delete unused snapshot_manager and pending_snapshots
Now that snapshot orchestration in snapshot_on_all_shards
doesn't use snapshot_manager, get rid of the data structure.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
cca9068cfb table: delete unused snapshot function
Now that snapshot orchestration is done solely
in snapshot_on_all_shards, the per-shard
snapshot function can be deleted.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
351a3a313d table: snapshot_on_all_shards: orchestrate snapshot process
Call take_snapshot on each shard and collect the
returns snapshot_file_set.

When all are done, move the vector<snapshot_file_set>
to finalize_snapshot.

All that without resorting to using the snapshot_manager
nor calling table::snapshot.

Both will deleted in the following patches.

Fixes #11132

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
84dfd2cabb table: snapshot: move pending_snapshots.erase from seal_snapshot
Now that seal_snapshot doesn't need to lookup
the snapshot_manager in pending_snapshots to
get to the file_sets, erasing the snapshot_manager
object can be done in table::snapshot which
also inserted it there.

This will make it easier to get rid of it in a later patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
39276cacc3 table: finalize_snapshot: take the file sets as a param
and pass it to seal_snapshot, so that the latter won't
need to lookup and access the snapshot_manager object.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
4dd56bbd6d table: make seal_snapshot a static member
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
7cb0a3f6f4 table: finalize_snapshot: reindent
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
12716866a9 table: refactor finalize_snapshot out of snapshot
Write schema.cql and the files manifest in finalize_snapshot.
Currently call it from table::snapshot, but it will
be called in a later patch by snapshot_on_all_shards.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
240f83546d table: snapshot: keep per-shard file sets in snapshot_manager
To simplify processing of the per-shard file names
for generating the manifest.

We only need to print them to the manifest at the
end of the process, so there's no point in copying
them around in the process, just move the
foreign unique unordered_set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
5100c1ba68 table: take_snapshot: return foreign unique ptr
Currently copying the sstable file names are created
and destroyed on each shard and are copied by the
"coordinator" shards using submit_to, while the
coroutine holds the source on its stack frame.

To prepare for the next patches that refactor this
code so that the coordinator shard will submit_to
each shard to perform `take_snapshot` and return
the set of sstrings in the future result, we need
to wrap the result in a foreign_ptr so it gets
freed on the shard that created it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
b54626ad0e table: take_snapshot: maybe yield in per-sstable loop
There could be thousands of sstables so we better
cosider yielding in the tight loop that copies
the sstable names into the unordered_set we return.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
24a1a4069e table: take_snapshot: simplify tables construction code
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
75e38ebccc table: take_snapshot: reindent
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
67c1d00f44 table: take_snapshot: simplify error handling
Don't catch exception but rather just return
them in the return future, as the exception
is handled by the caller.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
ff6508aa53 table: refactor take_snapshot out of snapshot
Do the actual snapshot-taking code in a per-shard
take_snapshot function, to be called from
snapshot_on_all_shards in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
37b7a9cce2 utils: get rid of joinpoint
Now that it is no longer used.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
56f336d1aa database: get rid of timestamp_func
Pass an optional truncated_at time_point to
truncate_table_on_all_shards instead of the over-complicated
timestamp_func that returns the same time_point on all shards
anyhow, and was only used for coordination across shards.

Since now we synchronize the internal execution phase in
truncate_table_on_all_shards, there is no longer need
for this timestamp_func.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
b640c4fd17 database: truncate: snapshot table in all-shards layer
With that the database layer does no longer
need to invoke the private table::snapshot function,
so it can be defriended from class table.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
af0c71aa12 database: truncate: flush table and views in all-shards layer
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
6e07e6b7ac database: truncate: stop and disable compaction in all-shards layer
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
e78dad1dfb database: truncate: move call to set_low_replay_position_mark to all-shards layer
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
a8bd3d97b6 database: truncate: enter per-shard table async_gate in all-shards layer
Start moving the per-shard state establishment logic
to truncate_table_on_all_shards, so that we would evetually
do only the truncate logic per-se in the per-shard truncate function.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
ff028316f2 database: truncate: move check for schema_tables keyspace to all-shards layer.
Now that the per-shard truncate function is called
only from truncate_table_on_all_shards, we can reject the schema_tables
keyspace in the upper layer.  There's no need to check that on each shard.

While at it, reuse `is_system_keyspace`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
fbe1fa1370 database: snapshot_table_on_all_shards: reindent
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
4d4ca40c38 table: add snapshot_on_all_shards
Called from the respective database entry points.

Will be called also from the database drop / truncate path
and will be used for central coordination of per-shard
table::snapshot so we don't have to depend on the snapshot_manager
mechanism that is fragile and currently causes abort if we fail
to allocate it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
be56a73e78 database: add snapshot_table_on_all_shards
We need to snapshot a single table in several paths.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
d96b56fee2 database: rename {flush,snapshot}_on_all and make static
Follow the convention of drop_table_on_all_shards.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
a1eed1a6e9 database: drop_table_on_all_shards: truncate and stop table in upper layer
truncate the table on all shards then stop it on shards
in the upper layer rather than in the per-shard drop_column_family()
function, so we can further refactor truncate later, flushing
and taking snapshot on all shards, before truncating.

With that, rename drop_column_family to detach_columng_family
as now it only deregisters the column family from containers
that refer to it (even via its uuid) and then its caller
is reponsible to take it from there.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
92cb7d448b database: drop_table_on_all_shards: get all table shards before drop_column_family on each
Se we the upper layer can flush, snapshot, and truncate
the table on all shards, step by step.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
0aaaefbb5c database: drop_column_family: define table& cf
To reduce the churn in the following patch
that will pass the table& as a parameter.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
bb1e5ffb8c database: drop_column_family: reuse uuid for evict_all_for_table
cf->schema()->id() is the same one returned
by find_uuid(ks_name, cf_name);

As a follow up, we should define a concrete
table_id type and rename schema::id() to schema::table_id()
to return it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
e800e1e720 database: drop_column_family: move log message up a layer
Print once on "coordinator" shard.

And promote to info level as it's important to log
when we're dropping a table (and if we're going to take a snapshot).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
ca78a63873 database: truncate: get rid of the unused ks param
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
46e2a7c83b database: add truncate_table_on_all_shards
As a first step to decouple truncate from flush
and snpashot.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
5e8c05f1a8 database: drop_table_on_all_shards: do not accept a truncated_at
timestamp_func

Since in the drop_table case we want to discard ALL
sstables in the table, not only those with `max_data_age()`
up until drop started.

Fixes #11232

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:52:51 +03:00
Benny Halevy
574909c78f database: truncate: get optional snapshot_name from caller
Before we change drop_table_on_all_shards to always
pass db_clock::time_point::max() in the next patch,
let it pass a unique snapshot name, otherwise
the snapshot name will always be based on the constant, max
time_point.

Refs #11232

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:03:19 +03:00
Benny Halevy
474b2fdf37 database: truncate: fix assert about replay_position low_mark
This assert was tweaked several times:
Introduced in 83323e155e,
then fixed in b2b1a1f7e1 to account
for no rp from discard_sstables, then in
9620755c7f to account for
cases we do not flush the table, then again in
71c5dc82df to make that more accurate.

But, the assert wasn't correct in the first place
in the sense that we first get `low_mark` which
represents the highest replay_position at the time truncate
was called, but then we call discard_sstables with a time_point
of `truncated_at` that we get from the caller via the timestamp_func,
and that one could be in the past, before truncate was called -
hence discard_sstables with that timestamp may very well
return a replay_position from older sstables, prior to flush
that can be smaller than the low_mark.

Fix this assert to account for that case.

The real fix to this issue is to have a truncate_tombstone
that will carry an authoritative api::timstamp (#11230)

Fixes #11231

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 09:18:06 +03:00
Benny Halevy
9f5e13800d database_test: apply_mutation on the correct db shard
Following up on 1c26d49fba,
apply mutations on the correct db shard in all test cases
before we define and use database::truncate_table_on_all_shards.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 09:18:06 +03:00
Tomasz Grabiec
7f80602b01 db: range_tombstone_list: Avoid quadratic behavior when applying
Range tombstones are kept in memory (cache/memtable) in
range_tombstone_list. It keeps them deoverlapped, so applying a range
tombstone which covers many range tombstones will erase existing range
tombstones from the list. This operation needs to be exception-safe,
so range_tombstone_list maintains an undo log. This undo log will
receive a record for each range tombstone which is removed. For
exception safety reasons, before pushing an undo log entry, we reserve
space in the log by calling std::vector::reserve(size() + 1). This is
O(N) where N is the number of undo log entries. Therefore, the whole
application is O(N^2).

This can cause reactor stalls and availability issues when replicas
apply such deletions.

This patch avoids the problem by reserving exponentially increasing
amount of space. Also, to avoid large allocations, switches the
container to chunked_vector.

Fixes #11211

Closes #11215
2022-08-05 20:34:07 +03:00
Kamil Braun
d84a93d683 Merge 'Raft test topology part 1' from Alecco
These are the first commits out of #10815.

It starts by moving pytest logic out of the common `test/conftest.py`
and into `test/topology/conftest.py`, including removing the async
support as it's not used anywhere else.

There's a fix of a bug of leaving tables in `RandomTables.tables` after
dropping all of them.

Keyspace creation is moved out of `conftest.py` into `RandomTables` as
it makes more sense and this way topology tests avoid all the
workarounds for old version (topology needs ScyllaDB 5+ for Raft,
anyway).

And a minor fix.

Closes #11210

* github.com:scylladb/scylladb:
  test.py: fix type hint for seed in ScyllaServer
  test.py: create/drop keyspace in tables helper
  test.py: RandomTables clear list when dropping all tables
  test.py: move topology conftest logic to its own
  test.py: async topology tests auto run with pytest_asyncio
2022-08-05 17:56:16 +02:00
Anna Stuchlik
d48ae5a9e0 doc: add the upgrade guide from 5.0 to 2022.1 on Ubuntu 2022.1 2022-08-05 17:49:01 +02:00
Warren Krewenki
4178ccd27f gossiper: Correct typo in log message
Closes #11212
2022-08-05 18:21:36 +03:00
Anna Stuchlik
ceaf0c41bd doc: add support for AWS i4g instances 2022-08-05 17:18:44 +02:00
Anna Stuchlik
7711436577 doc: extend the list of supported CPUs 2022-08-05 16:55:40 +02:00
Alejo Sanchez
ec70e26f12 test.py: fix type hint for seed in ScyllaServer
Param seed can be None (e.g. first server) so fix type hint accordingly.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-05 13:05:26 +02:00
Alejo Sanchez
1d7789e5a9 test.py: create/drop keyspace in tables helper
Since all topology test will use the helper, create the keyspace in the
helper.

Avoid the need of dropping all tables per test and just drop the
keyspace.

While there, use blocking CQL execution so it can be used in the
constructor and avoids possible issues with scheduling on cleanup. Also,
creation and drop should happen only once per cluster and no test should
be running changes (either not started or finished).

All topology tests are for Scylla with Raft. So don't use the Cassandra
this_dc workaround as it's unnecessary for Scylla.

Remove return type of random_tables fixture to match other fixtures
everywhere else.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-05 13:05:26 +02:00
Alejo Sanchez
9a019628f5 test.py: RandomTables clear list when dropping all tables
Clear the list of active tables when dropping them.

While there do the list element exchange atomically across active and
removed tables lists.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-05 13:05:26 +02:00
Alejo Sanchez
f6aa0d7bd7 test.py: move topology conftest logic to its own
Move asyncio, Raft checks, and RandomTables to topology test suite's own
conftest file.

While there, use non-async version of pre-checks to avoid unnecessary
complexity (we want async tests, not async setup, for now).

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-05 13:05:26 +02:00
Alejo Sanchez
f665779cdb test.py: async topology tests auto run with pytest_asyncio
Async tests and fixtures in the topology directory are expected to run
with pytest_asyncio (not other async frameworks). Force this with auto
mode.

CI has an older pytest_asyncio version lacking pytest_asyncio.fixture.
Auto mode helps avoiding the need of it and tests and fixtures can just
be marked with regular @pytest.mark.async.

This way tests can run in both older and newer versions of the packages.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-05 13:05:26 +02:00
Botond Dénes
fbbe2529c1 Merge "Remove global snitch usage from consistency_level.cc" from Pavel Emelyanov
"
There are several helpers in this .cc file that need to get datacenter
for endpoints. For it they use global snitch, because there's no other
place out there to get that data from.

The whole dc/rack info is now moving to topology, so this set patches
the consistency_level.cc to get the topology. This is done two ways.
First, the helpers that have keyspace at hand may get the topology via
ks's effective_replication_map.

Two difficult cases are db::is_local() and db.count_local_endpoints()
because both have just inet_address at hand. Those are patched to be
methods of topology itself and all their callers already mess with
token metadata and can get topology from it.
"

* 'br-consistency-level-over-topology' of https://github.com/xemul/scylla:
  consistency_level: Remove is_local() and count_local_endpoints()
  storage_proxy: Use topology::local_endpoints_count()
  storage_proxy: Use proxy's topology for DC checks
  storage_proxy: Keep shared_ptr<proxy> on digest_read_resolver
  storage_proxy: Use topology local_dc_filter in its methods
  storage_proxy: Mark some digest_read_resolver methods private
  forwarding_service: Use topology local_dc_filter
  storage_service: Use topology local_dc_filter
  consistency_level: Use topology local_dc_filter
  consitency-level: Call count_local_endpoints from topology
  consistency_level: Get datacenter from topology
  replication_strategy: Remove hold snitch reference
  effective_replication_map: Get datacenter from topology
  topology: Add local-dc detection shugar
2022-08-05 13:31:55 +03:00
Anna Stuchlik
4bc7833a0b doc: update the link to CQL3 type mapping on GitHub
Closes #11224
2022-08-05 13:21:29 +03:00
Pavel Emelyanov
c3718b7a6e consistency_level: Remove is_local() and count_local_endpoints()
No code uses them now -- switched to use topology -- so thse two can be
dropped together with their calls for global snitch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:48 +03:00
Pavel Emelyanov
9c662ee0e5 storage_proxy: Use topology::local_endpoints_count()
A continuation of the previous patches -- now all the code that needs
this helper have proxy pointer at hand

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:48 +03:00
Pavel Emelyanov
9a50d318b6 storage_proxy: Use proxy's topology for DC checks
Several proxy helper classes need to filter endpoints by datacenter.
Since now the have shared_ptr<proxy> on-board, they can get topology
via proxy's token metadata

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:48 +03:00
Pavel Emelyanov
183a2d5a83 storage_proxy: Keep shared_ptr<proxy> on digest_read_resolver
It will be needed to get token metadata from proxy. The resolver in
question is created and maintained by abstract_read_executor which
already has shared_ptr<proxy>, so it just gives its copy

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:48 +03:00
Pavel Emelyanov
e1ea801b67 storage_proxy: Use topology local_dc_filter in its methods
The proxy has token metadata pointer, so it can use its topology
reference to filter endpoints by datacenter

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:47 +03:00
Pavel Emelyanov
6f515f852d storage_proxy: Mark some digest_read_resolver methods private
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:47 +03:00
Pavel Emelyanov
9a19414c62 forwarding_service: Use topology local_dc_filter
The service needs to filter out non-local endpoints for its needs. The
service carries token metadata pointer and can get topology from it to
fulfill this goal

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:47 +03:00
Pavel Emelyanov
2423e1c642 storage_service: Use topology local_dc_filter
The storage-service API calls use db::is_local() helper to filter out
tokens from non-local datacenter. In all those places topology is
available from the token metadata pointer

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:47 +03:00
Pavel Emelyanov
0da8caba1d consistency_level: Use topology local_dc_filter
The filter_for_query() helper has keyspace at hand

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:47 +03:00
Pavel Emelyanov
de58b33eee consitency-level: Call count_local_endpoints from topology
Similar to previous patch, in those places with keyspace object at
hand the topology can be obtained from ks' replication map

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:47 +03:00
Pavel Emelyanov
f84ee8f0fb consistency_level: Get datacenter from topology
In some of db/consistency_level.cc helpers the topology can be
obtained from keyspace's effective replication map

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:47 +03:00
Pavel Emelyanov
00f166809e replication_strategy: Remove hold snitch reference
When the strategy is constructed there's no place to get snitch from
so the global instance is used. However, after previous patch the
replication strategy no longer needs snitch, so this dependency can
be dropped

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:43 +03:00
Pavel Emelyanov
298213f27f effective_replication_map: Get datacenter from topology
Now it gets it from snitch, but the dc/rack info is being relocated
onto topology. The topology is in turn already there

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:31 +03:00
Calle Wilund
fac2bc41ba commitlog: Include "segments_to_replay" in initial footprint
Fixes #11184

Not including it here can cause our estimate of "delete or not" after replay
to be skewed in favour of retaining segments as (new) recycles (or even flip
a counter), and if we have repeated crash+restarts we could be accumulating
an effectivly ever increasing segment footprint

Closes #11205
2022-08-05 12:16:53 +03:00
Pavel Emelyanov
527b345079 Merge 'storage_proxy: introduce a remote "subservice"' from Kamil Braun
Introduce a `remote` class that handles all remote communication in `storage_proxy`: sending and receiving RPCs, checking the state of other nodes by accessing the gossiper, and fetching schema.

The `remote` object lives inside `storage_proxy` and right now it's initialized and destroyed together with `storage_proxy`.

The long game here is to split the initialization of `storage_proxy` into two steps:
- the first step, which constructs `storage_proxy`, initializes it "locally" and does not require references to `messaging_service` and `gossiper`.
- the second step will take those references and add the `remote` part to `storage_proxy`.

This will allow us to remove some cycles from the service (de)initialization order and in general clean it up a bit. We'll be able to start `storage_proxy` right after the `database` (without messaging/gossiper). Similar refactors are planned for `query_processor`.

Closes #11088

* github.com:scylladb/scylladb:
  service: storage_proxy: pass `migration_manager*` to `init_messaging_service`
  service: storage_proxy: `remote`: make `_gossiper` a const reference
  gms: gossiper: mark some member functions const
  db: consistency_level: `filter_for_query`: take `const gossiper&`
  replica: table: `get_hit_rate`: take `const gossiper&`
  gms: gossiper: move `endpoint_filter` to `storage_proxy` module
  service: storage_proxy: pass `shared_ptr<gossiper>` to `start_hints_manager`
  service: storage_proxy: establish private section in `remote`
  service: storage_proxy: remove `migration_manager` pointer
  service: storage_proxy: remove calls to `storage_proxy::remote()` from `remote`
  service: storage_proxy: remove `_gossiper` field
  alternator: ttl: pass `gossiper&` to `expiration_service`
  service: storage_proxy: move `truncate_blocking` implementation to `remote`
  service: storage_proxy: introduce `is_alive` helper
  service: storage_proxy: remove `_messaging` reference
  service: storage_proxy: move `connection_dropped` to `remote`
  service: storage_proxy: make `encode_replica_exception_for_rpc` a static function
  service: storage_proxy: move `handle_write` to `remote`
  service: storage_proxy: move `handle_paxos_prune` to `remote`
  service: storage_proxy: move `handle_paxos_accept` to `remote`
  service: storage_proxy: move `handle_paxos_prepare` to `remote`
  service: storage_proxy: move `handle_truncate` to `remote`
  service: storage_proxy: move `handle_read_digest` to `remote`
  service: storage_proxy: move `handle_read_mutation_data` to `remote`
  service: storage_proxy: move `handle_read_data` to `remote`
  service: storage_proxy: move `handle_mutation_failed` to `remote`
  service: storage_proxy: move `handle_mutation_done` to `remote`
  service: storage_proxy: move `handle_paxos_learn` to `remote`
  service: storage_proxy: move `receive_mutation_handler` to `remote`
  service: storage_proxy: move `handle_counter_mutation` to `remote`
  service: storage_proxy: remove `get_local_shared_storage_proxy`
  service: storage_proxy: (de)register RPC handlers in `remote`
  service: storage_proxy: introduce `remote`
2022-08-04 17:50:20 +03:00
Alejo Sanchez
97f0e11c3a test.py: handle properly pytest ouput file for CQL tests
Previously, if pytest itself failed (e.g. bad import or unexpected
parameter), there was no output file but test.py tried to copy it and
failed.

Change the logic of handling the output file to first check if the
file is there. Then if it's worth keeping it, *move* it to the test
directory for easier comparison and maintenance. Else, if it's not worth
keeping, discard it.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #11193
2022-08-04 16:48:53 +02:00
Pavel Emelyanov
cf0f912e59 cdc: Handle sleep-aborted exception on stop
When update_streams_description() fails it spawns a fiber and retries
the update in the background once every 60s. If the sleeping between
attempts is aborted, the respective exceptional future happens to be
ignored and warned in logs.

fixes: #11192

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220802132148.20688-1-xemul@scylladb.com>
2022-08-04 13:03:29 +02:00
Kamil Braun
0a4e701b50 service: storage_proxy: pass migration_manager* to init_messaging_service
`migration_manager` lifetime is longer than the lifetime of "storage
proxy's messaging service part" - that is, `init_messaging_service` is
called after `migration_manager` is started, and `uninit_messaging_service`
is called before `migration_manager` is stopped. Thus we don't need to
hold an owning pointer to `migration_manager` here.

Later, when `init_messaging_service` will actually construct `remote`,
this will be a reference, not a pointer.

Also observe that `_mm` in `remote` is only used in handlers, and
handlers are unregistered before `_mm` is nullified, which ensures that
handlers are not running when `_mm` is nullified. (This argument shows
why the code made sense regardless of our switch from shared_ptr to raw
ptr).
2022-08-04 12:19:43 +02:00
Kamil Braun
a08be82ce2 service: storage_proxy: remote: make _gossiper a const reference 2022-08-04 12:19:43 +02:00
Kamil Braun
a1aa9cf3f7 gms: gossiper: mark some member functions const 2022-08-04 12:19:43 +02:00
Kamil Braun
a9fd156a1b db: consistency_level: filter_for_query: take const gossiper& 2022-08-04 12:19:38 +02:00
Kamil Braun
7b4146dd2a replica: table: get_hit_rate: take const gossiper&
It doesn't use any non-const members.
2022-08-04 12:16:09 +02:00
Kamil Braun
566e5f2a4f gms: gossiper: move endpoint_filter to storage_proxy module
The function only uses one public function of `gossiper` (`is_alive`)
and is used only in one place in `storage_proxy`.

Make it a static function private to the `storage_proxy` module.

The function used a `default_random_engine` field in `gossiper` for
generating random numbers. Turn this field into a static `thread_local`
variable inside the function - no other `gossiper` members used the
field.
2022-08-04 12:16:09 +02:00
Kamil Braun
078900042f service: storage_proxy: pass shared_ptr<gossiper> to start_hints_manager
No need to call `_remote->gossiper().shared_from_this()` from within
storage_proxy now.
2022-08-04 12:16:09 +02:00
Kamil Braun
d9d10d87ec service: storage_proxy: establish private section in remote
Only the (un)init, send_*, and `is_alive` functions are public,
plus a getter for gossiper.
2022-08-04 12:16:05 +02:00
Kamil Braun
7364d453dd service: storage_proxy: remove migration_manager pointer
The ownership is passed to `remote`, which now contains
a `shared_ptr<migration_manager>`.
2022-08-04 12:15:36 +02:00
Kamil Braun
bcc22ed1dc service: storage_proxy: remove calls to storage_proxy::remote() from remote
Catch `this` in the lambdas.
2022-08-04 12:15:36 +02:00
Kamil Braun
eddd3b8226 service: storage_proxy: remove _gossiper field
Access `gossiper` through `_remote`.
Later, all those accesses will handle missing `remote`.

Note that there are also accesses through the `remote()` internal getter.

The plan is as follows:
- direct accesses through `_remote` will be modified to handle missing
  `_remote` (these won't cause an error)
- `remote()` will throw if `_remote` is missing (`remote()` is only used
  for operations which actually need to send a message to a remote node).
2022-08-04 12:15:35 +02:00
Kamil Braun
ab946e392f alternator: ttl: pass gossiper& to expiration_service
This allows us to remove the `gossiper()` getter from `storage_proxy`.
2022-08-04 12:12:43 +02:00
Kamil Braun
242e31d56e service: storage_proxy: move truncate_blocking implementation to remote
The truncate operation always truncates a table on the entire cluster,
even for local tables. And it always does it by sending RPCs (the node
sends an RPC to itself too). Thus it fits in the remote class.

If we want to add a possibility to "truncate locally only" and/or change
the behavior for local tables, we can add a branch in
`storage_proxy::truncate_blocking`.

Refs: #11087
2022-08-04 12:12:43 +02:00
Kamil Braun
3e73de9a40 service: storage_proxy: introduce is_alive helper
A helper is introduced both in `remote` and in `storage_proxy`.

The `storage_proxy` one calls the `remote` one. In the future it will
also handle a missing `remote`. Then it will report only the local node
to be alive and other nodes dead while `remote` is missing.

The change reduces the number of functions using the `_gossiper` field
in `storage_proxy`.
2022-08-04 12:12:41 +02:00
Jenkins Promoter
0ce19e7812 release: prepare for 5.2.0-dev 2022-08-04 13:09:55 +03:00
Botond Dénes
df203a48af Merge "Remove reconnectable_snitch_helper" from Pavel Emelyanov
"
The helper is in charge of receiving INTERNAL_IP app state from
gossiper join/change notifications, updating system.peers with it
and kicking messaging service to update its preferred ip cache
along with initiating clients reconnection.

Effectively this helper duplicates the topology tracking code in
storage-service notifiers. Removing it makes less code and drops
a bunch of unwanted cross-components dependencies, in particular:

- one qctx call is gone
- snitch (almost) no longer needs to get messaging from gossiper
- public:private IP cache becomes local to messaging and can be
  moved to topology at low cost

Some nice minor side effect -- this helper was left unsubscribed
from gossiper on stop and snitch rename. Now its all gone.
"

* 'br-remove-reconnectible-snitch-helper-2' of https://github.com/xemul/scylla:
  snitch: Remove reconnectable snitch helper
  snitch, storage_service: Move reconnect to internal_ip kick
  snitch, storage_service: Move system.peers preferred_ip update
  snitch: Export prefer-local
2022-08-04 13:06:05 +03:00
Anna Stuchlik
532aa6e655 doc: update the links to Manager and Operator
Closes #11196
2022-08-04 11:38:39 +03:00
Anna Stuchlik
143455d7ac doc: rewording 2022-08-03 16:58:29 +02:00
Anna Stuchlik
f2af63ddd5 doc: update the links to fix the warnings 2022-08-03 15:12:41 +02:00
Anna Stuchlik
1d61550c64 doc: add the new page to the toctree 2022-08-03 15:03:48 +02:00
Anna Stuchlik
756b9a278f doc: add the descrption of specifying workload attributes with service levels 2022-08-03 14:57:50 +02:00
Anna Stuchlik
2fa175a819 doc: add the definition of workloads to the glossary 2022-08-03 13:31:07 +02:00
Piotr Dulikowski
4f2adc14de db/system_keyspace: fix indentation after previous patch 2022-08-03 13:19:19 +02:00
Piotr Dulikowski
eff8a6368c db/system_keyspace: in system.local, use broadcast_rpc_address in rpc_address column
Previously, the `system.local`'s `rpc_address` column kept local node's
`rpc_address` from the scylla.yaml configuration. Although it sounds
like it makes sense, there are a few reasons to change it to the value
of scylla.yaml's `broadcast_rpc_address`:

- The `broadcast_rpc_address` is the address that the drivers are
  supposed to connect to. `rpc_address` is the address that the node
  binds to - it can be set for example to 0.0.0.0 so that Scylla listens
  on all addresses, however this gives no useful information to the
  driver.
- The `system.peers` table also has the `rpc_address` column and it
  already keeps other nodes' `broadcast_rpc_address`es.
- Cassandra is going to do the same change in the upcoming version 4.1.

Fixes: #11201
2022-08-03 13:19:03 +02:00
Kamil Braun
2aff2fea00 service: storage_proxy: remove _messaging reference
All uses of `messaging_service&` have been moved to `remote`.
2022-08-02 19:55:12 +02:00
Kamil Braun
cf931c7863 service: storage_proxy: move connection_dropped to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
2203d4fa09 service: storage_proxy: make encode_replica_exception_for_rpc a static function
No need for this ugly template to be part of the `storage_proxy` header.
2022-08-02 19:55:12 +02:00
Kamil Braun
3499bc7731 service: storage_proxy: move handle_write to remote
It is a helper used by `receive_mutation_handler`
and `handle_paxos_learn`.
2022-08-02 19:55:12 +02:00
Kamil Braun
ba88ad8db0 service: storage_proxy: move handle_paxos_prune to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
548767f91e service: storage_proxy: move handle_paxos_accept to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
807c7f32de service: storage_proxy: move handle_paxos_prepare to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
0e431e7c03 service: storage_proxy: move handle_truncate to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
f8c1ba357f service: storage_proxy: move handle_read_digest to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
43997af40f service: storage_proxy: move handle_read_mutation_data to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
80586a0c7e service: storage_proxy: move handle_read_data to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
00c0ee44bd service: storage_proxy: move handle_mutation_failed to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
b9c436c6e0 service: storage_proxy: move handle_mutation_done to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
178536d5d2 service: storage_proxy: move handle_paxos_learn to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
f309886fac service: storage_proxy: move receive_mutation_handler to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
fad14d2094 service: storage_proxy: move handle_counter_mutation to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
93325a220f service: storage_proxy: remove get_local_shared_storage_proxy
Its remaining uses are trivial to remove.

Note: in `handle_counter_mutation` we had this piece of code:
```
            }).then([trace_state_ptr = std::move(trace_state_ptr), &mutations, cl, timeout] {
                auto sp = get_local_shared_storage_proxy();
                return sp->mutate_counters_on_leader(...);
```

Obtaining a `shared_ptr` to `storage_proxy` at this point is
no different from obtaining a regular pointer:
- The pointer is obtained inside `then` lambda body, not in the capture
  list. So if the goal of obtaining a `shared_ptr` here was to keep
  `storage_proxy` alive until the `then` lambda body is executed, that
  goal wasn't achieved because the pointer was obtained too late.
- The `shared_ptr` is destroyed as soon as `mutate_counters_on_leader`
  returns, it's not stored anywhere. So it doesn't prolong the lifetime
  of the service.

I replaced this with a simple capture of `this` in the lambda.
2022-08-02 19:55:12 +02:00
Kamil Braun
5148eafbd6 service: storage_proxy: (de)register RPC handlers in remote 2022-08-02 19:55:12 +02:00
Kamil Braun
f174645ab5 service: storage_proxy: introduce remote
Move most accesses to `_messaging` to this struct
(functions that send RPCs).
2022-08-02 19:55:10 +02:00
Pavel Emelyanov
ee0828b506 topology: Add local-dc detection shugar
It's often needed to check if an endpoint sits in the same DC as the
current node. It can be done by

    topo.get_datacenter() == topo.get_datacenter(endpoint)

but in some cases a RAII filter function can be helpful.

Also there's a db::count_local_endpoints() that is surprisingly in use,
so add it to topology as well. Next patches will make use of both.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-30 17:58:45 +03:00
Anna Stuchlik
844c875f15 doc: add info about the time-consuming step due to resharding 2022-07-26 14:52:11 +02:00
Pavel Emelyanov
40d6ea973c snitch: Remove reconnectable snitch helper
It's now no-op

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-26 13:51:05 +03:00
Pavel Emelyanov
b91f7e9ec4 snitch, storage_service: Move reconnect to internal_ip kick
The same thing as in previous patch -- when gossiper issues
on_join/_change notification, storage service can kick messaging
service to update its internal_ip cache and reconnect to the peer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-26 13:48:46 +03:00
Pavel Emelyanov
1bf8b0dd92 snitch, storage_service: Move system.peers preferred_ip update
Currently the INTERNAL_IP state is updated using reconnectable helper
by subscribing on on_join/on_change events from gossiper. The same
subscription exists in storage service (it's a bit more elaborated by
checking if the node is the part of the ring which is OK).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-26 13:48:46 +03:00
Pavel Emelyanov
0abd2c1e52 snitch: Export prefer-local
The boolean bit says whether "the system" should prefer connecting to
the address gossiper around via INTERNAL_IP. Currently only gossiping
property file snitch allows to tune it and ec2-multiregion snitch
prefers internal IP unconditionally.

So exporting consists of 2 pieces:

- add prefer_local() snitch method that's false by default or returns
  the (existing) _prefer_local bit for production snitch base
- set the _prefer_local to true by ec2-multiregion snitch

While at it the _prefer_local is moved to production_snitch_base for
uniformity with the new prefer_local() call

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-26 13:48:04 +03:00
Anna Stuchlik
ff5c4a33f5 doc: add the new KB to the toctree 2022-07-25 14:29:33 +02:00
Anna Stuchlik
f1daef4b1b doc: doc: add a KB about updating the mode in perftune.yaml after upgrade 2022-07-25 14:22:02 +02:00
1190 changed files with 67749 additions and 27733 deletions

24
.github/CODEOWNERS vendored
View File

@@ -12,7 +12,7 @@ test/cql/cdc_* @kbr- @elcallio @piodul @jul-stas
test/boost/cdc_* @kbr- @elcallio @piodul @jul-stas
# COMMITLOG / BATCHLOG
db/commitlog/* @elcallio
db/commitlog/* @elcallio @eliransin
db/batch* @elcallio
# COORDINATOR
@@ -25,7 +25,7 @@ compaction/* @raphaelsc @nyh
transport/*
# CQL QUERY LANGUAGE
cql3/* @tgrabiec @psarna @cvybhu
cql3/* @tgrabiec @cvybhu @nyh
# COUNTERS
counters* @jul-stas
@@ -33,7 +33,7 @@ tests/counter_test* @jul-stas
# DOCS
docs/* @annastuchlik @tzach
docs/alternator @annastuchlik @tzach @nyh @psarna
docs/alternator @annastuchlik @tzach @nyh @havaker @nuivall
# GOSSIP
gms/* @tgrabiec @asias
@@ -45,9 +45,9 @@ dist/docker/*
utils/logalloc* @tgrabiec
# MATERIALIZED VIEWS
db/view/* @nyh @psarna
cql3/statements/*view* @nyh @psarna
test/boost/view_* @nyh @psarna
db/view/* @nyh @cvybhu @piodul
cql3/statements/*view* @nyh @cvybhu @piodul
test/boost/view_* @nyh @cvybhu @piodul
# PACKAGING
dist/* @syuu1228
@@ -62,9 +62,9 @@ service/migration* @tgrabiec @nyh
schema* @tgrabiec @nyh
# SECONDARY INDEXES
db/index/* @nyh @psarna
cql3/statements/*index* @nyh @psarna
test/boost/*index* @nyh @psarna
index/* @nyh @cvybhu @piodul
cql3/statements/*index* @nyh @cvybhu @piodul
test/boost/*index* @nyh @cvybhu @piodul
# SSTABLES
sstables/* @tgrabiec @raphaelsc @nyh
@@ -74,11 +74,11 @@ streaming/* @tgrabiec @asias
service/storage_service.* @tgrabiec @asias
# ALTERNATOR
alternator/* @nyh @psarna
test/alternator/* @nyh @psarna
alternator/* @nyh @havaker @nuivall
test/alternator/* @nyh @havaker @nuivall
# HINTED HANDOFF
db/hints/* @piodul @vladzcloudius
db/hints/* @piodul @vladzcloudius @eliransin
# REDIS
redis/* @nyh @syuu1228

View File

@@ -0,0 +1,17 @@
name: "Docs / Amplify enhanced"
on: issue_comment
jobs:
build:
runs-on: ubuntu-latest
if: ${{ github.event.issue.pull_request }}
steps:
- name: Checkout
uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Amplify enhanced
env:
TOKEN: ${{ secrets.GITHUB_TOKEN }}
uses: scylladb/sphinx-scylladb-theme/.github/actions/amplify-enhanced@master

2
.gitignore vendored
View File

@@ -31,3 +31,5 @@ docs/poetry.lock
compile_commands.json
.ccls-cache/
.mypy_cache
.envrc
rust/Cargo.lock

8
.gitmodules vendored
View File

@@ -1,17 +1,11 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui
url = ../scylla-swagger-ui
ignore = dirty
[submodule "libdeflate"]
path = libdeflate
url = ../libdeflate
[submodule "abseil"]
path = abseil
url = ../abseil-cpp
[submodule "scylla-jmx"]
path = tools/jmx
url = ../scylla-jmx

View File

@@ -42,22 +42,13 @@ set(Seastar_CXX_FLAGS ${cxx_coro_flag} ${target_arch_flag} CACHE INTERNAL "" FOR
set(Seastar_CXX_DIALECT gnu++20 CACHE INTERNAL "" FORCE)
add_subdirectory(seastar)
add_subdirectory(abseil)
# Exclude absl::strerror from the default "all" target since it's not
# used in Scylla build and, moreover, makes use of deprecated glibc APIs,
# such as sys_nerr, which are not exposed from "stdio.h" since glibc 2.32,
# which happens to be the case for recent Fedora distribution versions.
#
# Need to use the internal "absl_strerror" target name instead of namespaced
# variant because `set_target_properties` does not understand the latter form,
# unfortunately.
set_target_properties(absl_strerror PROPERTIES EXCLUDE_FROM_ALL TRUE)
# System libraries dependencies
find_package(Boost COMPONENTS filesystem program_options system thread regex REQUIRED)
find_package(Lua REQUIRED)
find_package(ZLIB REQUIRED)
find_package(ICU COMPONENTS uc REQUIRED)
find_package(Abseil REQUIRED)
set(scylla_build_dir "${CMAKE_BINARY_DIR}/build/${BUILD_TYPE}")
set(scylla_gen_build_dir "${scylla_build_dir}/gen")
@@ -189,6 +180,8 @@ set(swagger_files
api/api-doc/storage_service.json
api/api-doc/stream_manager.json
api/api-doc/system.json
api/api-doc/task_manager.json
api/api-doc/task_manager_test.json
api/api-doc/utils.json)
set(swagger_gen_files)
@@ -301,6 +294,8 @@ set(scylla_sources
api/storage_service.cc
api/stream_manager.cc
api/system.cc
api/task_manager.cc
api/task_manager_test.cc
atomic_cell.cc
auth/allow_all_authenticator.cc
auth/allow_all_authorizer.cc
@@ -424,6 +419,8 @@ set(scylla_sources
cql3/statements/sl_prop_defs.cc
cql3/statements/truncate_statement.cc
cql3/statements/update_statement.cc
cql3/statements/strongly_consistent_modification_statement.cc
cql3/statements/strongly_consistent_select_statement.cc
cql3/statements/use_statement.cc
cql3/type_json.cc
cql3/untyped_result_set.cc
@@ -533,6 +530,7 @@ set(scylla_sources
raft/raft.cc
raft/server.cc
raft/tracker.cc
service/broadcast_tables/experimental/lang.cc
range_tombstone.cc
range_tombstone_list.cc
tombstone_gc_options.cc
@@ -612,6 +610,7 @@ set(scylla_sources
streaming/stream_task.cc
streaming/stream_transfer_task.cc
table_helper.cc
tasks/task_manager.cc
thrift/controller.cc
thrift/handler.cc
thrift/server.cc
@@ -738,7 +737,6 @@ target_compile_definitions(scylla PRIVATE XXH_PRIVATE_API HAVE_LZ4_COMPRESS_DEFA
target_include_directories(scylla PRIVATE
"${CMAKE_CURRENT_SOURCE_DIR}"
libdeflate
abseil
"${scylla_gen_build_dir}")
###

View File

@@ -1,11 +1,12 @@
#!/bin/sh
USAGE=$(cat <<-END
Usage: $(basename "$0") [-h|--help] [-o|--output-dir PATH] -- generate Scylla version and build information files.
Usage: $(basename "$0") [-h|--help] [-o|--output-dir PATH] [--date-stamp DATE] -- generate Scylla version and build information files.
Options:
-h|--help show this help message.
-o|--output-dir PATH specify destination path at which the version files are to be created.
-d|--date-stamp DATE manually set date for release parameter
By default, the script will attempt to parse 'version' file
in the current directory, which should contain a string of
@@ -31,7 +32,9 @@ using '-o PATH' option.
END
)
while [[ $# -gt 0 ]]; do
DATE=""
while [ $# -gt 0 ]; do
opt="$1"
case $opt in
-h|--help)
@@ -43,6 +46,11 @@ while [[ $# -gt 0 ]]; do
shift
shift
;;
--date-stamp)
DATE="$2"
shift
shift
;;
*)
echo "Unexpected argument found: $1"
echo
@@ -58,24 +66,33 @@ if [ -z "$OUTPUT_DIR" ]; then
OUTPUT_DIR="$SCRIPT_DIR/build"
fi
if [ -z "$DATE" ]; then
DATE=$(date --utc +%Y%m%d)
fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=5.1.0-dev
VERSION=5.2.19
if test -f version
then
SCYLLA_VERSION=$(cat version | awk -F'-' '{print $1}')
SCYLLA_RELEASE=$(cat version | awk -F'-' '{print $2}')
else
DATE=$(date --utc +%Y%m%d)
GIT_COMMIT=$(git -C "$SCRIPT_DIR" log --pretty=format:'%h' -n 1 --abbrev=12)
SCYLLA_VERSION=$VERSION
# For custom package builds, replace "0" with "counter.your_name",
# where counter starts at 1 and increments for successive versions.
# This ensures that the package manager will select your custom
# package over the standard release.
SCYLLA_BUILD=0
SCYLLA_RELEASE=$SCYLLA_BUILD.$DATE.$GIT_COMMIT
if [ -z "$SCYLLA_RELEASE" ]; then
DATE=$(date --utc +%Y%m%d)
GIT_COMMIT=$(git -C "$SCRIPT_DIR" log --pretty=format:'%h' -n 1 --abbrev=12)
# For custom package builds, replace "0" with "counter.your_name",
# where counter starts at 1 and increments for successive versions.
# This ensures that the package manager will select your custom
# package over the standard release.
SCYLLA_BUILD=0
SCYLLA_RELEASE=$SCYLLA_BUILD.$DATE.$GIT_COMMIT
elif [ -f "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" ]; then
echo "setting SCYLLA_RELEASE only makes sense in clean builds" 1>&2
exit 1
fi
fi
if [ -f "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" ]; then

1
abseil

Submodule abseil deleted from 9e408e050f

View File

@@ -133,14 +133,15 @@ future<std::string> get_key_from_roles(service::storage_proxy& proxy, std::strin
}
auto selection = cql3::selection::selection::for_columns(schema, {salted_hash_col});
auto partition_slice = query::partition_slice(std::move(bounds), {}, query::column_id_vector{salted_hash_col->id}, selection->get_query_options());
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice, proxy.get_max_result_size(partition_slice));
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice,
proxy.get_max_result_size(partition_slice), query::tombstone_limit(proxy.get_tombstone_limit()));
auto cl = auth::password_authenticator::consistency_for_user(username);
service::client_state client_state{service::client_state::internal_tag()};
service::storage_proxy::coordinator_query_result qr = co_await proxy.query(schema, std::move(command), std::move(partition_ranges), cl,
service::storage_proxy::coordinator_query_options(executor::default_timeout(), empty_service_permit(), client_state));
cql3::selection::result_set_builder builder(*selection, gc_clock::now(), cql_serialization_format::latest());
cql3::selection::result_set_builder builder(*selection, gc_clock::now());
query::result_view::consume(*qr.query_result, partition_slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));
auto result_set = builder.build();

View File

@@ -14,6 +14,8 @@
#include "db/config.hh"
#include "cdc/generation_service.hh"
#include "service/memory_limiter.hh"
#include "auth/service.hh"
#include "service/qos/service_level_controller.hh"
using namespace seastar;
@@ -28,6 +30,8 @@ controller::controller(
sharded<db::system_distributed_keyspace>& sys_dist_ks,
sharded<cdc::generation_service>& cdc_gen_svc,
sharded<service::memory_limiter>& memory_limiter,
sharded<auth::service>& auth_service,
sharded<qos::service_level_controller>& sl_controller,
const db::config& config)
: _gossiper(gossiper)
, _proxy(proxy)
@@ -35,6 +39,8 @@ controller::controller(
, _sys_dist_ks(sys_dist_ks)
, _cdc_gen_svc(cdc_gen_svc)
, _memory_limiter(memory_limiter)
, _auth_service(auth_service)
, _sl_controller(sl_controller)
, _config(config)
{
}
@@ -77,7 +83,7 @@ future<> controller::start_server() {
auto get_cdc_metadata = [] (cdc::generation_service& svc) { return std::ref(svc.get_cdc_metadata()); };
_executor.start(std::ref(_gossiper), std::ref(_proxy), std::ref(_mm), std::ref(_sys_dist_ks), sharded_parameter(get_cdc_metadata, std::ref(_cdc_gen_svc)), _ssg.value()).get();
_server.start(std::ref(_executor), std::ref(_proxy), std::ref(_gossiper)).get();
_server.start(std::ref(_executor), std::ref(_proxy), std::ref(_gossiper), std::ref(_auth_service), std::ref(_sl_controller)).get();
// Note: from this point on, if start_server() throws for any reason,
// it must first call stop_server() to stop the executor and server
// services we just started - or Scylla will cause an assertion

View File

@@ -34,6 +34,14 @@ class gossiper;
}
namespace auth {
class service;
}
namespace qos {
class service_level_controller;
}
namespace alternator {
// This is the official DynamoDB API version.
@@ -53,6 +61,8 @@ class controller : public protocol_server {
sharded<db::system_distributed_keyspace>& _sys_dist_ks;
sharded<cdc::generation_service>& _cdc_gen_svc;
sharded<service::memory_limiter>& _memory_limiter;
sharded<auth::service>& _auth_service;
sharded<qos::service_level_controller>& _sl_controller;
const db::config& _config;
std::vector<socket_address> _listen_addresses;
@@ -68,6 +78,8 @@ public:
sharded<db::system_distributed_keyspace>& sys_dist_ks,
sharded<cdc::generation_service>& cdc_gen_svc,
sharded<service::memory_limiter>& memory_limiter,
sharded<auth::service>& auth_service,
sharded<qos::service_level_controller>& sl_controller,
const db::config& config);
virtual sstring name() const override;

View File

@@ -23,7 +23,7 @@ namespace alternator {
// api_error into a JSON object, and that is returned to the user.
class api_error final : public std::exception {
public:
using status_type = httpd::reply::status_type;
using status_type = http::reply::status_type;
status_type _http_code;
std::string _type;
std::string _msg;
@@ -77,7 +77,7 @@ public:
return api_error("TableNotFoundException", std::move(msg));
}
static api_error internal(std::string msg) {
return api_error("InternalServerError", std::move(msg), reply::status_type::internal_server_error);
return api_error("InternalServerError", std::move(msg), http::reply::status_type::internal_server_error);
}
// Provide the "std::exception" interface, to make it easier to print this

View File

@@ -34,6 +34,7 @@
#include "expressions.hh"
#include "conditions.hh"
#include "cql3/constants.hh"
#include "cql3/util.hh"
#include <optional>
#include "utils/overloaded_functor.hh"
#include <seastar/json/json_elements.hh>
@@ -87,17 +88,20 @@ json::json_return_type make_streamed(rjson::value&& value) {
// move objects to coroutine frame.
auto los = std::move(os);
auto lrs = std::move(rs);
std::exception_ptr ex;
try {
co_await rjson::print(*lrs, los);
co_await los.flush();
co_await los.close();
} catch (...) {
// at this point, we cannot really do anything. HTTP headers and return code are
// already written, and quite potentially a portion of the content data.
// just log + rethrow. It is probably better the HTTP server closes connection
// abruptly or something...
elogger.error("Unhandled exception in data streaming: {}", std::current_exception());
throw;
ex = std::current_exception();
elogger.error("Exception during streaming HTTP response: {}", ex);
}
co_await los.close();
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
}
co_return;
};
@@ -438,6 +442,11 @@ future<executor::request_return_type> executor::describe_table(client_state& cli
rjson::add(table_description, "BillingModeSummary", rjson::empty_object());
rjson::add(table_description["BillingModeSummary"], "BillingMode", "PAY_PER_REQUEST");
rjson::add(table_description["BillingModeSummary"], "LastUpdateToPayPerRequestDateTime", rjson::value(creation_date_seconds));
// In PAY_PER_REQUEST billing mode, provisioned capacity should return 0
rjson::add(table_description, "ProvisionedThroughput", rjson::empty_object());
rjson::add(table_description["ProvisionedThroughput"], "ReadCapacityUnits", 0);
rjson::add(table_description["ProvisionedThroughput"], "WriteCapacityUnits", 0);
rjson::add(table_description["ProvisionedThroughput"], "NumberOfDecreasesToday", 0);
std::unordered_map<std::string,std::string> key_attribute_types;
// Add base table's KeySchema and collect types for AttributeDefinitions:
@@ -460,6 +469,11 @@ future<executor::request_return_type> executor::describe_table(client_state& cli
rjson::add(view_entry, "IndexArn", generate_arn_for_index(*schema, index_name));
// Add indexes's KeySchema and collect types for AttributeDefinitions:
describe_key_schema(view_entry, *vptr, key_attribute_types);
// Add projection type
rjson::value projection = rjson::empty_object();
rjson::add(projection, "ProjectionType", "ALL");
// FIXME: we have to get ProjectionType from the schema when it is added
rjson::add(view_entry, "Projection", std::move(projection));
// Local secondary indexes are marked by an extra '!' sign occurring before the ':' delimiter
rjson::value& index_array = (delim_it > 1 && cf_name[delim_it-1] == '!') ? lsi_array : gsi_array;
rjson::push_back(index_array, std::move(view_entry));
@@ -750,7 +764,6 @@ future<executor::request_return_type> executor::tag_resource(client_state& clien
co_return api_error::access_denied("Incorrect resource identifier");
}
schema_ptr schema = get_table_from_arn(_proxy, rjson::to_string_view(*arn));
std::map<sstring, sstring> tags_map = get_tags_of_table_or_throw(schema);
const rjson::value* tags = rjson::find(request, "Tags");
if (!tags || !tags->IsArray()) {
co_return api_error::validation("Cannot parse tags");
@@ -758,8 +771,9 @@ future<executor::request_return_type> executor::tag_resource(client_state& clien
if (tags->Size() < 1) {
co_return api_error::validation("The number of tags must be at least 1") ;
}
update_tags_map(*tags, tags_map, update_tags_action::add_tags);
co_await db::update_tags(_mm, schema, std::move(tags_map));
co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [tags](std::map<sstring, sstring>& tags_map) {
update_tags_map(*tags, tags_map, update_tags_action::add_tags);
});
co_return json_string("");
}
@@ -777,9 +791,9 @@ future<executor::request_return_type> executor::untag_resource(client_state& cli
schema_ptr schema = get_table_from_arn(_proxy, rjson::to_string_view(*arn));
std::map<sstring, sstring> tags_map = get_tags_of_table_or_throw(schema);
update_tags_map(*tags, tags_map, update_tags_action::delete_tags);
co_await db::update_tags(_mm, schema, std::move(tags_map));
co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [tags](std::map<sstring, sstring>& tags_map) {
update_tags_map(*tags, tags_map, update_tags_action::delete_tags);
});
co_return json_string("");
}
@@ -917,9 +931,10 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
if (!range_key.empty() && range_key != view_hash_key && range_key != view_range_key) {
add_column(view_builder, range_key, attribute_definitions, column_kind::clustering_key);
}
sstring where_clause = "\"" + view_hash_key + "\" IS NOT NULL";
sstring where_clause = format("{} IS NOT NULL", cql3::util::maybe_quote(view_hash_key));
if (!view_range_key.empty()) {
where_clause = where_clause + " AND \"" + view_hash_key + "\" IS NOT NULL";
where_clause = format("{} AND {} IS NOT NULL", where_clause,
cql3::util::maybe_quote(view_range_key));
}
where_clauses.push_back(std::move(where_clause));
view_builders.emplace_back(std::move(view_builder));
@@ -974,9 +989,10 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
// Note above we don't need to add virtual columns, as all
// base columns were copied to view. TODO: reconsider the need
// for virtual columns when we support Projection.
sstring where_clause = "\"" + view_hash_key + "\" IS NOT NULL";
sstring where_clause = format("{} IS NOT NULL", cql3::util::maybe_quote(view_hash_key));
if (!view_range_key.empty()) {
where_clause = where_clause + " AND \"" + view_range_key + "\" IS NOT NULL";
where_clause = format("{} AND {} IS NOT NULL", where_clause,
cql3::util::maybe_quote(view_range_key));
}
where_clauses.push_back(std::move(where_clause));
view_builders.emplace_back(std::move(view_builder));
@@ -1082,7 +1098,6 @@ future<executor::request_return_type> executor::update_table(client_state& clien
elogger.trace("Updating table {}", request);
static const std::vector<sstring> unsupported = {
"AttributeDefinitions",
"GlobalSecondaryIndexUpdates",
"ProvisionedThroughput",
"ReplicaUpdates",
@@ -1255,6 +1270,22 @@ put_or_delete_item::put_or_delete_item(const rjson::value& key, schema_ptr schem
check_key(key, schema);
}
// find_attribute() checks whether the named attribute is stored in the
// schema as a real column (we do this for key attribute, and for a GSI key)
// and if so, returns that column. If not, the function returns nullptr,
// telling the caller that the attribute is stored serialized in the
// ATTRS_COLUMN_NAME map - not in a stand-alone column in the schema.
static inline const column_definition* find_attribute(const schema& schema, const bytes& attribute_name) {
const column_definition* cdef = schema.get_column_definition(attribute_name);
// Although ATTRS_COLUMN_NAME exists as an actual column, when used as an
// attribute name it should refer to an attribute inside ATTRS_COLUMN_NAME
// not to ATTRS_COLUMN_NAME itself. This if() is needed for #5009.
if (cdef && cdef->name() == executor::ATTRS_COLUMN_NAME) {
return nullptr;
}
return cdef;
}
put_or_delete_item::put_or_delete_item(const rjson::value& item, schema_ptr schema, put_item)
: _pk(pk_from_json(item, schema)), _ck(ck_from_json(item, schema)) {
_cells = std::vector<cell>();
@@ -1262,7 +1293,7 @@ put_or_delete_item::put_or_delete_item(const rjson::value& item, schema_ptr sche
for (auto it = item.MemberBegin(); it != item.MemberEnd(); ++it) {
bytes column_name = to_bytes(it->name.GetString());
validate_value(it->value, "PutItem");
const column_definition* cdef = schema->get_column_definition(column_name);
const column_definition* cdef = find_attribute(*schema, column_name);
if (!cdef) {
bytes value = serialize_item(it->value);
_cells->push_back({std::move(column_name), serialize_item(it->value)});
@@ -1294,7 +1325,7 @@ mutation put_or_delete_item::build(schema_ptr schema, api::timestamp_type ts) co
auto& row = m.partition().clustered_row(*schema, _ck);
attribute_collector attrs_collector;
for (auto& c : *_cells) {
const column_definition* cdef = schema->get_column_definition(c.column_name);
const column_definition* cdef = find_attribute(*schema, c.column_name);
if (!cdef) {
attrs_collector.put(c.column_name, c.value, ts);
} else {
@@ -1359,7 +1390,8 @@ static lw_shared_ptr<query::read_command> previous_item_read_command(service::st
auto regular_columns = boost::copy_range<query::column_id_vector>(
schema->regular_columns() | boost::adaptors::transformed([] (const column_definition& cdef) { return cdef.id; }));
auto partition_slice = query::partition_slice(std::move(bounds), {}, std::move(regular_columns), selection->get_query_options());
return ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice, proxy.get_max_result_size(partition_slice));
return ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice, proxy.get_max_result_size(partition_slice),
query::tombstone_limit(proxy.get_tombstone_limit()));
}
static dht::partition_range_vector to_partition_ranges(const schema& schema, const partition_key& pk) {
@@ -2276,7 +2308,7 @@ void executor::describe_single_item(const cql3::selection::selection& selection,
rjson::add_with_string_name(field, type_to_string((*column_it)->type), json_key_column_value(*cell, **column_it));
}
} else if (cell) {
auto deserialized = attrs_type()->deserialize(*cell, cql_serialization_format::latest());
auto deserialized = attrs_type()->deserialize(*cell);
auto keys_and_values = value_cast<map_type_impl::native_type>(deserialized);
for (auto entry : keys_and_values) {
std::string attr_name = value_cast<sstring>(entry.first);
@@ -2311,7 +2343,7 @@ std::optional<rjson::value> executor::describe_single_item(schema_ptr schema,
const std::optional<attrs_to_get>& attrs_to_get) {
rjson::value item = rjson::empty_object();
cql3::selection::result_set_builder builder(selection, gc_clock::now(), cql_serialization_format::latest());
cql3::selection::result_set_builder builder(selection, gc_clock::now());
query::result_view::consume(query_result, slice, cql3::selection::result_set_builder::visitor(builder, *schema, selection));
auto result_set = builder.build();
@@ -2329,21 +2361,22 @@ std::optional<rjson::value> executor::describe_single_item(schema_ptr schema,
return item;
}
std::vector<rjson::value> executor::describe_multi_item(schema_ptr schema,
const query::partition_slice& slice,
const cql3::selection::selection& selection,
const query::result& query_result,
const std::optional<attrs_to_get>& attrs_to_get) {
cql3::selection::result_set_builder builder(selection, gc_clock::now(), cql_serialization_format::latest());
query::result_view::consume(query_result, slice, cql3::selection::result_set_builder::visitor(builder, *schema, selection));
future<std::vector<rjson::value>> executor::describe_multi_item(schema_ptr schema,
const query::partition_slice&& slice,
shared_ptr<cql3::selection::selection> selection,
foreign_ptr<lw_shared_ptr<query::result>> query_result,
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get) {
cql3::selection::result_set_builder builder(*selection, gc_clock::now());
query::result_view::consume(*query_result, slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));
auto result_set = builder.build();
std::vector<rjson::value> ret;
for (auto& result_row : result_set->rows()) {
rjson::value item = rjson::empty_object();
describe_single_item(selection, result_row, attrs_to_get, item);
describe_single_item(*selection, result_row, *attrs_to_get, item);
ret.push_back(std::move(item));
co_await coroutine::maybe_yield();
}
return ret;
co_return ret;
}
static bool check_needs_read_before_write(const parsed::value& v) {
@@ -2748,7 +2781,7 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
}
}
}
const column_definition* cdef = _schema->get_column_definition(column_name);
const column_definition* cdef = find_attribute(*_schema, column_name);
if (cdef) {
bytes column_value = get_key_from_typed_value(json_value, *cdef);
row.cells().apply(*cdef, atomic_cell::make_live(*cdef->type, ts, column_value));
@@ -2770,7 +2803,7 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
rjson::add_with_string_name(_return_attributes, cn, rjson::copy(*col));
}
}
const column_definition* cdef = _schema->get_column_definition(column_name);
const column_definition* cdef = find_attribute(*_schema, column_name);
if (cdef) {
row.cells().apply(*cdef, atomic_cell::make_dead(ts, gc_clock::now()));
} else {
@@ -3063,7 +3096,8 @@ future<executor::request_return_type> executor::get_item(client_state& client_st
auto selection = cql3::selection::selection::wildcard(schema);
auto partition_slice = query::partition_slice(std::move(bounds), {}, std::move(regular_columns), selection->get_query_options());
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice, _proxy.get_max_result_size(partition_slice));
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice, _proxy.get_max_result_size(partition_slice),
query::tombstone_limit(_proxy.get_tombstone_limit()));
std::unordered_set<std::string> used_attribute_names;
auto attrs_to_get = calculate_attrs_to_get(request, used_attribute_names);
@@ -3217,14 +3251,14 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
rs.schema->regular_columns() | boost::adaptors::transformed([] (const column_definition& cdef) { return cdef.id; }));
auto selection = cql3::selection::selection::wildcard(rs.schema);
auto partition_slice = query::partition_slice(std::move(bounds), {}, std::move(regular_columns), selection->get_query_options());
auto command = ::make_lw_shared<query::read_command>(rs.schema->id(), rs.schema->version(), partition_slice, _proxy.get_max_result_size(partition_slice));
auto command = ::make_lw_shared<query::read_command>(rs.schema->id(), rs.schema->version(), partition_slice, _proxy.get_max_result_size(partition_slice),
query::tombstone_limit(_proxy.get_tombstone_limit()));
command->allow_limit = db::allow_per_partition_rate_limit::yes;
future<std::vector<rjson::value>> f = _proxy.query(rs.schema, std::move(command), std::move(partition_ranges), rs.cl,
service::storage_proxy::coordinator_query_options(executor::default_timeout(), permit, client_state, trace_state)).then(
[schema = rs.schema, partition_slice = std::move(partition_slice), selection = std::move(selection), attrs_to_get = rs.attrs_to_get] (service::storage_proxy::coordinator_query_result qr) mutable {
utils::get_local_injector().inject("alternator_batch_get_item", [] { throw std::runtime_error("batch_get_item injection"); });
std::vector<rjson::value> jsons = describe_multi_item(schema, partition_slice, *selection, *qr.query_result, *attrs_to_get);
return make_ready_future<std::vector<rjson::value>>(std::move(jsons));
return describe_multi_item(std::move(schema), std::move(partition_slice), std::move(selection), std::move(qr.query_result), std::move(attrs_to_get));
});
response_futures.push_back(std::move(f));
}
@@ -3480,7 +3514,7 @@ public:
rjson::add_with_string_name(field, type_to_string((*_column_it)->type), json_key_column_value(bv, **_column_it));
}
} else {
auto deserialized = attrs_type()->deserialize(bv, cql_serialization_format::latest());
auto deserialized = attrs_type()->deserialize(bv);
auto keys_and_values = value_cast<map_type_impl::native_type>(deserialized);
for (auto entry : keys_and_values) {
std::string attr_name = value_cast<sstring>(entry.first);
@@ -3614,11 +3648,11 @@ static future<executor::request_return_type> do_query(service::storage_proxy& pr
if (exclusive_start_key) {
partition_key pk = pk_from_json(*exclusive_start_key, schema);
auto pos = position_in_partition(position_in_partition::partition_start_tag_t());
auto pos = position_in_partition::for_partition_start();
if (schema->clustering_key_size() > 0) {
pos = pos_from_json(*exclusive_start_key, schema);
}
paging_state = make_lw_shared<service::pager::paging_state>(pk, pos, query::max_partitions, utils::UUID(), service::pager::paging_state::replicas_per_token_range{}, std::nullopt, 0);
paging_state = make_lw_shared<service::pager::paging_state>(pk, pos, query::max_partitions, query_id::create_null_id(), service::pager::paging_state::replicas_per_token_range{}, std::nullopt, 0);
}
auto regular_columns = boost::copy_range<query::column_id_vector>(
@@ -3629,7 +3663,8 @@ static future<executor::request_return_type> do_query(service::storage_proxy& pr
query::partition_slice::option_set opts = selection->get_query_options();
opts.add(custom_opts);
auto partition_slice = query::partition_slice(std::move(ck_bounds), std::move(static_columns), std::move(regular_columns), opts);
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice, proxy.get_max_result_size(partition_slice));
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice, proxy.get_max_result_size(partition_slice),
query::tombstone_limit(proxy.get_tombstone_limit()));
auto query_state_ptr = std::make_unique<service::query_state>(client_state, trace_state, std::move(permit));

View File

@@ -222,11 +222,11 @@ public:
const query::result&,
const std::optional<attrs_to_get>&);
static std::vector<rjson::value> describe_multi_item(schema_ptr schema,
const query::partition_slice& slice,
const cql3::selection::selection& selection,
const query::result& query_result,
const std::optional<attrs_to_get>& attrs_to_get);
static future<std::vector<rjson::value>> describe_multi_item(schema_ptr schema,
const query::partition_slice&& slice,
shared_ptr<cql3::selection::selection> selection,
foreign_ptr<lw_shared_ptr<query::result>> query_result,
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get);
static void describe_single_item(const cql3::selection::selection&,
const std::vector<bytes_opt>&,

View File

@@ -73,7 +73,7 @@ struct from_json_visitor {
}
// default
void operator()(const abstract_type& t) const {
bo.write(from_json_object(t, v, cql_serialization_format::internal()));
bo.write(from_json_object(t, v));
}
};
@@ -279,7 +279,7 @@ position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema)
return position_in_partition(region, weight, region == partition_region::clustered ? std::optional(std::move(ck)) : std::nullopt);
}
if (ck.is_empty()) {
return position_in_partition(position_in_partition::partition_start_tag_t());
return position_in_partition::for_partition_start();
}
return position_in_partition::for_key(std::move(ck));
}

View File

@@ -16,6 +16,7 @@
#include <seastar/util/short_streams.hh>
#include "seastarx.hh"
#include "error.hh"
#include "service/qos/service_level_controller.hh"
#include "utils/rjson.hh"
#include "auth.hh"
#include <cctype>
@@ -27,6 +28,8 @@
static logging::logger slogger("alternator-server");
using namespace httpd;
using request = http::request;
using reply = http::reply;
namespace alternator {
@@ -234,7 +237,7 @@ protected:
future<std::string> server::verify_signature(const request& req, const chunked_content& content) {
if (!_enforce_authorization) {
slogger.debug("Skipping authorization");
return make_ready_future<std::string>("<unauthenticated request>");
return make_ready_future<std::string>();
}
auto host_it = req._headers.find("Host");
if (host_it == req._headers.end()) {
@@ -364,7 +367,9 @@ static tracing::trace_state_ptr maybe_trace_query(service::client_state& client_
tracing::add_session_param(trace_state, "alternator_op", op);
tracing::add_query(trace_state, truncated_content_view(query, buf));
tracing::begin(trace_state, format("Alternator {}", op), client_state.get_client_address());
tracing::set_username(trace_state, auth::authenticated_user(username));
if (!username.empty()) {
tracing::set_username(trace_state, auth::authenticated_user(username));
}
}
return trace_state;
}
@@ -407,7 +412,11 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
auto leave = defer([this] () noexcept { _pending_requests.leave(); });
//FIXME: Client state can provide more context, e.g. client's endpoint address
// We use unique_ptr because client_state cannot be moved or copied
executor::client_state client_state{executor::client_state::internal_tag()};
executor::client_state client_state = username.empty()
? service::client_state{service::client_state::internal_tag()}
: service::client_state{service::client_state::internal_tag(), _auth_service, _sl_controller, username};
co_await client_state.maybe_update_per_service_level_params();
tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, username, op, content);
tracing::trace(trace_state, op);
rjson::value json_request = co_await _json_parser.parse(std::move(content));
@@ -440,12 +449,14 @@ void server::set_routes(routes& r) {
//FIXME: A way to immediately invalidate the cache should be considered,
// e.g. when the system table which stores the keys is changed.
// For now, this propagation may take up to 1 minute.
server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gossiper)
server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& auth_service, qos::service_level_controller& sl_controller)
: _http_server("http-alternator")
, _https_server("https-alternator")
, _executor(exec)
, _proxy(proxy)
, _gossiper(gossiper)
, _auth_service(auth_service)
, _sl_controller(sl_controller)
, _key_cache(1024, 1min, slogger)
, _enforce_authorization(false)
, _enabled_servers{}

View File

@@ -15,6 +15,7 @@
#include <seastar/net/tls.hh>
#include <optional>
#include "alternator/auth.hh"
#include "service/qos/service_level_controller.hh"
#include "utils/small_vector.hh"
#include "utils/updateable_value.hh"
#include <seastar/core/units.hh>
@@ -26,7 +27,7 @@ using chunked_content = rjson::chunked_content;
class server {
static constexpr size_t content_length_limit = 16*MB;
using alternator_callback = std::function<future<executor::request_return_type>(executor&, executor::client_state&,
tracing::trace_state_ptr, service_permit, rjson::value, std::unique_ptr<request>)>;
tracing::trace_state_ptr, service_permit, rjson::value, std::unique_ptr<http::request>)>;
using alternator_callbacks_map = std::unordered_map<std::string_view, alternator_callback>;
http_server _http_server;
@@ -34,6 +35,8 @@ class server {
executor& _executor;
service::storage_proxy& _proxy;
gms::gossiper& _gossiper;
auth::service& _auth_service;
qos::service_level_controller& _sl_controller;
key_cache _key_cache;
bool _enforce_authorization;
@@ -65,7 +68,7 @@ class server {
json_parser _json_parser;
public:
server(executor& executor, service::storage_proxy& proxy, gms::gossiper& gossiper);
server(executor& executor, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& service, qos::service_level_controller& sl_controller);
future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
bool enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);
@@ -73,8 +76,8 @@ public:
private:
void set_routes(seastar::httpd::routes& r);
// If verification succeeds, returns the authenticated user's username
future<std::string> verify_signature(const seastar::httpd::request&, const chunked_content&);
future<executor::request_return_type> handle_api_request(std::unique_ptr<request> req);
future<std::string> verify_signature(const seastar::http::request&, const chunked_content&);
future<executor::request_return_type> handle_api_request(std::unique_ptr<http::request> req);
};
}

View File

@@ -74,8 +74,8 @@ struct rapidjson::internal::TypeHelper<ValueType, utils::UUID>
: public from_string_helper<ValueType, utils::UUID>
{};
static db_clock::time_point as_timepoint(const utils::UUID& uuid) {
return db_clock::time_point{utils::UUID_gen::unix_timestamp(uuid)};
static db_clock::time_point as_timepoint(const table_id& tid) {
return db_clock::time_point{utils::UUID_gen::unix_timestamp(tid.uuid())};
}
/**
@@ -106,6 +106,9 @@ public:
stream_arn(const UUID& uuid)
: UUID(uuid)
{}
stream_arn(const table_id& tid)
: UUID(tid.uuid())
{}
stream_arn(std::string_view v)
: UUID(v.substr(1))
{
@@ -142,20 +145,25 @@ future<alternator::executor::request_return_type> alternator::executor::list_str
auto table = find_table(_proxy, request);
auto db = _proxy.data_dictionary();
auto cfs = db.get_tables();
auto i = cfs.begin();
auto e = cfs.end();
if (limit < 1) {
throw api_error::validation("Limit must be 1 or more");
}
// TODO: the unordered_map here is not really well suited for partial
// querying - we're sorting on local hash order, and creating a table
// between queries may or may not miss info. But that should be rare,
// and we can probably expect this to be a single call.
// # 12601 (maybe?) - sort the set of tables on ID. This should ensure we never
// generate duplicates in a paged listing here. Can obviously miss things if they
// are added between paged calls and end up with a "smaller" UUID/ARN, but that
// is to be expected.
std::sort(cfs.begin(), cfs.end(), [](const data_dictionary::table& t1, const data_dictionary::table& t2) {
return t1.schema()->id().uuid() < t2.schema()->id().uuid();
});
auto i = cfs.begin();
auto e = cfs.end();
if (streams_start) {
i = std::find_if(i, e, [&](data_dictionary::table t) {
return t.schema()->id() == streams_start
i = std::find_if(i, e, [&](const data_dictionary::table& t) {
return t.schema()->id().uuid() == streams_start
&& cdc::get_base_table(db.real_database(), *t.schema())
&& is_alternator_keyspace(t.schema()->ks_name())
;
@@ -430,7 +438,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
auto db = _proxy.data_dictionary();
try {
auto cf = db.find_column_family(stream_arn);
auto cf = db.find_column_family(table_id(stream_arn));
schema = cf.schema();
bs = cdc::get_base_table(db.real_database(), *schema);
} catch (...) {
@@ -717,7 +725,7 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&
std::optional<shard_id> sid;
try {
auto cf = db.find_column_family(stream_arn);
auto cf = db.find_column_family(table_id(stream_arn));
schema = cf.schema();
sid = rjson::get<shard_id>(request, "ShardId");
} catch (...) {
@@ -802,7 +810,7 @@ future<executor::request_return_type> executor::get_records(client_state& client
auto db = _proxy.data_dictionary();
schema_ptr schema, base;
try {
auto log_table = db.find_column_family(iter.table);
auto log_table = db.find_column_family(table_id(iter.table));
schema = log_table.schema();
base = cdc::get_base_table(db.real_database(), *schema);
} catch (...) {
@@ -876,11 +884,11 @@ future<executor::request_return_type> executor::get_records(client_state& client
++mul;
}
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice, _proxy.get_max_result_size(partition_slice),
query::row_limit(limit * mul));
query::tombstone_limit(_proxy.get_tombstone_limit()), query::row_limit(limit * mul));
return _proxy.query(schema, std::move(command), std::move(partition_ranges), cl, service::storage_proxy::coordinator_query_options(default_timeout(), std::move(permit), client_state)).then(
[this, schema, partition_slice = std::move(partition_slice), selection = std::move(selection), start_time = std::move(start_time), limit, key_names = std::move(key_names), attr_names = std::move(attr_names), type, iter, high_ts] (service::storage_proxy::coordinator_query_result qr) mutable {
cql3::selection::result_set_builder builder(*selection, gc_clock::now(), cql_serialization_format::latest());
cql3::selection::result_set_builder builder(*selection, gc_clock::now());
query::result_view::consume(*qr.query_result, partition_slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));
auto result_set = builder.build();

View File

@@ -8,6 +8,7 @@
#include <chrono>
#include <cstdint>
#include <exception>
#include <optional>
#include <seastar/core/sstring.hh>
#include <seastar/core/coroutine.hh>
@@ -17,6 +18,7 @@
#include <seastar/coroutine/maybe_yield.hh>
#include <boost/multiprecision/cpp_int.hpp>
#include "exceptions/exceptions.hh"
#include "gms/gossiper.hh"
#include "gms/inet_address.hh"
#include "inet_address_vectors.hh"
@@ -92,24 +94,25 @@ future<executor::request_return_type> executor::update_time_to_live(client_state
}
sstring attribute_name(v->GetString(), v->GetStringLength());
std::map<sstring, sstring> tags_map = get_tags_of_table_or_throw(schema);
if (enabled) {
if (tags_map.contains(TTL_TAG_KEY)) {
co_return api_error::validation("TTL is already enabled");
co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [&](std::map<sstring, sstring>& tags_map) {
if (enabled) {
if (tags_map.contains(TTL_TAG_KEY)) {
throw api_error::validation("TTL is already enabled");
}
tags_map[TTL_TAG_KEY] = attribute_name;
} else {
auto i = tags_map.find(TTL_TAG_KEY);
if (i == tags_map.end()) {
throw api_error::validation("TTL is already disabled");
} else if (i->second != attribute_name) {
throw api_error::validation(format(
"Requested to disable TTL on attribute {}, but a different attribute {} is enabled.",
attribute_name, i->second));
}
tags_map.erase(TTL_TAG_KEY);
}
tags_map[TTL_TAG_KEY] = attribute_name;
} else {
auto i = tags_map.find(TTL_TAG_KEY);
if (i == tags_map.end()) {
co_return api_error::validation("TTL is already disabled");
} else if (i->second != attribute_name) {
co_return api_error::validation(format(
"Requested to disable TTL on attribute {}, but a different attribute {} is enabled.",
attribute_name, i->second));
}
tags_map.erase(TTL_TAG_KEY);
}
co_await db::update_tags(_mm, schema, std::move(tags_map));
});
// Prepare the response, which contains a TimeToLiveSpecification
// basically identical to the request's
rjson::value response = rjson::empty_object();
@@ -136,7 +139,7 @@ future<executor::request_return_type> executor::describe_time_to_live(client_sta
// expiration_service is a sharded service responsible for cleaning up expired
// items in all tables with per-item expiration enabled. Currently, this means
// Alternator tables with TTL configured via a UpdateTimeToLeave request.
// Alternator tables with TTL configured via a UpdateTimeToLive request.
//
// Here is a brief overview of how the expiration service works:
//
@@ -150,25 +153,25 @@ future<executor::request_return_type> executor::describe_time_to_live(client_sta
// To avoid scanning the same items RF times in RF replicas, only one node is
// responsible for scanning a token range at a time. Normally, this is the
// node owning this range as a "primary range" (the first node in the ring
// with this range), but when this node is down, other nodes may take over
// (FIXME: this is not implemented yet).
// with this range), but when this node is down, the secondary owner (the
// second in the ring) may take over.
// An expiration thread is reponsible for all tables which need expiration
// scans. FIXME: explain how this is done with multiple tables - parallel,
// staggered, or what?
// scans. Currently, the different tables are scanned sequentially (not in
// parallel).
// The expiration thread scans item using CL=QUORUM to ensures that it reads
// a consistent expiration-time attribute. This means that the items are read
// locally and in addition QUORUM-1 additional nodes (one additional node
// when RF=3) need to read the data and send digests.
// FIXME: explain if we can read the exact attribute or the entire map.
// When the expiration thread decides that an item has expired and wants
// to delete it, it does it using a CL=QUORUM write. This allows this
// deletion to be visible for consistent (quorum) reads. The deletion,
// like user deletions, will also appear on the CDC log and therefore
// Alternator Streams if enabled (FIXME: explain how we mark the
// deletion different from user deletes. We don't do it yet.).
expiration_service::expiration_service(data_dictionary::database db, service::storage_proxy& proxy)
// Alternator Streams if enabled - currently as ordinary deletes (the
// userIdentity flag is currently missing this is issue #11523).
expiration_service::expiration_service(data_dictionary::database db, service::storage_proxy& proxy, gms::gossiper& g)
: _db(db)
, _proxy(proxy)
, _gossiper(g)
{
}
@@ -282,7 +285,9 @@ static future<> expire_item(service::storage_proxy& proxy,
auto ck = clustering_key::from_exploded(exploded_ck);
m.partition().clustered_row(*schema, ck).apply(tombstone(ts, gc_clock::now()));
}
return proxy.mutate(std::vector<mutation>{std::move(m)},
std::vector<mutation> mutations;
mutations.push_back(std::move(m));
return proxy.mutate(std::move(mutations),
db::consistency_level::LOCAL_QUORUM,
executor::default_timeout(), // FIXME - which timeout?
qs.get_trace_state(), qs.get_permit(),
@@ -365,7 +370,7 @@ static std::vector<std::pair<dht::token_range, gms::inet_address>> get_secondary
// 2. The primary replica for this token is currently marked down.
// 3. In this node, this shard is responsible for this token.
// We use the <secondary> case to handle the possibility that some of the
// nodes in the system are down. A dead node will not be expiring expiring
// nodes in the system are down. A dead node will not be expiring
// the tokens owned by it, so we want the secondary owner to take over its
// primary ranges.
//
@@ -511,7 +516,7 @@ struct scan_ranges_context {
opts.set<query::partition_slice::option::bypass_cache>();
std::vector<query::clustering_range> ck_bounds{query::clustering_range::make_open_ended_both_sides()};
auto partition_slice = query::partition_slice(std::move(ck_bounds), {}, std::move(regular_columns), opts);
command = ::make_lw_shared<query::read_command>(s->id(), s->version(), partition_slice, proxy.get_max_result_size(partition_slice));
command = ::make_lw_shared<query::read_command>(s->id(), s->version(), partition_slice, proxy.get_max_result_size(partition_slice), query::tombstone_limit(proxy.get_tombstone_limit()));
executor::client_state client_state{executor::client_state::internal_tag()};
tracing::trace_state_ptr trace_state;
// NOTICE: empty_service_permit is used because the TTL service has fixed parallelism
@@ -546,13 +551,34 @@ static future<> scan_table_ranges(
co_return;
}
auto units = co_await get_units(page_sem, 1);
// We don't to limit page size in number of rows because there is a
// builtin limit of the page's size in bytes. Setting this limit to 1
// is useful for debugging the paging code with moderate-size data.
// We don't need to limit page size in number of rows because there is
// a builtin limit of the page's size in bytes. Setting this limit to
// 1 is useful for debugging the paging code with moderate-size data.
uint32_t limit = std::numeric_limits<uint32_t>::max();
// FIXME: which timeout?
// FIXME: if read times out, need to retry it.
std::unique_ptr<cql3::result_set> rs = co_await p->fetch_page(limit, gc_clock::now(), executor::default_timeout());
// Read a page, and if that times out, try again after a small sleep.
// If we didn't catch the timeout exception, it would cause the scan
// be aborted and only be restarted at the next scanning period.
// If we retry too many times, give up and restart the scan later.
std::unique_ptr<cql3::result_set> rs;
for (int retries=0; ; retries++) {
try {
// FIXME: which timeout?
rs = co_await p->fetch_page(limit, gc_clock::now(), executor::default_timeout());
break;
} catch(exceptions::read_timeout_exception&) {
tlogger.warn("expiration scanner read timed out, will retry: {}",
std::current_exception());
}
// If we didn't break out of this loop, add a minimal sleep
if (retries >= 10) {
// Don't get stuck forever asking the same page, maybe there's
// a bug or a real problem in several replicas. Give up on
// this scan an retry the scan from a random position later,
// in the next scan period.
throw runtime_exception("scanner thread failed after too many timeouts for the same page");
}
co_await sleep_abortable(std::chrono::seconds(1), abort_source);
}
auto rows = rs->rows();
auto meta = rs->get_metadata().get_names();
std::optional<unsigned> expiration_column;
@@ -637,6 +663,7 @@ static future<> scan_table_ranges(
static future<bool> scan_table(
service::storage_proxy& proxy,
data_dictionary::database db,
gms::gossiper& gossiper,
schema_ptr s,
abort_source& abort_source,
named_semaphore& page_sem,
@@ -689,7 +716,7 @@ static future<bool> scan_table(
expiration_stats.scan_table++;
// FIXME: need to pace the scan, not do it all at once.
scan_ranges_context scan_ctx{s, proxy, std::move(column_name), std::move(member)};
token_ranges_owned_by_this_shard<primary> my_ranges(db.real_database(), proxy.gossiper(), s);
token_ranges_owned_by_this_shard<primary> my_ranges(db.real_database(), gossiper, s);
while (std::optional<dht::partition_range> range = my_ranges.next_partition_range()) {
// Note that because of issue #9167 we need to run a separate
// query on each partition range, and can't pass several of
@@ -709,7 +736,7 @@ static future<bool> scan_table(
// by tasking another node to take over scanning of the dead node's primary
// ranges. What we do here is that this node will also check expiration
// on its *secondary* ranges - but only those whose primary owner is down.
token_ranges_owned_by_this_shard<secondary> my_secondary_ranges(db.real_database(), proxy.gossiper(), s);
token_ranges_owned_by_this_shard<secondary> my_secondary_ranges(db.real_database(), gossiper, s);
while (std::optional<dht::partition_range> range = my_secondary_ranges.next_partition_range()) {
expiration_stats.secondary_ranges_scanned++;
dht::partition_range_vector partition_ranges;
@@ -741,7 +768,7 @@ future<> expiration_service::run() {
co_return;
}
try {
co_await scan_table(_proxy, _db, s, _abort_source, _page_sem, _expiration_stats);
co_await scan_table(_proxy, _db, _gossiper, s, _abort_source, _page_sem, _expiration_stats);
} catch (...) {
// The scan of a table may fail in the middle for many
// reasons, including network failure and even the table
@@ -767,13 +794,15 @@ future<> expiration_service::run() {
// in the next iteration by reducing the scanner's scheduling-group
// share (if using a separate scheduling group), or introduce
// finer-grain sleeps into the scanning code.
std::chrono::seconds scan_duration(std::chrono::duration_cast<std::chrono::seconds>(lowres_clock::now() - start));
std::chrono::seconds period(_db.get_config().alternator_ttl_period_in_seconds());
std::chrono::milliseconds scan_duration(std::chrono::duration_cast<std::chrono::milliseconds>(lowres_clock::now() - start));
std::chrono::milliseconds period(long(_db.get_config().alternator_ttl_period_in_seconds() * 1000));
if (scan_duration < period) {
try {
tlogger.info("sleeping {} seconds until next period", (period - scan_duration).count());
tlogger.info("sleeping {} seconds until next period", (period - scan_duration).count()/1000.0);
co_await seastar::sleep_abortable(period - scan_duration, _abort_source);
} catch(seastar::sleep_aborted&) {}
} else {
tlogger.warn("scan took {} seconds, longer than period - not sleeping", scan_duration.count()/1000.0);
}
}
}

View File

@@ -14,6 +14,10 @@
#include <seastar/core/semaphore.hh>
#include "data_dictionary/data_dictionary.hh"
namespace gms {
class gossiper;
}
namespace replica {
class database;
}
@@ -47,6 +51,7 @@ public:
private:
data_dictionary::database _db;
service::storage_proxy& _proxy;
gms::gossiper& _gossiper;
// _end is set by start(), and resolves when the the background service
// started by it ends. To ask the background service to end, _abort_source
// should be triggered. stop() below uses both _abort_source and _end.
@@ -60,7 +65,7 @@ public:
// sharded_service<expiration_service>::start() creates this object on
// all shards, so calls this constructor on each shard. Later, the
// additional start() function should be invoked on all shards.
expiration_service(data_dictionary::database, service::storage_proxy&);
expiration_service(data_dictionary::database, service::storage_proxy&, gms::gossiper&);
future<> start();
future<> run();
// sharded_service<expiration_service>::stop() calls the following stop()

15
amplify.yml Normal file
View File

@@ -0,0 +1,15 @@
version: 1
applications:
- frontend:
phases:
build:
commands:
- make setupenv
- make dirhtml
artifacts:
baseDirectory: _build/dirhtml
files:
- '**/*'
cache:
paths: []
appRoot: docs

43
api/api-doc/raft.json Normal file
View File

@@ -0,0 +1,43 @@
{
"apiVersion":"0.0.1",
"swaggerVersion":"1.2",
"basePath":"{{Protocol}}://{{Host}}",
"resourcePath":"/raft",
"produces":[
"application/json"
],
"apis":[
{
"path":"/raft/trigger_snapshot/{group_id}",
"operations":[
{
"method":"POST",
"summary":"Triggers snapshot creation and log truncation for the given Raft group",
"type":"string",
"nickname":"trigger_snapshot",
"produces":[
"application/json"
],
"parameters":[
{
"name":"group_id",
"description":"The ID of the group which should get snapshotted",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"timeout",
"description":"Timeout in seconds after which the endpoint returns a failure. If not provided, 60s is used.",
"required":false,
"allowMultiple":false,
"type":"long",
"paramType":"query"
}
]
}
]
}
]
}

View File

@@ -1228,7 +1228,7 @@
"operations":[
{
"method":"POST",
"summary":"Removes token (and all data associated with enpoint that had it) from the ring",
"summary":"Removes a node from the cluster. Replicated data that logically belonged to this node is redistributed among the remaining nodes.",
"type":"void",
"nickname":"remove_node",
"produces":[
@@ -1245,7 +1245,7 @@
},
{
"name":"ignore_nodes",
"description":"List of dead nodes to ingore in removenode operation",
"description":"Comma-separated list of dead nodes to ignore in removenode operation. Use the same method for all nodes to ignore: either Host IDs or ip addresses.",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -1946,7 +1946,7 @@
"operations":[
{
"method":"POST",
"summary":"Reset local schema",
"summary":"Forces this node to recalculate versions of schema objects.",
"type":"void",
"nickname":"reset_local_schema",
"produces":[

View File

@@ -52,6 +52,45 @@
}
]
},
{
"path":"/system/log",
"operations":[
{
"method":"POST",
"summary":"Write a message to the Scylla log",
"type":"void",
"nickname":"write_log_message",
"produces":[
"application/json"
],
"parameters":[
{
"name":"message",
"description":"The message to write to the log",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"level",
"description":"The logging level to use",
"required":true,
"allowMultiple":false,
"type":"string",
"enum":[
"error",
"warn",
"info",
"debug",
"trace"
],
"paramType":"query"
}
]
}
]
},
{
"path":"/system/drop_sstable_caches",
"operations":[

View File

@@ -0,0 +1,305 @@
{
"apiVersion":"0.0.1",
"swaggerVersion":"1.2",
"basePath":"{{Protocol}}://{{Host}}",
"resourcePath":"/task_manager",
"produces":[
"application/json"
],
"apis":[
{
"path":"/task_manager/list_modules",
"operations":[
{
"method":"GET",
"summary":"Get all modules names",
"type":"array",
"items":{
"type":"string"
},
"nickname":"get_modules",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/task_manager/list_module_tasks/{module}",
"operations":[
{
"method":"GET",
"summary":"Get a list of tasks",
"type":"array",
"items":{
"type":"task_stats"
},
"nickname":"get_tasks",
"produces":[
"application/json"
],
"parameters":[
{
"name":"module",
"description":"The module to query about",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"internal",
"description":"Boolean flag indicating whether internal tasks should be shown (false by default)",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
},
{
"name":"keyspace",
"description":"The keyspace to query about",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"The table to query about",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/task_manager/task_status/{task_id}",
"operations":[
{
"method":"GET",
"summary":"Get task status",
"type":"task_status",
"nickname":"get_task_status",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to query about",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
},
{
"path":"/task_manager/abort_task/{task_id}",
"operations":[
{
"method":"POST",
"summary":"Abort running task and its descendants",
"type":"void",
"nickname":"abort_task",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to abort",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
},
{
"path":"/task_manager/wait_task/{task_id}",
"operations":[
{
"method":"GET",
"summary":"Wait for a task to complete",
"type":"task_status",
"nickname":"wait_task",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to wait for",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
},
{
"path":"/task_manager/task_status_recursive/{task_id}",
"operations":[
{
"method":"GET",
"summary":"Get statuses of the task and all its descendants",
"type":"array",
"items":{
"type":"task_status"
},
"nickname":"get_task_status_recursively",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to query about",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
}
],
"models":{
"task_stats" :{
"id": "task_stats",
"description":"A task statistics object",
"properties":{
"task_id":{
"type":"string",
"description":"The uuid of a task"
},
"state":{
"type":"string",
"enum":[
"created",
"running",
"done",
"failed"
],
"description":"The state of a task"
},
"type":{
"type":"string",
"description":"The description of the task"
},
"keyspace":{
"type":"string",
"description":"The keyspace the task is working on (if applicable)"
},
"table":{
"type":"string",
"description":"The table the task is working on (if applicable)"
},
"entity":{
"type":"string",
"description":"Task-specific entity description"
},
"sequence_number":{
"type":"long",
"description":"The running sequence number of the task"
}
}
},
"task_status":{
"id":"task_status",
"description":"A task status object",
"properties":{
"id":{
"type":"string",
"description":"The uuid of the task"
},
"type":{
"type":"string",
"description":"The description of the task"
},
"state":{
"type":"string",
"enum":[
"created",
"running",
"done",
"failed"
],
"description":"The state of the task"
},
"is_abortable":{
"type":"boolean",
"description":"Boolean flag indicating whether the task can be aborted"
},
"start_time":{
"type":"datetime",
"description":"The start time of the task"
},
"end_time":{
"type":"datetime",
"description":"The end time of the task (unspecified when the task is not completed)"
},
"error":{
"type":"string",
"description":"Error string, if the task failed"
},
"parent_id":{
"type":"string",
"description":"The uuid of the parent task"
},
"sequence_number":{
"type":"long",
"description":"The running sequence number of the task"
},
"shard":{
"type":"long",
"description":"The number of a shard the task is running on"
},
"keyspace":{
"type":"string",
"description":"The keyspace the task is working on (if applicable)"
},
"table":{
"type":"string",
"description":"The table the task is working on (if applicable)"
},
"entity":{
"type":"string",
"description":"Task-specific entity description"
},
"progress_units":{
"type":"string",
"description":"A description of the progress units"
},
"progress_total":{
"type":"double",
"description":"The total number of units to complete for the task"
},
"progress_completed":{
"type":"double",
"description":"The number of units completed so far"
},
"children_ids":{
"type":"array",
"items":{
"type":"string"
},
"description":"Task IDs of children of this task"
}
}
}
}
}

View File

@@ -0,0 +1,177 @@
{
"apiVersion":"0.0.1",
"swaggerVersion":"1.2",
"basePath":"{{Protocol}}://{{Host}}",
"resourcePath":"/task_manager_test",
"produces":[
"application/json"
],
"apis":[
{
"path":"/task_manager_test/test_module",
"operations":[
{
"method":"POST",
"summary":"Register test module in task manager",
"type":"void",
"nickname":"register_test_module",
"produces":[
"application/json"
],
"parameters":[
]
},
{
"method":"DELETE",
"summary":"Unregister test module in task manager",
"type":"void",
"nickname":"unregister_test_module",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/task_manager_test/test_task",
"operations":[
{
"method":"POST",
"summary":"Register test task",
"type":"string",
"nickname":"register_test_task",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to register",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"shard",
"description":"The shard of the task",
"required":false,
"allowMultiple":false,
"type":"long",
"paramType":"query"
},
{
"name":"parent_id",
"description":"The uuid of a parent task",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"keyspace",
"description":"The keyspace the task is working on",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"The table the task is working on",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"entity",
"description":"Task-specific entity description",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
},
{
"method":"DELETE",
"summary":"Unregister test task",
"type":"void",
"nickname":"unregister_test_task",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to register",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/task_manager_test/finish_test_task/{task_id}",
"operations":[
{
"method":"POST",
"summary":"Finish test task",
"type":"void",
"nickname":"finish_test_task",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to finish",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"error",
"description":"The error with which task fails (if it does)",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/task_manager_test/ttl",
"operations":[
{
"method":"POST",
"summary":"Set ttl in seconds and get last value",
"type":"long",
"nickname":"get_and_update_ttl",
"produces":[
"application/json"
],
"parameters":[
{
"name":"ttl",
"description":"The number of seconds for which the tasks will be kept in memory after it finishes",
"required":true,
"allowMultiple":false,
"type":"long",
"paramType":"query"
}
]
}
]
}
]
}

View File

@@ -29,6 +29,9 @@
#include "stream_manager.hh"
#include "system.hh"
#include "api/config.hh"
#include "task_manager.hh"
#include "task_manager_test.hh"
#include "raft.hh"
logging::logger apilog("api");
@@ -146,8 +149,14 @@ future<> unset_server_snapshot(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_snapshot(ctx, r); });
}
future<> set_server_snitch(http_context& ctx) {
return register_api(ctx, "endpoint_snitch_info", "The endpoint snitch info API", set_endpoint_snitch);
future<> set_server_snitch(http_context& ctx, sharded<locator::snitch_ptr>& snitch) {
return register_api(ctx, "endpoint_snitch_info", "The endpoint snitch info API", [&snitch] (http_context& ctx, routes& r) {
set_endpoint_snitch(ctx, r, snitch);
});
}
future<> unset_server_snitch(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_endpoint_snitch(ctx, r); });
}
future<> set_server_gossip(http_context& ctx, sharded<gms::gossiper>& g) {
@@ -245,6 +254,42 @@ future<> set_server_done(http_context& ctx) {
});
}
future<> set_server_task_manager(http_context& ctx) {
auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);
return ctx.http_server.set_routes([rb, &ctx](routes& r) {
rb->register_function(r, "task_manager",
"The task manager API");
set_task_manager(ctx, r);
});
}
#ifndef SCYLLA_BUILD_MODE_RELEASE
future<> set_server_task_manager_test(http_context& ctx, lw_shared_ptr<db::config> cfg) {
auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);
return ctx.http_server.set_routes([rb, &ctx, &cfg = *cfg](routes& r) mutable {
rb->register_function(r, "task_manager_test",
"The task manager test API");
set_task_manager_test(ctx, r, cfg);
});
}
#endif
future<> set_server_raft(http_context& ctx, sharded<service::raft_group_registry>& raft_gr) {
auto rb = std::make_shared<api_registry_builder>(ctx.api_doc);
return ctx.http_server.set_routes([rb, &ctx, &raft_gr] (routes& r) {
rb->register_function(r, "raft", "The Raft API");
set_raft(ctx, r, raft_gr);
});
}
future<> unset_server_raft(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_raft(ctx, r); });
}
void req_params::process(const request& req) {
// Process mandatory parameters
for (auto& [name, ent] : params) {

View File

@@ -137,6 +137,14 @@ future<json::json_return_type> sum_timer_stats(distributed<T>& d, utils::timed_
});
}
template<class T, class F>
future<json::json_return_type> sum_timer_stats(distributed<T>& d, utils::timed_rate_moving_average_summary_and_histogram F::*f) {
return d.map_reduce0([f](const T& p) {return (p.get_stats().*f).rate();}, utils::rate_moving_average_and_histogram(),
std::plus<utils::rate_moving_average_and_histogram>()).then([](const utils::rate_moving_average_and_histogram& val) {
return make_ready_future<json::json_return_type>(timer_to_json(val));
});
}
inline int64_t min_int64(int64_t a, int64_t b) {
return std::min(a,b);
}

View File

@@ -11,13 +11,18 @@
#include <seastar/core/future.hh>
#include "replica/database_fwd.hh"
#include "tasks/task_manager.hh"
#include "seastarx.hh"
using request = http::request;
using reply = http::reply;
namespace service {
class load_meter;
class storage_proxy;
class storage_service;
class raft_group_registry;
} // namespace service
@@ -31,6 +36,7 @@ namespace locator {
class token_metadata;
class shared_token_metadata;
class snitch_ptr;
} // namespace locator
@@ -66,11 +72,12 @@ struct http_context {
distributed<service::storage_proxy>& sp;
service::load_meter& lmeter;
const sharded<locator::shared_token_metadata>& shared_token_metadata;
sharded<tasks::task_manager>& tm;
http_context(distributed<replica::database>& _db,
distributed<service::storage_proxy>& _sp,
service::load_meter& _lm, const sharded<locator::shared_token_metadata>& _stm)
: db(_db), sp(_sp), lmeter(_lm), shared_token_metadata(_stm) {
service::load_meter& _lm, const sharded<locator::shared_token_metadata>& _stm, sharded<tasks::task_manager>& _tm)
: db(_db), sp(_sp), lmeter(_lm), shared_token_metadata(_stm), tm(_tm) {
}
const locator::token_metadata& get_token_metadata();
@@ -78,7 +85,8 @@ struct http_context {
future<> set_server_init(http_context& ctx);
future<> set_server_config(http_context& ctx, const db::config& cfg);
future<> set_server_snitch(http_context& ctx);
future<> set_server_snitch(http_context& ctx, sharded<locator::snitch_ptr>& snitch);
future<> unset_server_snitch(http_context& ctx);
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<gms::gossiper>& g, sharded<cdc::generation_service>& cdc_gs, sharded<db::system_keyspace>& sys_ks);
future<> set_server_sstables_loader(http_context& ctx, sharded<sstables_loader>& sst_loader);
future<> unset_server_sstables_loader(http_context& ctx);
@@ -107,5 +115,9 @@ future<> set_server_gossip_settle(http_context& ctx, sharded<gms::gossiper>& g);
future<> set_server_cache(http_context& ctx);
future<> set_server_compaction_manager(http_context& ctx);
future<> set_server_done(http_context& ctx);
future<> set_server_task_manager(http_context& ctx);
future<> set_server_task_manager_test(http_context& ctx, lw_shared_ptr<db::config> cfg);
future<> set_server_raft(http_context&, sharded<service::raft_group_registry>&);
future<> unset_server_raft(http_context&);
}

View File

@@ -14,7 +14,7 @@
#include "sstables/metadata_collector.hh"
#include "utils/estimated_histogram.hh"
#include <algorithm>
#include "db/system_keyspace_view_types.hh"
#include "db/system_keyspace.hh"
#include "db/data_listeners.hh"
#include "storage_service.hh"
#include "unimplemented.hh"
@@ -43,7 +43,7 @@ std::tuple<sstring, sstring> parse_fully_qualified_cf_name(sstring name) {
return std::make_tuple(name.substr(0, pos), name.substr(end));
}
const utils::UUID& get_uuid(const sstring& ks, const sstring& cf, const replica::database& db) {
const table_id& get_uuid(const sstring& ks, const sstring& cf, const replica::database& db) {
try {
return db.find_uuid(ks, cf);
} catch (replica::no_such_column_family& e) {
@@ -51,7 +51,7 @@ const utils::UUID& get_uuid(const sstring& ks, const sstring& cf, const replica:
}
}
const utils::UUID& get_uuid(const sstring& name, const replica::database& db) {
const table_id& get_uuid(const sstring& name, const replica::database& db) {
auto [ks, cf] = parse_fully_qualified_cf_name(name);
return get_uuid(ks, cf, db);
}
@@ -110,7 +110,7 @@ static future<json::json_return_type> get_cf_stats_count(http_context& ctx,
static future<json::json_return_type> get_cf_histogram(http_context& ctx, const sstring& name,
utils::timed_rate_moving_average_and_histogram replica::column_family_stats::*f) {
utils::UUID uuid = get_uuid(name, ctx.db.local());
auto uuid = get_uuid(name, ctx.db.local());
return ctx.db.map_reduce0([f, uuid](const replica::database& p) {
return (p.find_column_family(uuid).get_stats().*f).hist;},
utils::ihistogram(),
@@ -122,7 +122,7 @@ static future<json::json_return_type> get_cf_histogram(http_context& ctx, const
static future<json::json_return_type> get_cf_histogram(http_context& ctx, const sstring& name,
utils::timed_rate_moving_average_summary_and_histogram replica::column_family_stats::*f) {
utils::UUID uuid = get_uuid(name, ctx.db.local());
auto uuid = get_uuid(name, ctx.db.local());
return ctx.db.map_reduce0([f, uuid](const replica::database& p) {
return (p.find_column_family(uuid).get_stats().*f).hist;},
utils::ihistogram(),
@@ -149,7 +149,7 @@ static future<json::json_return_type> get_cf_histogram(http_context& ctx, utils:
static future<json::json_return_type> get_cf_rate_and_histogram(http_context& ctx, const sstring& name,
utils::timed_rate_moving_average_summary_and_histogram replica::column_family_stats::*f) {
utils::UUID uuid = get_uuid(name, ctx.db.local());
auto uuid = get_uuid(name, ctx.db.local());
return ctx.db.map_reduce0([f, uuid](const replica::database& p) {
return (p.find_column_family(uuid).get_stats().*f).rate();},
utils::rate_moving_average_and_histogram(),
@@ -334,13 +334,13 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_memtable_columns_count.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t{0}, [](replica::column_family& cf) {
return cf.active_memtable().partition_count();
return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed(std::mem_fn(&replica::memtable::partition_count)), uint64_t(0));
}, std::plus<>());
});
cf::get_all_memtable_columns_count.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t{0}, [](replica::column_family& cf) {
return cf.active_memtable().partition_count();
return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed(std::mem_fn(&replica::memtable::partition_count)), uint64_t(0));
}, std::plus<>());
});
@@ -354,25 +354,33 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_memtable_off_heap_size.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {
return cf.active_memtable().region().occupancy().total_space();
return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed([] (replica::memtable* active_memtable) {
return active_memtable->region().occupancy().total_space();
}), uint64_t(0));
}, std::plus<int64_t>());
});
cf::get_all_memtable_off_heap_size.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, int64_t(0), [](replica::column_family& cf) {
return cf.active_memtable().region().occupancy().total_space();
return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed([] (replica::memtable* active_memtable) {
return active_memtable->region().occupancy().total_space();
}), uint64_t(0));
}, std::plus<int64_t>());
});
cf::get_memtable_live_data_size.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {
return cf.active_memtable().region().occupancy().used_space();
return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed([] (replica::memtable* active_memtable) {
return active_memtable->region().occupancy().used_space();
}), uint64_t(0));
}, std::plus<int64_t>());
});
cf::get_all_memtable_live_data_size.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, int64_t(0), [](replica::column_family& cf) {
return cf.active_memtable().region().occupancy().used_space();
return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed([] (replica::memtable* active_memtable) {
return active_memtable->region().occupancy().used_space();
}), uint64_t(0));
}, std::plus<int64_t>());
});
@@ -394,7 +402,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_cf_all_memtables_off_heap_size.set(r, [&ctx] (std::unique_ptr<request> req) {
warn(unimplemented::cause::INDEXES);
return ctx.db.map_reduce0([](const replica::database& db){
return db.dirty_memory_region_group().memory_used();
return db.dirty_memory_region_group().real_memory_used();
}, int64_t(0), std::plus<int64_t>()).then([](int res) {
return make_ready_future<json::json_return_type>(res);
});
@@ -410,7 +418,9 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_cf_all_memtables_live_data_size.set(r, [&ctx] (std::unique_ptr<request> req) {
warn(unimplemented::cause::INDEXES);
return map_reduce_cf(ctx, int64_t(0), [](replica::column_family& cf) {
return cf.active_memtable().region().occupancy().used_space();
return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed([] (replica::memtable* active_memtable) {
return active_memtable->region().occupancy().used_space();
}), uint64_t(0));
}, std::plus<int64_t>());
});
@@ -529,13 +539,13 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_pending_compactions.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {
return cf.get_compaction_strategy().estimated_pending_compactions(cf.as_table_state());
return cf.estimate_pending_compactions();
}, std::plus<int64_t>());
});
cf::get_all_pending_compactions.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, int64_t(0), [](replica::column_family& cf) {
return cf.get_compaction_strategy().estimated_pending_compactions(cf.as_table_state());
return cf.estimate_pending_compactions();
}, std::plus<int64_t>());
});
@@ -855,7 +865,7 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_auto_compaction.set(r, [&ctx] (const_req req) {
const utils::UUID& uuid = get_uuid(req.param["name"], ctx.db.local());
auto uuid = get_uuid(req.param["name"], ctx.db.local());
replica::column_family& cf = ctx.db.local().find_column_family(uuid);
return !cf.is_auto_compaction_disabled_by_user();
});

View File

@@ -18,7 +18,7 @@ namespace api {
void set_column_family(http_context& ctx, routes& r);
const utils::UUID& get_uuid(const sstring& name, const replica::database& db);
const table_id& get_uuid(const sstring& name, const replica::database& db);
future<> foreach_column_family(http_context& ctx, const sstring& name, std::function<void(replica::column_family&)> f);
@@ -63,7 +63,7 @@ struct map_reduce_column_families_locally {
std::function<std::unique_ptr<std::any>(std::unique_ptr<std::any>, std::unique_ptr<std::any>)> reducer;
future<std::unique_ptr<std::any>> operator()(replica::database& db) const {
auto res = seastar::make_lw_shared<std::unique_ptr<std::any>>(std::make_unique<std::any>(init));
return do_for_each(db.get_column_families(), [res, this](const std::pair<utils::UUID, seastar::lw_shared_ptr<replica::table>>& i) {
return do_for_each(db.get_column_families(), [res, this](const std::pair<table_id, seastar::lw_shared_ptr<replica::table>>& i) {
*res = reducer(std::move(*res), mapper(*i.second.get()));
}).then([res] {
return std::move(*res);

View File

@@ -41,7 +41,6 @@ static std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_ha
return std::move(a);
}
void set_compaction_manager(http_context& ctx, routes& r) {
cm::get_compactions.set(r, [&ctx] (std::unique_ptr<request> req) {
return ctx.db.map_reduce0([](replica::database& db) {
@@ -68,9 +67,9 @@ void set_compaction_manager(http_context& ctx, routes& r) {
cm::get_pending_tasks_by_table.set(r, [&ctx] (std::unique_ptr<request> req) {
return ctx.db.map_reduce0([&ctx](replica::database& db) {
return do_with(std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_hash>(), [&ctx, &db](std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_hash>& tasks) {
return do_for_each(db.get_column_families(), [&tasks](const std::pair<utils::UUID, seastar::lw_shared_ptr<replica::table>>& i) {
return do_for_each(db.get_column_families(), [&tasks](const std::pair<table_id, seastar::lw_shared_ptr<replica::table>>& i) -> future<> {
replica::table& cf = *i.second.get();
tasks[std::make_pair(cf.schema()->ks_name(), cf.schema()->cf_name())] = cf.get_compaction_strategy().estimated_pending_compactions(cf.as_table_state());
tasks[std::make_pair(cf.schema()->ks_name(), cf.schema()->cf_name())] = cf.estimate_pending_compactions();
return make_ready_future<>();
}).then([&tasks] {
return std::move(tasks);
@@ -119,7 +118,9 @@ void set_compaction_manager(http_context& ctx, routes& r) {
auto& cm = db.get_compaction_manager();
return parallel_for_each(table_names, [&db, &cm, &ks_name, type] (sstring& table_name) {
auto& t = db.find_column_family(ks_name, table_name);
return cm.stop_compaction(type, &t.as_table_state());
return t.parallel_foreach_table_state([&] (compaction::table_state& ts) {
return cm.stop_compaction(type, &ts);
});
});
});
co_return json_void();
@@ -127,7 +128,7 @@ void set_compaction_manager(http_context& ctx, routes& r) {
cm::get_pending_tasks.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, int64_t(0), [](replica::column_family& cf) {
return cf.get_compaction_strategy().estimated_pending_compactions(cf.as_table_state());
return cf.estimate_pending_compactions();
}, std::plus<int64_t>());
});

View File

@@ -8,13 +8,15 @@
#include "locator/token_metadata.hh"
#include "locator/snitch_base.hh"
#include "locator/production_snitch_base.hh"
#include "endpoint_snitch.hh"
#include "api/api-doc/endpoint_snitch_info.json.hh"
#include "api/api-doc/storage_service.json.hh"
#include "utils/fb_utilities.hh"
namespace api {
void set_endpoint_snitch(http_context& ctx, routes& r) {
void set_endpoint_snitch(http_context& ctx, routes& r, sharded<locator::snitch_ptr>& snitch) {
static auto host_or_broadcast = [](const_req req) {
auto host = req.get_query_param("host");
return host.empty() ? gms::inet_address(utils::fb_utilities::get_broadcast_address()) : gms::inet_address(host);
@@ -22,17 +24,45 @@ void set_endpoint_snitch(http_context& ctx, routes& r) {
httpd::endpoint_snitch_info_json::get_datacenter.set(r, [&ctx](const_req req) {
auto& topology = ctx.shared_token_metadata.local().get()->get_topology();
return topology.get_datacenter(host_or_broadcast(req));
auto ep = host_or_broadcast(req);
if (!topology.has_endpoint(ep)) {
// Cannot return error here, nodetool status can race, request
// info about just-left node and not handle it nicely
return sstring(locator::production_snitch_base::default_dc);
}
return topology.get_datacenter(ep);
});
httpd::endpoint_snitch_info_json::get_rack.set(r, [&ctx](const_req req) {
auto& topology = ctx.shared_token_metadata.local().get()->get_topology();
return topology.get_rack(host_or_broadcast(req));
auto ep = host_or_broadcast(req);
if (!topology.has_endpoint(ep)) {
// Cannot return error here, nodetool status can race, request
// info about just-left node and not handle it nicely
return sstring(locator::production_snitch_base::default_rack);
}
return topology.get_rack(ep);
});
httpd::endpoint_snitch_info_json::get_snitch_name.set(r, [] (const_req req) {
return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_name();
httpd::endpoint_snitch_info_json::get_snitch_name.set(r, [&snitch] (const_req req) {
return snitch.local()->get_name();
});
httpd::storage_service_json::update_snitch.set(r, [&snitch](std::unique_ptr<request> req) {
locator::snitch_config cfg;
cfg.name = req->get_query_param("ep_snitch_class_name");
return locator::i_endpoint_snitch::reset_snitch(snitch, cfg).then([] {
return make_ready_future<json::json_return_type>(json::json_void());
});
});
}
void unset_endpoint_snitch(http_context& ctx, routes& r) {
httpd::endpoint_snitch_info_json::get_datacenter.unset(r);
httpd::endpoint_snitch_info_json::get_rack.unset(r);
httpd::endpoint_snitch_info_json::get_snitch_name.unset(r);
httpd::storage_service_json::update_snitch.unset(r);
}
}

View File

@@ -10,8 +10,13 @@
#include "api.hh"
namespace locator {
class snitch_ptr;
}
namespace api {
void set_endpoint_snitch(http_context& ctx, routes& r);
void set_endpoint_snitch(http_context& ctx, routes& r, sharded<locator::snitch_ptr>&);
void unset_endpoint_snitch(http_context& ctx, routes& r);
}

View File

@@ -17,36 +17,42 @@ namespace fd = httpd::failure_detector_json;
void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
fd::get_all_endpoint_states.set(r, [&g](std::unique_ptr<request> req) {
std::vector<fd::endpoint_state> res;
for (auto i : g.get_endpoint_states()) {
fd::endpoint_state val;
val.addrs = boost::lexical_cast<std::string>(i.first);
val.is_alive = i.second.is_alive();
val.generation = i.second.get_heart_beat_state().get_generation();
val.version = i.second.get_heart_beat_state().get_heart_beat_version();
val.update_time = i.second.get_update_timestamp().time_since_epoch().count();
for (auto a : i.second.get_application_state_map()) {
fd::version_value version_val;
// We return the enum index and not it's name to stay compatible to origin
// method that the state index are static but the name can be changed.
version_val.application_state = static_cast<std::underlying_type<gms::application_state>::type>(a.first);
version_val.value = a.second.value;
version_val.version = a.second.version;
val.application_state.push(version_val);
return g.container().invoke_on(0, [] (gms::gossiper& g) {
std::vector<fd::endpoint_state> res;
for (auto i : g.get_endpoint_states()) {
fd::endpoint_state val;
val.addrs = boost::lexical_cast<std::string>(i.first);
val.is_alive = i.second.is_alive();
val.generation = i.second.get_heart_beat_state().get_generation();
val.version = i.second.get_heart_beat_state().get_heart_beat_version();
val.update_time = i.second.get_update_timestamp().time_since_epoch().count();
for (auto a : i.second.get_application_state_map()) {
fd::version_value version_val;
// We return the enum index and not it's name to stay compatible to origin
// method that the state index are static but the name can be changed.
version_val.application_state = static_cast<std::underlying_type<gms::application_state>::type>(a.first);
version_val.value = a.second.value;
version_val.version = a.second.version;
val.application_state.push(version_val);
}
res.push_back(val);
}
res.push_back(val);
}
return make_ready_future<json::json_return_type>(res);
return make_ready_future<json::json_return_type>(res);
});
});
fd::get_up_endpoint_count.set(r, [&g](std::unique_ptr<request> req) {
int res = g.get_up_endpoint_count();
return make_ready_future<json::json_return_type>(res);
return g.container().invoke_on(0, [] (gms::gossiper& g) {
int res = g.get_up_endpoint_count();
return make_ready_future<json::json_return_type>(res);
});
});
fd::get_down_endpoint_count.set(r, [&g](std::unique_ptr<request> req) {
int res = g.get_down_endpoint_count();
return make_ready_future<json::json_return_type>(res);
return g.container().invoke_on(0, [] (gms::gossiper& g) {
int res = g.get_down_endpoint_count();
return make_ready_future<json::json_return_type>(res);
});
});
fd::get_phi_convict_threshold.set(r, [] (std::unique_ptr<request> req) {
@@ -54,11 +60,13 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
});
fd::get_simple_states.set(r, [&g] (std::unique_ptr<request> req) {
std::map<sstring, sstring> nodes_status;
for (auto& entry : g.get_endpoint_states()) {
nodes_status.emplace(entry.first.to_sstring(), entry.second.is_alive() ? "UP" : "DOWN");
}
return make_ready_future<json::json_return_type>(map_to_key_value<fd::mapper>(nodes_status));
return g.container().invoke_on(0, [] (gms::gossiper& g) {
std::map<sstring, sstring> nodes_status;
for (auto& entry : g.get_endpoint_states()) {
nodes_status.emplace(entry.first.to_sstring(), entry.second.is_alive() ? "UP" : "DOWN");
}
return make_ready_future<json::json_return_type>(map_to_key_value<fd::mapper>(nodes_status));
});
});
fd::set_phi_convict_threshold.set(r, [](std::unique_ptr<request> req) {
@@ -67,13 +75,15 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
});
fd::get_endpoint_state.set(r, [&g] (std::unique_ptr<request> req) {
auto* state = g.get_endpoint_state_for_endpoint_ptr(gms::inet_address(req->param["addr"]));
if (!state) {
return make_ready_future<json::json_return_type>(format("unknown endpoint {}", req->param["addr"]));
}
std::stringstream ss;
g.append_endpoint_state(ss, *state);
return make_ready_future<json::json_return_type>(sstring(ss.str()));
return g.container().invoke_on(0, [req = std::move(req)] (gms::gossiper& g) {
auto* state = g.get_endpoint_state_for_endpoint_ptr(gms::inet_address(req->param["addr"]));
if (!state) {
return make_ready_future<json::json_return_type>(format("unknown endpoint {}", req->param["addr"]));
}
std::stringstream ss;
g.append_endpoint_state(ss, *state);
return make_ready_future<json::json_return_type>(sstring(ss.str()));
});
});
fd::get_endpoint_phi_values.set(r, [](std::unique_ptr<request> req) {

View File

@@ -6,6 +6,8 @@
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include <seastar/core/coroutine.hh>
#include "gossiper.hh"
#include "api/api-doc/gossiper.json.hh"
#include "gms/gossiper.hh"
@@ -14,19 +16,23 @@ namespace api {
using namespace json;
void set_gossiper(http_context& ctx, routes& r, gms::gossiper& g) {
httpd::gossiper_json::get_down_endpoint.set(r, [&g] (const_req req) {
auto res = g.get_unreachable_members();
return container_to_vec(res);
httpd::gossiper_json::get_down_endpoint.set(r, [&g] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto res = co_await g.get_unreachable_members_synchronized();
co_return json::json_return_type(container_to_vec(res));
});
httpd::gossiper_json::get_live_endpoint.set(r, [&g] (const_req req) {
auto res = g.get_live_members();
return container_to_vec(res);
httpd::gossiper_json::get_live_endpoint.set(r, [&g] (std::unique_ptr<request> req) {
return g.get_live_members_synchronized().then([] (auto res) {
return make_ready_future<json::json_return_type>(container_to_vec(res));
});
});
httpd::gossiper_json::get_endpoint_downtime.set(r, [&g] (const_req req) {
gms::inet_address ep(req.param["addr"]);
return g.get_endpoint_downtime(ep);
httpd::gossiper_json::get_endpoint_downtime.set(r, [&g] (std::unique_ptr<request> req) -> future<json::json_return_type> {
gms::inet_address ep(req->param["addr"]);
// synchronize unreachable_members on all shards
co_await g.get_unreachable_members_synchronized();
co_return g.get_endpoint_downtime(ep);
});
httpd::gossiper_json::get_current_generation_number.set(r, [&g] (std::unique_ptr<request> req) {

70
api/raft.cc Normal file
View File

@@ -0,0 +1,70 @@
/*
* Copyright (C) 2024-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include <seastar/core/coroutine.hh>
#include "api/api.hh"
#include "api/api-doc/raft.json.hh"
#include "service/raft/raft_group_registry.hh"
using namespace seastar::httpd;
extern logging::logger apilog;
namespace api {
namespace r = httpd::raft_json;
using namespace json;
void set_raft(http_context&, httpd::routes& r, sharded<service::raft_group_registry>& raft_gr) {
r::trigger_snapshot.set(r, [&raft_gr] (std::unique_ptr<http::request> req) -> future<json_return_type> {
raft::group_id gid{utils::UUID{req->param["group_id"]}};
auto timeout_dur = std::invoke([timeout_str = req->get_query_param("timeout")] {
if (timeout_str.empty()) {
return std::chrono::seconds{60};
}
auto dur = std::stoll(timeout_str);
if (dur <= 0) {
throw std::runtime_error{"Timeout must be a positive number."};
}
return std::chrono::seconds{dur};
});
std::atomic<bool> found_srv{false};
co_await raft_gr.invoke_on_all([gid, timeout_dur, &found_srv] (service::raft_group_registry& raft_gr) -> future<> {
auto* srv = raft_gr.find_server(gid);
if (!srv) {
co_return;
}
found_srv = true;
abort_on_expiry aoe(lowres_clock::now() + timeout_dur);
apilog.info("Triggering Raft group {} snapshot", gid);
auto result = co_await srv->trigger_snapshot(&aoe.abort_source());
if (result) {
apilog.info("New snapshot for Raft group {} created", gid);
} else {
apilog.info("Could not create new snapshot for Raft group {}, no new entries applied", gid);
}
});
if (!found_srv) {
throw std::runtime_error{fmt::format("Server for group ID {} not found", gid)};
}
co_return json_void{};
});
}
void unset_raft(http_context&, httpd::routes& r) {
r::trigger_snapshot.unset(r);
}
}

18
api/raft.hh Normal file
View File

@@ -0,0 +1,18 @@
/*
* Copyright (C) 2023-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#include "api_init.hh"
namespace api {
void set_raft(http_context& ctx, httpd::routes& r, sharded<service::raft_group_registry>& raft_gr);
void unset_raft(http_context& ctx, httpd::routes& r);
}

View File

@@ -22,6 +22,9 @@ namespace sp = httpd::storage_proxy_json;
using proxy = service::storage_proxy;
using namespace json;
utils::time_estimated_histogram timed_rate_moving_average_summary_merge(utils::time_estimated_histogram a, const utils::timed_rate_moving_average_summary_and_histogram& b) {
return a.merge(b.histogram());
}
/**
* This function implement a two dimentional map reduce where
@@ -55,10 +58,10 @@ future<V> two_dimensional_map_reduce(distributed<service::storage_proxy>& d,
* @param initial_value - the initial value to use for both aggregations* @return
* @return A future that resolves to the result of the aggregation.
*/
template<typename V, typename Reducer, typename F>
template<typename V, typename Reducer, typename F, typename C>
future<V> two_dimensional_map_reduce(distributed<service::storage_proxy>& d,
V F::*f, Reducer reducer, V initial_value) {
return two_dimensional_map_reduce(d, [f] (F& stats) {
C F::*f, Reducer reducer, V initial_value) {
return two_dimensional_map_reduce(d, [f] (F& stats) -> V {
return stats.*f;
}, reducer, initial_value);
}
@@ -112,10 +115,10 @@ utils_json::estimated_histogram time_to_json_histogram(const utils::time_estimat
return res;
}
static future<json::json_return_type> sum_estimated_histogram(http_context& ctx, utils::time_estimated_histogram service::storage_proxy_stats::stats::*f) {
return two_dimensional_map_reduce(ctx.sp, f, utils::time_estimated_histogram_merge,
utils::time_estimated_histogram()).then([](const utils::time_estimated_histogram& val) {
static future<json::json_return_type> sum_estimated_histogram(http_context& ctx, utils::timed_rate_moving_average_summary_and_histogram service::storage_proxy_stats::stats::*f) {
return two_dimensional_map_reduce(ctx.sp, [f] (service::storage_proxy_stats::stats& stats) {
return (stats.*f).histogram();
}, utils::time_estimated_histogram_merge, utils::time_estimated_histogram()).then([](const utils::time_estimated_histogram& val) {
return make_ready_future<json::json_return_type>(time_to_json_histogram(val));
});
}
@@ -130,7 +133,7 @@ static future<json::json_return_type> sum_estimated_histogram(http_context& ctx
});
}
static future<json::json_return_type> total_latency(http_context& ctx, utils::timed_rate_moving_average_and_histogram service::storage_proxy_stats::stats::*f) {
static future<json::json_return_type> total_latency(http_context& ctx, utils::timed_rate_moving_average_summary_and_histogram service::storage_proxy_stats::stats::*f) {
return two_dimensional_map_reduce(ctx.sp, [f] (service::storage_proxy_stats::stats& stats) {
return (stats.*f).hist.mean * (stats.*f).hist.count;
}, std::plus<double>(), 0.0).then([](double val) {
@@ -150,7 +153,7 @@ static future<json::json_return_type> total_latency(http_context& ctx, utils::t
template<typename F>
future<json::json_return_type>
sum_histogram_stats_storage_proxy(distributed<proxy>& d,
utils::timed_rate_moving_average_and_histogram F::*f) {
utils::timed_rate_moving_average_summary_and_histogram F::*f) {
return two_dimensional_map_reduce(d, [f] (service::storage_proxy_stats::stats& stats) {
return (stats.*f).hist;
}, std::plus<utils::ihistogram>(), utils::ihistogram()).
@@ -170,7 +173,7 @@ sum_histogram_stats_storage_proxy(distributed<proxy>& d,
template<typename F>
future<json::json_return_type>
sum_timer_stats_storage_proxy(distributed<proxy>& d,
utils::timed_rate_moving_average_and_histogram F::*f) {
utils::timed_rate_moving_average_summary_and_histogram F::*f) {
return two_dimensional_map_reduce(d, [f] (service::storage_proxy_stats::stats& stats) {
return (stats.*f).rate();
@@ -491,14 +494,14 @@ void set_storage_proxy(http_context& ctx, routes& r, sharded<service::storage_se
});
sp::get_read_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_estimated_histogram(ctx, &service::storage_proxy_stats::stats::estimated_read);
return sum_estimated_histogram(ctx, &service::storage_proxy_stats::stats::read);
});
sp::get_read_latency.set(r, [&ctx](std::unique_ptr<request> req) {
return total_latency(ctx, &service::storage_proxy_stats::stats::read);
});
sp::get_write_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_estimated_histogram(ctx, &service::storage_proxy_stats::stats::estimated_write);
return sum_estimated_histogram(ctx, &service::storage_proxy_stats::stats::write);
});
sp::get_write_latency.set(r, [&ctx](std::unique_ptr<request> req) {

View File

@@ -49,6 +49,14 @@
extern logging::logger apilog;
namespace std {
std::ostream& operator<<(std::ostream& os, const api::table_info& ti) {
return os << "table{name=" << ti.name << ", id=" << ti.id << "}";
}
} // namespace std
namespace api {
const locator::token_metadata& http_context::get_token_metadata() {
@@ -69,6 +77,11 @@ sstring validate_keyspace(http_context& ctx, const parameters& param) {
return validate_keyspace(ctx, param["keyspace"]);
}
locator::host_id validate_host_id(const sstring& param) {
auto hoep = locator::host_id_or_endpoint(param, locator::host_id_or_endpoint::param_type::host_id);
return hoep.id;
}
// splits a request parameter assumed to hold a comma-separated list of table names
// verify that the tables are found, otherwise a bad_param_exception exception is thrown
// containing the description of the respective no_such_column_family error.
@@ -95,6 +108,55 @@ std::vector<sstring> parse_tables(const sstring& ks_name, http_context& ctx, con
return parse_tables(ks_name, ctx, it->second);
}
std::vector<table_info> parse_table_infos(const sstring& ks_name, http_context& ctx, sstring value) {
std::vector<table_info> res;
try {
if (value.empty()) {
const auto& cf_meta_data = ctx.db.local().find_keyspace(ks_name).metadata().get()->cf_meta_data();
res.reserve(cf_meta_data.size());
for (const auto& [name, schema] : cf_meta_data) {
res.emplace_back(table_info{name, schema->id()});
}
} else {
std::vector<sstring> names = split(value, ",");
res.reserve(names.size());
const auto& db = ctx.db.local();
for (const auto& table_name : names) {
res.emplace_back(table_info{table_name, db.find_uuid(ks_name, table_name)});
}
}
} catch (const replica::no_such_keyspace& e) {
throw bad_param_exception(e.what());
} catch (const replica::no_such_column_family& e) {
throw bad_param_exception(e.what());
}
return res;
}
std::vector<table_info> parse_table_infos(const sstring& ks_name, http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name) {
auto it = query_params.find(param_name);
return parse_table_infos(ks_name, ctx, it != query_params.end() ? it->second : "");
}
// Run on all tables, skipping dropped tables
future<> run_on_existing_tables(sstring op, replica::database& db, std::string_view keyspace, const std::vector<table_info> local_tables, std::function<future<> (replica::table&)> func) {
std::exception_ptr ex;
for (const auto& ti : local_tables) {
apilog.debug("Starting {} on {}.{}", op, keyspace, ti);
try {
co_await func(db.find_column_family(ti.id));
} catch (const replica::no_such_column_family& e) {
apilog.warn("Skipping {} of {}.{}: {}", op, keyspace, ti, e.what());
} catch (...) {
ex = std::current_exception();
apilog.error("Failed {} of {}.{}: {}", op, keyspace, ti, ex);
}
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
}
}
}
static ss::token_range token_range_endpoints_to_json(const dht::token_range_endpoints& d) {
ss::token_range r;
r.start_token = d._start_token;
@@ -113,16 +175,13 @@ static ss::token_range token_range_endpoints_to_json(const dht::token_range_endp
return r;
}
using ks_cf_func = std::function<future<json::json_return_type>(http_context&, std::unique_ptr<request>, sstring, std::vector<sstring>)>;
using ks_cf_func = std::function<future<json::json_return_type>(http_context&, std::unique_ptr<request>, sstring, std::vector<table_info>)>;
static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {
return [&ctx, f = std::move(f)](std::unique_ptr<request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto column_families = parse_tables(keyspace, ctx, req->query_parameters, "cf");
if (column_families.empty()) {
column_families = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());
}
return f(ctx, std::move(req), std::move(keyspace), std::move(column_families));
auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
return f(ctx, std::move(req), std::move(keyspace), std::move(table_infos));
};
}
@@ -184,17 +243,21 @@ future<json::json_return_type> set_tables_autocompaction(http_context& ctx, cons
}
void set_transport_controller(http_context& ctx, routes& r, cql_transport::controller& ctl) {
ss::start_native_transport.set(r, [&ctl](std::unique_ptr<request> req) {
ss::start_native_transport.set(r, [&ctx, &ctl](std::unique_ptr<request> req) {
return smp::submit_to(0, [&] {
return ctl.start_server();
return with_scheduling_group(ctx.db.local().get_statement_scheduling_group(), [&ctl] {
return ctl.start_server();
});
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::stop_native_transport.set(r, [&ctl](std::unique_ptr<request> req) {
ss::stop_native_transport.set(r, [&ctx, &ctl](std::unique_ptr<request> req) {
return smp::submit_to(0, [&] {
return ctl.request_stop_server();
return with_scheduling_group(ctx.db.local().get_statement_scheduling_group(), [&ctl] {
return ctl.request_stop_server();
});
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -216,17 +279,21 @@ void unset_transport_controller(http_context& ctx, routes& r) {
}
void set_rpc_controller(http_context& ctx, routes& r, thrift_controller& ctl) {
ss::stop_rpc_server.set(r, [&ctl](std::unique_ptr<request> req) {
return smp::submit_to(0, [&] {
return ctl.request_stop_server();
ss::stop_rpc_server.set(r, [&ctx, &ctl] (std::unique_ptr<request> req) {
return smp::submit_to(0, [&ctx, &ctl] {
return with_scheduling_group(ctx.db.local().get_statement_scheduling_group(), [&ctl] () mutable {
return ctl.request_stop_server();
});
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::start_rpc_server.set(r, [&ctl](std::unique_ptr<request> req) {
return smp::submit_to(0, [&] {
return ctl.start_server();
ss::start_rpc_server.set(r, [&ctx, &ctl](std::unique_ptr<request> req) {
return smp::submit_to(0, [&ctx, &ctl] {
return with_scheduling_group(ctx.db.local().get_statement_scheduling_group(), [&ctl] () mutable {
return ctl.start_server();
});
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -548,7 +615,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::describe_any_ring.set(r, [&ctx, &ss](std::unique_ptr<request> req) {
// Find an arbitrary non-system keyspace.
auto keyspaces = ctx.db.local().get_non_system_keyspaces();
auto keyspaces = ctx.db.local().get_non_local_strategy_keyspaces();
if (keyspaces.empty()) {
throw std::runtime_error("No keyspace provided and no non system kespace exist");
}
@@ -604,104 +671,126 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
});
ss::force_keyspace_compaction.set(r, [&ctx](std::unique_ptr<request> req) {
ss::force_keyspace_compaction.set(r, [&ctx](std::unique_ptr<request> req) -> future<json::json_return_type> {
auto& db = ctx.db;
auto keyspace = validate_keyspace(ctx, req->param);
auto column_families = parse_tables(keyspace, ctx, req->query_parameters, "cf");
if (column_families.empty()) {
column_families = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());
}
return ctx.db.invoke_on_all([keyspace, column_families] (replica::database& db) -> future<> {
auto table_ids = boost::copy_range<std::vector<utils::UUID>>(column_families | boost::adaptors::transformed([&] (auto& cf_name) {
return db.find_uuid(keyspace, cf_name);
}));
// major compact smaller tables first, to increase chances of success if low on space.
std::ranges::sort(table_ids, std::less<>(), [&] (const utils::UUID& id) {
return db.find_column_family(id).get_stats().live_disk_space_used;
auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
apilog.debug("force_keyspace_compaction: keyspace={} tables={}", keyspace, table_infos);
try {
co_await db.invoke_on_all([&] (replica::database& db) -> future<> {
auto local_tables = table_infos;
// major compact smaller tables first, to increase chances of success if low on space.
std::ranges::sort(local_tables, std::less<>(), [&] (const table_info& ti) {
try {
return db.find_column_family(ti.id).get_stats().live_disk_space_used;
} catch (const replica::no_such_column_family& e) {
return int64_t(-1);
}
});
co_await run_on_existing_tables("force_keyspace_compaction", db, keyspace, local_tables, [] (replica::table& t) {
return t.compact_all_sstables();
});
});
// as a table can be dropped during loop below, let's find it before issuing major compaction request.
for (auto& id : table_ids) {
co_await db.find_column_family(id).compact_all_sstables();
}
co_return;
}).then([]{
return make_ready_future<json::json_return_type>(json_void());
});
} catch (...) {
apilog.error("force_keyspace_compaction: keyspace={} tables={} failed: {}", keyspace, table_infos, std::current_exception());
throw;
}
co_return json_void();
});
ss::force_keyspace_cleanup.set(r, [&ctx, &ss](std::unique_ptr<request> req) {
ss::force_keyspace_cleanup.set(r, [&ctx, &ss](std::unique_ptr<request> req) -> future<json::json_return_type> {
auto& db = ctx.db;
auto keyspace = validate_keyspace(ctx, req->param);
auto column_families = parse_tables(keyspace, ctx, req->query_parameters, "cf");
if (column_families.empty()) {
column_families = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());
auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
apilog.info("force_keyspace_cleanup: keyspace={} tables={}", keyspace, table_infos);
if (!co_await ss.local().is_cleanup_allowed(keyspace)) {
auto msg = "Can not perform cleanup operation when topology changes";
apilog.warn("force_keyspace_cleanup: keyspace={} tables={}: {}", keyspace, table_infos, msg);
co_await coroutine::return_exception(std::runtime_error(msg));
}
return ss.local().is_cleanup_allowed(keyspace).then([&ctx, keyspace,
column_families = std::move(column_families)] (bool is_cleanup_allowed) mutable {
if (!is_cleanup_allowed) {
return make_exception_future<json::json_return_type>(
std::runtime_error("Can not perform cleanup operation when topology changes"));
}
return ctx.db.invoke_on_all([keyspace, column_families] (replica::database& db) -> future<> {
auto table_ids = boost::copy_range<std::vector<utils::UUID>>(column_families | boost::adaptors::transformed([&] (auto& table_name) {
return db.find_uuid(keyspace, table_name);
}));
try {
co_await db.invoke_on_all([&] (replica::database& db) -> future<> {
auto local_tables = table_infos;
// cleanup smaller tables first, to increase chances of success if low on space.
std::ranges::sort(table_ids, std::less<>(), [&] (const utils::UUID& id) {
return db.find_column_family(id).get_stats().live_disk_space_used;
std::ranges::sort(local_tables, std::less<>(), [&] (const table_info& ti) {
try {
return db.find_column_family(ti.id).get_stats().live_disk_space_used;
} catch (const replica::no_such_column_family& e) {
return int64_t(-1);
}
});
auto& cm = db.get_compaction_manager();
auto owned_ranges_ptr = compaction::make_owned_ranges_ptr(db.get_keyspace_local_ranges(keyspace));
// as a table can be dropped during loop below, let's find it before issuing the cleanup request.
for (auto& id : table_ids) {
replica::table& t = db.find_column_family(id);
co_await cm.perform_cleanup(owned_ranges_ptr, t.as_table_state());
}
co_return;
}).then([]{
return make_ready_future<json::json_return_type>(0);
co_await run_on_existing_tables("force_keyspace_cleanup", db, keyspace, local_tables, [&] (replica::table& t) {
return t.perform_cleanup_compaction(owned_ranges_ptr);
});
});
});
} catch (...) {
apilog.error("force_keyspace_cleanup: keyspace={} tables={} failed: {}", keyspace, table_infos, std::current_exception());
throw;
}
co_return json::json_return_type(0);
});
ss::perform_keyspace_offstrategy_compaction.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<request> req, sstring keyspace, std::vector<sstring> tables) -> future<json::json_return_type> {
co_return co_await ctx.db.map_reduce0([&keyspace, &tables] (replica::database& db) -> future<bool> {
bool needed = false;
for (const auto& table : tables) {
auto& t = db.find_column_family(keyspace, table);
needed |= co_await t.perform_offstrategy_compaction();
}
co_return needed;
}, false, std::plus<bool>());
ss::perform_keyspace_offstrategy_compaction.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<request> req, sstring keyspace, std::vector<table_info> table_infos) -> future<json::json_return_type> {
apilog.info("perform_keyspace_offstrategy_compaction: keyspace={} tables={}", keyspace, table_infos);
bool res = false;
try {
res = co_await ctx.db.map_reduce0([&] (replica::database& db) -> future<bool> {
bool needed = false;
co_await run_on_existing_tables("perform_keyspace_offstrategy_compaction", db, keyspace, table_infos, [&needed] (replica::table& t) -> future<> {
needed |= co_await t.perform_offstrategy_compaction();
});
co_return needed;
}, false, std::plus<bool>());
} catch (...) {
apilog.error("perform_keyspace_offstrategy_compaction: keyspace={} tables={} failed: {}", keyspace, table_infos, std::current_exception());
throw;
}
co_return json::json_return_type(res);
}));
ss::upgrade_sstables.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<request> req, sstring keyspace, std::vector<sstring> column_families) {
ss::upgrade_sstables.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<request> req, sstring keyspace, std::vector<table_info> table_infos) -> future<json::json_return_type> {
auto& db = ctx.db;
bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);
return ctx.db.invoke_on_all([=] (replica::database& db) {
auto owned_ranges_ptr = compaction::make_owned_ranges_ptr(db.get_keyspace_local_ranges(keyspace));
return do_for_each(column_families, [=, &db](sstring cfname) {
auto& cm = db.get_compaction_manager();
auto& cf = db.find_column_family(keyspace, cfname);
return cm.perform_sstable_upgrade(owned_ranges_ptr, cf.as_table_state(), exclude_current_version);
apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);
try {
co_await db.invoke_on_all([&] (replica::database& db) -> future<> {
auto owned_ranges_ptr = compaction::make_owned_ranges_ptr(db.get_keyspace_local_ranges(keyspace));
co_await run_on_existing_tables("upgrade_sstables", db, keyspace, table_infos, [&] (replica::table& t) {
return t.parallel_foreach_table_state([&] (compaction::table_state& ts) {
return t.get_compaction_manager().perform_sstable_upgrade(owned_ranges_ptr, ts, exclude_current_version);
});
});
});
}).then([]{
return make_ready_future<json::json_return_type>(0);
});
} catch (...) {
apilog.error("upgrade_sstables: keyspace={} tables={} failed: {}", keyspace, table_infos, std::current_exception());
throw;
}
co_return json::json_return_type(0);
}));
ss::force_keyspace_flush.set(r, [&ctx](std::unique_ptr<request> req) -> future<json::json_return_type> {
auto keyspace = validate_keyspace(ctx, req->param);
auto column_families = parse_tables(keyspace, ctx, req->query_parameters, "cf");
auto &db = ctx.db.local();
apilog.info("perform_keyspace_flush: keyspace={} tables={}", keyspace, column_families);
auto& db = ctx.db;
if (column_families.empty()) {
co_await db.flush_on_all(keyspace);
co_await replica::database::flush_keyspace_on_all_shards(db, keyspace);
} else {
co_await db.flush_on_all(keyspace, std::move(column_families));
co_await replica::database::flush_tables_on_all_shards(db, keyspace, std::move(column_families));
}
co_return json_void();
});
ss::decommission.set(r, [&ss](std::unique_ptr<request> req) {
apilog.info("decommission");
return ss.local().decommission().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -715,20 +804,24 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::remove_node.set(r, [&ss](std::unique_ptr<request> req) {
auto host_id = req->get_query_param("host_id");
auto host_id = validate_host_id(req->get_query_param("host_id"));
std::vector<sstring> ignore_nodes_strs= split(req->get_query_param("ignore_nodes"), ",");
auto ignore_nodes = std::list<gms::inet_address>();
apilog.info("remove_node: host_id={} ignore_nodes={}", host_id, ignore_nodes_strs);
auto ignore_nodes = std::list<locator::host_id_or_endpoint>();
for (std::string n : ignore_nodes_strs) {
try {
std::replace(n.begin(), n.end(), '\"', ' ');
std::replace(n.begin(), n.end(), '\'', ' ');
boost::trim_all(n);
if (!n.empty()) {
auto node = gms::inet_address(n);
ignore_nodes.push_back(node);
auto hoep = locator::host_id_or_endpoint(n);
if (!ignore_nodes.empty() && hoep.has_host_id() != ignore_nodes.front().has_host_id()) {
throw std::runtime_error("All nodes should be identified using the same method: either Host IDs or ip addresses.");
}
ignore_nodes.push_back(std::move(hoep));
}
} catch (...) {
throw std::runtime_error(format("Failed to parse ignore_nodes parameter: ignore_nodes={}, node={}", ignore_nodes_strs, n));
throw std::runtime_error(format("Failed to parse ignore_nodes parameter: ignore_nodes={}, node={}: {}", ignore_nodes_strs, n, std::current_exception()));
}
}
return ss.local().removenode(host_id, std::move(ignore_nodes)).then([] {
@@ -789,6 +882,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::drain.set(r, [&ss](std::unique_ptr<request> req) {
apilog.info("drain");
return ss.local().drain().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -806,28 +900,20 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
if (type == "user") {
return ctx.db.local().get_user_keyspaces();
} else if (type == "non_local_strategy") {
return map_keys(ctx.db.local().get_keyspaces() | boost::adaptors::filtered([](const auto& p) {
return p.second.get_replication_strategy().get_type() != locator::replication_strategy_type::local;
}));
return ctx.db.local().get_non_local_strategy_keyspaces();
}
return map_keys(ctx.db.local().get_keyspaces());
});
ss::update_snitch.set(r, [](std::unique_ptr<request> req) {
locator::snitch_config cfg;
cfg.name = req->get_query_param("ep_snitch_class_name");
return locator::i_endpoint_snitch::reset_snitch(cfg).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::stop_gossiping.set(r, [&ss](std::unique_ptr<request> req) {
apilog.info("stop_gossiping");
return ss.local().stop_gossiping().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::start_gossiping.set(r, [&ss](std::unique_ptr<request> req) {
apilog.info("start_gossiping");
return ss.local().start_gossiping().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -930,6 +1016,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::rebuild.set(r, [&ss](std::unique_ptr<request> req) {
auto source_dc = req->get_query_param("source_dc");
apilog.info("rebuild: source_dc={}", source_dc);
return ss.local().rebuild(std::move(source_dc)).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -962,17 +1049,16 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
return make_ready_future<json::json_return_type>(res);
});
ss::reset_local_schema.set(r, [&sys_ks](std::unique_ptr<request> req) {
ss::reset_local_schema.set(r, [&ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
// FIXME: We should truncate schema tables if more than one node in the cluster.
auto& sp = service::get_storage_proxy();
auto& fs = sp.local().features();
return db::schema_tables::recalculate_schema_version(sys_ks, sp, fs).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
apilog.info("reset_local_schema");
co_await ss.local().reload_schema();
co_return json_void();
});
ss::set_trace_probability.set(r, [](std::unique_ptr<request> req) {
auto probability = req->get_query_param("probability");
apilog.info("set_trace_probability: probability={}", probability);
return futurize_invoke([probability] {
double real_prob = std::stod(probability.c_str());
return tracing::tracing::tracing_instance().invoke_on_all([real_prob] (auto& local_tracing) {
@@ -1010,6 +1096,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
auto ttl = req->get_query_param("ttl");
auto threshold = req->get_query_param("threshold");
auto fast = req->get_query_param("fast");
apilog.info("set_slow_query: enable={} ttl={} threshold={} fast={}", enable, ttl, threshold, fast);
try {
return tracing::tracing::tracing_instance().invoke_on_all([enable, ttl, threshold, fast] (auto& local_tracing) {
if (threshold != "") {
@@ -1036,6 +1123,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
auto keyspace = validate_keyspace(ctx, req->param);
auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("enable_auto_compaction: keyspace={} tables={}", keyspace, tables);
return set_tables_autocompaction(ctx, keyspace, tables, true);
});
@@ -1043,6 +1131,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
auto keyspace = validate_keyspace(ctx, req->param);
auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("disable_auto_compaction: keyspace={} tables={}", keyspace, tables);
return set_tables_autocompaction(ctx, keyspace, tables, false);
});
@@ -1314,7 +1403,7 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
});
});
ss::take_snapshot.set(r, [&snap_ctl](std::unique_ptr<request> req) -> future<json::json_return_type> {
ss::take_snapshot.set(r, [&ctx, &snap_ctl](std::unique_ptr<request> req) -> future<json::json_return_type> {
apilog.info("take_snapshot: {}", req->query_parameters);
auto tag = req->get_query_param("tag");
auto column_families = split(req->get_query_param("cf"), ",");
@@ -1332,7 +1421,13 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
if (keynames.size() > 1) {
throw httpd::bad_param_exception("Only one keyspace allowed when specifying a column family");
}
co_await snap_ctl.local().take_column_family_snapshot(keynames[0], column_families, tag, sf);
for (const auto& table_name : column_families) {
auto& t = ctx.db.local().find_column_family(keynames[0], table_name);
if (t.schema()->is_view()) {
throw std::invalid_argument("Do not take a snapshot of a materialized view or a secondary index by itself. Run snapshot on the base table instead.");
}
}
co_await snap_ctl.local().take_column_family_snapshot(keynames[0], column_families, tag, db::snapshot_ctl::snap_views::yes, sf);
}
co_return json_void();
} catch (...) {
@@ -1362,7 +1457,8 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
});
});
ss::scrub.set(r, [&ctx, &snap_ctl] (std::unique_ptr<request> req) {
ss::scrub.set(r, [&ctx, &snap_ctl] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto& db = ctx.db;
auto rp = req_params({
{"keyspace", {mandatory::yes}},
{"cf", {""}},
@@ -1398,11 +1494,13 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
}
}
auto f = make_ready_future<>();
if (!req_param<bool>(*req, "disable_snapshot", false)) {
auto tag = format("pre-scrub-{:d}", db_clock::now().time_since_epoch().count());
f = parallel_for_each(column_families, [&snap_ctl, keyspace, tag](sstring cf) {
return snap_ctl.local().take_column_family_snapshot(keyspace, cf, tag, db::snapshot_ctl::skip_flush::no, db::snapshot_ctl::allow_view_snapshots::yes);
co_await coroutine::parallel_for_each(column_families, [&snap_ctl, keyspace, tag](sstring cf) {
// We always pass here db::snapshot_ctl::snap_views::no since:
// 1. When scrubbing particular tables, there's no need to auto-snapshot their views.
// 2. When scrubbing the whole keyspace, column_families will contain both base tables and views.
return snap_ctl.local().take_column_family_snapshot(keyspace, cf, tag, db::snapshot_ctl::snap_views::no, db::snapshot_ctl::skip_flush::no);
});
}
@@ -1427,28 +1525,30 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
return stats;
};
return f.then([&ctx, keyspace, column_families, opts, &reduce_compaction_stats] {
return ctx.db.map_reduce0([=] (replica::database& db) {
return map_reduce(column_families, [=, &db] (sstring cfname) {
try {
auto opt_stats = co_await db.map_reduce0([&] (replica::database& db) {
return map_reduce(column_families, [&] (sstring cfname) -> future<std::optional<sstables::compaction_stats>> {
auto& cm = db.get_compaction_manager();
auto& cf = db.find_column_family(keyspace, cfname);
return cm.perform_sstable_scrub(cf.as_table_state(), opts);
sstables::compaction_stats stats{};
co_await cf.parallel_foreach_table_state([&] (compaction::table_state& ts) mutable -> future<> {
auto r = co_await cm.perform_sstable_scrub(ts, opts);
stats += r.value_or(sstables::compaction_stats{});
});
co_return stats;
}, std::make_optional(sstables::compaction_stats{}), reduce_compaction_stats);
}, std::make_optional(sstables::compaction_stats{}), reduce_compaction_stats);
}).then_wrapped([] (auto f) {
if (f.failed()) {
auto ex = f.get_exception();
if (try_catch<sstables::compaction_aborted_exception>(ex)) {
return make_ready_future<json::json_return_type>(static_cast<int>(scrub_status::aborted));
} else {
return make_exception_future<json::json_return_type>(std::move(ex));
}
} else if (f.get()->validation_errors) {
return make_ready_future<json::json_return_type>(static_cast<int>(scrub_status::validation_errors));
} else {
return make_ready_future<json::json_return_type>(static_cast<int>(scrub_status::successful));
if (opt_stats && opt_stats->validation_errors) {
co_return json::json_return_type(static_cast<int>(scrub_status::validation_errors));
}
});
} catch (const sstables::compaction_aborted_exception&) {
co_return json::json_return_type(static_cast<int>(scrub_status::aborted));
} catch (...) {
apilog.error("scrub keyspace={} tables={} failed: {}", keyspace, column_families, std::current_exception());
throw;
}
co_return json::json_return_type(static_cast<int>(scrub_status::successful));
});
}

View File

@@ -8,6 +8,8 @@
#pragma once
#include <iostream>
#include <seastar/core/sharded.hh>
#include "api.hh"
#include "db/data_listeners.hh"
@@ -41,8 +43,22 @@ sstring validate_keyspace(http_context& ctx, const parameters& param);
// splits a request parameter assumed to hold a comma-separated list of table names
// verify that the tables are found, otherwise a bad_param_exception exception is thrown
// containing the description of the respective no_such_column_family error.
// Returns an empty vector if no parameter was found.
// If the parameter is found and empty, returns a list of all table names in the keyspace.
std::vector<sstring> parse_tables(const sstring& ks_name, http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name);
struct table_info {
sstring name;
table_id id;
};
// splits a request parameter assumed to hold a comma-separated list of table names
// verify that the tables are found, otherwise a bad_param_exception exception is thrown
// containing the description of the respective no_such_column_family error.
// Returns a vector of all table infos given by the parameter, or
// if the parameter is not found or is empty, returns a list of all table infos in the keyspace.
std::vector<table_info> parse_table_infos(const sstring& ks_name, http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name);
void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, gms::gossiper& g, sharded<cdc::generation_service>& cdc_gs, sharded<db::system_keyspace>& sys_ls);
void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>& sst_loader);
void unset_sstables_loader(http_context& ctx, routes& r);
@@ -58,4 +74,10 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
void unset_snapshot(http_context& ctx, routes& r);
seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request = false);
}
} // namespace api
namespace std {
std::ostream& operator<<(std::ostream& os, const api::table_info& ti);
} // namespace std

View File

@@ -61,6 +61,16 @@ void set_system(http_context& ctx, routes& r) {
return json::json_void();
});
hs::write_log_message.set(r, [](const_req req) {
try {
logging::log_level level = boost::lexical_cast<logging::log_level>(std::string(req.get_query_param("level")));
apilog.log(level, "/system/log: {}", std::string(req.get_query_param("message")));
} catch (boost::bad_lexical_cast& e) {
throw bad_param_exception("Unknown logging level " + req.get_query_param("level"));
}
return json::json_void();
});
hs::drop_sstable_caches.set(r, [&ctx](std::unique_ptr<request> req) {
apilog.info("Dropping sstable caches");
return ctx.db.invoke_on_all([] (replica::database& db) {

231
api/task_manager.cc Normal file
View File

@@ -0,0 +1,231 @@
/*
* Copyright (C) 2022-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include <seastar/core/coroutine.hh>
#include "task_manager.hh"
#include "api/api-doc/task_manager.json.hh"
#include "db/system_keyspace.hh"
#include "column_family.hh"
#include "unimplemented.hh"
#include "storage_service.hh"
#include <utility>
#include <boost/range/adaptors.hpp>
namespace api {
namespace tm = httpd::task_manager_json;
using namespace json;
inline bool filter_tasks(tasks::task_manager::task_ptr task, std::unordered_map<sstring, sstring>& query_params) {
return (!query_params.contains("keyspace") || query_params["keyspace"] == task->get_status().keyspace) &&
(!query_params.contains("table") || query_params["table"] == task->get_status().table);
}
struct full_task_status {
tasks::task_manager::task::status task_status;
std::string type;
tasks::task_manager::task::progress progress;
std::string module;
tasks::task_id parent_id;
tasks::is_abortable abortable;
std::vector<std::string> children_ids;
};
struct task_stats {
task_stats(tasks::task_manager::task_ptr task)
: task_id(task->id().to_sstring())
, state(task->get_status().state)
, type(task->type())
, keyspace(task->get_status().keyspace)
, table(task->get_status().table)
, entity(task->get_status().entity)
, sequence_number(task->get_status().sequence_number)
{ }
sstring task_id;
tasks::task_manager::task_state state;
std::string type;
std::string keyspace;
std::string table;
std::string entity;
uint64_t sequence_number;
};
tm::task_status make_status(full_task_status status) {
auto start_time = db_clock::to_time_t(status.task_status.start_time);
auto end_time = db_clock::to_time_t(status.task_status.end_time);
::tm st, et;
::gmtime_r(&end_time, &et);
::gmtime_r(&start_time, &st);
tm::task_status res{};
res.id = status.task_status.id.to_sstring();
res.type = status.type;
res.state = status.task_status.state;
res.is_abortable = bool(status.abortable);
res.start_time = st;
res.end_time = et;
res.error = status.task_status.error;
res.parent_id = status.parent_id.to_sstring();
res.sequence_number = status.task_status.sequence_number;
res.shard = status.task_status.shard;
res.keyspace = status.task_status.keyspace;
res.table = status.task_status.table;
res.entity = status.task_status.entity;
res.progress_units = status.task_status.progress_units;
res.progress_total = status.progress.total;
res.progress_completed = status.progress.completed;
res.children_ids = std::move(status.children_ids);
return res;
}
future<full_task_status> retrieve_status(const tasks::task_manager::foreign_task_ptr& task) {
if (task.get() == nullptr) {
co_return coroutine::return_exception(httpd::bad_param_exception("Task not found"));
}
auto progress = co_await task->get_progress();
full_task_status s;
s.task_status = task->get_status();
s.type = task->type();
s.parent_id = task->get_parent_id();
s.abortable = task->is_abortable();
s.module = task->get_module_name();
s.progress.completed = progress.completed;
s.progress.total = progress.total;
std::vector<std::string> ct{task->get_children().size()};
boost::transform(task->get_children(), ct.begin(), [] (const auto& child) {
return child->id().to_sstring();
});
s.children_ids = std::move(ct);
co_return s;
}
void set_task_manager(http_context& ctx, routes& r) {
tm::get_modules.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
std::vector<std::string> v = boost::copy_range<std::vector<std::string>>(ctx.tm.local().get_modules() | boost::adaptors::map_keys);
co_return v;
});
tm::get_tasks.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
using chunked_stats = utils::chunked_vector<task_stats>;
auto internal = tasks::is_internal{req_param<bool>(*req, "internal", false)};
std::vector<chunked_stats> res = co_await ctx.tm.map([&req, internal] (tasks::task_manager& tm) {
chunked_stats local_res;
auto module = tm.find_module(req->param["module"]);
const auto& filtered_tasks = module->get_tasks() | boost::adaptors::filtered([&params = req->query_parameters, internal] (const auto& task) {
return (internal || !task.second->is_internal()) && filter_tasks(task.second, params);
});
for (auto& [task_id, task] : filtered_tasks) {
local_res.push_back(task_stats{task});
}
return local_res;
});
std::function<future<>(output_stream<char>&&)> f = [r = std::move(res)] (output_stream<char>&& os) -> future<> {
auto s = std::move(os);
auto res = std::move(r);
co_await s.write("[");
std::string delim = "";
for (auto& v: res) {
for (auto& stats: v) {
co_await s.write(std::exchange(delim, ", "));
tm::task_stats ts;
ts = stats;
co_await formatter::write(s, ts);
}
}
co_await s.write("]");
co_await s.close();
};
co_return std::move(f);
});
tm::get_task_status.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
auto task = co_await tasks::task_manager::invoke_on_task(ctx.tm, id, std::function([] (tasks::task_manager::task_ptr task) -> future<tasks::task_manager::foreign_task_ptr> {
auto state = task->get_status().state;
if (state == tasks::task_manager::task_state::done || state == tasks::task_manager::task_state::failed) {
task->unregister_task();
}
co_return std::move(task);
}));
auto s = co_await retrieve_status(task);
co_return make_status(s);
});
tm::abort_task.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
co_await tasks::task_manager::invoke_on_task(ctx.tm, id, [] (tasks::task_manager::task_ptr task) -> future<> {
if (!task->is_abortable()) {
co_await coroutine::return_exception(std::runtime_error("Requested task cannot be aborted"));
}
co_await task->abort();
});
co_return json_void();
});
tm::wait_task.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
auto task = co_await tasks::task_manager::invoke_on_task(ctx.tm, id, std::function([] (tasks::task_manager::task_ptr task) {
return task->done().then_wrapped([task] (auto f) {
task->unregister_task();
// done() is called only because we want the task to be complete before getting its status.
// The future should be ignored here as the result does not matter.
f.ignore_ready_future();
return make_foreign(task);
});
}));
auto s = co_await retrieve_status(task);
co_return make_status(s);
});
tm::get_task_status_recursively.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto& _ctx = ctx;
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
std::queue<tasks::task_manager::foreign_task_ptr> q;
utils::chunked_vector<full_task_status> res;
// Get requested task.
auto task = co_await tasks::task_manager::invoke_on_task(_ctx.tm, id, std::function([] (tasks::task_manager::task_ptr task) -> future<tasks::task_manager::foreign_task_ptr> {
auto state = task->get_status().state;
if (state == tasks::task_manager::task_state::done || state == tasks::task_manager::task_state::failed) {
task->unregister_task();
}
co_return task;
}));
// Push children's statuses in BFS order.
q.push(co_await task.copy()); // Task cannot be moved since we need it to be alive during whole loop execution.
while (!q.empty()) {
auto& current = q.front();
res.push_back(co_await retrieve_status(current));
for (auto& child: current->get_children()) {
q.push(co_await child.copy());
}
q.pop();
}
std::function<future<>(output_stream<char>&&)> f = [r = std::move(res)] (output_stream<char>&& os) -> future<> {
auto s = std::move(os);
auto res = std::move(r);
co_await s.write("[");
std::string delim = "";
for (auto& status: res) {
co_await s.write(std::exchange(delim, ", "));
co_await formatter::write(s, make_status(status));
}
co_await s.write("]");
co_await s.close();
};
co_return f;
});
}
}

17
api/task_manager.hh Normal file
View File

@@ -0,0 +1,17 @@
/*
* Copyright (C) 2022-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#include "api.hh"
namespace api {
void set_task_manager(http_context& ctx, routes& r);
}

107
api/task_manager_test.cc Normal file
View File

@@ -0,0 +1,107 @@
/*
* Copyright (C) 2022-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#ifndef SCYLLA_BUILD_MODE_RELEASE
#include <seastar/core/coroutine.hh>
#include "task_manager_test.hh"
#include "api/api-doc/task_manager_test.json.hh"
#include "tasks/test_module.hh"
namespace api {
namespace tmt = httpd::task_manager_test_json;
using namespace json;
void set_task_manager_test(http_context& ctx, routes& r, db::config& cfg) {
tmt::register_test_module.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
co_await ctx.tm.invoke_on_all([] (tasks::task_manager& tm) {
auto m = make_shared<tasks::test_module>(tm);
tm.register_module("test", m);
});
co_return json_void();
});
tmt::unregister_test_module.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
co_await ctx.tm.invoke_on_all([] (tasks::task_manager& tm) -> future<> {
auto module_name = "test";
auto module = tm.find_module(module_name);
co_await module->stop();
});
co_return json_void();
});
tmt::register_test_task.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
sharded<tasks::task_manager>& tms = ctx.tm;
auto it = req->query_parameters.find("task_id");
auto id = it != req->query_parameters.end() ? tasks::task_id{utils::UUID{it->second}} : tasks::task_id::create_null_id();
it = req->query_parameters.find("shard");
unsigned shard = it != req->query_parameters.end() ? boost::lexical_cast<unsigned>(it->second) : 0;
it = req->query_parameters.find("keyspace");
std::string keyspace = it != req->query_parameters.end() ? it->second : "";
it = req->query_parameters.find("table");
std::string table = it != req->query_parameters.end() ? it->second : "";
it = req->query_parameters.find("entity");
std::string entity = it != req->query_parameters.end() ? it->second : "";
it = req->query_parameters.find("parent_id");
tasks::task_info data;
if (it != req->query_parameters.end()) {
data.id = tasks::task_id{utils::UUID{it->second}};
auto parent_ptr = co_await tasks::task_manager::lookup_task_on_all_shards(ctx.tm, data.id);
data.shard = parent_ptr->get_status().shard;
}
auto module = tms.local().find_module("test");
id = co_await module->make_task<tasks::test_task_impl>(shard, id, keyspace, table, entity, data);
co_await tms.invoke_on(shard, [id] (tasks::task_manager& tm) {
auto it = tm.get_all_tasks().find(id);
if (it != tm.get_all_tasks().end()) {
it->second->start();
}
});
co_return id.to_sstring();
});
tmt::unregister_test_task.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->query_parameters["task_id"]}};
co_await tasks::task_manager::invoke_on_task(ctx.tm, id, [] (tasks::task_manager::task_ptr task) -> future<> {
tasks::test_task test_task{task};
co_await test_task.unregister_task();
});
co_return json_void();
});
tmt::finish_test_task.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
auto it = req->query_parameters.find("error");
bool fail = it != req->query_parameters.end();
std::string error = fail ? it->second : "";
co_await tasks::task_manager::invoke_on_task(ctx.tm, id, [fail, error = std::move(error)] (tasks::task_manager::task_ptr task) {
tasks::test_task test_task{task};
if (fail) {
test_task.finish_failed(std::make_exception_ptr(std::runtime_error(error)));
} else {
test_task.finish();
}
return make_ready_future<>();
});
co_return json_void();
});
tmt::get_and_update_ttl.set(r, [&ctx, &cfg] (std::unique_ptr<request> req) -> future<json::json_return_type> {
uint32_t ttl = cfg.task_ttl_seconds();
co_await cfg.task_ttl_seconds.set_value_on_all_shards(req->query_parameters["ttl"], utils::config_file::config_source::API);
co_return json::json_return_type(ttl);
});
}
}
#endif

22
api/task_manager_test.hh Normal file
View File

@@ -0,0 +1,22 @@
/*
* Copyright (C) 2022-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#ifndef SCYLLA_BUILD_MODE_RELEASE
#pragma once
#include "api.hh"
#include "db/config.hh"
namespace api {
void set_task_manager_test(http_context& ctx, routes& r, db::config& cfg);
}
#endif

View File

@@ -66,36 +66,48 @@ atomic_cell::atomic_cell(const abstract_type& type, atomic_cell_view other)
set_view(_data);
}
// Based on:
// - org.apache.cassandra.db.AbstractCell#reconcile()
// - org.apache.cassandra.db.BufferExpiringCell#reconcile()
// - org.apache.cassandra.db.BufferDeletedCell#reconcile()
// Based on Cassandra's resolveRegular function:
// - https://github.com/apache/cassandra/blob/e4f31b73c21b04966269c5ac2d3bd2562e5f6c63/src/java/org/apache/cassandra/db/rows/Cells.java#L79-L119
//
// Note: the ordering algorithm for cell is the same as for rows,
// except that the cell value is used to break a tie in case all other attributes are equal.
// See compare_row_marker_for_merge.
std::strong_ordering
compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right) {
// Largest write timestamp wins.
if (left.timestamp() != right.timestamp()) {
return left.timestamp() <=> right.timestamp();
}
// Tombstones always win reconciliation with live cells of the same timestamp
if (left.is_live() != right.is_live()) {
return left.is_live() ? std::strong_ordering::less : std::strong_ordering::greater;
}
if (left.is_live()) {
auto c = compare_unsigned(left.value(), right.value()) <=> 0;
if (c != 0) {
return c;
}
// Prefer expiring cells (which will become tombstones at some future date) over live cells.
// See https://issues.apache.org/jira/browse/CASSANDRA-14592
if (left.is_live_and_has_ttl() != right.is_live_and_has_ttl()) {
// prefer expiring cells.
return left.is_live_and_has_ttl() ? std::strong_ordering::greater : std::strong_ordering::less;
}
// If both are expiring, choose the cell with the latest expiry or derived write time.
if (left.is_live_and_has_ttl()) {
// Prefer cell with latest expiry
if (left.expiry() != right.expiry()) {
return left.expiry() <=> right.expiry();
} else {
// prefer the cell that was written later,
// so it survives longer after it expires, until purged.
} else if (right.ttl() != left.ttl()) {
// The cell write time is derived by (expiry - ttl).
// Prefer the cell that was written later,
// so it survives longer after it expires, until purged,
// as it become purgeable gc_grace_seconds after it was written.
//
// Note that this is an extension to Cassandra's algorithm
// which stops at the expiration time, and if equal,
// move forward to compare the cell values.
return right.ttl() <=> left.ttl();
}
}
// The cell with the largest value wins, if all other attributes of the cells are identical.
// This is quite arbitrary, but still required to break the tie in a deterministic way.
return compare_unsigned(left.value(), right.value());
} else {
// Both are deleted

View File

@@ -229,6 +229,8 @@ future<authenticated_user> password_authenticator::authenticate(
std::throw_with_nested(exceptions::authentication_exception(e.what()));
} catch (exceptions::authentication_exception& e) {
std::throw_with_nested(e);
} catch (exceptions::unavailable_exception& e) {
std::throw_with_nested(exceptions::authentication_exception(e.get_message()));
} catch (...) {
std::throw_with_nested(exceptions::authentication_exception("authentication failed"));
}

View File

@@ -55,6 +55,7 @@ future<bool> default_role_row_satisfies(
return qp.execute_internal(
query,
db::consistency_level::ONE,
internal_distributed_query_state(),
{meta::DEFAULT_SUPERUSER_NAME},
cql3::query_processor::cache_internal::yes).then([&qp, &p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {

59
build_mode.hh Normal file
View File

@@ -0,0 +1,59 @@
/*
* Copyright (C) 2022-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#ifndef SCYLLA_BUILD_MODE
#error SCYLLA_BUILD_MODE must be defined
#endif
#ifndef STRINGIFY
// We need to levels of indirection
// to make a string out of the macro name.
// The outer level expands the macro
// and the inner level makes a string out of the expanded macro.
#define STRINGIFY_VALUE(x) #x
#define STRINGIFY_MACRO(x) STRINGIFY_VALUE(x)
#endif
#define SCYLLA_BUILD_MODE_STR STRINGIFY_MACRO(SCYLLA_BUILD_MODE)
// We use plain macro definitions
// so the preprocessor can expand them
// inline in the #if directives below
#define SCYLLA_BUILD_MODE_CODE_debug 0
#define SCYLLA_BUILD_MODE_CODE_release 1
#define SCYLLA_BUILD_MODE_CODE_dev 2
#define SCYLLA_BUILD_MODE_CODE_sanitize 3
#define SCYLLA_BUILD_MODE_CODE_coverage 4
#define _SCYLLA_BUILD_MODE_CODE(sbm) SCYLLA_BUILD_MODE_CODE_ ## sbm
#define SCYLLA_BUILD_MODE_CODE(sbm) _SCYLLA_BUILD_MODE_CODE(sbm)
#if SCYLLA_BUILD_MODE_CODE(SCYLLA_BUILD_MODE) == SCYLLA_BUILD_MODE_CODE_debug
#define SCYLLA_BUILD_MODE_DEBUG
#elif SCYLLA_BUILD_MODE_CODE(SCYLLA_BUILD_MODE) == SCYLLA_BUILD_MODE_CODE_release
#define SCYLLA_BUILD_MODE_RELEASE
#elif SCYLLA_BUILD_MODE_CODE(SCYLLA_BUILD_MODE) == SCYLLA_BUILD_MODE_CODE_dev
#define SCYLLA_BUILD_MODE_DEV
#elif SCYLLA_BUILD_MODE_CODE(SCYLLA_BUILD_MODE) == SCYLLA_BUILD_MODE_CODE_sanitize
#define SCYLLA_BUILD_MODE_SANITIZE
#elif SCYLLA_BUILD_MODE_CODE(SCYLLA_BUILD_MODE) == SCYLLA_BUILD_MODE_CODE_coverage
#define SCYLLA_BUILD_MODE_COVERAGE
#else
#error unrecognized SCYLLA_BUILD_MODE
#endif
#if (defined(SCYLLA_BUILD_MODE_RELEASE) || defined(SCYLLA_BUILD_MODE_DEV)) && defined(SEASTAR_DEBUG)
#error SEASTAR_DEBUG is not expected to be defined when SCYLLA_BUILD_MODE is "release" or "dev"
#endif
#if (defined(SCYLLA_BUILD_MODE_DEBUG) || defined(SCYLLA_BUILD_MODE_SANITIZE)) && !defined(SEASTAR_DEBUG)
#error SEASTAR_DEBUG is expected to be defined when SCYLLA_BUILD_MODE is "debug" or "sanitize"
#endif

View File

@@ -457,7 +457,9 @@ public:
_begin.ptr->size = _size;
_current = nullptr;
_size = 0;
return managed_bytes(std::exchange(_begin.ptr, {}));
auto begin_ptr = _begin.ptr;
_begin.ptr = nullptr;
return managed_bytes(begin_ptr);
} else {
return managed_bytes();
}

View File

@@ -572,7 +572,7 @@ void cache_flat_mutation_reader::maybe_add_to_cache(const clustering_row& cr) {
_read_context.cache().on_mispopulate();
return;
}
auto rt_opt = _rt_assembler.flush(*_schema, position_in_partition::after_key(cr.key()));
auto rt_opt = _rt_assembler.flush(*_schema, position_in_partition::after_key(*_schema, cr.key()));
clogger.trace("csm {}: populate({})", fmt::ptr(this), clustering_row::printer(*_schema, cr));
_lsa_manager.run_in_update_section_with_allocator([this, &cr, &rt_opt] {
mutation_partition& mp = _snp->version()->partition();
@@ -634,8 +634,8 @@ inline
void cache_flat_mutation_reader::copy_from_cache_to_buffer() {
clogger.trace("csm {}: copy_from_cache, next={}, next_row_in_range={}", fmt::ptr(this), _next_row.position(), _next_row_in_range);
_next_row.touch();
position_in_partition_view next_lower_bound = _next_row.dummy() ? _next_row.position() : position_in_partition_view::after_key(_next_row.key());
auto upper_bound = _next_row_in_range ? next_lower_bound : _upper_bound;
auto next_lower_bound = position_in_partition_view::after_key(table_schema(), _next_row.position());
auto upper_bound = _next_row_in_range ? next_lower_bound.view : _upper_bound;
if (_snp->range_tombstones(_lower_bound, upper_bound, [&] (range_tombstone rts) {
add_range_tombstone_to_buffer(std::move(rts));
return stop_iteration(_lower_bound_changed && is_buffer_full());
@@ -737,7 +737,9 @@ void cache_flat_mutation_reader::maybe_drop_last_entry() noexcept {
&& _snp->at_oldest_version()) {
with_allocator(_snp->region().allocator(), [&] {
_last_row->on_evicted(_read_context.cache()._tracker);
cache_tracker& tracker = _read_context.cache()._tracker;
tracker.get_lru().remove(*_last_row);
_last_row->on_evicted(tracker);
});
_last_row = nullptr;
@@ -772,14 +774,14 @@ void cache_flat_mutation_reader::move_to_next_entry() {
}
}
void cache_flat_mutation_reader::flush_tombstones(position_in_partition_view pos, bool end_of_range) {
void cache_flat_mutation_reader::flush_tombstones(position_in_partition_view pos_, bool end_of_range) {
// Ensure position is appropriate for range tombstone bound
pos = position_in_partition_view::after_key(pos);
clogger.trace("csm {}: flush_tombstones({}) end_of_range: {}", fmt::ptr(this), pos, end_of_range);
_rt_gen.flush(pos, [this] (range_tombstone_change&& rtc) {
auto pos = position_in_partition_view::after_key(*_schema, pos_);
clogger.trace("csm {}: flush_tombstones({}) end_of_range: {}", fmt::ptr(this), pos.view, end_of_range);
_rt_gen.flush(pos.view, [this] (range_tombstone_change&& rtc) {
add_to_buffer(std::move(rtc), source::cache);
}, end_of_range);
if (auto rtc_opt = _rt_merger.flush(pos, end_of_range)) {
if (auto rtc_opt = _rt_merger.flush(pos.view, end_of_range)) {
do_add_to_buffer(std::move(*rtc_opt));
}
}
@@ -830,7 +832,7 @@ inline
void cache_flat_mutation_reader::add_clustering_row_to_buffer(mutation_fragment_v2&& mf) {
clogger.trace("csm {}: add_clustering_row_to_buffer({})", fmt::ptr(this), mutation_fragment_v2::printer(*_schema, mf));
auto& row = mf.as_clustering_row();
auto new_lower_bound = position_in_partition::after_key(row.key());
auto new_lower_bound = position_in_partition::after_key(*_schema, row.key());
push_mutation_fragment(std::move(mf));
_lower_bound = std::move(new_lower_bound);
_lower_bound_changed = true;

View File

@@ -15,14 +15,7 @@
#include "converting_mutation_partition_applier.hh"
#include "hashing_partition_visitor.hh"
#include "utils/UUID.hh"
#include "serializer.hh"
#include "idl/uuid.dist.hh"
#include "idl/keys.dist.hh"
#include "idl/mutation.dist.hh"
#include "serializer_impl.hh"
#include "serialization_visitors.hh"
#include "idl/uuid.dist.impl.hh"
#include "idl/keys.dist.impl.hh"
#include "idl/mutation.dist.impl.hh"
#include <iostream>
@@ -44,7 +37,7 @@ canonical_mutation::canonical_mutation(const mutation& m)
}).end_canonical_mutation();
}
utils::UUID canonical_mutation::column_family_id() const {
table_id canonical_mutation::column_family_id() const {
auto in = ser::as_input_stream(_data);
auto mv = ser::deserialize(in, boost::type<ser::canonical_mutation_view>());
return mv.table_id();
@@ -120,17 +113,19 @@ std::ostream& operator<<(std::ostream& os, const canonical_mutation& cm) {
auto&& entry = _cm.static_column_at(id);
fmt::print(_os, "static column {} {}", bytes_to_text(entry.name()), collection_mutation_view::printer(*entry.type(), cmv));
}
virtual void accept_row_tombstone(range_tombstone rt) override {
virtual stop_iteration accept_row_tombstone(range_tombstone rt) override {
print_separator();
fmt::print(_os, "row tombstone {}", rt);
return stop_iteration::no;
}
virtual void accept_row(position_in_partition_view pipv, row_tombstone rt, row_marker rm, is_dummy, is_continuous) override {
virtual stop_iteration accept_row(position_in_partition_view pipv, row_tombstone rt, row_marker rm, is_dummy, is_continuous) override {
if (_in_row) {
fmt::print(_os, "}}, ");
}
fmt::print(_os, "{{row {} tombstone {} marker {}", pipv, rt, rm);
_in_row = true;
_first = false;
return stop_iteration::no;
}
virtual void accept_row_cell(column_id id, atomic_cell ac) override {
print_separator();

View File

@@ -14,10 +14,6 @@
#include "bytes_ostream.hh"
#include <iosfwd>
namespace utils {
class UUID;
} // namespace utils
// Immutable mutation form which can be read using any schema version of the same table.
// Safe to access from other shards via const&.
// Safe to pass serialized across nodes.
@@ -39,7 +35,7 @@ public:
// is not intended, user should sync the schema first.
mutation to_mutation(schema_ptr) const;
utils::UUID column_family_id() const;
table_id column_family_id() const;
const bytes_ostream& representation() const { return _data; }

View File

@@ -25,6 +25,7 @@
#include "gms/gossiper.hh"
#include "gms/feature_service.hh"
#include "utils/UUID_gen.hh"
#include "utils/error_injection.hh"
#include "cdc/generation.hh"
#include "cdc/cdc_options.hh"
@@ -44,8 +45,16 @@ static unsigned get_sharding_ignore_msb(const gms::inet_address& endpoint, const
namespace cdc {
extern const api::timestamp_clock::duration generation_leeway =
std::chrono::duration_cast<api::timestamp_clock::duration>(std::chrono::seconds(5));
api::timestamp_clock::duration get_generation_leeway() {
static thread_local auto generation_leeway =
std::chrono::duration_cast<api::timestamp_clock::duration>(std::chrono::seconds(5));
utils::get_local_injector().inject("increase_cdc_generation_leeway", [&] {
generation_leeway = std::chrono::duration_cast<api::timestamp_clock::duration>(std::chrono::minutes(5));
});
return generation_leeway;
}
static void copy_int_to_bytes(int64_t i, size_t offset, bytes& b) {
i = net::hton(i);
@@ -160,18 +169,18 @@ bool token_range_description::operator==(const token_range_description& o) const
&& sharding_ignore_msb == o.sharding_ignore_msb;
}
topology_description::topology_description(std::vector<token_range_description> entries)
topology_description::topology_description(utils::chunked_vector<token_range_description> entries)
: _entries(std::move(entries)) {}
bool topology_description::operator==(const topology_description& o) const {
return _entries == o._entries;
}
const std::vector<token_range_description>& topology_description::entries() const& {
const utils::chunked_vector<token_range_description>& topology_description::entries() const& {
return _entries;
}
std::vector<token_range_description>&& topology_description::entries() && {
utils::chunked_vector<token_range_description>&& topology_description::entries() && {
return std::move(_entries);
}
@@ -263,7 +272,7 @@ public:
topology_description generate() const {
const auto tokens = get_tokens();
std::vector<token_range_description> vnode_descriptions;
utils::chunked_vector<token_range_description> vnode_descriptions;
vnode_descriptions.reserve(tokens.size());
vnode_descriptions.push_back(
@@ -331,7 +340,7 @@ future<cdc::generation_id> generation_service::make_new_generation(const std::un
auto new_generation_timestamp = [add_delay, ring_delay = _cfg.ring_delay] {
auto ts = db_clock::now();
if (add_delay && ring_delay != 0ms) {
ts += 2 * ring_delay + duration_cast<milliseconds>(generation_leeway);
ts += 2 * ring_delay + duration_cast<milliseconds>(get_generation_leeway());
}
return ts;
};
@@ -455,7 +464,12 @@ static future<> update_streams_description(
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) -> future<> {
while (true) {
co_await sleep_abortable(std::chrono::seconds(60), abort_src);
try {
co_await sleep_abortable(std::chrono::seconds(60), abort_src);
} catch (seastar::sleep_aborted&) {
cdc_log.warn( "Aborted update CDC description table with generation {}", gen_id);
co_return;
}
try {
co_await do_update_streams_description(gen_id, *sys_dist_ks, { get_num_token_owners() });
co_return;
@@ -578,7 +592,7 @@ future<> generation_service::maybe_rewrite_streams_descriptions() {
co_return;
}
if (co_await db::system_keyspace::cdc_is_rewritten()) {
if (co_await _sys_ks.local().cdc_is_rewritten()) {
co_return;
}
@@ -602,13 +616,13 @@ future<> generation_service::maybe_rewrite_streams_descriptions() {
continue;
}
times_and_ttls.push_back(time_and_ttl{as_timepoint(s.id()), cdc_opts.ttl()});
times_and_ttls.push_back(time_and_ttl{as_timepoint(s.id().uuid()), cdc_opts.ttl()});
}
if (times_and_ttls.empty()) {
// There's no point in rewriting old generations' streams (they don't contain any data).
cdc_log.info("No CDC log tables present, not rewriting stream tables.");
co_return co_await db::system_keyspace::cdc_set_rewritten(std::nullopt);
co_return co_await _sys_ks.local().cdc_set_rewritten(std::nullopt);
}
auto get_num_token_owners = [tm = _token_metadata.get()] { return tm->count_normal_token_owners(); };
@@ -626,7 +640,7 @@ future<> generation_service::maybe_rewrite_streams_descriptions() {
std::move(get_num_token_owners),
_abort_src);
co_await db::system_keyspace::cdc_set_rewritten(last_rewritten);
co_await _sys_ks.local().cdc_set_rewritten(last_rewritten);
}
static void assert_shard_zero(const sstring& where) {
@@ -870,7 +884,7 @@ future<> generation_service::check_and_repair_cdc_streams() {
{ gms::application_state::CDC_GENERATION_ID, gms::versioned_value::cdc_generation_id(new_gen_id) },
{ gms::application_state::STATUS, *status }
});
co_await db::system_keyspace::update_cdc_generation_id(new_gen_id);
co_await _sys_ks.local().update_cdc_generation_id(new_gen_id);
}
future<> generation_service::handle_cdc_generation(std::optional<cdc::generation_id> gen_id) {
@@ -1013,7 +1027,7 @@ future<bool> generation_service::do_handle_cdc_generation(cdc::generation_id gen
// The assumption follows from the requirement of bootstrapping nodes sequentially.
if (!_gen_id || get_ts(*_gen_id) < get_ts(gen_id)) {
_gen_id = gen_id;
co_await db::system_keyspace::update_cdc_generation_id(gen_id);
co_await _sys_ks.local().update_cdc_generation_id(gen_id);
co_await _gossiper.add_local_application_state(
gms::application_state::CDC_GENERATION_ID, gms::versioned_value::cdc_generation_id(gen_id));
}

View File

@@ -46,6 +46,8 @@ namespace gms {
namespace cdc {
api::timestamp_clock::duration get_generation_leeway();
class stream_id final {
bytes _value;
public:
@@ -94,13 +96,13 @@ struct token_range_description {
* in the `_entries` vector. See the comment above `token_range_description` for explanation.
*/
class topology_description {
std::vector<token_range_description> _entries;
utils::chunked_vector<token_range_description> _entries;
public:
topology_description(std::vector<token_range_description> entries);
topology_description(utils::chunked_vector<token_range_description> entries);
bool operator==(const topology_description&) const;
const std::vector<token_range_description>& entries() const&;
std::vector<token_range_description>&& entries() &&;
const utils::chunked_vector<token_range_description>& entries() const&;
utils::chunked_vector<token_range_description>&& entries() &&;
};
/**

View File

@@ -59,7 +59,7 @@ using namespace std::chrono_literals;
logging::logger cdc_log("cdc");
namespace cdc {
static schema_ptr create_log_schema(const schema&, std::optional<utils::UUID> = {}, schema_ptr = nullptr);
static schema_ptr create_log_schema(const schema&, std::optional<table_id> = {}, schema_ptr = nullptr);
}
static constexpr auto cdc_group_name = "cdc";
@@ -485,7 +485,7 @@ bytes log_data_column_deleted_elements_name_bytes(const bytes& column_name) {
return to_bytes(cdc_deleted_elements_column_prefix) + column_name;
}
static schema_ptr create_log_schema(const schema& s, std::optional<utils::UUID> uuid, schema_ptr old) {
static schema_ptr create_log_schema(const schema& s, std::optional<table_id> uuid, schema_ptr old) {
schema_builder b(s.ks_name(), log_name(s.cf_name()));
b.with_partitioner(cdc::cdc_partitioner::classname);
b.set_compaction_strategy(sstables::compaction_strategy_type::time_window);
@@ -605,7 +605,7 @@ private:
public:
collection_iterator(managed_bytes_view_opt v = {})
: _v(v.value_or(managed_bytes_view{}))
, _rem(_v.empty() ? 0 : read_collection_size(_v, cql_serialization_format::internal()))
, _rem(_v.empty() ? 0 : read_collection_size(_v))
{
if (_rem != 0) {
parse();
@@ -650,8 +650,8 @@ template<>
void collection_iterator<std::pair<managed_bytes_view, managed_bytes_view>>::parse() {
assert(_rem > 0);
_next = _v;
auto k = read_collection_value(_next, cql_serialization_format::internal());
auto v = read_collection_value(_next, cql_serialization_format::internal());
auto k = read_collection_value(_next);
auto v = read_collection_value(_next);
_current = std::make_pair(k, v);
}
@@ -659,7 +659,7 @@ template<>
void collection_iterator<managed_bytes_view>::parse() {
assert(_rem > 0);
_next = _v;
auto k = read_collection_value(_next, cql_serialization_format::internal());
auto k = read_collection_value(_next);
_current = k;
}
@@ -728,7 +728,7 @@ auto make_maybe_back_inserter(Container& c, const abstract_type& type, collectio
static size_t collection_size(const managed_bytes_opt& bo) {
if (bo) {
managed_bytes_view mbv(*bo);
return read_collection_size(mbv, cql_serialization_format::internal());
return read_collection_size(mbv);
}
return 0;
}
@@ -750,7 +750,7 @@ static managed_bytes merge(const collection_type_impl& ctype, const managed_byte
// note order: set_union, when finding doubles, use value from first1 (j here). So
// since this is next, it has prio
std::set_union(j, e, i, e, make_maybe_back_inserter(res, *type, collection_iterator<managed_bytes_view>(deleted)), cmp);
return map_type_impl::serialize_partially_deserialized_form_fragmented(res, cql_serialization_format::internal());
return map_type_impl::serialize_partially_deserialized_form_fragmented(res);
}
static managed_bytes merge(const set_type_impl& ctype, const managed_bytes_opt& prev, const managed_bytes_opt& next, const managed_bytes_opt& deleted) {
std::vector<managed_bytes_view> res;
@@ -761,7 +761,7 @@ static managed_bytes merge(const set_type_impl& ctype, const managed_bytes_opt&
};
collection_iterator<managed_bytes_view> e, i(prev), j(next), d(deleted);
std::set_union(j, e, i, e, make_maybe_back_inserter(res, *type, d), cmp);
return set_type_impl::serialize_partially_deserialized_form_fragmented(res, cql_serialization_format::internal());
return set_type_impl::serialize_partially_deserialized_form_fragmented(res);
}
static managed_bytes merge(const user_type_impl& type, const managed_bytes_opt& prev, const managed_bytes_opt& next, const managed_bytes_opt& deleted) {
std::vector<managed_bytes_view_opt> res(type.size());
@@ -812,15 +812,14 @@ static managed_bytes_opt get_preimage_col_value(const column_definition& cdef, c
// flatten set
[&] (const set_type_impl& type) {
auto v = pirow->get_view(cdef.name_as_text());
auto f = cql_serialization_format::internal();
auto n = read_collection_size(v, f);
auto n = read_collection_size(v);
std::vector<managed_bytes> tmp;
tmp.reserve(n);
while (n--) {
tmp.emplace_back(read_collection_value(v, f)); // key
read_collection_value(v, f); // value. ignore.
tmp.emplace_back(read_collection_value(v)); // key
read_collection_value(v); // value. ignore.
}
return set_type_impl::serialize_partially_deserialized_form_fragmented({tmp.begin(), tmp.end()}, f);
return set_type_impl::serialize_partially_deserialized_form_fragmented({tmp.begin(), tmp.end()});
},
[&] (const abstract_type& o) -> managed_bytes {
return pirow->get_blob_fragmented(cdef.name_as_text());
@@ -1122,7 +1121,7 @@ struct process_row_visitor {
visit_collection(v);
managed_bytes_opt added_keys = v._added_keys.empty() ? std::nullopt :
std::optional{set_type_impl::serialize_partially_deserialized_form_fragmented(v._added_keys, cql_serialization_format::internal())};
std::optional{set_type_impl::serialize_partially_deserialized_form_fragmented(v._added_keys)};
return {
v._is_column_delete,
@@ -1178,7 +1177,7 @@ struct process_row_visitor {
visit_collection(v);
managed_bytes_opt added_cells = v._added_cells.empty() ? std::nullopt :
std::optional{map_type_impl::serialize_partially_deserialized_form_fragmented(v._added_cells, cql_serialization_format::internal())};
std::optional{map_type_impl::serialize_partially_deserialized_form_fragmented(v._added_cells)};
return {
v._is_column_delete,
@@ -1198,7 +1197,7 @@ struct process_row_visitor {
// then we deserialize again when merging images below
managed_bytes_opt deleted_elements = std::nullopt;
if (!deleted_keys.empty()) {
deleted_elements = set_type_impl::serialize_partially_deserialized_form_fragmented(deleted_keys, cql_serialization_format::internal());
deleted_elements = set_type_impl::serialize_partially_deserialized_form_fragmented(deleted_keys);
}
// delta
@@ -1655,7 +1654,8 @@ public:
auto partition_slice = query::partition_slice(std::move(bounds), std::move(static_columns), std::move(regular_columns), std::move(opts));
const auto max_result_size = _ctx._proxy.get_max_result_size(partition_slice);
auto command = ::make_lw_shared<query::read_command>(_schema->id(), _schema->version(), partition_slice, query::max_result_size(max_result_size), query::row_limit(row_limit));
const auto tombstone_limit = query::tombstone_limit(_ctx._proxy.get_tombstone_limit());
auto command = ::make_lw_shared<query::read_command>(_schema->id(), _schema->version(), partition_slice, query::max_result_size(max_result_size), tombstone_limit, query::row_limit(row_limit));
const auto select_cl = adjust_cl(write_cl);

View File

@@ -15,10 +15,6 @@
extern logging::logger cdc_log;
namespace cdc {
extern const api::timestamp_clock::duration generation_leeway;
} // namespace cdc
static api::timestamp_type to_ts(db_clock::time_point tp) {
// This assumes that timestamp_clock and db_clock have the same epochs.
return std::chrono::duration_cast<api::timestamp_clock::duration>(tp.time_since_epoch()).count();
@@ -40,7 +36,7 @@ static cdc::stream_id get_stream(
// non-static for testing
cdc::stream_id get_stream(
const std::vector<cdc::token_range_description>& entries,
const utils::chunked_vector<cdc::token_range_description>& entries,
dht::token tok) {
if (entries.empty()) {
on_internal_error(cdc_log, "get_stream: entries empty");
@@ -73,7 +69,7 @@ bool cdc::metadata::streams_available() const {
cdc::stream_id cdc::metadata::get_stream(api::timestamp_type ts, dht::token tok) {
auto now = api::new_timestamp();
if (ts > now + generation_leeway.count()) {
if (ts > now + get_generation_leeway().count()) {
throw exceptions::invalid_request_exception(format(
"cdc: attempted to get a stream \"from the future\" ({}; current server time: {})."
" With CDC you cannot send writes with timestamps arbitrarily into the future, because we don't"
@@ -86,27 +82,43 @@ cdc::stream_id cdc::metadata::get_stream(api::timestamp_type ts, dht::token tok)
// Nothing protects us from that until we start using transactions for generation switching.
}
auto it = gen_used_at(now);
if (it == _gens.end()) {
auto it = gen_used_at(now - get_generation_leeway().count());
if (it != _gens.end()) {
// Garbage-collect generations that will no longer be used.
it = _gens.erase(_gens.begin(), it);
}
if (ts <= now - get_generation_leeway().count()) {
// We reject the write if `ts <= now - generation_leeway` and the write is not to the current generation, which
// happens iff one of the following is true:
// - the write is to no generation,
// - the write is to a generation older than the generation under `it`,
// - the write is to the generation under `it` and that generation is not the current generation.
// Note that we cannot distinguish the first and second cases because we garbage-collect obsolete generations,
// but we can check if one of them takes place (`it == _gens.end() || ts < it->first`). These three conditions
// are sufficient. The write with `ts <= now - generation_leeway` cannot be to one of the generations following
// the generation under `it` because that generation was operating at `now - generation_leeway`.
bool is_previous_gen = it != _gens.end() && std::next(it) != _gens.end() && std::next(it)->first <= now;
if (it == _gens.end() || ts < it->first || is_previous_gen) {
throw exceptions::invalid_request_exception(format(
"cdc: attempted to get a stream \"from the past\" ({}; current server time: {})."
" With CDC you cannot send writes with timestamps too far into the past, because that would break"
" consistency properties.\n"
"We *do* allow sending writes into the near past, but our ability to do that is limited."
" Are you using client-side timestamps? Make sure your clocks are well-synchronized"
" with the database's clocks.", format_timestamp(ts), format_timestamp(now)));
}
}
it = _gens.begin();
if (it == _gens.end() || ts < it->first) {
throw std::runtime_error(format(
"cdc::metadata::get_stream: could not find any CDC stream (current time: {})."
" Are we in the middle of a cluster upgrade?", format_timestamp(now)));
"cdc::metadata::get_stream: could not find any CDC stream for timestamp {}."
" Are we in the middle of a cluster upgrade?", format_timestamp(ts)));
}
// Garbage-collect generations that will no longer be used.
it = _gens.erase(_gens.begin(), it);
if (it->first > ts) {
throw exceptions::invalid_request_exception(format(
"cdc: attempted to get a stream from an earlier generation than the currently used one."
" With CDC you cannot send writes with timestamps too far into the past, because that would break"
" consistency properties (write timestamp: {}, current generation started at: {})",
format_timestamp(ts), format_timestamp(it->first)));
}
// With `generation_leeway` we allow sending writes to the near future. It might happen
// that `ts` doesn't belong to the current generation ("current" according to our clock),
// but to the next generation. Adjust for this case:
// Find the generation operating at `ts`.
{
auto next_it = std::next(it);
while (next_it != _gens.end() && next_it->first <= ts) {
@@ -147,8 +159,8 @@ bool cdc::metadata::known_or_obsolete(db_clock::time_point tp) const {
++it;
}
// Check if some new generation has already superseded this one.
return it != _gens.end() && it->first <= api::new_timestamp();
// Check if the generation is obsolete.
return it != _gens.end() && it->first <= api::new_timestamp() - get_generation_leeway().count();
}
bool cdc::metadata::insert(db_clock::time_point tp, topology_description&& gen) {
@@ -157,7 +169,7 @@ bool cdc::metadata::insert(db_clock::time_point tp, topology_description&& gen)
}
auto now = api::new_timestamp();
auto it = gen_used_at(now);
auto it = gen_used_at(now - get_generation_leeway().count());
if (it != _gens.end()) {
// Garbage-collect generations that will no longer be used.

View File

@@ -42,7 +42,9 @@ class metadata final {
container_t::const_iterator gen_used_at(api::timestamp_type ts) const;
public:
/* Is a generation with the given timestamp already known or superseded by a newer generation? */
/* Is a generation with the given timestamp already known or obsolete? It is obsolete if and only if
* it is older than the generation operating at `now - get_generation_leeway()`.
*/
bool known_or_obsolete(db_clock::time_point) const;
/* Are there streams available. I.e. valid for time == now. If this is false, any writes to
@@ -54,8 +56,9 @@ public:
*
* If the provided timestamp is too far away "into the future" (where "now" is defined according to our local clock),
* we reject the get_stream query. This is because the resulting stream might belong to a generation which we don't
* yet know about. The amount of leeway (how much "into the future" we allow `ts` to be) is defined
* by the `cdc::generation_leeway` constant.
* yet know about. Similarly, we reject queries to the previous generations if the timestamp is too far away "into
* the past". The amount of leeway (how much "into the future" or "into the past" we allow `ts` to be) is defined by
* `get_generation_leeway()`.
*/
stream_id get_stream(api::timestamp_type ts, dht::token tok);

View File

@@ -21,8 +21,6 @@ class row_tombstone;
class collection_mutation;
class cql_serialization_format;
// An auxiliary struct used to (de)construct collection_mutations.
// Unlike collection_mutation which is a serialized blob, this struct allows to inspect logical units of information
// (tombstone and cells) inside the mutation easily.
@@ -131,4 +129,4 @@ collection_mutation merge(const abstract_type&, collection_mutation_view, collec
collection_mutation difference(const abstract_type&, collection_mutation_view, collection_mutation_view);
// Serializes the given collection of cells to a sequence of bytes ready to be sent over the CQL protocol.
bytes_ostream serialize_for_cql(const abstract_type&, collection_mutation_view, cql_serialization_format);
bytes_ostream serialize_for_cql(const abstract_type&, collection_mutation_view);

View File

@@ -12,7 +12,13 @@
class schema;
class partition_key;
class clustering_row;
struct atomic_cell_view;
struct tombstone;
namespace db::view {
struct clustering_or_static_row;
struct view_key_and_action;
}
class column_computation;
using column_computation_ptr = std::unique_ptr<column_computation>;
@@ -22,7 +28,7 @@ using column_computation_ptr = std::unique_ptr<column_computation>;
* Computed columns description is also available at docs/dev/system_schema_keyspace.md. They hold values
* not provided directly by the user, but rather computed: from other column values and possibly other sources.
* This class is able to serialize/deserialize column computations and perform the computation itself,
* based on given schema, partition key and clustering row. Responsibility for providing enough data
* based on given schema, and partition key. Responsibility for providing enough data
* in the clustering row in order for computation to succeed belongs to the caller. In particular,
* generating a value might involve performing a read-before-write if the computation is performed
* on more values than are present in the update request.
@@ -36,7 +42,19 @@ public:
virtual column_computation_ptr clone() const = 0;
virtual bytes serialize() const = 0;
virtual bytes_opt compute_value(const schema& schema, const partition_key& key, const clustering_row& row) const = 0;
virtual bytes compute_value(const schema& schema, const partition_key& key) const = 0;
/*
* depends_on_non_primary_key_column for a column computation is needed to
* detect a case where the primary key of a materialized view depends on a
* non primary key column from the base table, but at the same time, the view
* itself doesn't have non-primary key columns. This is an issue, since as
* for now, it was assumed that no non-primary key columns in view schema
* meant that the update cannot change the primary key of the view, and
* therefore the update path can be simplified.
*/
virtual bool depends_on_non_primary_key_column() const {
return false;
}
};
/*
@@ -54,7 +72,7 @@ public:
return std::make_unique<legacy_token_column_computation>(*this);
}
virtual bytes serialize() const override;
virtual bytes_opt compute_value(const schema& schema, const partition_key& key, const clustering_row& row) const override;
virtual bytes compute_value(const schema& schema, const partition_key& key) const override;
};
@@ -75,5 +93,54 @@ public:
return std::make_unique<token_column_computation>(*this);
}
virtual bytes serialize() const override;
virtual bytes_opt compute_value(const schema& schema, const partition_key& key, const clustering_row& row) const override;
virtual bytes compute_value(const schema& schema, const partition_key& key) const override;
};
/*
* collection_column_computation is used for a secondary index on a collection
* column. In this case we don't have a single value to compute, but rather we
* want to return multiple values (e.g., all the keys in the collection).
* So this class does not implement the base class's compute_value() -
* instead it implements a new method compute_collection_values(), which
* can return multiple values. This new method is currently called only from
* the materialized-view code which uses collection_column_computation.
*/
class collection_column_computation final : public column_computation {
enum class kind {
keys,
values,
entries,
};
const bytes _collection_name;
const kind _kind;
collection_column_computation(const bytes& collection_name, kind kind) : _collection_name(collection_name), _kind(kind) {}
using collection_kv = std::pair<bytes_view, atomic_cell_view>;
void operate_on_collection_entries(
std::invocable<collection_kv*, collection_kv*, tombstone> auto&& old_and_new_row_func, const schema& schema,
const partition_key& key, const db::view::clustering_or_static_row& update, const std::optional<db::view::clustering_or_static_row>& existing) const;
public:
static collection_column_computation for_keys(const bytes& collection_name) {
return {collection_name, kind::keys};
}
static collection_column_computation for_values(const bytes& collection_name) {
return {collection_name, kind::values};
}
static collection_column_computation for_entries(const bytes& collection_name) {
return {collection_name, kind::entries};
}
static column_computation_ptr for_target_type(std::string_view type, const bytes& collection_name);
virtual bytes serialize() const override;
virtual bytes compute_value(const schema& schema, const partition_key& key) const override;
virtual column_computation_ptr clone() const override {
return std::make_unique<collection_column_computation>(*this);
}
virtual bool depends_on_non_primary_key_column() const override {
return true;
}
std::vector<db::view::view_key_and_action> compute_values_with_action(const schema& schema, const partition_key& key,
const db::view::clustering_or_static_row& row, const std::optional<db::view::clustering_or_static_row>& existing) const;
};

View File

@@ -28,6 +28,7 @@
#include <seastar/util/closeable.hh>
#include <seastar/core/shared_ptr.hh>
#include "dht/i_partitioner.hh"
#include "sstables/sstables.hh"
#include "sstables/sstable_writer.hh"
#include "sstables/progress_monitor.hh"
@@ -41,6 +42,7 @@
#include "mutation_compactor.hh"
#include "leveled_manifest.hh"
#include "dht/token.hh"
#include "dht/partition_filter.hh"
#include "mutation_writer/shard_based_splitting_writer.hh"
#include "mutation_writer/partition_based_splitting_writer.hh"
#include "mutation_source_metadata.hh"
@@ -52,6 +54,7 @@
#include "readers/filtering.hh"
#include "readers/compacting.hh"
#include "tombstone_gc.hh"
#include "keys.hh"
namespace sstables {
@@ -165,7 +168,7 @@ std::ostream& operator<<(std::ostream& os, pretty_printed_throughput tp) {
}
static api::timestamp_type get_max_purgeable_timestamp(const table_state& table_s, sstable_set::incremental_selector& selector,
const std::unordered_set<shared_sstable>& compacting_set, const dht::decorated_key& dk) {
const std::unordered_set<shared_sstable>& compacting_set, const dht::decorated_key& dk, uint64_t& bloom_filter_checks) {
auto timestamp = table_s.min_memtable_timestamp();
std::optional<utils::hashed_key> hk;
for (auto&& sst : boost::range::join(selector.select(dk).sstables, table_s.compacted_undeleted_sstables())) {
@@ -176,6 +179,7 @@ static api::timestamp_type get_max_purgeable_timestamp(const table_state& table_
hk = sstables::sstable::make_hashed_key(*table_s.schema(), dk.key());
}
if (sst->filter_has_key(*hk)) {
bloom_filter_checks++;
timestamp = std::min(timestamp, sst->get_stats_metadata().min_timestamp);
}
}
@@ -219,13 +223,13 @@ public:
~compaction_write_monitor() {
if (_sst) {
_table_s.get_compaction_strategy().get_backlog_tracker().revert_charges(_sst);
_table_s.get_backlog_tracker().revert_charges(_sst);
}
}
virtual void on_write_started(const sstables::writer_offset_tracker& tracker) override {
_tracker = &tracker;
_table_s.get_compaction_strategy().get_backlog_tracker().register_partially_written_sstable(_sst, *this);
_table_s.get_backlog_tracker().register_partially_written_sstable(_sst, *this);
}
virtual void on_data_write_completed() override {
@@ -277,6 +281,18 @@ class compacted_fragments_writer {
stop_func_t _stop_compaction_writer;
std::optional<utils::observer<>> _stop_request_observer;
bool _unclosed_partition = false;
struct partition_state {
dht::decorated_key_opt dk;
// Partition tombstone is saved for the purpose of replicating it to every fragment storing a partition pL.
// Then when reading from the SSTable run, we won't unnecessarily have to open >= 2 fragments, the one which
// contains the tombstone and another one(s) that has the partition slice being queried.
::tombstone tombstone;
// Used to determine whether any active tombstones need closing at EOS.
::tombstone current_emitted_tombstone;
// Track last emitted clustering row, which will be used to close active tombstone if splitting partition
position_in_partition last_pos = position_in_partition::before_all_clustered_rows();
bool is_splitting_partition = false;
} _current_partition;
private:
inline void maybe_abort_compaction();
@@ -286,6 +302,13 @@ private:
consume_end_of_stream();
});
}
void stop_current_writer();
bool can_split_large_partition() const;
void track_last_position(position_in_partition_view pos);
void split_large_partition();
void do_consume_new_partition(const dht::decorated_key& dk);
stop_iteration do_consume_end_of_partition();
public:
explicit compacted_fragments_writer(compaction& c, creator_func_t cpw, stop_func_t scw)
: _c(c)
@@ -304,7 +327,7 @@ public:
void consume_new_partition(const dht::decorated_key& dk);
void consume(tombstone t) { _compaction_writer->writer.consume(t); }
void consume(tombstone t);
stop_iteration consume(static_row&& sr, tombstone, bool) {
maybe_abort_compaction();
return _compaction_writer->writer.consume(std::move(sr));
@@ -312,17 +335,11 @@ public:
stop_iteration consume(static_row&& sr) {
return consume(std::move(sr), tombstone{}, bool{});
}
stop_iteration consume(clustering_row&& cr, row_tombstone, bool) {
maybe_abort_compaction();
return _compaction_writer->writer.consume(std::move(cr));
}
stop_iteration consume(clustering_row&& cr, row_tombstone, bool);
stop_iteration consume(clustering_row&& cr) {
return consume(std::move(cr), row_tombstone{}, bool{});
}
stop_iteration consume(range_tombstone_change&& rtc) {
maybe_abort_compaction();
return _compaction_writer->writer.consume(std::move(rtc));
}
stop_iteration consume(range_tombstone_change&& rtc);
stop_iteration consume_end_of_partition();
void consume_end_of_stream();
@@ -337,7 +354,7 @@ struct compaction_read_monitor_generator final : public read_monitor_generator {
public:
virtual void on_read_started(const sstables::reader_position_tracker& tracker) override {
_tracker = &tracker;
_table_s.get_compaction_strategy().get_backlog_tracker().register_compacting_sstable(_sst, *this);
_table_s.get_backlog_tracker().register_compacting_sstable(_sst, *this);
}
virtual void on_read_completed() override {
@@ -356,7 +373,7 @@ struct compaction_read_monitor_generator final : public read_monitor_generator {
void remove_sstable() {
if (_sst) {
_table_s.get_compaction_strategy().get_backlog_tracker().revert_charges(_sst);
_table_s.get_backlog_tracker().revert_charges(_sst);
}
_sst = {};
}
@@ -368,7 +385,7 @@ struct compaction_read_monitor_generator final : public read_monitor_generator {
// We failed to finish handling this SSTable, so we have to update the backlog_tracker
// about it.
if (_sst) {
_table_s.get_compaction_strategy().get_backlog_tracker().revert_charges(_sst);
_table_s.get_backlog_tracker().revert_charges(_sst);
}
}
@@ -398,9 +415,12 @@ private:
class formatted_sstables_list {
bool _include_origin = true;
std::vector<sstring> _ssts;
std::vector<std::string> _ssts;
public:
formatted_sstables_list() = default;
void reserve(size_t n) {
_ssts.reserve(n);
}
explicit formatted_sstables_list(const std::vector<shared_sstable>& ssts, bool include_origin) : _include_origin(include_origin) {
_ssts.reserve(ssts.size());
for (const auto& sst : ssts) {
@@ -419,9 +439,7 @@ public:
};
std::ostream& operator<<(std::ostream& os, const formatted_sstables_list& lst) {
os << "[";
os << boost::algorithm::join(lst._ssts, ",");
os << "]";
fmt::print(os, "[{}]", fmt::join(lst._ssts, ","));
return os;
}
@@ -446,12 +464,15 @@ protected:
uint64_t _start_size = 0;
uint64_t _end_size = 0;
uint64_t _estimated_partitions = 0;
double _estimated_droppable_tombstone_ratio = 0;
uint64_t _bloom_filter_checks = 0;
db::replay_position _rp;
encoding_stats_collector _stats_collector;
bool _can_split_large_partition = false;
bool _contains_multi_fragment_runs = false;
mutation_source_metadata _ms_metadata = {};
compaction_sstable_replacer_fn _replacer;
utils::UUID _run_identifier;
run_id _run_identifier;
::io_priority_class _io_priority;
// optional clone of sstable set to be used for expiration purposes, so it will be set if expiration is enabled.
std::optional<sstable_set> _sstable_set;
@@ -479,6 +500,7 @@ protected:
, _type(descriptor.options.type())
, _max_sstable_size(descriptor.max_sstable_bytes)
, _sstable_level(descriptor.level)
, _can_split_large_partition(descriptor.can_split_large_partition)
, _replacer(std::move(descriptor.replacer))
, _run_identifier(descriptor.run_identifier)
, _io_priority(descriptor.io_priority)
@@ -489,7 +511,7 @@ protected:
for (auto& sst : _sstables) {
_stats_collector.update(sst->get_encoding_stats_for_compaction());
}
std::unordered_set<utils::UUID> ssts_run_ids;
std::unordered_set<run_id> ssts_run_ids;
_contains_multi_fragment_runs = std::any_of(_sstables.begin(), _sstables.end(), [&ssts_run_ids] (shared_sstable& sst) {
return !ssts_run_ids.insert(sst->run_identifier()).second;
});
@@ -500,7 +522,7 @@ protected:
auto max_sstable_size = std::max<uint64_t>(_max_sstable_size, 1);
uint64_t estimated_sstables = std::max(1UL, uint64_t(ceil(double(_start_size) / max_sstable_size)));
return std::min(uint64_t(ceil(double(_estimated_partitions) / estimated_sstables)),
_table_s.get_compaction_strategy().adjust_partition_estimate(_ms_metadata, _estimated_partitions));
_table_s.get_compaction_strategy().adjust_partition_estimate(_ms_metadata, _estimated_partitions, _schema));
}
void setup_new_sstable(shared_sstable& sst) {
@@ -555,14 +577,15 @@ protected:
return bool(_sstable_set);
}
compaction_writer create_gc_compaction_writer() const {
compaction_writer create_gc_compaction_writer(run_id gc_run) const {
auto sst = _sstable_creator(this_shard_id());
auto&& priority = _io_priority;
auto monitor = std::make_unique<compaction_write_monitor>(sst, _table_s, maximum_timestamp(), _sstable_level);
sstable_writer_config cfg = _table_s.configure_writer("garbage_collection");
cfg.run_identifier = _run_identifier;
cfg.run_identifier = gc_run;
cfg.monitor = monitor.get();
uint64_t estimated_partitions = std::max(1UL, uint64_t(ceil(partitions_per_sstable() * _estimated_droppable_tombstone_ratio)));
auto writer = sst->get_writer(*schema(), partitions_per_sstable(), cfg, get_encoding_stats(), priority);
return compaction_writer(std::move(monitor), std::move(writer), std::move(sst));
}
@@ -582,8 +605,14 @@ protected:
// When compaction finishes, all the temporary sstables generated here will be deleted and removed
// from table's sstable set.
compacted_fragments_writer get_gc_compacted_fragments_writer() {
// because the temporary sstable run can overlap with the non-gc sstables run created by
// get_compacted_fragments_writer(), we have to use a different run_id. the gc_run_id is
// created here as:
// 1. it can be shared across all sstables created by this writer
// 2. it is optional, as gc writer is not always used
auto gc_run = run_id::create_random_id();
return compacted_fragments_writer(*this,
[this] (const dht::decorated_key&) { return create_gc_compaction_writer(); },
[this, gc_run] (const dht::decorated_key&) { return create_gc_compaction_writer(gc_run); },
[this] (compaction_writer* cw) { stop_gc_compaction_writer(cw); },
_stop_request_observable);
}
@@ -600,8 +629,8 @@ protected:
return _used_garbage_collected_sstables;
}
bool enable_garbage_collected_sstable_writer() const noexcept {
return _contains_multi_fragment_runs && _max_sstable_size != std::numeric_limits<uint64_t>::max();
virtual bool enable_garbage_collected_sstable_writer() const noexcept {
return _contains_multi_fragment_runs && _max_sstable_size != std::numeric_limits<uint64_t>::max() && bool(_replacer);
}
public:
compaction& operator=(const compaction&) = delete;
@@ -623,9 +652,11 @@ private:
future<> setup() {
auto ssts = make_lw_shared<sstables::sstable_set>(make_sstable_set_for_input());
formatted_sstables_list formatted_msg;
formatted_msg.reserve(_sstables.size());
auto fully_expired = _table_s.fully_expired_sstables(_sstables, gc_clock::now());
min_max_tracker<api::timestamp_type> timestamp_tracker;
double sum_of_estimated_droppable_tombstone_ratio = 0;
_input_sstable_generations.reserve(_sstables.size());
for (auto& sst : _sstables) {
co_await coroutine::maybe_yield();
@@ -660,12 +691,16 @@ private:
// this is kind of ok, esp. since we will hopefully not be trying to recover based on
// compacted sstables anyway (CL should be clean by then).
_rp = std::max(_rp, sst_stats.position);
auto gc_before = sst->get_gc_before_for_drop_estimation(gc_clock::now(), _table_s.get_tombstone_gc_state());
sum_of_estimated_droppable_tombstone_ratio += sst->estimate_droppable_tombstone_ratio(gc_before);
}
log_info("{} {}", report_start_desc(), formatted_msg);
if (ssts->all()->size() < _sstables.size()) {
log_debug("{} out of {} input sstables are fully expired sstables that will not be actually compacted",
_sstables.size() - ssts->all()->size(), _sstables.size());
}
// _estimated_droppable_tombstone_ratio could exceed 1.0 in certain cases, so limit it to 1.0.
_estimated_droppable_tombstone_ratio = std::min(1.0, sum_of_estimated_droppable_tombstone_ratio / ssts->all()->size());
_compacting = std::move(ssts);
@@ -684,7 +719,8 @@ private:
reader.consume_in_thread(std::move(cfc));
});
});
return consumer(make_compacting_reader(make_sstable_reader(), compaction_time, max_purgeable_func()));
const auto& gc_state = _table_s.get_tombstone_gc_state();
return consumer(make_compacting_reader(make_sstable_reader(), compaction_time, max_purgeable_func(), gc_state));
}
future<> consume() {
@@ -704,6 +740,7 @@ private:
using compact_mutations = compact_for_compaction_v2<compacted_fragments_writer, compacted_fragments_writer>;
auto cfc = compact_mutations(*schema(), now,
max_purgeable_func(),
_table_s.get_tombstone_gc_state(),
get_compacted_fragments_writer(),
get_gc_compacted_fragments_writer());
@@ -713,6 +750,7 @@ private:
using compact_mutations = compact_for_compaction_v2<compacted_fragments_writer, noop_compacted_fragments_consumer>;
auto cfc = compact_mutations(*schema(), now,
max_purgeable_func(),
_table_s.get_tombstone_gc_state(),
get_compacted_fragments_writer(),
noop_compacted_fragments_consumer());
reader.consume_in_thread(std::move(cfc));
@@ -736,6 +774,7 @@ protected:
.ended_at = ended_at,
.start_size = _start_size,
.end_size = _end_size,
.bloom_filter_checks = _bloom_filter_checks,
},
};
@@ -755,7 +794,7 @@ protected:
log_info("{} {} sstables to {}. {} to {} (~{}% of original) in {}ms = {}. ~{} total partitions merged to {}.",
report_finish_desc(),
_input_sstable_generations.size(), new_sstables_msg, pretty_printed_data_size(_start_size), pretty_printed_data_size(_end_size), int(ratio * 100),
std::chrono::duration_cast<std::chrono::milliseconds>(duration).count(), pretty_printed_throughput(_end_size, duration),
std::chrono::duration_cast<std::chrono::milliseconds>(duration).count(), pretty_printed_throughput(_start_size, duration),
_cdata.total_partitions, _cdata.total_keys_written);
return ret;
@@ -776,7 +815,7 @@ private:
};
}
return [this] (const dht::decorated_key& dk) {
return get_max_purgeable_timestamp(_table_s, *_selector, _compacting_for_max_purgeable_func, dk);
return get_max_purgeable_timestamp(_table_s, *_selector, _compacting_for_max_purgeable_func, dk, _bloom_filter_checks);
};
}
@@ -864,7 +903,51 @@ void compacted_fragments_writer::maybe_abort_compaction() {
}
}
void compacted_fragments_writer::consume_new_partition(const dht::decorated_key& dk) {
void compacted_fragments_writer::stop_current_writer() {
// stop sstable writer being currently used.
_stop_compaction_writer(&*_compaction_writer);
_compaction_writer = std::nullopt;
}
bool compacted_fragments_writer::can_split_large_partition() const {
return _c._can_split_large_partition;
}
void compacted_fragments_writer::track_last_position(position_in_partition_view pos) {
if (can_split_large_partition()) {
_current_partition.last_pos = pos;
}
}
void compacted_fragments_writer::split_large_partition() {
// Closes the active range tombstone if needed, before emitting partition end.
// after_key(last_pos) is used for both closing and re-opening the active tombstone, which
// will result in current fragment storing an inclusive end bound for last pos, and the
// next fragment storing an exclusive start bound for last pos. This is very important
// for not losing information on the range tombstone.
auto after_last_pos = position_in_partition::after_key(*_c.schema(), _current_partition.last_pos.key());
if (_current_partition.current_emitted_tombstone) {
auto rtc = range_tombstone_change(after_last_pos, tombstone{});
_c.log_debug("Closing active tombstone {} with {} for partition {}", _current_partition.current_emitted_tombstone, rtc, *_current_partition.dk);
_compaction_writer->writer.consume(std::move(rtc));
}
_c.log_debug("Splitting large partition {} in order to respect SSTable size limit of {}", *_current_partition.dk, pretty_printed_data_size(_c._max_sstable_size));
// Close partition in current writer, and open it again in a new writer.
do_consume_end_of_partition();
stop_current_writer();
do_consume_new_partition(*_current_partition.dk);
// Replicate partition tombstone to every fragment, allowing the SSTable run reader
// to open a single fragment during the read.
if (_current_partition.tombstone) {
consume(_current_partition.tombstone);
}
if (_current_partition.current_emitted_tombstone) {
_compaction_writer->writer.consume(range_tombstone_change(after_last_pos, _current_partition.current_emitted_tombstone));
}
_current_partition.is_splitting_partition = false;
}
void compacted_fragments_writer::do_consume_new_partition(const dht::decorated_key& dk) {
maybe_abort_compaction();
if (!_compaction_writer) {
_compaction_writer = _create_compaction_writer(dk);
@@ -872,17 +955,55 @@ void compacted_fragments_writer::consume_new_partition(const dht::decorated_key&
_c.on_new_partition();
_compaction_writer->writer.consume_new_partition(dk);
_c._cdata.total_keys_written++;
_unclosed_partition = true;
}
stop_iteration compacted_fragments_writer::consume_end_of_partition() {
auto ret = _compaction_writer->writer.consume_end_of_partition();
stop_iteration compacted_fragments_writer::do_consume_end_of_partition() {
_unclosed_partition = false;
return _compaction_writer->writer.consume_end_of_partition();
}
void compacted_fragments_writer::consume_new_partition(const dht::decorated_key& dk) {
_current_partition = {
.dk = dk,
.tombstone = tombstone(),
.current_emitted_tombstone = tombstone(),
.last_pos = position_in_partition::for_partition_start(),
.is_splitting_partition = false
};
do_consume_new_partition(dk);
_c._cdata.total_keys_written++;
}
void compacted_fragments_writer::consume(tombstone t) {
_current_partition.tombstone = t;
_compaction_writer->writer.consume(t);
}
stop_iteration compacted_fragments_writer::consume(clustering_row&& cr, row_tombstone, bool) {
maybe_abort_compaction();
if (_current_partition.is_splitting_partition) [[unlikely]] {
split_large_partition();
}
track_last_position(cr.position());
auto ret = _compaction_writer->writer.consume(std::move(cr));
if (can_split_large_partition() && ret == stop_iteration::yes) [[unlikely]] {
_current_partition.is_splitting_partition = true;
}
return stop_iteration::no;
}
stop_iteration compacted_fragments_writer::consume(range_tombstone_change&& rtc) {
maybe_abort_compaction();
_current_partition.current_emitted_tombstone = rtc.tombstone();
track_last_position(rtc.position());
return _compaction_writer->writer.consume(std::move(rtc));
}
stop_iteration compacted_fragments_writer::consume_end_of_partition() {
auto ret = do_consume_end_of_partition();
if (ret == stop_iteration::yes) {
// stop sstable writer being currently used.
_stop_compaction_writer(&*_compaction_writer);
_compaction_writer = std::nullopt;
stop_current_writer();
}
return ret;
}
@@ -894,51 +1015,6 @@ void compacted_fragments_writer::consume_end_of_stream() {
}
}
class reshape_compaction : public compaction {
public:
reshape_compaction(table_state& table_s, compaction_descriptor descriptor, compaction_data& cdata)
: compaction(table_s, std::move(descriptor), cdata) {
}
virtual sstables::sstable_set make_sstable_set_for_input() const override {
return sstables::make_partitioned_sstable_set(_schema, false);
}
flat_mutation_reader_v2 make_sstable_reader() const override {
return _compacting->make_local_shard_sstable_reader(_schema,
_permit,
query::full_partition_range,
_schema->full_slice(),
_io_priority,
tracing::trace_state_ptr(),
::streamed_mutation::forwarding::no,
::mutation_reader::forwarding::no,
default_read_monitor_generator());
}
std::string_view report_start_desc() const override {
return "Reshaping";
}
std::string_view report_finish_desc() const override {
return "Reshaped";
}
virtual compaction_writer create_compaction_writer(const dht::decorated_key& dk) override {
auto sst = _sstable_creator(this_shard_id());
setup_new_sstable(sst);
sstable_writer_config cfg = make_sstable_writer_config(compaction_type::Reshape);
return compaction_writer{sst->get_writer(*_schema, partitions_per_sstable(), cfg, get_encoding_stats(), _io_priority), sst};
}
virtual void stop_sstable_writer(compaction_writer* writer) override {
if (writer) {
finish_new_sstable(writer);
}
}
};
class regular_compaction : public compaction {
// keeps track of monitors for input sstable, which are responsible for adjusting backlog as compaction progresses.
mutable compaction_read_monitor_generator _monitor_generator;
@@ -1048,12 +1124,13 @@ private:
}
void update_pending_ranges() {
if (!_sstable_set || _sstable_set->all()->empty() || _cdata.pending_replacements.empty()) { // set can be empty for testing scenario.
auto pending_replacements = std::exchange(_cdata.pending_replacements, {});
if (!_sstable_set || _sstable_set->all()->empty() || pending_replacements.empty()) { // set can be empty for testing scenario.
return;
}
// Releases reference to sstables compacted by this compaction or another, both of which belongs
// to the same column family
for (auto& pending_replacement : _cdata.pending_replacements) {
for (auto& pending_replacement : pending_replacements) {
for (auto& sst : pending_replacement.removed) {
// Set may not contain sstable to be removed because this compaction may have started
// before the creation of that sstable.
@@ -1067,35 +1144,76 @@ private:
}
}
_selector.emplace(_sstable_set->make_incremental_selector());
_cdata.pending_replacements.clear();
}
};
class reshape_compaction : public regular_compaction {
private:
bool has_sstable_replacer() const noexcept {
return bool(_replacer);
}
public:
reshape_compaction(table_state& table_s, compaction_descriptor descriptor, compaction_data& cdata)
: regular_compaction(table_s, std::move(descriptor), cdata) {
}
virtual sstables::sstable_set make_sstable_set_for_input() const override {
return sstables::make_partitioned_sstable_set(_schema, false);
}
// Unconditionally enable incremental compaction if the strategy specifies a max output size, e.g. LCS.
virtual bool enable_garbage_collected_sstable_writer() const noexcept override {
return _max_sstable_size != std::numeric_limits<uint64_t>::max() && bool(_replacer);
}
flat_mutation_reader_v2 make_sstable_reader() const override {
return _compacting->make_local_shard_sstable_reader(_schema,
_permit,
query::full_partition_range,
_schema->full_slice(),
_io_priority,
tracing::trace_state_ptr(),
::streamed_mutation::forwarding::no,
::mutation_reader::forwarding::no,
default_read_monitor_generator());
}
std::string_view report_start_desc() const override {
return "Reshaping";
}
std::string_view report_finish_desc() const override {
return "Reshaped";
}
virtual compaction_writer create_compaction_writer(const dht::decorated_key& dk) override {
auto sst = _sstable_creator(this_shard_id());
setup_new_sstable(sst);
sstable_writer_config cfg = make_sstable_writer_config(compaction_type::Reshape);
return compaction_writer{sst->get_writer(*_schema, partitions_per_sstable(), cfg, get_encoding_stats(), _io_priority), sst};
}
virtual void stop_sstable_writer(compaction_writer* writer) override {
if (writer) {
if (has_sstable_replacer()) {
regular_compaction::stop_sstable_writer(writer);
} else {
finish_new_sstable(writer);
}
}
}
virtual void on_end_of_compaction() override {
if (has_sstable_replacer()) {
regular_compaction::on_end_of_compaction();
}
}
};
class cleanup_compaction final : public regular_compaction {
class incremental_owned_ranges_checker {
const dht::token_range_vector& _sorted_owned_ranges;
mutable dht::token_range_vector::const_iterator _it;
public:
incremental_owned_ranges_checker(const dht::token_range_vector& sorted_owned_ranges)
: _sorted_owned_ranges(sorted_owned_ranges)
, _it(_sorted_owned_ranges.begin()) {
}
// Must be called with increasing token values.
bool belongs_to_current_node(const dht::token& t) const {
// While token T is after a range Rn, advance the iterator.
// iterator will be stopped at a range which either overlaps with T (if T belongs to node),
// or at a range which is after T (if T doesn't belong to this node).
while (_it != _sorted_owned_ranges.end() && _it->after(t, dht::token_comparator())) {
_it++;
}
return _it != _sorted_owned_ranges.end() && _it->contains(t, dht::token_comparator());
}
};
owned_ranges_ptr _owned_ranges;
incremental_owned_ranges_checker _owned_ranges_checker;
mutable dht::incremental_owned_ranges_checker _owned_ranges_checker;
private:
// Called in a seastar thread
dht::partition_range_vector
@@ -1108,21 +1226,8 @@ private:
return dht::partition_range::make({sst->get_first_decorated_key(), true},
{sst->get_last_decorated_key(), true});
}));
// optimize set of potentially overlapping ranges by deoverlapping them.
non_owned_ranges = dht::partition_range::deoverlap(std::move(non_owned_ranges), dht::ring_position_comparator(*_schema));
// subtract *each* owned range from the partition range of *each* sstable*,
// such that we'll be left only with a set of non-owned ranges.
for (auto& owned_range : owned_ranges) {
dht::partition_range_vector new_non_owned_ranges;
for (auto& non_owned_range : non_owned_ranges) {
auto ret = non_owned_range.subtract(owned_range, dht::ring_position_comparator(*_schema));
new_non_owned_ranges.insert(new_non_owned_ranges.end(), ret.begin(), ret.end());
seastar::thread::maybe_yield();
}
non_owned_ranges = std::move(new_non_owned_ranges);
}
return non_owned_ranges;
return dht::subtract_ranges(*_schema, non_owned_ranges, std::move(owned_ranges)).get();
}
protected:
virtual compaction_completion_desc
@@ -1183,9 +1288,9 @@ public:
type,
schema.ks_name(),
schema.cf_name(),
partition_key_to_string(new_key.key(), schema),
new_key.key().with_schema(schema),
new_key,
partition_key_to_string(current_key.key(), schema),
current_key.key().with_schema(schema),
current_key,
action.empty() ? "" : "; ",
action);
@@ -1198,9 +1303,9 @@ public:
type,
schema.ks_name(),
schema.cf_name(),
partition_key_to_string(new_key.key(), schema),
new_key.key().with_schema(schema),
new_key,
partition_key_to_string(current_key.key(), schema),
current_key.key().with_schema(schema),
current_key,
action.empty() ? "" : "; ",
action);
@@ -1218,7 +1323,7 @@ public:
mf.mutation_fragment_kind(),
mf.has_key() ? format(" with key {}", mf.key().with_schema(schema)) : "",
mf.position(),
partition_key_to_string(key.key(), schema),
key.key().with_schema(schema),
key,
prev_pos.region(),
prev_pos.has_key() ? format(" with key {}", prev_pos.key().with_schema(schema)) : "",
@@ -1230,14 +1335,10 @@ public:
const auto& schema = validator.schema();
const auto& key = validator.previous_partition_key();
clogger.error("[{} compaction {}.{}] Invalid end-of-stream, last partition {} ({}) didn't end with a partition-end fragment{}{}",
type, schema.ks_name(), schema.cf_name(), partition_key_to_string(key.key(), schema), key, action.empty() ? "" : "; ", action);
type, schema.ks_name(), schema.cf_name(), key.key().with_schema(schema), key, action.empty() ? "" : "; ", action);
}
private:
static sstring partition_key_to_string(const partition_key& key, const ::schema& s) {
sstring ret = format("{}", key.with_schema(s));
return utils::utf8::validate((const uint8_t*)ret.data(), ret.size()) ? ret : "<non-utf8-key>";
}
class reader : public flat_mutation_reader_v2::impl {
using skip = bool_class<class skip_tag>;
@@ -1249,17 +1350,20 @@ private:
uint64_t& _validation_errors;
private:
void maybe_abort_scrub() {
void maybe_abort_scrub(std::function<void()> report_error) {
if (_scrub_mode == compaction_type_options::scrub::mode::abort) {
report_error();
throw compaction_aborted_exception(_schema->ks_name(), _schema->cf_name(), "scrub compaction found invalid data");
}
++_validation_errors;
}
void on_unexpected_partition_start(const mutation_fragment_v2& ps) {
maybe_abort_scrub();
report_invalid_partition_start(compaction_type::Scrub, _validator, ps.as_partition_start().key(),
"Rectifying by adding assumed missing partition-end");
auto report_fn = [this, &ps] (std::string_view action = "") {
report_invalid_partition_start(compaction_type::Scrub, _validator, ps.as_partition_start().key(), action);
};
maybe_abort_scrub(report_fn);
report_fn("Rectifying by adding assumed missing partition-end");
auto pe = mutation_fragment_v2(*_schema, _permit, partition_end{});
if (!_validator(pe)) {
@@ -1279,20 +1383,26 @@ private:
}
skip on_invalid_partition(const dht::decorated_key& new_key) {
maybe_abort_scrub();
auto report_fn = [this, &new_key] (std::string_view action = "") {
report_invalid_partition(compaction_type::Scrub, _validator, new_key, action);
};
maybe_abort_scrub(report_fn);
if (_scrub_mode == compaction_type_options::scrub::mode::segregate) {
report_invalid_partition(compaction_type::Scrub, _validator, new_key, "Detected");
report_fn("Detected");
_validator.reset(new_key);
// Let the segregating interposer consumer handle this.
return skip::no;
}
report_invalid_partition(compaction_type::Scrub, _validator, new_key, "Skipping");
report_fn("Skipping");
_skip_to_next_partition = true;
return skip::yes;
}
skip on_invalid_mutation_fragment(const mutation_fragment_v2& mf) {
maybe_abort_scrub();
auto report_fn = [this, &mf] (std::string_view action = "") {
report_invalid_mutation_fragment(compaction_type::Scrub, _validator, mf, "");
};
maybe_abort_scrub(report_fn);
const auto& key = _validator.previous_partition_key();
@@ -1307,8 +1417,7 @@ private:
// The only case a partition end is invalid is when it comes after
// another partition end, and we can just drop it in that case.
if (!mf.is_end_of_partition() && _scrub_mode == compaction_type_options::scrub::mode::segregate) {
report_invalid_mutation_fragment(compaction_type::Scrub, _validator, mf,
"Injecting partition start/end to segregate out-of-order fragment");
report_fn("Injecting partition start/end to segregate out-of-order fragment");
push_mutation_fragment(*_schema, _permit, partition_end{});
// We loose the partition tombstone if any, but it will be
@@ -1321,16 +1430,19 @@ private:
return skip::no;
}
report_invalid_mutation_fragment(compaction_type::Scrub, _validator, mf, "Skipping");
report_fn("Skipping");
return skip::yes;
}
void on_invalid_end_of_stream() {
maybe_abort_scrub();
auto report_fn = [this] (std::string_view action = "") {
report_invalid_end_of_stream(compaction_type::Scrub, _validator, action);
};
maybe_abort_scrub(report_fn);
// Handle missing partition_end
push_mutation_fragment(mutation_fragment_v2(*_schema, _permit, partition_end{}));
report_invalid_end_of_stream(compaction_type::Scrub, _validator, "Rectifying by adding missing partition-end to the end of the stream");
report_fn("Rectifying by adding missing partition-end to the end of the stream");
}
void fill_buffer_from_underlying() {
@@ -1509,13 +1621,13 @@ class resharding_compaction final : public compaction {
uint64_t estimated_partitions = 0;
};
std::vector<estimated_values> _estimation_per_shard;
std::vector<utils::UUID> _run_identifiers;
std::vector<run_id> _run_identifiers;
private:
// return estimated partitions per sstable for a given shard
uint64_t partitions_per_sstable(shard_id s) const {
uint64_t estimated_sstables = std::max(uint64_t(1), uint64_t(ceil(double(_estimation_per_shard[s].estimated_size) / _max_sstable_size)));
return std::min(uint64_t(ceil(double(_estimation_per_shard[s].estimated_partitions) / estimated_sstables)),
_table_s.get_compaction_strategy().adjust_partition_estimate(_ms_metadata, _estimation_per_shard[s].estimated_partitions));
_table_s.get_compaction_strategy().adjust_partition_estimate(_ms_metadata, _estimation_per_shard[s].estimated_partitions, _schema));
}
public:
resharding_compaction(table_state& table_s, sstables::compaction_descriptor descriptor, compaction_data& cdata)
@@ -1533,7 +1645,7 @@ public:
}
}
for (auto i : boost::irange(0u, smp::count)) {
_run_identifiers[i] = utils::make_random_uuid();
_run_identifiers[i] = run_id::create_random_id();
}
}
@@ -1766,7 +1878,7 @@ get_fully_expired_sstables(const table_state& table_s, const std::vector<sstable
int64_t min_timestamp = std::numeric_limits<int64_t>::max();
for (auto& sstable : overlapping) {
auto gc_before = sstable->get_gc_before_for_fully_expire(compaction_time);
auto gc_before = sstable->get_gc_before_for_fully_expire(compaction_time, table_s.get_tombstone_gc_state());
if (sstable->get_max_local_deletion_time() >= gc_before) {
min_timestamp = std::min(min_timestamp, sstable->get_stats_metadata().min_timestamp);
}
@@ -1785,7 +1897,7 @@ get_fully_expired_sstables(const table_state& table_s, const std::vector<sstable
// SStables that do not contain live data is added to list of possibly expired sstables.
for (auto& candidate : compacting) {
auto gc_before = candidate->get_gc_before_for_fully_expire(compaction_time);
auto gc_before = candidate->get_gc_before_for_fully_expire(compaction_time, table_s.get_tombstone_gc_state());
clogger.debug("Checking if candidate of generation {} and max_deletion_time {} is expired, gc_before is {}",
candidate->generation(), candidate->get_stats_metadata().max_local_deletion_time, gc_before);
// A fully expired sstable which has an ancestor undeleted shouldn't be compacted because
@@ -1816,7 +1928,7 @@ get_fully_expired_sstables(const table_state& table_s, const std::vector<sstable
}
unsigned compaction_descriptor::fan_in() const {
return boost::copy_range<std::unordered_set<utils::UUID>>(sstables | boost::adaptors::transformed(std::mem_fn(&sstables::sstable::run_identifier))).size();
return boost::copy_range<std::unordered_set<run_id>>(sstables | boost::adaptors::transformed(std::mem_fn(&sstables::sstable::run_identifier))).size();
}
uint64_t compaction_descriptor::sstables_size() const {

View File

@@ -80,8 +80,10 @@ struct compaction_data {
}
void stop(sstring reason) {
stop_requested = std::move(reason);
abort.request_abort();
if (!abort.abort_requested()) {
stop_requested = std::move(reason);
abort.request_abort();
}
}
};
@@ -90,12 +92,15 @@ struct compaction_stats {
uint64_t start_size = 0;
uint64_t end_size = 0;
uint64_t validation_errors = 0;
// Bloom filter checks during max purgeable calculation
uint64_t bloom_filter_checks = 0;
compaction_stats& operator+=(const compaction_stats& r) {
ended_at = std::max(ended_at, r.ended_at);
start_size += r.start_size;
end_size += r.end_size;
validation_errors += r.validation_errors;
bloom_filter_checks += r.bloom_filter_checks;
return *this;
}
friend compaction_stats operator+(const compaction_stats& l, const compaction_stats& r) {

View File

@@ -66,7 +66,8 @@ public:
};
compaction_backlog_tracker(std::unique_ptr<impl> impl) : _impl(std::move(impl)) {}
compaction_backlog_tracker(compaction_backlog_tracker&&) = default;
compaction_backlog_tracker(compaction_backlog_tracker&&);
compaction_backlog_tracker& operator=(compaction_backlog_tracker&&) noexcept;
compaction_backlog_tracker(const compaction_backlog_tracker&) = delete;
~compaction_backlog_tracker();
@@ -74,7 +75,7 @@ public:
void replace_sstables(const std::vector<sstables::shared_sstable>& old_ssts, const std::vector<sstables::shared_sstable>& new_ssts);
void register_partially_written_sstable(sstables::shared_sstable sst, backlog_write_progress_manager& wp);
void register_compacting_sstable(sstables::shared_sstable sst, backlog_read_progress_manager& rp);
void transfer_ongoing_charges(compaction_backlog_tracker& new_bt, bool move_read_charges = true);
void copy_ongoing_charges(compaction_backlog_tracker& new_bt, bool move_read_charges = true) const;
void revert_charges(sstables::shared_sstable sst);
void disable() {

View File

@@ -14,7 +14,7 @@
#include <variant>
#include <seastar/core/smp.hh>
#include <seastar/core/file.hh>
#include "sstables/shared_sstable.hh"
#include "sstables/types_fwd.hh"
#include "sstables/sstable_set.hh"
#include "utils/UUID.hh"
#include "dht/i_partitioner.hh"
@@ -153,8 +153,10 @@ struct compaction_descriptor {
int level;
// Threshold size for sstable(s) to be created.
uint64_t max_sstable_bytes;
// Can split large partitions at clustering boundary.
bool can_split_large_partition = false;
// Run identifier of output sstables.
utils::UUID run_identifier;
sstables::run_id run_identifier;
// The options passed down to the compaction code.
// This also selects the kind of compaction to do.
compaction_type_options options = compaction_type_options::make_regular();
@@ -176,7 +178,7 @@ struct compaction_descriptor {
::io_priority_class io_priority,
int level = default_level,
uint64_t max_sstable_bytes = default_max_sstable_bytes,
utils::UUID run_identifier = utils::make_random_uuid(),
run_id run_identifier = run_id::create_random_id(),
compaction_type_options options = compaction_type_options::make_regular())
: sstables(std::move(sstables))
, level(level)
@@ -192,7 +194,7 @@ struct compaction_descriptor {
: sstables(std::move(sstables))
, level(default_level)
, max_sstable_bytes(default_max_sstable_bytes)
, run_identifier(utils::make_random_uuid())
, run_identifier(run_id::create_random_id())
, options(compaction_type_options::make_regular())
, io_priority(io_priority)
, has_only_fully_expired(has_only_fully_expired)

View File

@@ -7,18 +7,23 @@
*/
#include "compaction_manager.hh"
#include "compaction_descriptor.hh"
#include "compaction_strategy.hh"
#include "compaction_backlog_manager.hh"
#include "sstables/sstables.hh"
#include "sstables/sstables_manager.hh"
#include <memory>
#include <seastar/core/metrics.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/switch_to.hh>
#include <seastar/coroutine/parallel_for_each.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include "sstables/exceptions.hh"
#include "sstables/sstable_directory.hh"
#include "locator/abstract_replication_strategy.hh"
#include "utils/fb_utilities.hh"
#include "utils/UUID_gen.hh"
#include "db/system_keyspace.hh"
#include <cmath>
#include <boost/algorithm/cxx11/any_of.hpp>
#include <boost/range/algorithm/remove_if.hpp>
@@ -75,6 +80,23 @@ public:
_compacting.erase(sst);
}
}
class update_me : public compaction_manager::task::on_replacement {
compacting_sstable_registration& _registration;
public:
update_me(compacting_sstable_registration& registration)
: _registration{registration} {}
void on_removal(const std::vector<sstables::shared_sstable>& sstables) override {
_registration.release_compacting(sstables);
}
void on_addition(const std::vector<sstables::shared_sstable>& sstables) override {
_registration.register_compacting(sstables);
}
};
auto update_on_sstable_replacement() {
return update_me(*this);
}
};
sstables::compaction_data compaction_manager::create_compaction_data() {
@@ -206,7 +228,7 @@ std::vector<sstables::shared_sstable> compaction_manager::get_candidates(compact
candidates.reserve(t.main_sstable_set().all()->size());
// prevents sstables that belongs to a partial run being generated by ongoing compaction from being
// selected for compaction, which could potentially result in wrong behavior.
auto partial_run_identifiers = boost::copy_range<std::unordered_set<utils::UUID>>(_tasks
auto partial_run_identifiers = boost::copy_range<std::unordered_set<sstables::run_id>>(_tasks
| boost::adaptors::filtered(std::mem_fn(&task::generating_output_run))
| boost::adaptors::transformed(std::mem_fn(&task::output_run_id)));
auto& cs = t.get_compaction_strategy();
@@ -276,7 +298,7 @@ compaction_manager::task::task(compaction_manager& mgr, compaction::table_state*
, _description(std::move(desc))
{}
future<compaction_manager::compaction_stats_opt> compaction_manager::perform_task(shared_ptr<compaction_manager::task> task) {
future<compaction_manager::compaction_stats_opt> compaction_manager::perform_task(shared_ptr<compaction_manager::task> task, throw_if_stopping do_throw_if_stopping) {
_tasks.push_back(task);
auto unregister_task = defer([this, task] {
_tasks.remove(task);
@@ -289,6 +311,9 @@ future<compaction_manager::compaction_stats_opt> compaction_manager::perform_tas
co_return res;
} catch (sstables::compaction_stopped_exception& e) {
cmlog.info("{}: stopped, reason: {}", *task, e.what());
if (do_throw_if_stopping) {
throw;
}
} catch (sstables::compaction_aborted_exception& e) {
cmlog.error("{}: aborted, reason: {}", *task, e.what());
_stats.errors++;
@@ -307,14 +332,14 @@ future<compaction_manager::compaction_stats_opt> compaction_manager::perform_tas
co_return std::nullopt;
}
future<sstables::compaction_result> compaction_manager::task::compact_sstables_and_update_history(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, release_exhausted_func_t release_exhausted, can_purge_tombstones can_purge) {
future<sstables::compaction_result> compaction_manager::task::compact_sstables_and_update_history(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, on_replacement& on_replace, can_purge_tombstones can_purge) {
if (!descriptor.sstables.size()) {
// if there is nothing to compact, just return.
co_return sstables::compaction_result{};
}
bool should_update_history = this->should_update_history(descriptor.options.type());
sstables::compaction_result res = co_await compact_sstables(std::move(descriptor), cdata, std::move(release_exhausted), std::move(can_purge));
sstables::compaction_result res = co_await compact_sstables(std::move(descriptor), cdata, on_replace, std::move(can_purge));
if (should_update_history) {
co_await update_history(*_compacting_table, res, cdata);
@@ -322,8 +347,11 @@ future<sstables::compaction_result> compaction_manager::task::compact_sstables_a
co_return res;
}
future<sstables::compaction_result> compaction_manager::task::compact_sstables(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, release_exhausted_func_t release_exhausted, can_purge_tombstones can_purge) {
future<sstables::compaction_result> compaction_manager::task::compact_sstables(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, on_replacement& on_replace, can_purge_tombstones can_purge,
sstables::offstrategy offstrategy) {
compaction::table_state& t = *_compacting_table;
if (can_purge) {
descriptor.enable_garbage_collection(t.main_sstable_set());
}
@@ -331,15 +359,26 @@ future<sstables::compaction_result> compaction_manager::task::compact_sstables(s
auto sst = t.make_sstable();
return sst;
};
descriptor.replacer = [this, &t, release_exhausted] (sstables::compaction_completion_desc desc) {
descriptor.replacer = [this, &t, &on_replace, offstrategy] (sstables::compaction_completion_desc desc) {
t.get_compaction_strategy().notify_completion(desc.old_sstables, desc.new_sstables);
_cm.propagate_replacement(t, desc.old_sstables, desc.new_sstables);
// on_replace updates the compacting registration with the old and new
// sstables. while on_compaction_completion() removes the old sstables
// from the table's sstable set, and adds the new ones to the sstable
// set.
// since the regular compactions exclude the sstables in the sstable
// set which are currently being compacted, if we want to ensure the
// exclusive access of compactions to an sstable we should guard it
// with the registration when adding/removing it to/from the sstable
// set. otherwise, the regular compaction would pick it up in the time
// window, where the sstables:
// - are still in the main set
// - are not being compacted.
on_replace.on_addition(desc.new_sstables);
auto old_sstables = desc.old_sstables;
t.on_compaction_completion(std::move(desc), sstables::offstrategy::no).get();
// Calls compaction manager's task for this compaction to release reference to exhausted SSTables.
if (release_exhausted) {
release_exhausted(old_sstables);
}
t.on_compaction_completion(std::move(desc), offstrategy).get();
on_replace.on_removal(old_sstables);
};
co_return co_await sstables::compact_sstables(std::move(descriptor), cdata, t);
@@ -347,8 +386,15 @@ future<sstables::compaction_result> compaction_manager::task::compact_sstables(s
future<> compaction_manager::task::update_history(compaction::table_state& t, const sstables::compaction_result& res, const sstables::compaction_data& cdata) {
auto ended_at = std::chrono::duration_cast<std::chrono::milliseconds>(res.stats.ended_at.time_since_epoch());
co_return co_await t.update_compaction_history(cdata.compaction_uuid, t.schema()->ks_name(), t.schema()->cf_name(), ended_at,
res.stats.start_size, res.stats.end_size);
if (_cm._sys_ks) {
// FIXME: add support to merged_rows. merged_rows is a histogram that
// shows how many sstables each row is merged from. This information
// cannot be accessed until we make combined_reader more generic,
// for example, by adding a reducer method.
auto sys_ks = _cm._sys_ks; // hold pointer on sys_ks
co_await sys_ks->update_compaction_history(cdata.compaction_uuid, t.schema()->ks_name(), t.schema()->cf_name(),
ended_at.count(), res.stats.start_size, res.stats.end_size, std::unordered_map<int32_t, int64_t>{});
}
}
class compaction_manager::major_compaction_task : public compaction_manager::task {
@@ -377,9 +423,7 @@ protected:
sstables::compaction_strategy cs = t->get_compaction_strategy();
sstables::compaction_descriptor descriptor = cs.get_major_compaction_job(*t, _cm.get_candidates(*t));
auto compacting = compacting_sstable_registration(_cm, descriptor.sstables);
auto release_exhausted = [&compacting] (const std::vector<sstables::shared_sstable>& exhausted_sstables) {
compacting.release_compacting(exhausted_sstables);
};
auto on_replace = compacting.update_on_sstable_replacement();
setup_new_compaction(descriptor.run_identifier);
cmlog.info0("User initiated compaction started on behalf of {}.{}", t->schema()->ks_name(), t->schema()->cf_name());
@@ -391,7 +435,7 @@ protected:
// the exclusive lock can be freed to let regular compaction run in parallel to major
lock_holder.return_all();
co_await compact_sstables_and_update_history(std::move(descriptor), _compaction_data, std::move(release_exhausted));
co_await compact_sstables_and_update_history(std::move(descriptor), _compaction_data, on_replace);
finish_compaction();
@@ -438,12 +482,12 @@ protected:
}
};
future<> compaction_manager::run_custom_job(compaction::table_state& t, sstables::compaction_type type, const char* desc, noncopyable_function<future<>(sstables::compaction_data&)> job) {
future<> compaction_manager::run_custom_job(compaction::table_state& t, sstables::compaction_type type, const char* desc, noncopyable_function<future<>(sstables::compaction_data&)> job, throw_if_stopping do_throw_if_stopping) {
if (_state != state::enabled) {
return make_ready_future<>();
}
return perform_task(make_shared<custom_compaction_task>(*this, &t, type, desc, std::move(job))).discard_result();
return perform_task(make_shared<custom_compaction_task>(*this, &t, type, desc, std::move(job)), do_throw_if_stopping).discard_result();
}
future<> compaction_manager::update_static_shares(float static_shares) {
@@ -611,7 +655,7 @@ future<semaphore_units<named_semaphore_exception_factory>> compaction_manager::t
});
}
void compaction_manager::task::setup_new_compaction(utils::UUID output_run_id) {
void compaction_manager::task::setup_new_compaction(sstables::run_id output_run_id) {
_compaction_data = create_compaction_data();
_output_run_identifier = output_run_id;
switch_state(state::active);
@@ -619,7 +663,7 @@ void compaction_manager::task::setup_new_compaction(utils::UUID output_run_id) {
void compaction_manager::task::finish_compaction(state finish_state) noexcept {
switch_state(finish_state);
_output_run_identifier = utils::null_uuid();
_output_run_identifier = sstables::run_id::create_null_id();
if (finish_state != state::failed) {
_compaction_retry.reset();
}
@@ -637,6 +681,7 @@ sstables::compaction_stopped_exception compaction_manager::task::make_compaction
compaction_manager::compaction_manager(config cfg, abort_source& as)
: _cfg(std::move(cfg))
, _compaction_submission_timer(compaction_sg().cpu, compaction_submission_callback())
, _compaction_controller(make_compaction_controller(compaction_sg(), static_shares(), [this] () -> float {
_last_backlog = backlog();
auto b = _last_backlog / available_memory();
@@ -657,6 +702,7 @@ compaction_manager::compaction_manager(config cfg, abort_source& as)
, _update_compaction_static_shares_action([this] { return update_static_shares(static_shares()); })
, _compaction_static_shares_observer(_cfg.static_shares.observe(_update_compaction_static_shares_action.make_observer()))
, _strategy_control(std::make_unique<strategy_control>(*this))
, _tombstone_gc_state(&_repair_history_maps)
{
register_metrics();
// Bandwidth throttling is node-wide, updater is needed on single shard
@@ -670,12 +716,14 @@ compaction_manager::compaction_manager(config cfg, abort_source& as)
compaction_manager::compaction_manager()
: _cfg(config{ .available_memory = 1 })
, _compaction_submission_timer(compaction_sg().cpu, compaction_submission_callback())
, _compaction_controller(make_compaction_controller(compaction_sg(), 1, [] () -> float { return 1.0; }))
, _backlog_manager(_compaction_controller)
, _throughput_updater(serialized_action([this] { return update_throughput(throughput_mbs()); }))
, _update_compaction_static_shares_action([] { return make_ready_future<>(); })
, _compaction_static_shares_observer(_cfg.static_shares.observe(_update_compaction_static_shares_action.make_observer()))
, _strategy_control(std::make_unique<strategy_control>(*this))
, _tombstone_gc_state(&_repair_history_maps)
{
// No metric registration because this constructor is supposed to be used only by the testing
// infrastructure.
@@ -726,38 +774,46 @@ void compaction_manager::register_metrics() {
void compaction_manager::enable() {
assert(_state == state::none || _state == state::disabled);
_state = state::enabled;
_compaction_submission_timer.arm(periodic_compaction_submission_interval());
postponed_compactions_reevaluation();
_compaction_submission_timer.arm_periodic(periodic_compaction_submission_interval());
_waiting_reevalution = postponed_compactions_reevaluation();
}
std::function<void()> compaction_manager::compaction_submission_callback() {
return [this] () mutable {
for (auto& e: _compaction_state) {
submit(*e.first);
postpone_compaction_for_table(e.first);
}
reevaluate_postponed_compactions();
};
}
void compaction_manager::postponed_compactions_reevaluation() {
_waiting_reevalution = repeat([this] {
return _postponed_reevaluation.wait().then([this] {
if (_state != state::enabled) {
_postponed.clear();
return stop_iteration::yes;
}
auto postponed = std::move(_postponed);
try {
for (auto& t : postponed) {
auto s = t->schema();
cmlog.debug("resubmitting postponed compaction for table {}.{} [{}]", s->ks_name(), s->cf_name(), fmt::ptr(t));
submit(*t);
future<> compaction_manager::postponed_compactions_reevaluation() {
while (true) {
co_await _postponed_reevaluation.when();
if (_state != state::enabled) {
_postponed.clear();
co_return;
}
// A task_state being reevaluated can re-insert itself into postponed list, which is the reason
// for moving the list to be processed into a local.
auto postponed = std::exchange(_postponed, {});
try {
for (auto it = postponed.begin(); it != postponed.end();) {
compaction::table_state* t = *it;
it = postponed.erase(it);
// skip reevaluation of a table_state that became invalid post its removal
if (!_compaction_state.contains(t)) {
continue;
}
} catch (...) {
_postponed = std::move(postponed);
auto s = t->schema();
cmlog.debug("resubmitting postponed compaction for table {}.{} [{}]", s->ks_name(), s->cf_name(), fmt::ptr(t));
submit(*t);
co_await coroutine::maybe_yield();
}
return stop_iteration::no;
});
});
} catch (...) {
_postponed.insert(postponed.begin(), postponed.end());
}
}
}
void compaction_manager::reevaluate_postponed_compactions() noexcept {
@@ -789,23 +845,27 @@ future<> compaction_manager::stop_tasks(std::vector<shared_ptr<task>> tasks, sst
});
}
future<> compaction_manager::stop_ongoing_compactions(sstring reason, compaction::table_state* t, std::optional<sstables::compaction_type> type_opt) {
auto ongoing_compactions = get_compactions(t).size();
auto tasks = boost::copy_range<std::vector<shared_ptr<task>>>(_tasks | boost::adaptors::filtered([t, type_opt] (auto& task) {
return (!t || task->compacting_table() == t) && (!type_opt || task->type() == *type_opt);
}));
logging::log_level level = tasks.empty() ? log_level::debug : log_level::info;
if (cmlog.is_enabled(level)) {
std::string scope = "";
if (t) {
scope = fmt::format(" for table {}.{}", t->schema()->ks_name(), t->schema()->cf_name());
future<> compaction_manager::stop_ongoing_compactions(sstring reason, compaction::table_state* t, std::optional<sstables::compaction_type> type_opt) noexcept {
try {
auto ongoing_compactions = get_compactions(t).size();
auto tasks = boost::copy_range<std::vector<shared_ptr<task>>>(_tasks | boost::adaptors::filtered([t, type_opt] (auto& task) {
return (!t || task->compacting_table() == t) && (!type_opt || task->type() == *type_opt);
}));
logging::log_level level = tasks.empty() ? log_level::debug : log_level::info;
if (cmlog.is_enabled(level)) {
std::string scope = "";
if (t) {
scope = fmt::format(" for table {}.{}", t->schema()->ks_name(), t->schema()->cf_name());
}
if (type_opt) {
scope += fmt::format(" {} type={}", scope.size() ? "and" : "for", *type_opt);
}
cmlog.log(level, "Stopping {} tasks for {} ongoing compactions{} due to {}", tasks.size(), ongoing_compactions, scope, reason);
}
if (type_opt) {
scope += fmt::format(" {} type={}", scope.size() ? "and" : "for", *type_opt);
}
cmlog.log(level, "Stopping {} tasks for {} ongoing compactions{} due to {}", tasks.size(), ongoing_compactions, scope, reason);
return stop_tasks(std::move(tasks), std::move(reason));
} catch (...) {
return current_exception_as_future<>();
}
return stop_tasks(std::move(tasks), std::move(reason));
}
future<> compaction_manager::drain() {
@@ -842,14 +902,31 @@ future<> compaction_manager::really_do_stop() {
cmlog.info("Stopped");
}
template <typename Ex>
requires std::is_base_of_v<std::exception, Ex> &&
requires (const Ex& ex) {
{ ex.code() } noexcept -> std::same_as<const std::error_code&>;
}
auto swallow_enospc(const Ex& ex) noexcept {
if (ex.code().value() != ENOSPC) {
return make_exception_future<>(std::make_exception_ptr(ex));
}
cmlog.warn("Got ENOSPC on stop, ignoring...");
return make_ready_future<>();
}
void compaction_manager::do_stop() noexcept {
if (_state == state::none || _state == state::stopped) {
if (_state == state::none || _stop_future) {
return;
}
try {
_state = state::stopped;
_stop_future = really_do_stop();
_stop_future = really_do_stop()
.handle_exception_type([] (const std::system_error& ex) { return swallow_enospc(ex); })
.handle_exception_type([] (const storage_io_error& ex) { return swallow_enospc(ex); })
;
} catch (...) {
cmlog.error("Failed to stop the manager: {}", std::current_exception());
}
@@ -941,9 +1018,7 @@ protected:
}
auto compacting = compacting_sstable_registration(_cm, descriptor.sstables);
auto weight_r = compaction_weight_registration(&_cm, weight);
auto release_exhausted = [&compacting] (const std::vector<sstables::shared_sstable>& exhausted_sstables) {
compacting.release_compacting(exhausted_sstables);
};
auto on_replace = compacting.update_on_sstable_replacement();
cmlog.debug("Accepted compaction job: task={} ({} sstable(s)) of weight {} for {}.{}",
fmt::ptr(this), descriptor.sstables.size(), weight, t.schema()->ks_name(), t.schema()->cf_name());
@@ -952,7 +1027,7 @@ protected:
try {
bool should_update_history = this->should_update_history(descriptor.options.type());
sstables::compaction_result res = co_await compact_sstables(std::move(descriptor), _compaction_data, std::move(release_exhausted));
sstables::compaction_result res = co_await compact_sstables(std::move(descriptor), _compaction_data, on_replace);
finish_compaction();
if (should_update_history) {
// update_history can take a long time compared to
@@ -993,7 +1068,7 @@ void compaction_manager::submit(compaction::table_state& t) {
// OK to drop future.
// waited via task->stop()
(void)perform_task(make_shared<regular_compaction_task>(*this, t));
(void)perform_task(make_shared<regular_compaction_task>(*this, t)).then_wrapped([] (auto f) { f.ignore_ready_future(); });
}
bool compaction_manager::can_perform_regular_compaction(compaction::table_state& t) {
@@ -1010,11 +1085,11 @@ future<> compaction_manager::maybe_wait_for_sstable_count_reduction(compaction::
auto num_runs_for_compaction = [&, this] {
auto& cs = t.get_compaction_strategy();
auto desc = cs.get_sstables_for_compaction(t, get_strategy_control(), get_candidates(t));
return boost::copy_range<std::unordered_set<utils::UUID>>(
return boost::copy_range<std::unordered_set<sstables::run_id>>(
desc.sstables
| boost::adaptors::transformed(std::mem_fn(&sstables::sstable::run_identifier))).size();
};
const auto threshold = std::max(schema->max_compaction_threshold(), 32);
const auto threshold = size_t(std::max(schema->max_compaction_threshold(), 32));
auto count = num_runs_for_compaction();
if (count <= threshold) {
cmlog.trace("No need to wait for sstable count reduction in {}.{}: {} <= {}",
@@ -1050,83 +1125,66 @@ public:
bool performed() const noexcept {
return _performed;
}
private:
future<> run_offstrategy_compaction(sstables::compaction_data& cdata) {
// This procedure will reshape sstables in maintenance set until it's ready for
// integration into main set.
// It may require N reshape rounds before the set satisfies the strategy invariant.
// This procedure also only updates maintenance set at the end, on success.
// Otherwise, some overlapping could be introduced in the set after each reshape
// round, progressively degrading read amplification until integration happens.
// The drawback of this approach is the 2x space requirement as the old sstables
// will only be deleted at the end. The impact of this space requirement is reduced
// by the fact that off-strategy is serialized across all tables, meaning that the
// actual requirement is the size of the largest table's maintenance set.
// Incrementally reshape the SSTables in maintenance set. The output of each reshape
// round is merged into the main set. The common case is that off-strategy input
// is mostly disjoint, e.g. repair-based node ops, then all the input will be
// reshaped in a single round. The incremental approach allows us to be space
// efficient (avoiding a 100% overhead) as we will incrementally replace input
// SSTables from maintenance set by output ones into main set.
compaction::table_state& t = *_compacting_table;
const auto& maintenance_sstables = t.maintenance_sstable_set();
const auto old_sstables = boost::copy_range<std::vector<sstables::shared_sstable>>(*maintenance_sstables.all());
std::vector<sstables::shared_sstable> reshape_candidates = old_sstables;
std::vector<sstables::shared_sstable> sstables_to_remove;
std::unordered_set<sstables::shared_sstable> new_unused_sstables;
auto cleanup_new_unused_sstables_on_failure = defer([&new_unused_sstables] {
for (auto& sst : new_unused_sstables) {
sst->mark_for_deletion();
}
});
// Filter out sstables that require view building, to avoid a race between off-strategy
// and view building. Refs: #11882
auto get_reshape_candidates = [&t] () {
auto maintenance_ssts = t.maintenance_sstable_set().all();
return boost::copy_range<std::vector<sstables::shared_sstable>>(*maintenance_ssts
| boost::adaptors::filtered([](const sstables::shared_sstable& sst) {
return !sst->requires_view_building();
}));
};
auto get_next_job = [&] () -> std::optional<sstables::compaction_descriptor> {
auto& iop = service::get_local_streaming_priority(); // run reshape in maintenance mode
auto desc = t.get_compaction_strategy().get_reshaping_job(reshape_candidates, t.schema(), iop, sstables::reshape_mode::strict);
auto desc = t.get_compaction_strategy().get_reshaping_job(get_reshape_candidates(), t.schema(), iop, sstables::reshape_mode::strict);
return desc.sstables.size() ? std::make_optional(std::move(desc)) : std::nullopt;
};
std::exception_ptr err;
while (auto desc = get_next_job()) {
desc->creator = [this, &new_unused_sstables, &t] (shard_id dummy) {
auto sst = t.make_sstable();
new_unused_sstables.insert(sst);
return sst;
};
auto input = boost::copy_range<std::unordered_set<sstables::shared_sstable>>(desc->sstables);
auto compacting = compacting_sstable_registration(_cm, desc->sstables);
auto on_replace = compacting.update_on_sstable_replacement();
auto ret = co_await sstables::compact_sstables(std::move(*desc), cdata, t);
_performed = true;
// update list of reshape candidates without input but with output added to it
auto it = boost::remove_if(reshape_candidates, [&] (auto& s) { return input.contains(s); });
reshape_candidates.erase(it, reshape_candidates.end());
std::move(ret.new_sstables.begin(), ret.new_sstables.end(), std::back_inserter(reshape_candidates));
// If compaction strategy is unable to reshape input data in a single round, it may happen that a SSTable A
// created in round 1 will be compacted in a next round producing SSTable B. As SSTable A is no longer needed,
// it can be removed immediately. Let's remove all such SSTables immediately to reduce off-strategy space requirement.
// Input SSTables from maintenance set can only be removed later, as SSTable sets are only updated on completion.
auto can_remove_now = [&] (const sstables::shared_sstable& s) { return new_unused_sstables.contains(s); };
for (auto&& sst : input) {
if (can_remove_now(sst)) {
co_await sst->unlink();
new_unused_sstables.erase(std::move(sst));
} else {
sstables_to_remove.push_back(std::move(sst));
}
try {
sstables::compaction_result _ = co_await compact_sstables(std::move(*desc), _compaction_data, on_replace,
compaction_manager::can_purge_tombstones::no,
sstables::offstrategy::yes);
} catch (sstables::compaction_stopped_exception&) {
// If off-strategy compaction stopped on user request, let's not discard the partial work.
// Therefore, both un-reshaped and reshaped data will be integrated into main set, allowing
// regular compaction to continue from where off-strategy left off.
err = std::current_exception();
break;
}
_performed = true;
}
// at this moment reshape_candidates contains a set of sstables ready for integration into main set
auto completion_desc = sstables::compaction_completion_desc{
.old_sstables = std::move(old_sstables),
.new_sstables = std::move(reshape_candidates)
};
co_await t.on_compaction_completion(std::move(completion_desc), sstables::offstrategy::yes);
// There might be some remaining sstables in maintenance set that didn't require reshape, or the
// user has aborted off-strategy. So we can only integrate them into the main set, such that
// they become candidates for regular compaction. We cannot hold them forever in maintenance set,
// as that causes read and space amplification issues.
if (auto sstables = get_reshape_candidates(); sstables.size()) {
auto completion_desc = sstables::compaction_completion_desc{
.old_sstables = sstables, // removes from maintenance set.
.new_sstables = sstables, // adds into main set.
};
co_await t.on_compaction_completion(std::move(completion_desc), sstables::offstrategy::yes);
}
cleanup_new_unused_sstables_on_failure.cancel();
// By marking input sstables for deletion instead, the ones which require view building will stay in the staging
// directory until they're moved to the main dir when the time comes. Also, that allows view building to resume
// on restart if there's a crash midway.
for (auto& sst : sstables_to_remove) {
sst->mark_for_deletion();
if (err) {
co_await coroutine::return_exception_ptr(std::move(err));
}
}
protected:
@@ -1147,9 +1205,11 @@ protected:
std::exception_ptr ex;
try {
compaction::table_state& t = *_compacting_table;
auto maintenance_sstables = t.maintenance_sstable_set().all();
cmlog.info("Starting off-strategy compaction for {}.{}, {} candidates were found",
t.schema()->ks_name(), t.schema()->cf_name(), maintenance_sstables->size());
{
auto maintenance_sstables = t.maintenance_sstable_set().all();
cmlog.info("Starting off-strategy compaction for {}.{}, {} candidates were found",
t.schema()->ks_name(), t.schema()->cf_name(), maintenance_sstables->size());
}
co_await run_offstrategy_compaction(_compaction_data);
finish_compaction();
cmlog.info("Done with off-strategy compaction for {}.{}", t.schema()->ks_name(), t.schema()->cf_name());
@@ -1222,9 +1282,7 @@ private:
sstable_level, sstables::compaction_descriptor::default_max_sstable_bytes, run_identifier, _options);
// Releases reference to cleaned sstable such that respective used disk space can be freed.
auto release_exhausted = [this] (const std::vector<sstables::shared_sstable>& exhausted_sstables) {
_compacting.release_compacting(exhausted_sstables);
};
auto on_replace = _compacting.update_on_sstable_replacement();
setup_new_compaction(descriptor.run_identifier);
@@ -1233,7 +1291,7 @@ private:
std::exception_ptr ex;
try {
sstables::compaction_result res = co_await compact_sstables_and_update_history(std::move(descriptor), _compaction_data, std::move(release_exhausted), _can_purge);
sstables::compaction_result res = co_await compact_sstables_and_update_history(std::move(descriptor), _compaction_data, on_replace, _can_purge);
finish_compaction();
_cm.reevaluate_postponed_compactions();
co_return res; // done with current sstable
@@ -1390,14 +1448,26 @@ protected:
co_return std::nullopt;
}
private:
// Releases reference to cleaned files such that respective used disk space can be freed.
void release_exhausted(std::vector<sstables::shared_sstable> exhausted_sstables) {
_compacting.release_compacting(exhausted_sstables);
}
future<> run_cleanup_job(sstables::compaction_descriptor descriptor) {
co_await coroutine::switch_to(_cm.compaction_sg().cpu);
// Releases reference to cleaned files such that respective used disk space can be freed.
using update_registration = compacting_sstable_registration::update_me;
class release_exhausted : public update_registration {
sstables::compaction_descriptor& _desc;
public:
release_exhausted(compacting_sstable_registration& registration, sstables::compaction_descriptor& desc)
: update_registration{registration}
, _desc{desc} {}
void on_removal(const std::vector<sstables::shared_sstable>& sstables) override {
auto exhausted = boost::copy_range<std::unordered_set<sstables::shared_sstable>>(sstables);
std::erase_if(_desc.sstables, [&] (const sstables::shared_sstable& sst) {
return exhausted.contains(sst);
});
update_registration::on_removal(sstables);
}
};
release_exhausted on_replace{_compacting, descriptor};
for (;;) {
compaction_backlog_tracker user_initiated(std::make_unique<user_initiated_backlog_tracker>(_cm._compaction_controller.backlog_of_shares(200), _cm.available_memory()));
_cm.register_backlog_tracker(user_initiated);
@@ -1405,8 +1475,7 @@ private:
std::exception_ptr ex;
try {
setup_new_compaction(descriptor.run_identifier);
co_await compact_sstables_and_update_history(descriptor, _compaction_data,
std::bind(&cleanup_sstables_compaction_task::release_exhausted, this, std::placeholders::_1));
co_await compact_sstables_and_update_history(descriptor, _compaction_data, on_replace);
finish_compaction();
_cm.reevaluate_postponed_compactions();
co_return; // done with current job
@@ -1426,10 +1495,8 @@ private:
bool needs_cleanup(const sstables::shared_sstable& sst,
const dht::token_range_vector& sorted_owned_ranges,
schema_ptr s) {
auto first = sst->get_first_partition_key();
auto last = sst->get_last_partition_key();
auto first_token = dht::get_token(*s, first);
auto last_token = dht::get_token(*s, last);
auto first_token = sst->get_first_decorated_key().token();
auto last_token = sst->get_last_decorated_key().token();
dht::token_range sst_token_range = dht::token_range::make(first_token, last_token);
auto r = std::lower_bound(sorted_owned_ranges.begin(), sorted_owned_ranges.end(), first_token,
@@ -1529,31 +1596,35 @@ future<compaction_manager::compaction_stats_opt> compaction_manager::perform_sst
}, can_purge_tombstones::no);
}
compaction_manager::compaction_state::compaction_state(table_state& t)
: backlog_tracker(t.get_compaction_strategy().make_backlog_tracker())
{
}
void compaction_manager::add(compaction::table_state& t) {
auto [_, inserted] = _compaction_state.insert({&t, compaction_state{}});
auto [_, inserted] = _compaction_state.try_emplace(&t, t);
if (!inserted) {
auto s = t.schema();
on_internal_error(cmlog, format("compaction_state for table {}.{} [{}] already exists", s->ks_name(), s->cf_name(), fmt::ptr(&t)));
}
}
future<> compaction_manager::remove(compaction::table_state& t) {
auto handle = _compaction_state.extract(&t);
future<> compaction_manager::remove(compaction::table_state& t) noexcept {
auto& c_state = get_compaction_state(&t);
if (!handle.empty()) {
auto& c_state = handle.mapped();
// We need to guarantee that a task being stopped will not retry to compact
// a table being removed.
// The requirement above is provided by stop_ongoing_compactions().
_postponed.erase(&t);
// We need to guarantee that a task being stopped will not retry to compact
// a table being removed.
// The requirement above is provided by stop_ongoing_compactions().
_postponed.erase(&t);
// Wait for all compaction tasks running under gate to terminate
// and prevent new tasks from entering the gate.
co_await seastar::when_all_succeed(stop_ongoing_compactions("table removal", &t), c_state.gate.close()).discard_result();
// Wait for the termination of an ongoing compaction on table T, if any.
co_await stop_ongoing_compactions("table removal", &t);
c_state.backlog_tracker.disable();
_compaction_state.erase(&t);
// Wait for all functions running under gate to terminate.
co_await c_state.gate.close();
}
#ifdef DEBUG
auto found = false;
sstring msg;
@@ -1645,6 +1716,14 @@ compaction::strategy_control& compaction_manager::get_strategy_control() const n
return *_strategy_control;
}
void compaction_manager::plug_system_keyspace(db::system_keyspace& sys_ks) noexcept {
_sys_ks = sys_ks.shared_from_this();
}
void compaction_manager::unplug_system_keyspace() noexcept {
_sys_ks = nullptr;
}
double compaction_backlog_tracker::backlog() const {
return disabled() ? compaction_controller::disable_backlog : _impl->backlog(_ongoing_writes, _ongoing_compactions);
}
@@ -1704,7 +1783,7 @@ void compaction_backlog_tracker::register_compacting_sstable(sstables::shared_ss
}
}
void compaction_backlog_tracker::transfer_ongoing_charges(compaction_backlog_tracker& new_bt, bool move_read_charges) {
void compaction_backlog_tracker::copy_ongoing_charges(compaction_backlog_tracker& new_bt, bool move_read_charges) const {
for (auto&& w : _ongoing_writes) {
new_bt.register_partially_written_sstable(w.first, *w.second);
}
@@ -1714,8 +1793,6 @@ void compaction_backlog_tracker::transfer_ongoing_charges(compaction_backlog_tra
new_bt.register_compacting_sstable(w.first, *w.second);
}
}
_ongoing_writes = {};
_ongoing_compactions = {};
}
void compaction_backlog_tracker::revert_charges(sstables::shared_sstable sst) {
@@ -1723,6 +1800,26 @@ void compaction_backlog_tracker::revert_charges(sstables::shared_sstable sst) {
_ongoing_compactions.erase(sst);
}
compaction_backlog_tracker::compaction_backlog_tracker(compaction_backlog_tracker&& other)
: _impl(std::move(other._impl))
, _ongoing_writes(std::move(other._ongoing_writes))
, _ongoing_compactions(std::move(other._ongoing_compactions))
, _manager(std::exchange(other._manager, nullptr)) {
}
compaction_backlog_tracker&
compaction_backlog_tracker::operator=(compaction_backlog_tracker&& x) noexcept {
if (this != &x) {
if (auto manager = std::exchange(_manager, x._manager)) {
manager->remove_backlog_tracker(this);
}
_impl = std::move(x._impl);
_ongoing_writes = std::move(x._ongoing_writes);
_ongoing_compactions = std::move(x._ongoing_compactions);
}
return *this;
}
compaction_backlog_tracker::~compaction_backlog_tracker() {
if (_manager) {
_manager->remove_backlog_tracker(this);
@@ -1760,3 +1857,14 @@ compaction_backlog_manager::~compaction_backlog_manager() {
tracker->_manager = nullptr;
}
}
void compaction_manager::register_backlog_tracker(compaction::table_state& t, compaction_backlog_tracker new_backlog_tracker) {
auto& cs = get_compaction_state(&t);
cs.backlog_tracker = std::move(new_backlog_tracker);
register_backlog_tracker(cs.backlog_tracker);
}
compaction_backlog_tracker& compaction_manager::get_backlog_tracker(compaction::table_state& t) {
auto& cs = get_compaction_state(&t);
return cs.backlog_tracker;
}

View File

@@ -8,6 +8,9 @@
#pragma once
#include <boost/icl/interval.hpp>
#include <boost/icl/interval_map.hpp>
#include <seastar/core/semaphore.hh>
#include <seastar/core/sstring.hh>
#include <seastar/core/shared_ptr.hh>
@@ -29,13 +32,26 @@
#include "compaction.hh"
#include "compaction_weight_registration.hh"
#include "compaction_backlog_manager.hh"
#include "compaction/compaction_descriptor.hh"
#include "strategy_control.hh"
#include "backlog_controller.hh"
#include "seastarx.hh"
#include "sstables/exceptions.hh"
#include "tombstone_gc.hh"
namespace db {
class system_keyspace;
}
class compacting_sstable_registration;
class repair_history_map {
public:
boost::icl::interval_map<dht::token, gc_clock::time_point, boost::icl::partial_absorber, std::less, boost::icl::inplace_max> map;
};
using throw_if_stopping = bool_class<struct throw_if_stopping_tag>;
// Compaction manager provides facilities to submit and track compaction jobs on
// behalf of existing tables.
class compaction_manager {
@@ -70,8 +86,10 @@ private:
// Signaled whenever a compaction task completes.
condition_variable compaction_done;
compaction_state() = default;
compaction_state(compaction_state&&) = default;
compaction_backlog_tracker backlog_tracker;
explicit compaction_state(table_state& t);
compaction_state(compaction_state&&) = delete;
~compaction_state();
bool compaction_disabled() const noexcept {
@@ -110,7 +128,7 @@ public:
shared_future<compaction_stats_opt> _compaction_done = make_ready_future<compaction_stats_opt>();
exponential_backoff_retry _compaction_retry = exponential_backoff_retry(std::chrono::seconds(5), std::chrono::seconds(300));
sstables::compaction_type _type;
utils::UUID _output_run_identifier;
sstables::run_id _output_run_identifier;
gate::holder _gate_holder;
sstring _description;
@@ -122,11 +140,20 @@ public:
virtual ~task();
// called when a compaction replaces the exhausted sstables with the new set
struct on_replacement {
virtual ~on_replacement() {}
// called after the replacement completes
// @param sstables the old sstable which are replaced in this replacement
virtual void on_removal(const std::vector<sstables::shared_sstable>& sstables) = 0;
// called before the replacement happens
// @param sstables the new sstables to be added to the table's sstable set
virtual void on_addition(const std::vector<sstables::shared_sstable>& sstables) = 0;
};
protected:
virtual future<compaction_stats_opt> do_run() = 0;
using throw_if_stopping = bool_class<struct throw_if_stopping_tag>;
state switch_state(state new_state);
future<semaphore_units<named_semaphore_exception_factory>> acquire_semaphore(named_semaphore& sem, size_t units = 1);
@@ -134,7 +161,7 @@ public:
// Return true if the task isn't stopped
// and the compaction manager allows proceeding.
inline bool can_proceed(throw_if_stopping do_throw_if_stopping = throw_if_stopping::no) const;
void setup_new_compaction(utils::UUID output_run_id = utils::null_uuid());
void setup_new_compaction(sstables::run_id output_run_id = sstables::run_id::create_null_id());
void finish_compaction(state finish_state = state::done) noexcept;
// Compaction manager stop itself if it finds an storage I/O error which results in
@@ -143,12 +170,10 @@ public:
// otherwise, returns stop_iteration::no after sleep for exponential retry.
future<stop_iteration> maybe_retry(std::exception_ptr err, bool throw_on_abort = false);
// Compacts set of SSTables according to the descriptor.
using release_exhausted_func_t = std::function<void(const std::vector<sstables::shared_sstable>& exhausted_sstables)>;
future<sstables::compaction_result> compact_sstables_and_update_history(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, release_exhausted_func_t release_exhausted,
can_purge_tombstones can_purge = can_purge_tombstones::yes);
future<sstables::compaction_result> compact_sstables(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, release_exhausted_func_t release_exhausted,
future<sstables::compaction_result> compact_sstables_and_update_history(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, on_replacement&,
can_purge_tombstones can_purge = can_purge_tombstones::yes);
future<sstables::compaction_result> compact_sstables(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, on_replacement&,
can_purge_tombstones can_purge = can_purge_tombstones::yes, sstables::offstrategy offstrategy = sstables::offstrategy::no);
future<> update_history(compaction::table_state& t, const sstables::compaction_result& res, const sstables::compaction_data& cdata);
bool should_update_history(sstables::compaction_type ct) {
return ct == sstables::compaction_type::Compaction;
@@ -179,7 +204,7 @@ public:
bool generating_output_run() const noexcept {
return compaction_running() && _output_run_identifier;
}
const utils::UUID& output_run_id() const noexcept {
const sstables::run_id& output_run_id() const noexcept {
return _output_run_identifier;
}
@@ -276,13 +301,15 @@ private:
// being picked more than once.
seastar::named_semaphore _off_strategy_sem = {1, named_semaphore_exception_factory{"off-strategy compaction"}};
seastar::shared_ptr<db::system_keyspace> _sys_ks;
std::function<void()> compaction_submission_callback();
// all registered tables are reevaluated at a constant interval.
// Submission is a NO-OP when there's nothing to do, so it's fine to call it regularly.
timer<lowres_clock> _compaction_submission_timer = timer<lowres_clock>(compaction_submission_callback());
static constexpr std::chrono::seconds periodic_compaction_submission_interval() { return std::chrono::seconds(3600); }
config _cfg;
timer<lowres_clock> _compaction_submission_timer;
compaction_controller _compaction_controller;
compaction_backlog_manager _backlog_manager;
optimized_optional<abort_source::subscription> _early_abort_subscription;
@@ -294,8 +321,11 @@ private:
class strategy_control;
std::unique_ptr<strategy_control> _strategy_control;
per_table_history_maps _repair_history_maps;
tombstone_gc_state _tombstone_gc_state;
private:
future<compaction_stats_opt> perform_task(shared_ptr<task>);
future<compaction_stats_opt> perform_task(shared_ptr<task>, throw_if_stopping do_throw_if_stopping = throw_if_stopping::no);
future<> stop_tasks(std::vector<shared_ptr<task>> tasks, sstring reason);
future<> update_throughput(uint32_t value_mbs);
@@ -330,7 +360,7 @@ private:
// table still exists and compaction is not disabled for the table.
inline bool can_proceed(compaction::table_state* t) const;
void postponed_compactions_reevaluation();
future<> postponed_compactions_reevaluation();
void reevaluate_postponed_compactions() noexcept;
// Postpone compaction for a table that couldn't be executed due to ongoing
// similar-sized compaction.
@@ -440,7 +470,7 @@ public:
// parameter type is the compaction type the operation can most closely be
// associated with, use compaction_type::Compaction, if none apply.
// parameter job is a function that will carry the operation
future<> run_custom_job(compaction::table_state& s, sstables::compaction_type type, const char *desc, noncopyable_function<future<>(sstables::compaction_data&)> job);
future<> run_custom_job(compaction::table_state& s, sstables::compaction_type type, const char *desc, noncopyable_function<future<>(sstables::compaction_data&)> job, throw_if_stopping do_throw_if_stopping);
class compaction_reenabler {
compaction_manager& _cm;
@@ -470,6 +500,9 @@ public:
// Run a function with compaction temporarily disabled for a table T.
future<> run_with_compaction_disabled(compaction::table_state& t, std::function<future<> ()> func);
void plug_system_keyspace(db::system_keyspace& sys_ks) noexcept;
void unplug_system_keyspace() noexcept;
// Adds a table to the compaction manager.
// Creates a compaction_state structure that can be used for submitting
// compaction jobs of all types.
@@ -477,7 +510,7 @@ public:
// Remove a table from the compaction manager.
// Cancel requests on table and wait for possible ongoing compactions.
future<> remove(compaction::table_state& t);
future<> remove(compaction::table_state& t) noexcept;
const stats& get_stats() const {
return _stats;
@@ -494,7 +527,7 @@ public:
future<> stop_compaction(sstring type, compaction::table_state* table = nullptr);
// Stops ongoing compaction of a given table and/or compaction_type.
future<> stop_ongoing_compactions(sstring reason, compaction::table_state* t = nullptr, std::optional<sstables::compaction_type> type_opt = {});
future<> stop_ongoing_compactions(sstring reason, compaction::table_state* t = nullptr, std::optional<sstables::compaction_type> type_opt = {}) noexcept;
double backlog() {
return _backlog_manager.backlog();
@@ -503,11 +536,22 @@ public:
void register_backlog_tracker(compaction_backlog_tracker& backlog_tracker) {
_backlog_manager.register_backlog_tracker(backlog_tracker);
}
void register_backlog_tracker(compaction::table_state& t, compaction_backlog_tracker new_backlog_tracker);
compaction_backlog_tracker& get_backlog_tracker(compaction::table_state& t);
static sstables::compaction_data create_compaction_data();
compaction::strategy_control& get_strategy_control() const noexcept;
tombstone_gc_state& get_tombstone_gc_state() noexcept {
return _tombstone_gc_state;
};
const tombstone_gc_state& get_tombstone_gc_state() const noexcept {
return _tombstone_gc_state;
};
friend class compacting_sstable_registration;
friend class compaction_weight_registration;
friend class compaction_manager_test;

View File

@@ -50,7 +50,7 @@ std::vector<compaction_descriptor> compaction_strategy_impl::get_cleanup_compact
}));
}
bool compaction_strategy_impl::worth_dropping_tombstones(const shared_sstable& sst, gc_clock::time_point compaction_time) {
bool compaction_strategy_impl::worth_dropping_tombstones(const shared_sstable& sst, gc_clock::time_point compaction_time, const tombstone_gc_state& gc_state) {
if (_disable_tombstone_compaction) {
return false;
}
@@ -61,11 +61,11 @@ bool compaction_strategy_impl::worth_dropping_tombstones(const shared_sstable& s
if (db_clock::now()-_tombstone_compaction_interval < sst->data_file_write_time()) {
return false;
}
auto gc_before = sst->get_gc_before_for_drop_estimation(compaction_time);
auto gc_before = sst->get_gc_before_for_drop_estimation(compaction_time, gc_state);
return sst->estimate_droppable_tombstone_ratio(gc_before) >= _tombstone_threshold;
}
uint64_t compaction_strategy_impl::adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate) {
uint64_t compaction_strategy_impl::adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate, schema_ptr schema) {
return partition_estimate;
}
@@ -409,7 +409,9 @@ public:
l0_old_ssts.push_back(std::move(sst));
}
}
_l0_scts.replace_sstables(std::move(l0_old_ssts), std::move(l0_new_ssts));
if (l0_old_ssts.size() || l0_new_ssts.size()) {
_l0_scts.replace_sstables(std::move(l0_old_ssts), std::move(l0_new_ssts));
}
}
};
@@ -427,14 +429,6 @@ struct null_backlog_tracker final : public compaction_backlog_tracker::impl {
virtual void replace_sstables(std::vector<sstables::shared_sstable> old_ssts, std::vector<sstables::shared_sstable> new_ssts) override {}
};
// Just so that if we have more than one CF with NullStrategy, we don't create a lot
// of objects to iterate over for no reason
// Still thread local because of make_unique. But this will disappear soon
static thread_local compaction_backlog_tracker null_backlog_tracker(std::make_unique<null_backlog_tracker>());
compaction_backlog_tracker& get_null_backlog_tracker() {
return null_backlog_tracker;
}
//
// Null compaction strategy is the default compaction strategy.
// As the name implies, it does nothing.
@@ -453,8 +447,8 @@ public:
return compaction_strategy_type::null;
}
virtual compaction_backlog_tracker& get_backlog_tracker() override {
return get_null_backlog_tracker();
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() override {
return std::make_unique<null_backlog_tracker>();
}
};
@@ -462,11 +456,14 @@ leveled_compaction_strategy::leveled_compaction_strategy(const std::map<sstring,
: compaction_strategy_impl(options)
, _max_sstable_size_in_mb(calculate_max_sstable_size_in_mb(compaction_strategy_impl::get_value(options, SSTABLE_SIZE_OPTION)))
, _stcs_options(options)
, _backlog_tracker(std::make_unique<leveled_compaction_backlog_tracker>(_max_sstable_size_in_mb, _stcs_options))
{
_compaction_counter.resize(leveled_manifest::MAX_LEVELS);
}
std::unique_ptr<compaction_backlog_tracker::impl> leveled_compaction_strategy::make_backlog_tracker() {
return std::make_unique<leveled_compaction_backlog_tracker>(_max_sstable_size_in_mb, _stcs_options);
}
int32_t
leveled_compaction_strategy::calculate_max_sstable_size_in_mb(std::optional<sstring> option_value) const {
using namespace cql3::statements;
@@ -486,7 +483,6 @@ time_window_compaction_strategy::time_window_compaction_strategy(const std::map<
: compaction_strategy_impl(options)
, _options(options)
, _stcs_options(options)
, _backlog_tracker(std::make_unique<time_window_backlog_tracker>(_options, _stcs_options))
{
if (!options.contains(TOMBSTONE_COMPACTION_INTERVAL_OPTION) && !options.contains(TOMBSTONE_THRESHOLD_OPTION)) {
_disable_tombstone_compaction = true;
@@ -497,6 +493,10 @@ time_window_compaction_strategy::time_window_compaction_strategy(const std::map<
_use_clustering_key_filter = true;
}
std::unique_ptr<compaction_backlog_tracker::impl> time_window_compaction_strategy::make_backlog_tracker() {
return std::make_unique<time_window_backlog_tracker>(_options, _stcs_options);
}
} // namespace sstables
std::vector<sstables::shared_sstable>
@@ -640,7 +640,6 @@ namespace sstables {
date_tiered_compaction_strategy::date_tiered_compaction_strategy(const std::map<sstring, sstring>& options)
: compaction_strategy_impl(options)
, _manifest(options)
, _backlog_tracker(std::make_unique<unimplemented_backlog_tracker>())
{
clogger.warn("DateTieredCompactionStrategy is deprecated. Usually cases for which it is used are better handled by TimeWindowCompactionStrategy."
" Please change your compaction strategy to TWCS as DTCS will be retired in the near future");
@@ -670,8 +669,8 @@ compaction_descriptor date_tiered_compaction_strategy::get_sstables_for_compacti
}
// filter out sstables which droppable tombstone ratio isn't greater than the defined threshold.
auto e = boost::range::remove_if(candidates, [this, compaction_time] (const sstables::shared_sstable& sst) -> bool {
return !worth_dropping_tombstones(sst, compaction_time);
auto e = boost::range::remove_if(candidates, [this, compaction_time, &table_s] (const sstables::shared_sstable& sst) -> bool {
return !worth_dropping_tombstones(sst, compaction_time, table_s.get_tombstone_gc_state());
});
candidates.erase(e, candidates.end());
if (candidates.empty()) {
@@ -685,17 +684,23 @@ compaction_descriptor date_tiered_compaction_strategy::get_sstables_for_compacti
return sstables::compaction_descriptor({ *it }, service::get_local_compaction_priority());
}
std::unique_ptr<compaction_backlog_tracker::impl> date_tiered_compaction_strategy::make_backlog_tracker() {
return std::make_unique<unimplemented_backlog_tracker>();
}
size_tiered_compaction_strategy::size_tiered_compaction_strategy(const std::map<sstring, sstring>& options)
: compaction_strategy_impl(options)
, _options(options)
, _backlog_tracker(std::make_unique<size_tiered_backlog_tracker>(_options))
{}
size_tiered_compaction_strategy::size_tiered_compaction_strategy(const size_tiered_compaction_strategy_options& options)
: _options(options)
, _backlog_tracker(std::make_unique<size_tiered_backlog_tracker>(_options))
{}
std::unique_ptr<compaction_backlog_tracker::impl> size_tiered_compaction_strategy::make_backlog_tracker() {
return std::make_unique<size_tiered_backlog_tracker>(_options);
}
compaction_strategy::compaction_strategy(::shared_ptr<compaction_strategy_impl> impl)
: _compaction_strategy_impl(std::move(impl)) {}
compaction_strategy::compaction_strategy() = default;
@@ -736,8 +741,8 @@ bool compaction_strategy::use_clustering_key_filter() const {
return _compaction_strategy_impl->use_clustering_key_filter();
}
compaction_backlog_tracker& compaction_strategy::get_backlog_tracker() {
return _compaction_strategy_impl->get_backlog_tracker();
compaction_backlog_tracker compaction_strategy::make_backlog_tracker() {
return compaction_backlog_tracker(_compaction_strategy_impl->make_backlog_tracker());
}
sstables::compaction_descriptor
@@ -745,8 +750,8 @@ compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema
return _compaction_strategy_impl->get_reshaping_job(std::move(input), schema, iop, mode);
}
uint64_t compaction_strategy::adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate) {
return _compaction_strategy_impl->adjust_partition_estimate(ms_meta, partition_estimate);
uint64_t compaction_strategy::adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate, schema_ptr schema) {
return _compaction_strategy_impl->adjust_partition_estimate(ms_meta, partition_estimate, std::move(schema));
}
reader_consumer_v2 compaction_strategy::make_interposer_consumer(const mutation_source_metadata& ms_meta, reader_consumer_v2 end_consumer) {

View File

@@ -106,9 +106,9 @@ public:
sstable_set make_sstable_set(schema_ptr schema) const;
compaction_backlog_tracker& get_backlog_tracker();
compaction_backlog_tracker make_backlog_tracker();
uint64_t adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate);
uint64_t adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate, schema_ptr);
reader_consumer_v2 make_interposer_consumer(const mutation_source_metadata& ms_meta, reader_consumer_v2 end_consumer);

View File

@@ -13,6 +13,7 @@
#include "compaction_strategy.hh"
#include "db_clock.hh"
#include "compaction_descriptor.hh"
#include "tombstone_gc.hh"
namespace compaction {
class table_state;
@@ -21,8 +22,6 @@ class strategy_control;
namespace sstables {
compaction_backlog_tracker& get_unimplemented_backlog_tracker();
class sstable_set_impl;
class resharding_descriptor;
@@ -67,11 +66,11 @@ public:
// Check if a given sstable is entitled for tombstone compaction based on its
// droppable tombstone histogram and gc_before.
bool worth_dropping_tombstones(const shared_sstable& sst, gc_clock::time_point compaction_time);
bool worth_dropping_tombstones(const shared_sstable& sst, gc_clock::time_point compaction_time, const tombstone_gc_state& gc_state);
virtual compaction_backlog_tracker& get_backlog_tracker() = 0;
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() = 0;
virtual uint64_t adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate);
virtual uint64_t adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate, schema_ptr schema);
virtual reader_consumer_v2 make_interposer_consumer(const mutation_source_metadata& ms_meta, reader_consumer_v2 end_consumer);

View File

@@ -259,7 +259,6 @@ namespace sstables {
class date_tiered_compaction_strategy : public compaction_strategy_impl {
date_tiered_manifest _manifest;
compaction_backlog_tracker _backlog_tracker;
public:
date_tiered_compaction_strategy(const std::map<sstring, sstring>& options);
virtual compaction_descriptor get_sstables_for_compaction(table_state& table_s, strategy_control& control, std::vector<sstables::shared_sstable> candidates) override;
@@ -272,9 +271,7 @@ public:
return compaction_strategy_type::date_tiered;
}
virtual compaction_backlog_tracker& get_backlog_tracker() override {
return _backlog_tracker;
}
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() override;
};
}

View File

@@ -39,16 +39,16 @@ compaction_descriptor leveled_compaction_strategy::get_sstables_for_compaction(t
for (auto level = int(manifest.get_level_count()); level >= 0; level--) {
auto& sstables = manifest.get_level(level);
// filter out sstables which droppable tombstone ratio isn't greater than the defined threshold.
auto e = boost::range::remove_if(sstables, [this, compaction_time] (const sstables::shared_sstable& sst) -> bool {
return !worth_dropping_tombstones(sst, compaction_time);
auto e = boost::range::remove_if(sstables, [this, compaction_time, &table_s] (const sstables::shared_sstable& sst) -> bool {
return !worth_dropping_tombstones(sst, compaction_time, table_s.get_tombstone_gc_state());
});
sstables.erase(e, sstables.end());
if (sstables.empty()) {
continue;
}
auto& sst = *std::max_element(sstables.begin(), sstables.end(), [&] (auto& i, auto& j) {
auto gc_before1 = i->get_gc_before_for_drop_estimation(compaction_time);
auto gc_before2 = j->get_gc_before_for_drop_estimation(compaction_time);
auto gc_before1 = i->get_gc_before_for_drop_estimation(compaction_time, table_s.get_tombstone_gc_state());
auto gc_before2 = j->get_gc_before_for_drop_estimation(compaction_time, table_s.get_tombstone_gc_state());
return i->estimate_droppable_tombstone_ratio(gc_before1) < j->estimate_droppable_tombstone_ratio(gc_before2);
});
return sstables::compaction_descriptor({ sst }, service::get_local_compaction_priority(), sst->get_sstable_level());
@@ -144,6 +144,8 @@ leveled_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input
auto max_sstable_size_in_bytes = _max_sstable_size_in_mb * 1024 * 1024;
leveled_manifest::logger.debug("get_reshaping_job: mode={} input.size={} max_sstable_size_in_bytes={}", mode == reshape_mode::relaxed ? "relaxed" : "strict", input.size(), max_sstable_size_in_bytes);
for (auto& sst : input) {
auto sst_level = sst->get_sstable_level();
if (sst_level > leveled_manifest::MAX_LEVELS - 1) {
@@ -200,10 +202,8 @@ leveled_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input
auto [disjoint, overlapping_sstables] = is_disjoint(level_info[level], tolerance(level));
if (!disjoint) {
auto ideal_level = ideal_level_for_input(input, max_sstable_size_in_bytes);
leveled_manifest::logger.warn("Turns out that level {} is not disjoint, found {} overlapping SSTables, so compacting everything on behalf of {}.{}", level, overlapping_sstables, schema->ks_name(), schema->cf_name());
// Unfortunately no good limit to limit input size to max_sstables for LCS major
compaction_descriptor desc(std::move(input), iop, ideal_level, max_sstable_size_in_bytes);
leveled_manifest::logger.warn("Turns out that level {} is not disjoint, found {} overlapping SSTables, so the level will be entirely compacted on behalf of {}.{}", level, overlapping_sstables, schema->ks_name(), schema->cf_name());
compaction_descriptor desc(std::move(level_info[level]), iop, level, max_sstable_size_in_bytes);
desc.options = compaction_type_options::make_reshape();
return desc;
}
@@ -229,6 +229,9 @@ leveled_compaction_strategy::get_cleanup_compaction_jobs(table_state& table_s, s
}
unsigned leveled_compaction_strategy::ideal_level_for_input(const std::vector<sstables::shared_sstable>& input, uint64_t max_sstable_size) {
if (!max_sstable_size) {
return 1;
}
auto log_fanout = [fanout = leveled_manifest::leveled_fan_out] (double x) {
double inv_log_fanout = 1.0f / std::log(fanout);
return log(x) * inv_log_fanout;

View File

@@ -35,7 +35,6 @@ class leveled_compaction_strategy : public compaction_strategy_impl {
std::optional<std::vector<std::optional<dht::decorated_key>>> _last_compacted_keys;
std::vector<int> _compaction_counter;
size_tiered_compaction_strategy_options _stcs_options;
compaction_backlog_tracker _backlog_tracker;
int32_t calculate_max_sstable_size_in_mb(std::optional<sstring> option_value) const;
public:
static unsigned ideal_level_for_input(const std::vector<sstables::shared_sstable>& input, uint64_t max_sstable_size);
@@ -64,9 +63,7 @@ public:
}
virtual std::unique_ptr<sstable_set_impl> make_sstable_set(schema_ptr schema) const override;
virtual compaction_backlog_tracker& get_backlog_tracker() override {
return _backlog_tracker;
}
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() override;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, const ::io_priority_class& iop, reshape_mode mode) override;
};

View File

@@ -6,6 +6,7 @@
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "sstables/sstables.hh"
#include "size_tiered_compaction_strategy.hh"
#include <boost/range/adaptor/transformed.hpp>
@@ -169,8 +170,8 @@ size_tiered_compaction_strategy::get_sstables_for_compaction(table_state& table_
// tombstone purge, i.e. less likely to shadow even older data.
for (auto&& sstables : buckets | boost::adaptors::reversed) {
// filter out sstables which droppable tombstone ratio isn't greater than the defined threshold.
auto e = boost::range::remove_if(sstables, [this, compaction_time] (const sstables::shared_sstable& sst) -> bool {
return !worth_dropping_tombstones(sst, compaction_time);
auto e = boost::range::remove_if(sstables, [this, compaction_time, &table_s] (const sstables::shared_sstable& sst) -> bool {
return !worth_dropping_tombstones(sst, compaction_time, table_s.get_tombstone_gc_state());
});
sstables.erase(e, sstables.end());
if (sstables.empty()) {

View File

@@ -10,7 +10,7 @@
#include "compaction_strategy_impl.hh"
#include "compaction.hh"
#include "sstables/sstables.hh"
#include "sstables/shared_sstable.hh"
#include <boost/algorithm/cxx11/any_of.hpp>
class size_tiered_backlog_tracker;
@@ -82,7 +82,6 @@ public:
class size_tiered_compaction_strategy : public compaction_strategy_impl {
size_tiered_compaction_strategy_options _options;
compaction_backlog_tracker _backlog_tracker;
// Return a list of pair of shared_sstable and its respective size.
static std::vector<std::pair<sstables::shared_sstable, uint64_t>> create_sstable_and_length_pairs(const std::vector<sstables::shared_sstable>& sstables);
@@ -128,9 +127,7 @@ public:
most_interesting_bucket(const std::vector<sstables::shared_sstable>& candidates, int min_threshold, int max_threshold,
size_tiered_compaction_strategy_options options = {});
virtual compaction_backlog_tracker& get_backlog_tracker() override {
return _backlog_tracker;
}
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() override;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, const ::io_priority_class& iop, reshape_mode mode) override;

View File

@@ -10,14 +10,15 @@
#pragma once
#include "schema_fwd.hh"
#include "sstables/sstable_set.hh"
#include "sstables/sstables_manager.hh"
#include "compaction_descriptor.hh"
class reader_permit;
class compaction_backlog_tracker;
namespace sstables {
class sstable_set;
class compaction_strategy;
class sstables_manager;
struct sstable_writer_config;
}
@@ -40,9 +41,10 @@ public:
virtual sstables::shared_sstable make_sstable() const = 0;
virtual sstables::sstable_writer_config configure_writer(sstring origin) const = 0;
virtual api::timestamp_type min_memtable_timestamp() const = 0;
virtual future<> update_compaction_history(utils::UUID compaction_id, sstring ks_name, sstring cf_name, std::chrono::milliseconds ended_at, int64_t bytes_in, int64_t bytes_out) = 0;
virtual future<> on_compaction_completion(sstables::compaction_completion_desc desc, sstables::offstrategy offstrategy) = 0;
virtual bool is_auto_compaction_disabled_by_user() const noexcept = 0;
virtual const tombstone_gc_state& get_tombstone_gc_state() const noexcept = 0;
virtual compaction_backlog_tracker& get_backlog_tracker() = 0;
};
}

View File

@@ -43,6 +43,10 @@ time_window_compaction_strategy_options::time_window_compaction_strategy_options
}
}
if (window_size <= 0) {
throw exceptions::configuration_exception(fmt::format("{} must be greater than 1 for compaction_window_size", window_size));
}
sstable_window_size = window_size * window_unit;
it = options.find(EXPIRED_SSTABLE_CHECK_FREQUENCY_SECONDS_KEY);
@@ -96,16 +100,27 @@ public:
};
};
uint64_t time_window_compaction_strategy::adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate) {
if (!ms_meta.min_timestamp || !ms_meta.max_timestamp) {
// Not enough information, we assume the worst
return partition_estimate / max_data_segregation_window_count;
}
const auto min_window = get_window_for(_options, *ms_meta.min_timestamp);
const auto max_window = get_window_for(_options, *ms_meta.max_timestamp);
const auto window_size = get_window_size(_options);
uint64_t time_window_compaction_strategy::adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate, schema_ptr s) {
// If not enough information, we assume the worst
auto estimated_window_count = max_data_segregation_window_count;
auto default_ttl = std::chrono::duration_cast<std::chrono::microseconds>(s->default_time_to_live());
bool min_and_max_ts_available = ms_meta.min_timestamp && ms_meta.max_timestamp;
auto estimate_window_count = [this] (timestamp_type min_window, timestamp_type max_window) {
const auto window_size = get_window_size(_options);
return (max_window + (window_size - 1) - min_window) / window_size;
};
auto estimated_window_count = (max_window + (window_size - 1) - min_window) / window_size;
if (!min_and_max_ts_available && default_ttl.count()) {
auto min_window = get_window_for(_options, timestamp_type(0));
auto max_window = get_window_for(_options, timestamp_type(default_ttl.count()));
estimated_window_count = estimate_window_count(min_window, max_window);
} else if (min_and_max_ts_available) {
auto min_window = get_window_for(_options, *ms_meta.min_timestamp);
auto max_window = get_window_for(_options, *ms_meta.max_timestamp);
estimated_window_count = estimate_window_count(min_window, max_window);
}
return partition_estimate / std::max(1UL, uint64_t(estimated_window_count));
}
@@ -270,8 +285,8 @@ time_window_compaction_strategy::get_next_non_expired_sstables(table_state& tabl
// if there is no sstable to compact in standard way, try compacting single sstable whose droppable tombstone
// ratio is greater than threshold.
auto e = boost::range::remove_if(non_expiring_sstables, [this, compaction_time] (const shared_sstable& sst) -> bool {
return !worth_dropping_tombstones(sst, compaction_time);
auto e = boost::range::remove_if(non_expiring_sstables, [this, compaction_time, &table_s] (const shared_sstable& sst) -> bool {
return !worth_dropping_tombstones(sst, compaction_time, table_s.get_tombstone_gc_state());
});
non_expiring_sstables.erase(e, non_expiring_sstables.end());
if (non_expiring_sstables.empty()) {

View File

@@ -15,7 +15,7 @@
#include "size_tiered_compaction_strategy.hh"
#include "timestamp.hh"
#include "exceptions/exceptions.hh"
#include "sstables/sstables.hh"
#include "sstables/shared_sstable.hh"
#include "service/priority_manager.hh"
namespace sstables {
@@ -73,7 +73,6 @@ class time_window_compaction_strategy : public compaction_strategy_impl {
// Keep track of all recent active windows that still need to be compacted into a single SSTable
std::unordered_set<timestamp_type> _recent_active_windows;
size_tiered_compaction_strategy_options _stcs_options;
compaction_backlog_tracker _backlog_tracker;
public:
// The maximum amount of buckets we segregate data into when writing into sstables.
// To prevent an explosion in the number of sstables we cap it.
@@ -156,11 +155,9 @@ public:
virtual std::unique_ptr<sstable_set_impl> make_sstable_set(schema_ptr schema) const override;
virtual compaction_backlog_tracker& get_backlog_tracker() override {
return _backlog_tracker;
}
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() override;
virtual uint64_t adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate) override;
virtual uint64_t adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate, schema_ptr s) override;
virtual reader_consumer_v2 make_interposer_consumer(const mutation_source_metadata& ms_meta, reader_consumer_v2 end_consumer) override;

View File

@@ -10,88 +10,6 @@
#pragma once
#include "dht/i_partitioner.hh"
#include <optional>
#include <variant>
// Wraps ring_position_view so it is compatible with old-style C++: default
// constructor, stateless comparators, yada yada.
class compatible_ring_position_view {
const ::schema* _schema = nullptr;
// Optional to supply a default constructor, no more.
std::optional<dht::ring_position_view> _rpv;
public:
constexpr compatible_ring_position_view() = default;
compatible_ring_position_view(const schema& s, dht::ring_position_view rpv)
: _schema(&s), _rpv(rpv) {
}
const dht::ring_position_view& position() const {
return *_rpv;
}
const ::schema& schema() const {
return *_schema;
}
friend std::strong_ordering tri_compare(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
return dht::ring_position_tri_compare(*x._schema, *x._rpv, *y._rpv);
}
friend bool operator<(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
return tri_compare(x, y) < 0;
}
friend bool operator<=(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
return tri_compare(x, y) <= 0;
}
friend bool operator>(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
return tri_compare(x, y) > 0;
}
friend bool operator>=(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
return tri_compare(x, y) >= 0;
}
friend bool operator==(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
return tri_compare(x, y) == 0;
}
friend bool operator!=(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
return tri_compare(x, y) != 0;
}
};
// Wraps ring_position so it is compatible with old-style C++: default
// constructor, stateless comparators, yada yada.
class compatible_ring_position {
schema_ptr _schema;
// Optional to supply a default constructor, no more.
std::optional<dht::ring_position> _rp;
public:
constexpr compatible_ring_position() = default;
compatible_ring_position(schema_ptr s, dht::ring_position rp)
: _schema(std::move(s)), _rp(std::move(rp)) {
}
dht::ring_position_view position() const {
return *_rp;
}
const ::schema& schema() const {
return *_schema;
}
friend std::strong_ordering tri_compare(const compatible_ring_position& x, const compatible_ring_position& y) {
return dht::ring_position_tri_compare(*x._schema, *x._rp, *y._rp);
}
friend bool operator<(const compatible_ring_position& x, const compatible_ring_position& y) {
return tri_compare(x, y) < 0;
}
friend bool operator<=(const compatible_ring_position& x, const compatible_ring_position& y) {
return tri_compare(x, y) <= 0;
}
friend bool operator>(const compatible_ring_position& x, const compatible_ring_position& y) {
return tri_compare(x, y) > 0;
}
friend bool operator>=(const compatible_ring_position& x, const compatible_ring_position& y) {
return tri_compare(x, y) >= 0;
}
friend bool operator==(const compatible_ring_position& x, const compatible_ring_position& y) {
return tri_compare(x, y) == 0;
}
friend bool operator!=(const compatible_ring_position& x, const compatible_ring_position& y) {
return tri_compare(x, y) != 0;
}
};
// Wraps ring_position or ring_position_view so either is compatible with old-style C++: default
// constructor, stateless comparators, yada yada.
@@ -99,37 +17,22 @@ public:
// on callers to keep ring position alive, allow lookup on containers that don't support different
// key types, and also avoiding unnecessary copies.
class compatible_ring_position_or_view {
// Optional to supply a default constructor, no more.
std::optional<std::variant<compatible_ring_position, compatible_ring_position_view>> _crp_or_view;
schema_ptr _schema;
lw_shared_ptr<dht::ring_position> _rp;
dht::ring_position_view_opt _rpv; // optional only for default ctor, nothing more
public:
constexpr compatible_ring_position_or_view() = default;
compatible_ring_position_or_view() = default;
explicit compatible_ring_position_or_view(schema_ptr s, dht::ring_position rp)
: _crp_or_view(compatible_ring_position(std::move(s), std::move(rp))) {
: _schema(std::move(s)), _rp(make_lw_shared<dht::ring_position>(std::move(rp))), _rpv(dht::ring_position_view(*_rp)) {
}
explicit compatible_ring_position_or_view(const schema& s, dht::ring_position_view rpv)
: _crp_or_view(compatible_ring_position_view(s, rpv)) {
: _schema(s.shared_from_this()), _rpv(rpv) {
}
dht::ring_position_view position() const {
struct rpv_accessor {
dht::ring_position_view operator()(const compatible_ring_position& crp) {
return crp.position();
}
dht::ring_position_view operator()(const compatible_ring_position_view& crpv) {
return crpv.position();
}
};
return std::visit(rpv_accessor{}, *_crp_or_view);
const dht::ring_position_view& position() const {
return *_rpv;
}
friend std::strong_ordering tri_compare(const compatible_ring_position_or_view& x, const compatible_ring_position_or_view& y) {
struct schema_accessor {
const ::schema& operator()(const compatible_ring_position& crp) {
return crp.schema();
}
const ::schema& operator()(const compatible_ring_position_view& crpv) {
return crpv.schema();
}
};
return dht::ring_position_tri_compare(std::visit(schema_accessor{}, *x._crp_or_view), x.position(), y.position());
return dht::ring_position_tri_compare(*x._schema, x.position(), y.position());
}
friend bool operator<(const compatible_ring_position_or_view& x, const compatible_ring_position_or_view& y) {
return tri_compare(x, y) < 0;

View File

@@ -16,7 +16,6 @@
#include <boost/range/adaptor/transformed.hpp>
#include "utils/serialization.hh"
#include <seastar/util/backtrace.hh>
#include "cql_serialization_format.hh"
enum class allow_prefixes { no, yes };
@@ -280,7 +279,7 @@ public:
}
for (size_t i = 0; i != values.size(); ++i) {
//FIXME: is it safe to assume internal serialization-format format?
_types[i]->validate(values[i], cql_serialization_format::internal());
_types[i]->validate(values[i]);
}
}
bool equal(managed_bytes_view v1, managed_bytes_view v2) const {

View File

@@ -560,7 +560,7 @@ public:
auto marker = it->second;
++it;
if (it != e && marker != composite::eoc::none) {
throw runtime_exception(format("non-zero component divider found ({:d}) mid", format("0x{:02x}", composite::eoc_type(marker) & 0xff)));
throw runtime_exception(format("non-zero component divider found ({:#02x}) mid", composite::eoc_type(marker) & 0xff));
}
}
return ret;

View File

@@ -117,6 +117,8 @@ struct date_type_impl final : public concrete_type<db_clock::time_point> {
using timestamp_date_base_class = concrete_type<db_clock::time_point>;
sstring timestamp_to_json_string(const timestamp_date_base_class& t, const bytes_view& bv);
struct timeuuid_type_impl final : public concrete_type<utils::UUID> {
timeuuid_type_impl();
static utils::UUID from_sstring(sstring_view s);

View File

@@ -65,6 +65,13 @@ commitlog_sync_period_in_ms: 10000
# is reasonable.
commitlog_segment_size_in_mb: 32
# The size of the individual schema commitlog file segments.
# The segment size puts a limit on the mutation size that can be
# written at once, and some schema mutation writes are much larger
# than average.
schema_commitlog_segment_size_in_mb: 32
# seed_provider class_name is saved for future use.
# A seed address is mandatory.
seed_provider:
@@ -410,6 +417,9 @@ commitlog_total_space_in_mb: -1
# Log a warning when row number is larger than this value
# compaction_rows_count_warning_threshold: 100000
# Log a warning when writing a collection containing more elements than this value
# compaction_collection_elements_count_warning_threshold: 10000
# How long the coordinator should wait for seq or index scans to complete
# range_request_timeout_in_ms: 10000
# How long the coordinator should wait for writes to complete
@@ -445,20 +455,20 @@ commitlog_total_space_in_mb: -1
# internode_encryption: none
# certificate: conf/scylla.crt
# keyfile: conf/scylla.key
# truststore: <none, use system trust>
# certficate_revocation_list: <none>
# truststore: <not set, use system trust>
# certficate_revocation_list: <not set>
# require_client_auth: False
# priority_string: <none, use default>
# priority_string: <not set, use default>
# enable or disable client/server encryption.
# client_encryption_options:
# enabled: false
# certificate: conf/scylla.crt
# keyfile: conf/scylla.key
# truststore: <none, use system trust>
# certficate_revocation_list: <none>
# truststore: <not set, use system trust>
# certficate_revocation_list: <not set>
# require_client_auth: False
# priority_string: <none, use default>
# priority_string: <not set, use default>
# internode_compression controls whether traffic between nodes is
# compressed.
@@ -550,4 +560,16 @@ murmur3_partitioner_ignore_msb_bits: 12
# WARNING: It's unsafe to set this to false if the node previously booted
# with the schema commit log enabled. In such case, some schema changes
# may be lost if the node was not cleanly stopped.
force_schema_commit_log: true
force_schema_commit_log: true
# Use Raft to consistently manage schema information in the cluster.
# Refer to https://docs.scylladb.com/master/architecture/raft.html for more details.
# The 'Handling Failures' section is especially important.
#
# Once enabled in a cluster, this cannot be turned off.
# If you want to bootstrap a new cluster without Raft, make sure to set this to `false`
# before starting your nodes for the first time.
#
# A cluster not using Raft can be 'upgraded' to use Raft. Refer to the aforementioned
# documentation, section 'Enabling Raft in ScyllaDB 5.2 and further', for the procedure.
consistent_cluster_management: true

View File

@@ -48,28 +48,10 @@ employ_ld_trickery = True
# distro-specific setup
def distro_setup_nix():
global os_ids, employ_ld_trickery
global distro_extra_ldflags, distro_extra_cflags, distro_extra_cmake_args
os_ids = ['linux']
employ_ld_trickery = False
libdirs = list(dict.fromkeys(os.environ.get('CMAKE_LIBRARY_PATH').split(':')))
incdirs = list(dict.fromkeys(os.environ.get('CMAKE_INCLUDE_PATH').split(':')))
# add nix {lib,inc}dirs to relevant flags, mimicing nix versions of cmake & autotools.
# also add rpaths to make sure that any built executables can run in place.
distro_extra_ldflags = ' '.join([ '-rpath ' + path + ' -L' + path for path in libdirs ]);
distro_extra_cflags = ' '.join([ '-isystem ' + path for path in incdirs ])
# indexers like clangd may or may not know which stdc++ or glibc
# the compiler was configured with, so make the relevant paths
# explicit on each compilation command line:
implicit_cflags = os.environ.get('IMPLICIT_CFLAGS').strip()
distro_extra_cflags += ' ' + implicit_cflags
# also propagate to cmake-built dependencies:
distro_extra_cmake_args = ['-DCMAKE_CXX_STANDARD_INCLUDE_DIRECTORIES:INTERNAL=' + implicit_cflags]
if os.environ.get('NIX_BUILD_TOP'):
if os.environ.get('NIX_CC'):
distro_setup_nix()
# distribution "internationalization", converting package names.
@@ -214,7 +196,7 @@ def linker_flags(compiler):
def maybe_static(flag, libs):
if flag and not args.static:
if flag:
libs = '-Wl,-Bstatic {} -Wl,-Bdynamic'.format(libs)
return libs
@@ -303,7 +285,8 @@ modes = {
'cxxflags': '-DDEBUG -DSANITIZE -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION',
'cxx_ld_flags': '',
'stack-usage-threshold': 1024*40,
'optimization-level': 'g',
# -fasan -Og breaks some coroutines on aarch64, use -O0 instead
'optimization-level': ('0' if platform.machine() == 'aarch64' else 'g'),
'per_src_extra_cxxflags': {},
'cmake_build_type': 'Debug',
'can_have_debug_info': True,
@@ -426,10 +409,12 @@ scylla_tests = set([
'test/boost/limiting_data_source_test',
'test/boost/linearizing_input_stream_test',
'test/boost/loading_cache_test',
'test/boost/locator_topology_test',
'test/boost/log_heap_test',
'test/boost/estimated_histogram_test',
'test/boost/summary_test',
'test/boost/logalloc_test',
'test/boost/logalloc_standard_allocator_segment_pool_backend_test',
'test/boost/managed_vector_test',
'test/boost/managed_bytes_test',
'test/boost/intrusive_array_test',
@@ -461,7 +446,7 @@ scylla_tests = set([
'test/boost/schema_change_test',
'test/boost/schema_registry_test',
'test/boost/secondary_index_test',
'test/boost/tracing',
'test/boost/tracing_test',
'test/boost/index_with_paging_test',
'test/boost/serialization_test',
'test/boost/serialized_action_test',
@@ -495,6 +480,8 @@ scylla_tests = set([
'test/boost/virtual_reader_test',
'test/boost/virtual_table_mutation_source_test',
'test/boost/virtual_table_test',
'test/boost/wasm_test',
'test/boost/wasm_alloc_test',
'test/boost/bptree_test',
'test/boost/btree_test',
'test/boost/radix_tree_test',
@@ -562,6 +549,7 @@ raft_tests = set([
'test/raft/replication_test',
'test/raft/randomized_nemesis_test',
'test/raft/many_test',
'test/raft/raft_server_test',
'test/raft/fsm_test',
'test/raft/etcd_test',
'test/raft/raft_sys_table_storage_test',
@@ -585,13 +573,6 @@ all_artifacts = apps | tests | other
arg_parser = argparse.ArgumentParser('Configure scylla')
arg_parser.add_argument('--out', dest='buildfile', action='store', default='build.ninja',
help='Output build-file name (by default build.ninja)')
arg_parser.add_argument('--static', dest='static', action='store_const', default='',
const='-static',
help='Static link (useful for running on hosts outside the build environment')
arg_parser.add_argument('--pie', dest='pie', action='store_true',
help='Build position-independent executable (PIE)')
arg_parser.add_argument('--so', dest='so', action='store_true',
help='Build shared object (SO) instead of executable')
arg_parser.add_argument('--mode', action='append', choices=list(modes.keys()), dest='selected_modes',
help="Build modes to generate ninja files for. The available build modes are:\n{}".format("; ".join(["{} - {}".format(m, cfg['description']) for m, cfg in modes.items()])))
arg_parser.add_argument('--with', dest='artifacts', action='append', default=[],
@@ -630,6 +611,8 @@ arg_parser.add_argument('--static-yaml-cpp', dest='staticyamlcpp', action='store
help='Link libyaml-cpp statically')
arg_parser.add_argument('--tests-debuginfo', action='store', dest='tests_debuginfo', type=int, default=0,
help='Enable(1)/disable(0)compiler debug information generation for tests')
arg_parser.add_argument('--perf-tests-debuginfo', action='store', dest='perf_tests_debuginfo', type=int, default=0,
help='Enable(1)/disable(0)compiler debug information generation for perf tests')
arg_parser.add_argument('--python', action='store', dest='python', default='python3',
help='Python3 path')
arg_parser.add_argument('--split-dwarf', dest='split_dwarf', action='store_true', default=False,
@@ -652,6 +635,8 @@ arg_parser.add_argument('--clang-inline-threshold', action='store', type=int, de
help="LLVM-specific inline threshold compilation parameter")
arg_parser.add_argument('--list-artifacts', dest='list_artifacts', action='store_true', default=False,
help='List all available build artifacts, that can be passed to --with')
arg_parser.add_argument('--date-stamp', dest='date_stamp', type=str,
help='Set datestamp for SCYLLA-VERSION-GEN')
args = arg_parser.parse_args()
if args.list_artifacts:
@@ -661,6 +646,7 @@ if args.list_artifacts:
defines = ['XXH_PRIVATE_API',
'SEASTAR_TESTING_MAIN',
'FMT_DEPRECATED_OSTREAM',
]
scylla_raft_core = [
@@ -671,12 +657,13 @@ scylla_raft_core = [
'raft/log.cc',
]
scylla_core = (['replica/database.cc',
scylla_core = (['message/messaging_service.cc',
'replica/database.cc',
'replica/table.cc',
'replica/distributed_loader.cc',
'replica/memtable.cc',
'replica/exceptions.cc',
'dirty_memory_manager.cc',
'replica/dirty_memory_manager.cc',
'absl-flat_hash_map.cc',
'atomic_cell.cc',
'caching_options.cc',
@@ -711,6 +698,7 @@ scylla_core = (['replica/database.cc',
'mutation_partition.cc',
'mutation_partition_view.cc',
'mutation_partition_serializer.cc',
'utils/on_internal_error.cc',
'converting_mutation_partition_applier.cc',
'readers/combined.cc',
'readers/multishard.cc',
@@ -799,6 +787,8 @@ scylla_core = (['replica/database.cc',
'cql3/statements/raw/parsed_statement.cc',
'cql3/statements/property_definitions.cc',
'cql3/statements/update_statement.cc',
'cql3/statements/strongly_consistent_modification_statement.cc',
'cql3/statements/strongly_consistent_select_statement.cc',
'cql3/statements/delete_statement.cc',
'cql3/statements/prune_materialized_view_statement.cc',
'cql3/statements/batch_statement.cc',
@@ -828,6 +818,7 @@ scylla_core = (['replica/database.cc',
'cql3/statements/detach_service_level_statement.cc',
'cql3/statements/list_service_level_statement.cc',
'cql3/statements/list_service_level_attachments_statement.cc',
'cql3/statements/describe_statement.cc',
'cql3/update_parameters.cc',
'cql3/util.cc',
'cql3/ut_name.cc',
@@ -913,6 +904,7 @@ scylla_core = (['replica/database.cc',
'utils/config_file.cc',
'utils/multiprecision_int.cc',
'utils/gz/crc_combine.cc',
'utils/gz/crc_combine_table.cc',
'gms/version_generator.cc',
'gms/versioned_value.cc',
'gms/gossiper.cc',
@@ -947,7 +939,8 @@ scylla_core = (['replica/database.cc',
'locator/ec2_snitch.cc',
'locator/ec2_multi_region_snitch.cc',
'locator/gce_snitch.cc',
'message/messaging_service.cc',
'locator/topology.cc',
'locator/util.cc',
'service/client_state.cc',
'service/storage_service.cc',
'service/misc_services.cc',
@@ -977,6 +970,7 @@ scylla_core = (['replica/database.cc',
'utils/lister.cc',
'repair/repair.cc',
'repair/row_level.cc',
'repair/table_check.cc',
'exceptions/exceptions.cc',
'auth/allow_all_authenticator.cc',
'auth/allow_all_authorizer.cc',
@@ -999,7 +993,6 @@ scylla_core = (['replica/database.cc',
'tracing/tracing.cc',
'tracing/trace_keyspace_helper.cc',
'tracing/trace_state.cc',
'tracing/tracing_backend_registry.cc',
'tracing/traced_file.cc',
'table_helper.cc',
'range_tombstone.cc',
@@ -1037,6 +1030,9 @@ scylla_core = (['replica/database.cc',
'service/raft/raft_group0.cc',
'direct_failure_detector/failure_detector.cc',
'service/raft/raft_group0_client.cc',
'service/broadcast_tables/experimental/lang.cc',
'tasks/task_manager.cc',
'rust/wasmtime_bindings/src/lib.rs',
] + [Antlr3Grammar('cql3/Cql.g')] + [Thrift('interface/cassandra.thrift', 'Cassandra')] \
+ scylla_raft_core
)
@@ -1073,12 +1069,18 @@ api = ['api/api.cc',
'api/stream_manager.cc',
Json2Code('api/api-doc/system.json'),
'api/system.cc',
Json2Code('api/api-doc/task_manager.json'),
'api/task_manager.cc',
Json2Code('api/api-doc/task_manager_test.json'),
'api/task_manager_test.cc',
'api/config.cc',
Json2Code('api/api-doc/config.json'),
'api/error_injection.cc',
Json2Code('api/api-doc/error_injection.json'),
'api/authorization_cache.cc',
Json2Code('api/api-doc/authorization_cache.json'),
'api/raft.cc',
Json2Code('api/api-doc/raft.json'),
]
alternator = [
@@ -1147,12 +1149,9 @@ idls = ['idl/gossip_digest.idl.hh',
'idl/replica_exception.idl.hh',
'idl/per_partition_rate_limit_info.idl.hh',
'idl/position_in_partition.idl.hh',
'idl/experimental/broadcast_tables_lang.idl.hh',
]
rusts = [
'rust/inc/src/lib.rs',
]
headers = find_headers('.', excluded_dirs=['idl', 'build', 'seastar', '.git'])
scylla_tests_generic_dependencies = [
@@ -1174,9 +1173,9 @@ scylla_tests_dependencies = scylla_core + idls + scylla_tests_generic_dependenci
'test/lib/random_schema.cc',
]
scylla_raft_dependencies = scylla_raft_core + ['utils/uuid.cc']
scylla_raft_dependencies = scylla_raft_core + ['utils/uuid.cc', 'utils/error_injection.cc']
scylla_tools = ['tools/scylla-types.cc', 'tools/scylla-sstable.cc', 'tools/schema_loader.cc', 'tools/utils.cc']
scylla_tools = ['tools/scylla-types.cc', 'tools/scylla-sstable.cc', 'tools/schema_loader.cc', 'tools/utils.cc', 'tools/lua_sstable_consumer.cc']
deps = {
'scylla': idls + ['main.cc'] + scylla_core + api + alternator + redis + scylla_tools,
@@ -1274,7 +1273,7 @@ deps['test/boost/bytes_ostream_test'] = [
"test/lib/log.cc",
]
deps['test/boost/input_stream_test'] = ['test/boost/input_stream_test.cc']
deps['test/boost/UUID_test'] = ['utils/UUID_gen.cc', 'test/boost/UUID_test.cc', 'utils/uuid.cc', 'utils/dynamic_bitset.cc', 'hashers.cc']
deps['test/boost/UUID_test'] = ['utils/UUID_gen.cc', 'test/boost/UUID_test.cc', 'utils/uuid.cc', 'utils/dynamic_bitset.cc', 'hashers.cc', 'utils/on_internal_error.cc']
deps['test/boost/murmur_hash_test'] = ['bytes.cc', 'utils/murmur_hash.cc', 'test/boost/murmur_hash_test.cc']
deps['test/boost/allocation_strategy_test'] = ['test/boost/allocation_strategy_test.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['test/boost/log_heap_test'] = ['test/boost/log_heap_test.cc']
@@ -1298,16 +1297,17 @@ deps['test/boost/linearizing_input_stream_test'] = [
"test/boost/linearizing_input_stream_test.cc",
"test/lib/log.cc",
]
deps['test/boost/expr_test'] = ['test/boost/expr_test.cc'] + scylla_core
deps['test/boost/expr_test'] = ['test/boost/expr_test.cc', 'test/lib/expr_test_utils.cc'] + scylla_core
deps['test/boost/rate_limiter_test'] = ['test/boost/rate_limiter_test.cc', 'db/rate_limiter.cc']
deps['test/boost/exceptions_optimized_test'] = ['test/boost/exceptions_optimized_test.cc', 'utils/exceptions.cc']
deps['test/boost/exceptions_fallback_test'] = ['test/boost/exceptions_fallback_test.cc', 'utils/exceptions.cc']
deps['test/boost/duration_test'] += ['test/lib/exception_utils.cc']
deps['test/boost/schema_loader_test'] += ['tools/schema_loader.cc']
deps['test/boost/rust_test'] += rusts
deps['test/boost/rust_test'] += ['rust/inc/src/lib.rs']
deps['test/raft/replication_test'] = ['test/raft/replication_test.cc', 'test/raft/replication.cc', 'test/raft/helpers.cc'] + scylla_raft_dependencies
deps['test/raft/raft_server_test'] = ['test/raft/raft_server_test.cc', 'test/raft/replication.cc', 'test/raft/helpers.cc'] + scylla_raft_dependencies
deps['test/raft/randomized_nemesis_test'] = ['test/raft/randomized_nemesis_test.cc', 'direct_failure_detector/failure_detector.cc', 'test/raft/helpers.cc'] + scylla_raft_dependencies
deps['test/raft/failure_detector_test'] = ['test/raft/failure_detector_test.cc', 'direct_failure_detector/failure_detector.cc', 'test/raft/helpers.cc'] + scylla_raft_dependencies
deps['test/raft/many_test'] = ['test/raft/many_test.cc', 'test/raft/replication.cc', 'test/raft/helpers.cc'] + scylla_raft_dependencies
@@ -1321,8 +1321,6 @@ deps['test/raft/discovery_test'] = ['test/raft/discovery_test.cc',
'test/lib/log.cc',
'service/raft/discovery.cc'] + scylla_raft_dependencies
deps['utils/gz/gen_crc_combine_table'] = ['utils/gz/gen_crc_combine_table.cc']
warnings = [
'-Wall',
@@ -1372,7 +1370,7 @@ warnings = [w
warnings = ' '.join(warnings + ['-Wno-error=deprecated-declarations'])
def clang_inline_threshold():
def get_clang_inline_threshold():
if args.clang_inline_threshold != -1:
return args.clang_inline_threshold
elif platform.machine() == 'aarch64':
@@ -1393,7 +1391,7 @@ for mode in modes:
optimization_flags = [
'--param inline-unit-growth=300', # gcc
f'-mllvm -inline-threshold={clang_inline_threshold()}', # clang
f'-mllvm -inline-threshold={get_clang_inline_threshold()}', # clang
# clang generates 16-byte loads that break store-to-load forwarding
# gcc also has some trouble: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103554
'-fno-slp-vectorize',
@@ -1407,37 +1405,16 @@ if flag_supported(flag='-Wstack-usage=4096', compiler=args.cxx):
for mode in modes:
modes[mode]['cxxflags'] += f' -Wstack-usage={modes[mode]["stack-usage-threshold"]} -Wno-error=stack-usage='
has_wasmtime = os.path.isfile('/usr/lib64/libwasmtime.a') and os.path.isdir('/usr/local/include/wasmtime')
if has_wasmtime:
if platform.machine() == 'aarch64':
print("wasmtime is temporarily not supported on aarch64. Ref: issue #9387")
has_wasmtime = False
else:
for mode in modes:
modes[mode]['cxxflags'] += ' -DSCYLLA_ENABLE_WASMTIME'
else:
print("wasmtime not found - WASM support will not be enabled in this build")
linker_flags = linker_flags(compiler=args.cxx)
dbgflag = '-g -gz' if args.debuginfo else ''
tests_link_rule = 'link' if args.tests_debuginfo else 'link_stripped'
perf_tests_link_rule = 'link' if args.perf_tests_debuginfo else 'link_stripped'
# Strip if debuginfo is disabled, otherwise we end up with partial
# debug info from the libraries we static link with
regular_link_rule = 'link' if args.debuginfo else 'link_stripped'
if args.so:
args.pie = '-shared'
args.fpie = '-fpic'
elif args.pie:
args.pie = '-pie'
args.fpie = '-fpie'
else:
args.pie = ''
args.fpie = ''
# a list element means a list of alternative packages to consider
# the first element becomes the HAVE_pkg define
# a string element is a package name with no alternatives
@@ -1536,7 +1513,8 @@ if args.artifacts:
else:
build_artifacts = all_artifacts
status = subprocess.call("./SCYLLA-VERSION-GEN")
date_stamp = f"--date-stamp {args.date_stamp}" if args.date_stamp else ""
status = subprocess.call(f"./SCYLLA-VERSION-GEN {date_stamp}", shell=True)
if status != 0:
print('Version file generation failed')
sys.exit(1)
@@ -1551,7 +1529,8 @@ scylla_product = file.read().strip()
arch = platform.machine()
for m, mode_config in modes.items():
cxxflags = "-DSCYLLA_VERSION=\"\\\"" + scylla_version + "\\\"\" -DSCYLLA_RELEASE=\"\\\"" + scylla_release + "\\\"\" -DSCYLLA_BUILD_MODE=\"\\\"" + m + "\\\"\""
mode_config['cxxflags'] += f" -DSCYLLA_BUILD_MODE={m}"
cxxflags = "-DSCYLLA_VERSION=\"\\\"" + scylla_version + "\\\"\" -DSCYLLA_RELEASE=\"\\\"" + scylla_release + "\\\"\""
mode_config["per_src_extra_cxxflags"]["release.cc"] = cxxflags
if mode_config["can_have_debug_info"]:
mode_config['cxxflags'] += ' ' + dbgflag
@@ -1592,13 +1571,14 @@ args.user_ldflags = forced_ldflags + ' ' + args.user_ldflags
args.user_cflags += f" -ffile-prefix-map={curdir}=."
seastar_cflags = args.user_cflags
if args.target != '':
seastar_cflags += ' -march=' + args.target
seastar_ldflags = args.user_ldflags
args.user_cflags += ' -march=' + args.target
libdeflate_cflags = seastar_cflags
for mode in modes:
# Those flags are passed not only to Scylla objects, but also to libraries
# that we compile ourselves.
modes[mode]['lib_cflags'] = args.user_cflags
modes[mode]['lib_ldflags'] = args.user_ldflags + linker_flags
# cmake likes to separate things with semicolons
def semicolon_separated(*flags):
@@ -1618,8 +1598,8 @@ def configure_seastar(build_dir, mode, mode_config):
'-DCMAKE_C_COMPILER={}'.format(args.cc),
'-DCMAKE_CXX_COMPILER={}'.format(args.cxx),
'-DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON',
'-DSeastar_CXX_FLAGS={}'.format((seastar_cflags).replace(' ', ';')),
'-DSeastar_LD_FLAGS={}'.format(semicolon_separated(seastar_ldflags, modes[mode]['cxx_ld_flags'])),
'-DSeastar_CXX_FLAGS=SHELL:{}'.format(mode_config['lib_cflags']),
'-DSeastar_LD_FLAGS={}'.format(semicolon_separated(mode_config['lib_ldflags'], mode_config['cxx_ld_flags'])),
'-DSeastar_CXX_DIALECT=gnu++20',
'-DSeastar_API_LEVEL=6',
'-DSeastar_UNUSED_RESULT_ERROR=ON',
@@ -1680,52 +1660,16 @@ for mode in build_modes:
seastar_pc_cflags, seastar_pc_libs = query_seastar_flags(pc[mode], link_static_cxx=args.staticcxx)
modes[mode]['seastar_cflags'] = seastar_pc_cflags
modes[mode]['seastar_libs'] = seastar_pc_libs
modes[mode]['seastar_testing_libs'] = pkg_config(pc[mode].replace('seastar.pc', 'seastar-testing.pc'), '--libs', '--static')
def configure_abseil(build_dir, mode, mode_config):
abseil_build_dir = os.path.join(build_dir, mode, 'abseil')
abseil_pkgs = [
'absl_raw_hash_set',
'absl_hash',
]
abseil_cflags = seastar_cflags + ' ' + modes[mode]['cxx_ld_flags']
cmake_mode = mode_config['cmake_build_type']
abseil_cmake_args = [
'-DCMAKE_BUILD_TYPE={}'.format(cmake_mode),
'-DCMAKE_INSTALL_PREFIX={}'.format(build_dir + '/inst'), # just to avoid a warning from absl
'-DCMAKE_C_COMPILER={}'.format(args.cc),
'-DCMAKE_CXX_COMPILER={}'.format(args.cxx),
'-DCMAKE_CXX_FLAGS_{}={}'.format(cmake_mode.upper(), abseil_cflags),
'-DCMAKE_EXPORT_COMPILE_COMMANDS=ON',
'-DCMAKE_CXX_STANDARD=20',
'-DABSL_PROPAGATE_CXX_STD=ON',
] + distro_extra_cmake_args
abseil_cmd = ['cmake', '-G', 'Ninja', real_relpath('abseil', abseil_build_dir)] + abseil_cmake_args
os.makedirs(abseil_build_dir, exist_ok=True)
subprocess.check_call(abseil_cmd, shell=False, cwd=abseil_build_dir)
abseil_libs = ['absl/' + lib for lib in [
'container/libabsl_hashtablez_sampler.a',
'container/libabsl_raw_hash_set.a',
'synchronization/libabsl_synchronization.a',
'synchronization/libabsl_graphcycles_internal.a',
'debugging/libabsl_stacktrace.a',
'debugging/libabsl_symbolize.a',
'debugging/libabsl_debugging_internal.a',
'debugging/libabsl_demangle_internal.a',
'time/libabsl_time.a',
'time/libabsl_time_zone.a',
'numeric/libabsl_int128.a',
'hash/libabsl_city.a',
'hash/libabsl_hash.a',
'hash/libabsl_low_level_hash.a',
'base/libabsl_malloc_internal.a',
'base/libabsl_spinlock_wait.a',
'base/libabsl_base.a',
'base/libabsl_raw_logging_internal.a',
'profiling/libabsl_exponential_biased.a',
'base/libabsl_throw_delegate.a']]
pkgs += abseil_pkgs
args.user_cflags += " " + pkg_config('jsoncpp', '--cflags')
args.user_cflags += ' -march=' + args.target
libs = ' '.join([maybe_static(args.staticyamlcpp, '-lyaml-cpp'), '-latomic', '-llz4', '-lz', '-lsnappy', pkg_config('jsoncpp', '--libs'),
' -lstdc++fs', ' -lcrypt', ' -lcryptopp', ' -lpthread',
# Must link with static version of libzstd, since
@@ -1733,9 +1677,8 @@ libs = ' '.join([maybe_static(args.staticyamlcpp, '-lyaml-cpp'), '-latomic', '-l
maybe_static(True, '-lzstd'),
maybe_static(args.staticboost, '-lboost_date_time -lboost_regex -licuuc -licui18n'),
'-lxxhash',
'-ldeflate',
])
if has_wasmtime:
print("Found wasmtime dependency, linking with libwasmtime")
if not args.staticboost:
args.user_cflags += ' -DBOOST_TEST_DYN_LINK'
@@ -1754,7 +1697,6 @@ if any(filter(thrift_version.startswith, thrift_boost_versions)):
for pkg in pkgs:
args.user_cflags += ' ' + pkg_config(pkg, '--cflags')
libs += ' ' + pkg_config(pkg, '--libs')
args.user_cflags += ' -Iabseil'
user_cflags = args.user_cflags + ' -fvisibility=hidden'
user_ldflags = args.user_ldflags + ' -fvisibility=hidden'
if args.staticcxx:
@@ -1776,10 +1718,6 @@ if args.ragel_exec:
else:
ragel_exec = "ragel"
if not args.dist_only:
for mode, mode_config in build_modes.items():
configure_abseil(outdir, mode, mode_config)
with open(buildfile, 'w') as f:
f.write(textwrap.dedent('''\
configure_args = {configure_args}
@@ -1817,8 +1755,14 @@ with open(buildfile, 'w') as f:
rule copy
command = cp --reflink=auto $in $out
description = COPY $out
rule strip
command = scripts/strip.sh $in
rule package
command = scripts/create-relocatable-package.py --mode $mode $out
rule stripped_package
command = scripts/create-relocatable-package.py --stripped --mode $mode $out
rule debuginfo_package
command = dist/debuginfo/scripts/create-relocatable-package.py --mode $mode $out
rule rpmbuild
command = reloc/build_rpm.sh --reloc-pkg $in --builddir $out
rule debbuild
@@ -1826,18 +1770,24 @@ with open(buildfile, 'w') as f:
rule unified
command = unified/build_unified.sh --mode $mode --unified-pkg $out
rule rust_header
command = cxxbridge $in > $out
command = cxxbridge --include rust/cxx.h --header $in > $out
description = RUST_HEADER $out
rule rust_source
command = cxxbridge --include rust/cxx.h $in > $out
description = RUST_SOURCE $out
rule cxxbridge_header
command = cxxbridge --header > $out
''').format(**globals()))
for mode in build_modes:
modeval = modes[mode]
fmt_lib = 'fmt'
f.write(textwrap.dedent('''\
cxx_ld_flags_{mode} = {cxx_ld_flags}
ld_flags_{mode} = $cxx_ld_flags_{mode}
cxxflags_{mode} = $cxx_ld_flags_{mode} {cxxflags} -iquote. -iquote $builddir/{mode}/gen
ld_flags_{mode} = $cxx_ld_flags_{mode} {lib_ldflags}
cxxflags_{mode} = $cxx_ld_flags_{mode} {lib_cflags} {cxxflags} -iquote. -iquote $builddir/{mode}/gen
libs_{mode} = -l{fmt_lib}
seastar_libs_{mode} = {seastar_libs}
seastar_testing_libs_{mode} = {seastar_testing_libs}
rule cxx.{mode}
command = $cxx -MD -MT $out -MF $out.d {seastar_cflags} $cxxflags_{mode} $cxxflags $obj_cxxflags -c -o $out $in
description = CXX $out
@@ -1887,7 +1837,8 @@ with open(buildfile, 'w') as f:
pool = console
description = TEST {mode}
rule rust_lib.{mode}
command = CARGO_HOME=build/{mode}/rust/.cargo cargo build --release --manifest-path=rust/Cargo.toml --target-dir=build/{mode}/rust -p ${{pkg}}
command = CARGO_BUILD_DEP_INFO_BASEDIR='.' cargo build --locked --manifest-path=rust/Cargo.toml --target-dir=$builddir/{mode} --profile=rust-{mode} $
&& touch $out
description = RUST_LIB $out
''').format(mode=mode, antlr3_exec=antlr3_exec, fmt_lib=fmt_lib, test_repeat=test_repeat, test_timeout=test_timeout, **modeval))
f.write(
@@ -1906,7 +1857,6 @@ with open(buildfile, 'w') as f:
ragels = {}
antlr3_grammars = set()
rust_headers = {}
rust_libs = {}
seastar_dep = '$builddir/{}/seastar/libseastar.a'.format(mode)
seastar_testing_dep = '$builddir/{}/seastar/libseastar_testing.a'.format(mode)
for binary in sorted(build_artifacts):
@@ -1917,9 +1867,8 @@ with open(buildfile, 'w') as f:
for src in srcs
if src.endswith('.cc')]
objs.append('$builddir/../utils/arch/powerpc/crc32-vpmsum/crc32.S')
if has_wasmtime:
objs.append('/usr/lib64/libwasmtime.a')
has_thrift = False
has_rust = False
for dep in deps[binary]:
if isinstance(dep, Thrift):
has_thrift = True
@@ -1928,40 +1877,36 @@ with open(buildfile, 'w') as f:
objs += dep.objects('$builddir/' + mode + '/gen')
if isinstance(dep, Json2Code):
objs += dep.objects('$builddir/' + mode + '/gen')
if dep.endswith('/src/lib.rs'):
lib = dep.replace('/src/lib.rs', '.a').replace('rust/','lib')
objs.append('$builddir/' + mode + '/rust/release/' + lib)
if binary.endswith('.a'):
f.write('build $builddir/{}/{}: ar.{} {}\n'.format(mode, binary, mode, str.join(' ', objs)))
if dep.endswith('.rs'):
has_rust = True
idx = dep.rindex('/src/')
obj = dep[:idx].replace('rust/','') + '.o'
objs.append('$builddir/' + mode + '/gen/rust/' + obj)
if has_rust:
objs.append('$builddir/' + mode +'/rust-' + mode + '/librust_combined.a')
local_libs = '$seastar_libs_{} $libs'.format(mode)
if has_thrift:
local_libs += ' ' + thrift_libs + ' ' + maybe_static(args.staticboost, '-lboost_system')
if binary in tests:
if binary in pure_boost_tests:
local_libs += ' ' + maybe_static(args.staticboost, '-lboost_unit_test_framework')
if binary not in tests_not_using_seastar_test_framework:
local_libs += ' ' + "$seastar_testing_libs_{}".format(mode)
# Our code's debugging information is huge, and multiplied
# by many tests yields ridiculous amounts of disk space.
# So we strip the tests by default; The user can very
# quickly re-link the test unstripped by adding a "_g"
# to the test name, e.g., "ninja build/release/testname_g"
link_rule = perf_tests_link_rule if binary.startswith('test/perf/') else tests_link_rule
f.write('build $builddir/{}/{}: {}.{} {} | {} {}\n'.format(mode, binary, link_rule, mode, str.join(' ', objs), seastar_dep, seastar_testing_dep))
f.write(' libs = {}\n'.format(local_libs))
f.write('build $builddir/{}/{}_g: {}.{} {} | {} {}\n'.format(mode, binary, regular_link_rule, mode, str.join(' ', objs), seastar_dep, seastar_testing_dep))
f.write(' libs = {}\n'.format(local_libs))
else:
objs.extend(['$builddir/' + mode + '/' + artifact for artifact in [
'libdeflate/libdeflate.a',
] + [
'abseil/' + x for x in abseil_libs
]])
objs.append('$builddir/' + mode + '/gen/utils/gz/crc_combine_table.o')
if binary in tests:
local_libs = '$seastar_libs_{} $libs'.format(mode)
if binary in pure_boost_tests:
local_libs += ' ' + maybe_static(args.staticboost, '-lboost_unit_test_framework')
if binary not in tests_not_using_seastar_test_framework:
pc_path = pc[mode].replace('seastar.pc', 'seastar-testing.pc')
local_libs += ' ' + pkg_config(pc_path, '--libs', '--static')
if has_thrift:
local_libs += ' ' + thrift_libs + ' ' + maybe_static(args.staticboost, '-lboost_system')
# Our code's debugging information is huge, and multiplied
# by many tests yields ridiculous amounts of disk space.
# So we strip the tests by default; The user can very
# quickly re-link the test unstripped by adding a "_g"
# to the test name, e.g., "ninja build/release/testname_g"
f.write('build $builddir/{}/{}: {}.{} {} | {} {}\n'.format(mode, binary, tests_link_rule, mode, str.join(' ', objs), seastar_dep, seastar_testing_dep))
f.write(' libs = {}\n'.format(local_libs))
f.write('build $builddir/{}/{}_g: {}.{} {} | {} {}\n'.format(mode, binary, regular_link_rule, mode, str.join(' ', objs), seastar_dep, seastar_testing_dep))
f.write(' libs = {}\n'.format(local_libs))
else:
f.write('build $builddir/{}/{}: {}.{} {} | {}\n'.format(mode, binary, regular_link_rule, mode, str.join(' ', objs), seastar_dep))
if has_thrift:
f.write(' libs = {} {} $seastar_libs_{} $libs\n'.format(thrift_libs, maybe_static(args.staticboost, '-lboost_system'), mode))
f.write('build $builddir/{}/{}: {}.{} {} | {}\n'.format(mode, binary, regular_link_rule, mode, str.join(' ', objs), seastar_dep))
f.write(' libs = {}\n'.format(local_libs))
f.write(f'build $builddir/{mode}/{binary}.stripped: strip $builddir/{mode}/{binary}\n')
f.write(f'build $builddir/{mode}/{binary}.debug: phony $builddir/{mode}/{binary}.stripped\n')
for src in srcs:
if src.endswith('.cc'):
obj = '$builddir/' + mode + '/' + src.replace('.cc', '.o')
@@ -1978,19 +1923,12 @@ with open(buildfile, 'w') as f:
thrifts.add(src)
elif src.endswith('.g'):
antlr3_grammars.add(src)
elif src.endswith('/src/lib.rs'):
hh = '$builddir/' + mode + '/gen/' + src.replace('/src/lib.rs', '.hh')
elif src.endswith('.rs'):
idx = src.rindex('/src/')
hh = '$builddir/' + mode + '/gen/' + src[:idx] + '.hh'
rust_headers[hh] = src
staticlib = src.replace('rust/', '$builddir/' + mode + '/rust/release/lib').replace('/src/lib.rs', '.a')
rust_libs[staticlib] = src
else:
raise Exception('No rule for ' + src)
compiles['$builddir/' + mode + '/gen/utils/gz/crc_combine_table.o'] = '$builddir/' + mode + '/gen/utils/gz/crc_combine_table.cc'
compiles['$builddir/' + mode + '/utils/gz/gen_crc_combine_table.o'] = 'utils/gz/gen_crc_combine_table.cc'
f.write('build {}: run {}\n'.format('$builddir/' + mode + '/gen/utils/gz/crc_combine_table.cc',
'$builddir/' + mode + '/utils/gz/gen_crc_combine_table'))
f.write('build {}: link_build.{} {}\n'.format('$builddir/' + mode + '/utils/gz/gen_crc_combine_table', mode,
'$builddir/' + mode + '/utils/gz/gen_crc_combine_table.o'))
f.write(' libs = $seastar_libs_{}\n'.format(mode))
f.write(
'build {mode}-objects: phony {objs}\n'.format(
@@ -2028,6 +1966,7 @@ with open(buildfile, 'w') as f:
gen_headers += list(serializers.keys())
gen_headers += list(ragels.keys())
gen_headers += list(rust_headers.keys())
gen_headers.append('$builddir/{}/gen/rust/cxx.h'.format(mode))
gen_headers_dep = ' '.join(gen_headers)
for obj in compiles:
@@ -2051,10 +1990,13 @@ with open(buildfile, 'w') as f:
for hh in rust_headers:
src = rust_headers[hh]
f.write('build {}: rust_header {}\n'.format(hh, src))
for lib in rust_libs:
src = rust_libs[lib]
package = src.replace('/src/lib.rs', '').replace('rust/','')
f.write('build {}: rust_lib.{} {}\n pkg = {}\n'.format(lib, mode, src, package))
cc = hh.replace('.hh', '.cc')
f.write('build {}: rust_source {}\n'.format(cc, src))
obj = cc.replace('.cc', '.o')
f.write('build {}: cxx.{} {} || {}\n'.format(obj, mode, cc, gen_headers_dep))
f.write('build {}: cxxbridge_header\n'.format('$builddir/{}/gen/rust/cxx.h'.format(mode)))
librust = '$builddir/{}/rust-{}/librust_combined'.format(mode, mode)
f.write('build {}.a: rust_lib.{} rust/Cargo.lock\n depfile={}.d\n'.format(librust, mode, librust))
for thrift in thrifts:
outs = ' '.join(thrift.generated('$builddir/{}/gen'.format(mode)))
f.write('build {}: thrift.{} {}\n'.format(outs, mode, thrift.source))
@@ -2070,7 +2012,8 @@ with open(buildfile, 'w') as f:
f.write('build {}: cxx.{} {} || {}\n'.format(obj, mode, cc, ' '.join(serializers)))
if cc.endswith('Parser.cpp'):
# Unoptimized parsers end up using huge amounts of stack space and overflowing their stack
flags = '-O1'
flags = '-O1' if modes[mode]['optimization-level'] in ['0', 'g', 's'] else ''
if has_sanitize_address_use_after_scope:
flags += ' -fno-sanitize-address-use-after-scope'
f.write(' obj_cxxflags = %s\n' % flags)
@@ -2096,42 +2039,42 @@ with open(buildfile, 'w') as f:
f.write(' target = iotune\n'.format(**locals()))
f.write(textwrap.dedent('''\
build $builddir/{mode}/iotune: copy $builddir/{mode}/seastar/apps/iotune/iotune
build $builddir/{mode}/iotune.stripped: strip $builddir/{mode}/iotune
build $builddir/{mode}/iotune.debug: phony $builddir/{mode}/iotune.stripped
''').format(**locals()))
include_scylla_and_iotune = f'$builddir/{mode}/scylla $builddir/{mode}/iotune' if not args.dist_only else ''
f.write('build $builddir/{mode}/dist/tar/{scylla_product}-{scylla_version}-{scylla_release}.{arch}.tar.gz: package {include_scylla_and_iotune} $builddir/SCYLLA-RELEASE-FILE $builddir/SCYLLA-VERSION-FILE $builddir/debian/debian $builddir/node_exporter | always\n'.format(**locals()))
if args.dist_only:
include_scylla_and_iotune = ''
include_scylla_and_iotune_stripped = ''
include_scylla_and_iotune_debug = ''
else:
include_scylla_and_iotune = f'$builddir/{mode}/scylla $builddir/{mode}/iotune'
include_scylla_and_iotune_stripped = f'$builddir/{mode}/scylla.stripped $builddir/{mode}/iotune.stripped'
include_scylla_and_iotune_debug = f'$builddir/{mode}/scylla.debug $builddir/{mode}/iotune.debug'
f.write('build $builddir/{mode}/dist/tar/{scylla_product}-unstripped-{scylla_version}-{scylla_release}.{arch}.tar.gz: package {include_scylla_and_iotune} $builddir/SCYLLA-RELEASE-FILE $builddir/SCYLLA-VERSION-FILE $builddir/debian/debian $builddir/node_exporter/node_exporter | always\n'.format(**locals()))
f.write(' mode = {mode}\n'.format(**locals()))
f.write('build $builddir/{mode}/dist/tar/{scylla_product}-{scylla_version}-{scylla_release}.{arch}.tar.gz: stripped_package {include_scylla_and_iotune_stripped} $builddir/SCYLLA-RELEASE-FILE $builddir/SCYLLA-VERSION-FILE $builddir/debian/debian $builddir/node_exporter/node_exporter.stripped | always\n'.format(**locals()))
f.write(' mode = {mode}\n'.format(**locals()))
f.write('build $builddir/{mode}/dist/tar/{scylla_product}-debuginfo-{scylla_version}-{scylla_release}.{arch}.tar.gz: debuginfo_package {include_scylla_and_iotune_debug} $builddir/SCYLLA-RELEASE-FILE $builddir/SCYLLA-VERSION-FILE $builddir/debian/debian $builddir/node_exporter/node_exporter.debug | always\n'.format(**locals()))
f.write(' mode = {mode}\n'.format(**locals()))
f.write('build $builddir/{mode}/dist/tar/{scylla_product}-package.tar.gz: copy $builddir/{mode}/dist/tar/{scylla_product}-{scylla_version}-{scylla_release}.{arch}.tar.gz\n'.format(**locals()))
f.write(' mode = {mode}\n'.format(**locals()))
f.write('build $builddir/{mode}/dist/tar/{scylla_product}-{arch}-package.tar.gz: copy $builddir/{mode}/dist/tar/{scylla_product}-{scylla_version}-{scylla_release}.{arch}.tar.gz\n'.format(**locals()))
f.write(' mode = {mode}\n'.format(**locals()))
f.write(f'build $builddir/dist/{mode}/redhat: rpmbuild $builddir/{mode}/dist/tar/{scylla_product}-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
f.write(f'build $builddir/dist/{mode}/redhat: rpmbuild $builddir/{mode}/dist/tar/{scylla_product}-unstripped-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
f.write(f' mode = {mode}\n')
f.write(f'build $builddir/dist/{mode}/debian: debbuild $builddir/{mode}/dist/tar/{scylla_product}-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
f.write(f'build $builddir/dist/{mode}/debian: debbuild $builddir/{mode}/dist/tar/{scylla_product}-unstripped-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
f.write(f' mode = {mode}\n')
f.write(f'build dist-server-{mode}: phony $builddir/dist/{mode}/redhat $builddir/dist/{mode}/debian dist-server-compat-{mode} dist-server-compat-arch-{mode}\n')
f.write(f'build dist-server-compat-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-package.tar.gz\n')
f.write(f'build dist-server-compat-arch-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-{arch}-package.tar.gz\n')
f.write(f'build dist-jmx-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz dist-jmx-rpm dist-jmx-deb dist-jmx-compat\n')
f.write(f'build dist-tools-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz dist-tools-rpm dist-tools-deb dist-tools-compat\n')
f.write(f'build dist-python3-{mode}: phony dist-python3-tar dist-python3-rpm dist-python3-deb dist-python3-compat dist-python3-compat-arch\n')
f.write(f'build dist-unified-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz dist-unified-compat-{mode} dist-unified-compat-arch-{mode}\n')
f.write(f'build dist-unified-compat-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-unified-package-{scylla_version}-{scylla_release}.tar.gz\n')
f.write(f'build dist-unified-compat-arch-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-unified-{arch}-package-{scylla_version}-{scylla_release}.tar.gz\n')
f.write(f'build dist-server-{mode}: phony $builddir/dist/{mode}/redhat $builddir/dist/{mode}/debian\n')
f.write(f'build dist-server-debuginfo-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-debuginfo-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
f.write(f'build dist-jmx-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz dist-jmx-rpm dist-jmx-deb\n')
f.write(f'build dist-tools-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz dist-tools-rpm dist-tools-deb\n')
f.write(f'build dist-python3-{mode}: phony dist-python3-tar dist-python3-rpm dist-python3-deb\n')
f.write(f'build dist-unified-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
f.write(f'build $builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz: unified $builddir/{mode}/dist/tar/{scylla_product}-{scylla_version}-{scylla_release}.{arch}.tar.gz $builddir/{mode}/dist/tar/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz $builddir/{mode}/dist/tar/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz $builddir/{mode}/dist/tar/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz | always\n')
f.write(f' mode = {mode}\n')
f.write(f'build $builddir/{mode}/dist/tar/{scylla_product}-unified-package-{scylla_version}-{scylla_release}.tar.gz: copy $builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
f.write(f'build $builddir/{mode}/dist/tar/{scylla_product}-unified-{arch}-package-{scylla_version}-{scylla_release}.tar.gz: copy $builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
f.write('rule libdeflate.{mode}\n'.format(**locals()))
f.write(' command = make -C libdeflate BUILD_DIR=../$builddir/{mode}/libdeflate/ CFLAGS="{libdeflate_cflags}" CC={args.cc} ../$builddir/{mode}/libdeflate//libdeflate.a\n'.format(**locals()))
f.write('build $builddir/{mode}/libdeflate/libdeflate.a: libdeflate.{mode}\n'.format(**locals()))
f.write(' pool = submodule_pool\n')
for lib in abseil_libs:
f.write('build $builddir/{mode}/abseil/{lib}: ninja $builddir/{mode}/abseil/build.ninja\n'.format(**locals()))
f.write(' pool = submodule_pool\n')
f.write(' subdir = $builddir/{mode}/abseil\n'.format(**locals()))
f.write(' target = {lib}\n'.format(**locals()))
checkheaders_mode = 'dev' if 'dev' in modes else modes.keys()[0]
f.write('build checkheaders: phony || {}\n'.format(' '.join(['$builddir/{}/{}.o'.format(checkheaders_mode, hh) for hh in headers])))
@@ -2148,16 +2091,13 @@ with open(buildfile, 'w') as f:
f.write(textwrap.dedent(f'''\
build dist-unified-tar: phony {' '.join([f'$builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz' for mode in default_modes])}
build dist-unified-compat: phony {' '.join([f'$builddir/{mode}/dist/tar/{scylla_product}-unified-package-{scylla_version}-{scylla_release}.tar.gz' for mode in default_modes])}
build dist-unified-compat-arch: phony {' '.join([f'$builddir/{mode}/dist/tar/{scylla_product}-unified-{arch}-package-{scylla_version}-{scylla_release}.tar.gz' for mode in default_modes])}
build dist-unified: phony dist-unified-tar dist-unified-compat dist-unified-compat-arch
build dist-unified: phony dist-unified-tar
build dist-server-deb: phony {' '.join(['$builddir/dist/{mode}/debian'.format(mode=mode) for mode in build_modes])}
build dist-server-rpm: phony {' '.join(['$builddir/dist/{mode}/redhat'.format(mode=mode) for mode in build_modes])}
build dist-server-tar: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-{scylla_version}-{scylla_release}.{arch}.tar.gz'.format(mode=mode, scylla_product=scylla_product, arch=arch, scylla_version=scylla_version, scylla_release=scylla_release) for mode in default_modes])}
build dist-server-compat: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-package.tar.gz'.format(mode=mode, scylla_product=scylla_product, arch=arch) for mode in default_modes])}
build dist-server-compat-arch: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-{arch}-package.tar.gz'.format(mode=mode, scylla_product=scylla_product, arch=arch) for mode in default_modes])}
build dist-server: phony dist-server-tar dist-server-compat dist-server-compat-arch dist-server-rpm dist-server-deb
build dist-server-debuginfo: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-debuginfo-{scylla_version}-{scylla_release}.{arch}.tar.gz'.format(mode=mode, scylla_product=scylla_product, arch=arch, scylla_version=scylla_version, scylla_release=scylla_release) for mode in default_modes])}
build dist-server: phony dist-server-tar dist-server-debuginfo dist-server-rpm dist-server-deb
rule build-submodule-reloc
command = cd $reloc_dir && ./reloc/build_reloc.sh --version $$(<../../build/SCYLLA-PRODUCT-FILE)-$$(sed 's/-/~/' <../../build/SCYLLA-VERSION-FILE)-$$(<../../build/SCYLLA-RELEASE-FILE) --nodeps $args
@@ -2175,8 +2115,7 @@ with open(buildfile, 'w') as f:
dir = tools/jmx
artifact = $builddir/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz
build dist-jmx-tar: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz'.format(mode=mode, scylla_product=scylla_product, scylla_version=scylla_version, scylla_release=scylla_release) for mode in default_modes])}
build dist-jmx-compat: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-jmx-package.tar.gz'.format(mode=mode, scylla_product=scylla_product, arch=arch) for mode in default_modes])}
build dist-jmx: phony dist-jmx-tar dist-jmx-compat dist-jmx-rpm dist-jmx-deb
build dist-jmx: phony dist-jmx-tar dist-jmx-rpm dist-jmx-deb
build tools/java/build/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz: build-submodule-reloc | build/SCYLLA-PRODUCT-FILE build/SCYLLA-VERSION-FILE build/SCYLLA-RELEASE-FILE
reloc_dir = tools/java
@@ -2187,8 +2126,7 @@ with open(buildfile, 'w') as f:
dir = tools/java
artifact = $builddir/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz
build dist-tools-tar: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz'.format(mode=mode, scylla_product=scylla_product, scylla_version=scylla_version, scylla_release=scylla_release) for mode in default_modes])}
build dist-tools-compat: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-tools-package.tar.gz'.format(mode=mode, scylla_product=scylla_product, arch=arch) for mode in default_modes])}
build dist-tools: phony dist-tools-tar dist-tools-compat dist-tools-rpm dist-tools-deb
build dist-tools: phony dist-tools-tar dist-tools-rpm dist-tools-deb
build tools/python3/build/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz: build-submodule-reloc | build/SCYLLA-PRODUCT-FILE build/SCYLLA-VERSION-FILE build/SCYLLA-RELEASE-FILE
reloc_dir = tools/python3
@@ -2200,14 +2138,10 @@ with open(buildfile, 'w') as f:
dir = tools/python3
artifact = $builddir/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz
build dist-python3-tar: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz'.format(mode=mode, scylla_product=scylla_product, arch=arch, scylla_version=scylla_version, scylla_release=scylla_release) for mode in default_modes])}
build dist-python3-compat: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-python3-package.tar.gz'.format(mode=mode, scylla_product=scylla_product, arch=arch) for mode in default_modes])}
build dist-python3-compat-arch: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-python3-{arch}-package.tar.gz'.format(mode=mode, scylla_product=scylla_product, arch=arch) for mode in default_modes])}
build dist-python3: phony dist-python3-tar dist-python3-compat dist-python3-compat-arch dist-python3-rpm dist-python3-deb
build dist-python3: phony dist-python3-tar dist-python3-rpm dist-python3-deb
build dist-deb: phony dist-server-deb dist-python3-deb dist-jmx-deb dist-tools-deb
build dist-rpm: phony dist-server-rpm dist-python3-rpm dist-jmx-rpm dist-tools-rpm
build dist-tar: phony dist-unified-tar dist-server-tar dist-python3-tar dist-jmx-tar dist-tools-tar
build dist-compat: phony dist-unified-compat dist-server-compat dist-python3-compat
build dist-compat-arch: phony dist-unified-compat-arch dist-server-compat-arch dist-python3-compat-arch
build dist: phony dist-unified dist-server dist-python3 dist-jmx dist-tools
'''))
@@ -2227,7 +2161,7 @@ with open(buildfile, 'w') as f:
build $builddir/{mode}/dist/tar/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz: copy tools/jmx/build/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz
build $builddir/{mode}/dist/tar/{scylla_product}-jmx-package.tar.gz: copy tools/jmx/build/{scylla_product}-jmx-{scylla_version}-{scylla_release}.noarch.tar.gz
build {mode}-dist: phony dist-server-{mode} dist-python3-{mode} dist-tools-{mode} dist-jmx-{mode} dist-unified-{mode}
build {mode}-dist: phony dist-server-{mode} dist-server-debuginfo-{mode} dist-python3-{mode} dist-tools-{mode} dist-jmx-{mode} dist-unified-{mode}
build dist-{mode}: phony {mode}-dist
build dist-check-{mode}: dist-check
mode = {mode}
@@ -2252,7 +2186,7 @@ with open(buildfile, 'w') as f:
description = List configured modes
build mode_list: mode_list
default {modes_list}
''').format(modes_list=' '.join(default_modes), build_ninja_list=' '.join([f'build/{mode}/{dir}/build.ninja' for mode in build_modes for dir in ['seastar', 'abseil']]), **globals()))
''').format(modes_list=' '.join(default_modes), build_ninja_list=' '.join([f'build/{mode}/{dir}/build.ninja' for mode in build_modes for dir in ['seastar']]), **globals()))
unit_test_list = set(test for test in build_artifacts if test in set(tests))
f.write(textwrap.dedent('''\
rule unit_test_list
@@ -2270,7 +2204,9 @@ with open(buildfile, 'w') as f:
build $builddir/debian/debian: debian_files_gen | always
rule extract_node_exporter
command = tar -C build -xvpf {node_exporter_filename} --no-same-owner && rm -rfv build/node_exporter && mv -v build/{node_exporter_dirname} build/node_exporter
build $builddir/node_exporter: extract_node_exporter | always
build $builddir/node_exporter/node_exporter: extract_node_exporter | always
build $builddir/node_exporter/node_exporter.stripped: strip $builddir/node_exporter/node_exporter
build $builddir/node_exporter/node_exporter.debug: phony $builddir/node_exporter/node_exporter.stripped
rule print_help
command = ./scripts/build-help.sh
build help: print_help | always
@@ -2279,7 +2215,7 @@ with open(buildfile, 'w') as f:
compdb = 'compile_commands.json'
# per-mode compdbs are built by taking the relevant entries from the
# output of "ninja -t compdb" and combining them with the CMake-made
# compdbs for Seastar and Abseil in the relevant mode.
# compdbs for Seastar in the relevant mode.
#
# "ninja -t compdb" output has to be filtered because
# - it contains rules for all selected modes, and several entries for
@@ -2294,7 +2230,7 @@ with tempfile.NamedTemporaryFile() as ninja_compdb:
# build mode-specific compdbs
for mode in selected_modes:
mode_out = outdir + '/' + mode
submodule_compdbs = [mode_out + '/' + submodule + '/' + compdb for submodule in ['abseil', 'seastar']]
submodule_compdbs = [mode_out + '/' + submodule + '/' + compdb for submodule in ['seastar']]
with open(mode_out + '/' + compdb, 'w+b') as combined_mode_specific_compdb:
subprocess.run(['./scripts/merge-compdb.py', 'build/' + mode,
ninja_compdb.name] + submodule_compdbs, stdout=combined_mode_specific_compdb)

View File

@@ -12,10 +12,6 @@
#include <boost/range/algorithm/sort.hpp>
std::ostream& operator<<(std::ostream& os, const counter_id& id) {
return os << id.to_uuid();
}
std::ostream& operator<<(std::ostream& os, counter_shard_view csv) {
return os << "{global_shard id: " << csv.id() << " value: " << csv.value()
<< " clock: " << csv.logical_clock() << "}";
@@ -174,9 +170,11 @@ std::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, ato
}
void transform_counter_updates_to_shards(mutation& m, const mutation* current_state, uint64_t clock_offset, utils::UUID local_id) {
void transform_counter_updates_to_shards(mutation& m, const mutation* current_state, uint64_t clock_offset, locator::host_id local_host_id) {
// FIXME: allow current_state to be frozen_mutation
utils::UUID local_id = local_host_id.uuid();
auto transform_new_row_to_shards = [&s = *m.schema(), clock_offset, local_id] (column_kind kind, auto& cells) {
cells.for_each_cell([&] (column_id id, atomic_cell_or_collection& ac_o_c) {
auto& cdef = s.column_at(kind, id);

View File

@@ -14,51 +14,12 @@
#include "atomic_cell.hh"
#include "types.hh"
#include "locator/host_id.hh"
class mutation;
class atomic_cell_or_collection;
class counter_id {
int64_t _least_significant;
int64_t _most_significant;
public:
static_assert(std::is_same<decltype(std::declval<utils::UUID>().get_least_significant_bits()), int64_t>::value
&& std::is_same<decltype(std::declval<utils::UUID>().get_most_significant_bits()), int64_t>::value,
"utils::UUID is expected to work with two signed 64-bit integers");
counter_id() = default;
explicit counter_id(utils::UUID uuid) noexcept
: _least_significant(uuid.get_least_significant_bits())
, _most_significant(uuid.get_most_significant_bits())
{ }
utils::UUID to_uuid() const {
return utils::UUID(_most_significant, _least_significant);
}
bool operator<(const counter_id& other) const {
return to_uuid() < other.to_uuid();
}
bool operator>(const counter_id& other) const {
return other.to_uuid() < to_uuid();
}
bool operator==(const counter_id& other) const {
return to_uuid() == other.to_uuid();
}
bool operator!=(const counter_id& other) const {
return !(*this == other);
}
public:
// For tests.
static counter_id generate_random() {
return counter_id(utils::make_random_uuid());
}
};
static_assert(
std::is_standard_layout_v<counter_id> && std::is_trivial_v<counter_id>,
"counter_id should be a POD type");
std::ostream& operator<<(std::ostream& os, const counter_id& id);
using counter_id = utils::tagged_uuid<struct counter_id_tag>;
template<mutable_view is_mutable>
class basic_counter_shard_view {
@@ -406,13 +367,13 @@ struct counter_cell_mutable_view : basic_counter_cell_view<mutable_view::yes> {
// Transforms mutation dst from counter updates to counter shards using state
// stored in current_state.
// If current_state is present it has to be in the same schema as dst.
void transform_counter_updates_to_shards(mutation& dst, const mutation* current_state, uint64_t clock_offset, utils::UUID local_id);
void transform_counter_updates_to_shards(mutation& dst, const mutation* current_state, uint64_t clock_offset, locator::host_id local_id);
template<>
struct appending_hash<counter_shard_view> {
template<typename Hasher>
void operator()(Hasher& h, const counter_shard_view& cshard) const {
::feed_hash(h, cshard.id().to_uuid());
::feed_hash(h, cshard.id());
::feed_hash(h, cshard.value());
::feed_hash(h, cshard.logical_clock());
}

View File

@@ -44,13 +44,14 @@ options {
#include "cql3/statements/drop_aggregate_statement.hh"
#include "cql3/statements/drop_service_level_statement.hh"
#include "cql3/statements/detach_service_level_statement.hh"
#include "cql3/statements/truncate_statement.hh"
#include "cql3/statements/raw/truncate_statement.hh"
#include "cql3/statements/raw/update_statement.hh"
#include "cql3/statements/raw/insert_statement.hh"
#include "cql3/statements/raw/delete_statement.hh"
#include "cql3/statements/index_prop_defs.hh"
#include "cql3/statements/raw/use_statement.hh"
#include "cql3/statements/raw/batch_statement.hh"
#include "cql3/statements/raw/describe_statement.hh"
#include "cql3/statements/list_users_statement.hh"
#include "cql3/statements/grant_statement.hh"
#include "cql3/statements/revoke_statement.hh"
@@ -358,6 +359,7 @@ cqlStatement returns [std::unique_ptr<raw::parsed_statement> stmt]
| st46=listServiceLevelStatement { $stmt = std::move(st46); }
| st47=listServiceLevelAttachStatement { $stmt = std::move(st47); }
| st48=pruneMaterializedViewStatement { $stmt = std::move(st48); }
| st49=describeStatement { $stmt = std::move(st49); }
;
/*
@@ -371,7 +373,8 @@ useStatement returns [std::unique_ptr<raw::use_statement> stmt]
* SELECT [JSON] <expression>
* FROM <CF>
* WHERE KEY = "key1" AND COL > 1 AND COL < 100
* LIMIT <NUMBER>;
* LIMIT <NUMBER>
* [USING TIMEOUT <duration>];
*/
selectStatement returns [std::unique_ptr<raw::select_statement> expr]
@init {
@@ -398,7 +401,7 @@ selectStatement returns [std::unique_ptr<raw::select_statement> expr]
( K_LIMIT rows=intValue { limit = rows; } )?
( K_ALLOW K_FILTERING { allow_filtering = true; } )?
( K_BYPASS K_CACHE { bypass_cache = true; })?
( usingClause[attrs] )?
( usingTimeoutClause[attrs] )?
{
auto params = make_lw_shared<raw::select_statement::parameters>(std::move(orderings), is_distinct, allow_filtering, statement_subtype, bypass_cache);
$expr = std::make_unique<raw::select_statement>(std::move(cf), std::move(params),
@@ -460,8 +463,7 @@ orderByClause[raw::select_statement::parameters::orderings_type& orderings]
;
jsonValue returns [expression value]
:
| s=STRING_LITERAL { $value = untyped_constant{untyped_constant::string, $s.text}; }
: s=STRING_LITERAL { $value = untyped_constant{untyped_constant::string, $s.text}; }
| m=marker { $value = std::move(m); }
;
@@ -518,6 +520,19 @@ usingClauseObjective[std::unique_ptr<cql3::attributes::raw>& attrs]
| K_TIMEOUT to=term { attrs->timeout = to; }
;
usingTimestampTimeoutClause[std::unique_ptr<cql3::attributes::raw>& attrs]
: K_USING usingTimestampTimeoutClauseObjective[attrs] ( K_AND usingTimestampTimeoutClauseObjective[attrs] )*
;
usingTimestampTimeoutClauseObjective[std::unique_ptr<cql3::attributes::raw>& attrs]
: K_TIMESTAMP ts=intValue { attrs->timestamp = ts; }
| K_TIMEOUT to=term { attrs->timeout = to; }
;
usingTimeoutClause[std::unique_ptr<cql3::attributes::raw>& attrs]
: K_USING K_TIMEOUT to=term { attrs->timeout = to; }
;
/**
* UPDATE <CF>
* USING TIMESTAMP <long>
@@ -552,7 +567,7 @@ updateConditions returns [conditions_type conditions]
/**
* DELETE name1, name2
* FROM <CF>
* USING TIMESTAMP <long>
* [USING (TIMESTAMP <long> | TIMEOUT <duration>) [AND ...]]
* WHERE KEY = keyname
[IF (EXISTS | name = value, ...)];
*/
@@ -564,7 +579,7 @@ deleteStatement returns [std::unique_ptr<raw::delete_statement> expr]
}
: K_DELETE ( dels=deleteSelection { column_deletions = std::move(dels); } )?
K_FROM cf=columnFamilyName
( usingClause[attrs] )?
( usingTimestampTimeoutClause[attrs] )?
K_WHERE wclause=whereClause
( K_IF ( K_EXISTS { if_exists = true; } | conditions=updateConditions ))?
{
@@ -857,7 +872,8 @@ indexIdent returns [::shared_ptr<index_target::raw> id]
@init {
std::vector<::shared_ptr<cql3::column_identifier::raw>> columns;
}
: c=cident { $id = index_target::raw::values_of(c); }
: c=cident { $id = index_target::raw::regular_values_of(c); }
| K_VALUES '(' c=cident ')' { $id = index_target::raw::collection_values_of(c); }
| K_KEYS '(' c=cident ')' { $id = index_target::raw::keys_of(c); }
| K_ENTRIES '(' c=cident ')' { $id = index_target::raw::keys_and_values_of(c); }
| K_FULL '(' c=cident ')' { $id = index_target::raw::full_collection(c); }
@@ -1057,10 +1073,18 @@ dropIndexStatement returns [std::unique_ptr<drop_index_statement> expr]
;
/**
* TRUNCATE <CF>;
* TRUNCATE [TABLE] <CF>
* [USING TIMEOUT <duration>];
*/
truncateStatement returns [std::unique_ptr<truncate_statement> stmt]
: K_TRUNCATE (K_COLUMNFAMILY)? cf=columnFamilyName { $stmt = std::make_unique<truncate_statement>(cf); }
truncateStatement returns [std::unique_ptr<raw::truncate_statement> stmt]
@init {
auto attrs = std::make_unique<cql3::attributes::raw>();
}
: K_TRUNCATE (K_COLUMNFAMILY)? cf=columnFamilyName
( usingTimeoutClause[attrs] )?
{
$stmt = std::make_unique<raw::truncate_statement>(std::move(cf), std::move(attrs));
}
;
/**
@@ -1345,6 +1369,59 @@ listServiceLevelAttachStatement returns [std::unique_ptr<list_service_level_atta
{ $stmt = std::make_unique<list_service_level_attachments_statement>(); }
;
/**
* (DESCRIBE | DESC) (
* CLUSTER
* [FULL] SCHEMA
* KEYSPACES
* [ONLY] KEYSPACE <name>?
* TABLES
* TABLE <name>
* TYPES
* TYPE <name>
* FUNCTIONS
* FUNCTION <name>
* AGGREGATES
* AGGREGATE <name>
* ) (WITH INTERNALS)?
*/
describeStatement returns [std::unique_ptr<cql3::statements::raw::describe_statement> stmt]
@init {
bool fullSchema = false;
bool pending = false;
bool config = false;
bool only = false;
std::optional<sstring> keyspace;
sstring generic_name = "";
}
: ( K_DESCRIBE | K_DESC )
( (K_CLUSTER) => K_CLUSTER { $stmt = cql3::statements::raw::describe_statement::cluster(); }
| (K_FULL { fullSchema=true; })? K_SCHEMA { $stmt = cql3::statements::raw::describe_statement::schema(fullSchema); }
| (K_KEYSPACES) => K_KEYSPACES { $stmt = cql3::statements::raw::describe_statement::keyspaces(); }
| (K_ONLY { only=true; })? K_KEYSPACE ( ks=keyspaceName { keyspace = ks; })?
{ $stmt = cql3::statements::raw::describe_statement::keyspace(keyspace, only); }
| (K_TABLES) => K_TABLES { $stmt = cql3::statements::raw::describe_statement::tables(); }
| K_COLUMNFAMILY cf=columnFamilyName { $stmt = cql3::statements::raw::describe_statement::table(cf); }
| K_INDEX idx=columnFamilyName { $stmt = cql3::statements::raw::describe_statement::index(idx); }
| K_MATERIALIZED K_VIEW view=columnFamilyName { $stmt = cql3::statements::raw::describe_statement::view(view); }
| (K_TYPES) => K_TYPES { $stmt = cql3::statements::raw::describe_statement::types(); }
| K_TYPE tn=userTypeName { $stmt = cql3::statements::raw::describe_statement::type(tn); }
| (K_FUNCTIONS) => K_FUNCTIONS { $stmt = cql3::statements::raw::describe_statement::functions(); }
| K_FUNCTION fn=functionName { $stmt = cql3::statements::raw::describe_statement::function(fn); }
| (K_AGGREGATES) => K_AGGREGATES { $stmt = cql3::statements::raw::describe_statement::aggregates(); }
| K_AGGREGATE ag=functionName { $stmt = cql3::statements::raw::describe_statement::aggregate(ag); }
| ( ( ksT=IDENT { keyspace = sstring{$ksT.text}; }
| ksT=QUOTED_NAME { keyspace = sstring{$ksT.text}; }
| ksK=unreserved_keyword { keyspace = ksK; } )
'.' )?
( tT=IDENT { generic_name = sstring{$tT.text}; }
| tT=QUOTED_NAME { generic_name = sstring{$tT.text}; }
| tK=unreserved_keyword { generic_name = tK; } )
{ $stmt = cql3::statements::raw::describe_statement::generic(keyspace, generic_name); }
)
( K_WITH K_INTERNALS { $stmt->with_internals_details(); } )?
;
/** DEFINITIONS **/
// Column Identifiers. These need to be treated differently from other
@@ -1396,7 +1473,7 @@ serviceLevelOrRoleName returns [sstring name]
std::transform($name.begin(), $name.end(), $name.begin(), ::tolower); }
| t=STRING_LITERAL { $name = sstring($t.text); }
| t=QUOTED_NAME { $name = sstring($t.text); }
| k=unreserved_keyword { $name = sstring($t.text);
| k=unreserved_keyword { $name = k;
std::transform($name.begin(), $name.end(), $name.begin(), ::tolower);}
| QMARK {add_recognition_error("Bind variables cannot be used for service levels or role names");}
;
@@ -1490,7 +1567,7 @@ value returns [expression value]
| l=collectionLiteral { $value = std::move(l); }
| u=usertypeLiteral { $value = std::move(u); }
| t=tupleLiteral { $value = std::move(t); }
| K_NULL { $value = null(); }
| K_NULL { $value = make_untyped_null(); }
| e=marker { $value = std::move(e); }
;
@@ -1500,8 +1577,7 @@ marker returns [expression value]
;
intValue returns [expression value]
:
| t=INTEGER { $value = untyped_constant{untyped_constant::integer, $t.text}; }
: t=INTEGER { $value = untyped_constant{untyped_constant::integer, $t.text}; }
| e=marker { $value = std::move(e); }
;
@@ -1655,7 +1731,7 @@ relation returns [expression e]
| K_TOKEN l=tupleOfIdentifiers type=relationType t=term
{ $e = binary_operator(token{std::move(l.elements)}, type, std::move(t)); }
| name=cident K_IS K_NOT K_NULL {
$e = binary_operator(unresolved_identifier{std::move(name)}, oper_t::IS_NOT, null()); }
$e = binary_operator(unresolved_identifier{std::move(name)}, oper_t::IS_NOT, make_untyped_null()); }
| name=cident K_IN marker1=marker
{ $e = binary_operator(unresolved_identifier{std::move(name)}, oper_t::IN, std::move(marker1)); }
| name=cident K_IN in_values=singleColumnInValues
@@ -1874,10 +1950,13 @@ unreserved_function_keyword returns [sstring str]
basic_unreserved_keyword returns [sstring str]
: k=( K_KEYS
| K_AS
| K_CLUSTER
| K_CLUSTERING
| K_COMPACT
| K_STORAGE
| K_TABLES
| K_TYPE
| K_TYPES
| K_VALUES
| K_MAP
| K_LIST
@@ -1901,11 +1980,14 @@ basic_unreserved_keyword returns [sstring str]
| K_TRIGGER
| K_DISTINCT
| K_CONTAINS
| K_INTERNALS
| K_STATIC
| K_FROZEN
| K_TUPLE
| K_FUNCTION
| K_FUNCTIONS
| K_AGGREGATE
| K_AGGREGATES
| K_SFUNC
| K_STYPE
| K_REDUCEFUNC
@@ -1933,6 +2015,9 @@ basic_unreserved_keyword returns [sstring str]
| K_LEVEL
| K_LEVELS
| K_PRUNE
| K_ONLY
| K_DESCRIBE
| K_DESC
) { $str = $k.text; }
;
@@ -1990,11 +2075,14 @@ K_TRUNCATE: T R U N C A T E;
K_DELETE: D E L E T E;
K_IN: I N;
K_CREATE: C R E A T E;
K_SCHEMA: S C H E M A;
K_KEYSPACE: ( K E Y S P A C E
| S C H E M A );
| K_SCHEMA );
K_KEYSPACES: K E Y S P A C E S;
K_COLUMNFAMILY:( C O L U M N F A M I L Y
| T A B L E );
K_TABLES: ( C O L U M N F A M I L I E S
| T A B L E S );
K_MATERIALIZED:M A T E R I A L I Z E D;
K_VIEW: V I E W;
K_INDEX: I N D E X;
@@ -2011,6 +2099,7 @@ K_ALTER: A L T E R;
K_RENAME: R E N A M E;
K_ADD: A D D;
K_TYPE: T Y P E;
K_TYPES: T Y P E S;
K_COMPACT: C O M P A C T;
K_STORAGE: S T O R A G E;
K_ORDER: O R D E R;
@@ -2022,6 +2111,8 @@ K_FILTERING: F I L T E R I N G;
K_IF: I F;
K_IS: I S;
K_CONTAINS: C O N T A I N S;
K_INTERNALS: I N T E R N A L S;
K_ONLY: O N L Y;
K_GRANT: G R A N T;
K_ALL: A L L;
@@ -2045,6 +2136,7 @@ K_LOGIN: L O G I N;
K_NOLOGIN: N O L O G I N;
K_OPTIONS: O P T I O N S;
K_CLUSTER: C L U S T E R;
K_CLUSTERING: C L U S T E R I N G;
K_ASCII: A S C I I;
K_BIGINT: B I G I N T;
@@ -2084,7 +2176,9 @@ K_STATIC: S T A T I C;
K_FROZEN: F R O Z E N;
K_FUNCTION: F U N C T I O N;
K_FUNCTIONS: F U N C T I O N S;
K_AGGREGATE: A G G R E G A T E;
K_AGGREGATES: A G G R E G A T E S;
K_SFUNC: S F U N C;
K_STYPE: S T Y P E;
K_REDUCEFUNC: R E D U C E F U N C;

View File

@@ -10,6 +10,7 @@
#include "cql3/attributes.hh"
#include "cql3/column_identifier.hh"
#include <optional>
namespace cql3 {
@@ -20,7 +21,9 @@ std::unique_ptr<attributes> attributes::none() {
attributes::attributes(std::optional<cql3::expr::expression>&& timestamp,
std::optional<cql3::expr::expression>&& time_to_live,
std::optional<cql3::expr::expression>&& timeout)
: _timestamp{std::move(timestamp)}
: _timestamp_unset_guard(timestamp)
, _timestamp{std::move(timestamp)}
, _time_to_live_unset_guard(time_to_live)
, _time_to_live{std::move(time_to_live)}
, _timeout{std::move(timeout)}
{ }
@@ -38,7 +41,7 @@ bool attributes::is_timeout_set() const {
}
int64_t attributes::get_timestamp(int64_t now, const query_options& options) {
if (!_timestamp.has_value()) {
if (!_timestamp.has_value() || _timestamp_unset_guard.is_unset(options)) {
return now;
}
@@ -46,31 +49,25 @@ int64_t attributes::get_timestamp(int64_t now, const query_options& options) {
if (tval.is_null()) {
throw exceptions::invalid_request_exception("Invalid null value of timestamp");
}
if (tval.is_unset_value()) {
return now;
}
try {
return tval.view().validate_and_deserialize<int64_t>(*long_type, cql_serialization_format::internal());
return tval.view().validate_and_deserialize<int64_t>(*long_type);
} catch (marshal_exception& e) {
throw exceptions::invalid_request_exception("Invalid timestamp value");
}
}
int32_t attributes::get_time_to_live(const query_options& options) {
if (!_time_to_live.has_value())
return 0;
std::optional<int32_t> attributes::get_time_to_live(const query_options& options) {
if (!_time_to_live.has_value() || _time_to_live_unset_guard.is_unset(options))
return std::nullopt;
cql3::raw_value tval = expr::evaluate(*_time_to_live, options);
if (tval.is_null()) {
throw exceptions::invalid_request_exception("Invalid null value of TTL");
}
if (tval.is_unset_value()) {
return 0;
}
int32_t ttl;
try {
ttl = tval.view().validate_and_deserialize<int32_t>(*int32_type, cql_serialization_format::internal());
ttl = tval.view().validate_and_deserialize<int32_t>(*int32_type);
}
catch (marshal_exception& e) {
throw exceptions::invalid_request_exception("Invalid TTL value");
@@ -91,8 +88,8 @@ int32_t attributes::get_time_to_live(const query_options& options) {
db::timeout_clock::duration attributes::get_timeout(const query_options& options) const {
cql3::raw_value timeout = expr::evaluate(*_timeout, options);
if (timeout.is_null() || timeout.is_unset_value()) {
throw exceptions::invalid_request_exception("Timeout value cannot be unset/null");
if (timeout.is_null()) {
throw exceptions::invalid_request_exception("Timeout value cannot be null");
}
cql_duration duration = timeout.view().deserialize<cql_duration>(*duration_type);
if (duration.months || duration.days) {

View File

@@ -11,6 +11,7 @@
#pragma once
#include "cql3/expr/expression.hh"
#include "cql3/expr/unset.hh"
#include "db/timeout_clock.hh"
namespace cql3 {
@@ -24,7 +25,9 @@ class prepare_context;
*/
class attributes final {
private:
expr::unset_bind_variable_guard _timestamp_unset_guard;
std::optional<cql3::expr::expression> _timestamp;
expr::unset_bind_variable_guard _time_to_live_unset_guard;
std::optional<cql3::expr::expression> _time_to_live;
std::optional<cql3::expr::expression> _timeout;
public:
@@ -42,7 +45,7 @@ public:
int64_t get_timestamp(int64_t now, const query_options& options);
int32_t get_time_to_live(const query_options& options);
std::optional<int32_t> get_time_to_live(const query_options& options);
db::timeout_clock::duration get_timeout(const query_options& options) const;

View File

@@ -139,10 +139,6 @@ bool column_condition::applies_to(const data_value* cell_value, const query_opti
cql3::raw_value key_constant = expr::evaluate(*_collection_element, options);
cql3::raw_value_view key = key_constant.view();
if (key.is_unset_value()) {
throw exceptions::invalid_request_exception(
format("Invalid 'unset' value in {} element access", cell_type.cql3_type_name()));
}
if (key.is_null()) {
throw exceptions::invalid_request_exception(
format("Invalid null value for {} element access", cell_type.cql3_type_name()));
@@ -196,9 +192,6 @@ bool column_condition::applies_to(const data_value* cell_value, const query_opti
// <, >, >=, <=, !=
cql3::raw_value param = expr::evaluate(*_value, options);
if (param.is_unset_value()) {
throw exceptions::invalid_request_exception("Invalid 'unset' value in condition");
}
if (param.is_null()) {
if (_op == expr::oper_t::EQ) {
return cell_value == nullptr;
@@ -224,9 +217,6 @@ bool column_condition::applies_to(const data_value* cell_value, const query_opti
return (*_matcher)(bytes_view(cell_value->serialize_nonnull()));
} else {
auto param = expr::evaluate(*_value, options); // LIKE pattern
if (param.is_unset_value()) {
throw exceptions::invalid_request_exception("Invalid 'unset' value in LIKE pattern");
}
if (param.is_null()) {
throw exceptions::invalid_request_exception("Invalid NULL value in LIKE pattern");
}
@@ -309,7 +299,7 @@ column_condition::raw::prepare(data_dictionary::database db, const sstring& keys
if (_op == expr::oper_t::LIKE) {
auto literal_term = expr::as_if<expr::untyped_constant>(&*_value);
if (literal_term) {
if (literal_term && literal_term->partial_type != expr::untyped_constant::type_class::null) {
// Pass matcher object
const sstring& pattern = literal_term->raw_text;
return column_condition::condition(receiver, std::move(collection_element_expression),

View File

@@ -73,6 +73,14 @@ public:
std::move(in_values), nullptr, expr::oper_t::IN);
}
const std::optional<expr::expression>& get_value() const {
return _value;
}
expr::oper_t get_operation() const {
return _op;
}
class raw final {
private:
std::optional<cql3::expr::expression> _value;

View File

@@ -10,6 +10,7 @@
#include "cql3/constants.hh"
#include "cql3/cql3_type.hh"
#include "cql3/statements/strongly_consistent_modification_statement.hh"
namespace cql3 {
void constants::deleter::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
@@ -22,4 +23,9 @@ void constants::deleter::execute(mutation& m, const clustering_key_prefix& prefi
m.set_cell(prefix, column, params.make_dead_cell());
}
}
void
constants::setter::prepare_for_broadcast_tables(statements::broadcast_tables::prepared_update& query) const {
query.new_value = *_e;
}
}

View File

@@ -11,12 +11,17 @@
#pragma once
#include "cql3/abstract_marker.hh"
#include "cql3/query_options.hh"
#include "cql3/update_parameters.hh"
#include "cql3/operation.hh"
#include "cql3/values.hh"
#include "mutation.hh"
#include <seastar/core/shared_ptr.hh>
namespace service::broadcast_tables {
class update_query;
}
namespace cql3 {
/**
@@ -28,9 +33,9 @@ public:
private static final Logger logger = LoggerFactory.getLogger(Constants.class);
#endif
public:
class setter : public operation {
class setter : public operation_skip_if_unset {
public:
using operation::operation;
using operation_skip_if_unset::operation_skip_if_unset;
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override {
auto value = expr::evaluate(*_e, params._options);
@@ -44,32 +49,30 @@ public:
m.set_cell(prefix, column, params.make_cell(*column.type, value));
}
}
virtual void prepare_for_broadcast_tables(statements::broadcast_tables::prepared_update& query) const override;
};
struct adder final : operation {
using operation::operation;
struct adder final : operation_skip_if_unset {
using operation_skip_if_unset::operation_skip_if_unset;
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override {
auto value = expr::evaluate(*_e, params._options);
if (value.is_null()) {
throw exceptions::invalid_request_exception("Invalid null value for counter increment");
} else if (value.is_unset_value()) {
return;
}
auto increment = value.view().deserialize<int64_t>(*long_type);
m.set_cell(prefix, column, params.make_counter_update_cell(increment));
}
};
struct subtracter final : operation {
using operation::operation;
struct subtracter final : operation_skip_if_unset {
using operation_skip_if_unset::operation_skip_if_unset;
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override {
auto value = expr::evaluate(*_e, params._options);
if (value.is_null()) {
throw exceptions::invalid_request_exception("Invalid null value for counter increment");
} else if (value.is_unset_value()) {
return;
}
auto increment = value.view().deserialize<int64_t>(*long_type);
if (increment == std::numeric_limits<int64_t>::min()) {
@@ -79,10 +82,10 @@ public:
}
};
class deleter : public operation {
class deleter : public operation_no_unset_support {
public:
deleter(const column_definition& column)
: operation(column, std::nullopt)
: operation_no_unset_support(column, std::nullopt)
{ }
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override;

View File

@@ -473,27 +473,40 @@ sstring maybe_quote(const sstring& identifier) {
return result;
}
sstring quote(const sstring& identifier) {
template <char C>
static sstring quote_with(const sstring& str) {
static const std::string quote_str{C};
// quote empty string
if (identifier.empty()) {
return "\"\"";
if (str.empty()) {
return make_sstring(quote_str, quote_str);
}
size_t num_quotes = 0;
for (char c : identifier) {
num_quotes += (c == '"');
for (char c : str) {
num_quotes += (c == C);
}
if (num_quotes == 0) {
return make_sstring("\"", identifier, "\"");
return make_sstring(quote_str, str, quote_str);
}
static const std::regex double_quote_re("\"");
static const std::string double_quote_str{C, C};
static const std::regex quote_re(std::string{C});
std::string result;
result.reserve(2 + identifier.size() + num_quotes);
result.push_back('"');
std::regex_replace(std::back_inserter(result), identifier.begin(), identifier.end(), double_quote_re, "\"\"");
result.push_back('"');
result.reserve(2 + str.size() + num_quotes);
result.push_back(C);
std::regex_replace(std::back_inserter(result), str.begin(), str.end(), quote_re, double_quote_str);
result.push_back(C);
return result;
}
sstring quote(const sstring& identifier) {
return quote_with<'"'>(identifier);
}
sstring single_quote(const sstring& str) {
return quote_with<'\''>(str);
}
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -76,7 +76,6 @@ struct column_mutation_attribute;
struct function_call;
struct cast;
struct field_selection;
struct null;
struct bind_variable;
struct untyped_constant;
struct constant;
@@ -96,7 +95,6 @@ concept ExpressionElement
|| std::same_as<T, function_call>
|| std::same_as<T, cast>
|| std::same_as<T, field_selection>
|| std::same_as<T, null>
|| std::same_as<T, bind_variable>
|| std::same_as<T, untyped_constant>
|| std::same_as<T, constant>
@@ -117,7 +115,6 @@ concept invocable_on_expression
&& std::invocable<Func, function_call>
&& std::invocable<Func, cast>
&& std::invocable<Func, field_selection>
&& std::invocable<Func, null>
&& std::invocable<Func, bind_variable>
&& std::invocable<Func, untyped_constant>
&& std::invocable<Func, constant>
@@ -138,7 +135,6 @@ concept invocable_on_expression_ref
&& std::invocable<Func, function_call&>
&& std::invocable<Func, cast&>
&& std::invocable<Func, field_selection&>
&& std::invocable<Func, null&>
&& std::invocable<Func, bind_variable&>
&& std::invocable<Func, untyped_constant&>
&& std::invocable<Func, constant&>
@@ -147,7 +143,7 @@ concept invocable_on_expression_ref
&& std::invocable<Func, usertype_constructor&>
;
/// A CQL expression -- union of all possible expression types. bool means a Boolean constant.
/// A CQL expression -- union of all possible expression types.
class expression final {
// 'impl' holds a variant of all expression types, but since
// variants of incomplete types are not allowed, we forward declare it
@@ -198,9 +194,7 @@ bool operator==(const expression& e1, const expression& e2);
// An expression that doesn't contain subexpressions
template <typename E>
concept LeafExpression
= std::same_as<bool, E>
|| std::same_as<unresolved_identifier, E>
|| std::same_as<null, E>
= std::same_as<unresolved_identifier, E>
|| std::same_as<bind_variable, E>
|| std::same_as<untyped_constant, E>
|| std::same_as<constant, E>
@@ -346,12 +340,6 @@ struct field_selection {
friend bool operator==(const field_selection&, const field_selection&) = default;
};
struct null {
data_type type; // may be null before prepare
friend bool operator==(const null&, const null&) = default;
};
struct bind_variable {
int32_t bind_index;
@@ -365,17 +353,18 @@ struct bind_variable {
// A constant which does not yet have a date type. It is partially typed
// (we know if it's floating or int) but not sized.
struct untyped_constant {
enum type_class { integer, floating_point, string, boolean, duration, uuid, hex };
enum type_class { integer, floating_point, string, boolean, duration, uuid, hex, null };
type_class partial_type;
sstring raw_text;
friend bool operator==(const untyped_constant&, const untyped_constant&) = default;
};
untyped_constant make_untyped_null();
// Represents a constant value with known value and type
// For null and unset the type can sometimes be set to empty_type
struct constant {
// A value serialized using the internal (latest) cql_serialization_format
cql3::raw_value value;
// Never nullptr, for NULL and UNSET might be empty_type
@@ -383,7 +372,6 @@ struct constant {
constant(cql3::raw_value value, data_type type);
static constant make_null(data_type val_type = empty_type);
static constant make_unset_value(data_type val_type = empty_type);
static constant make_bool(bool bool_val);
bool is_null() const;
@@ -436,7 +424,7 @@ struct usertype_constructor {
struct expression::impl final {
using variant_type = std::variant<
conjunction, binary_operator, column_value, token, unresolved_identifier,
column_mutation_attribute, function_call, cast, field_selection, null,
column_mutation_attribute, function_call, cast, field_selection,
bind_variable, untyped_constant, constant, tuple_constructor, collection_constructor,
usertype_constructor, subscript>;
variant_type v;
@@ -666,6 +654,9 @@ inline auto find_clustering_order(const expression& e) {
/// empty conjunction).
std::vector<expression> boolean_factors(expression e);
/// Run the given function for each element in the top level conjunction.
void for_each_boolean_factor(const expression& e, const noncopyable_function<void (const expression&)>& for_each_func);
/// True iff binary_operator involves a collection.
extern bool is_on_collection(const binary_operator&);

View File

@@ -78,7 +78,7 @@ static
void
usertype_constructor_validate_assignable_to(const usertype_constructor& u, data_dictionary::database db, const sstring& keyspace, const column_specification& receiver) {
if (!receiver.type->is_user_type()) {
throw exceptions::invalid_request_exception(format("Invalid user type literal for {} of type {}", receiver.name, receiver.type->as_cql3_type()));
throw exceptions::invalid_request_exception(format("Invalid user type literal for {} of type {}", *receiver.name, receiver.type->as_cql3_type()));
}
auto ut = static_pointer_cast<const user_type_impl>(receiver.type);
@@ -90,7 +90,7 @@ usertype_constructor_validate_assignable_to(const usertype_constructor& u, data_
const expression& value = u.elements.at(field);
auto&& field_spec = usertype_field_spec_of(receiver, i);
if (!assignment_testable::is_assignable(test_assignment(value, db, keyspace, *field_spec))) {
throw exceptions::invalid_request_exception(format("Invalid user type literal for {}: field {} is not of type {}", receiver.name, field, field_spec->type->as_cql3_type()));
throw exceptions::invalid_request_exception(format("Invalid user type literal for {}: field {} is not of type {}", *receiver.name, field, field_spec->type->as_cql3_type()));
}
}
}
@@ -123,7 +123,7 @@ usertype_constructor_prepare_expression(const usertype_constructor& u, data_dict
auto iraw = u.elements.find(field);
expression raw;
if (iraw == u.elements.end()) {
raw = expr::null();
raw = expr::make_untyped_null();
} else {
raw = iraw->second;
++found_values;
@@ -246,6 +246,21 @@ map_prepare_expression(const collection_constructor& c, data_dictionary::databas
auto key_spec = maps::key_spec_of(*receiver);
auto value_spec = maps::value_spec_of(*receiver);
const map_type_impl* map_type = dynamic_cast<const map_type_impl*>(&receiver->type->without_reversed());
if (map_type == nullptr) {
on_internal_error(expr_logger,
format("map_prepare_expression bad non-map receiver type: {}", receiver->type->name()));
}
data_type map_element_tuple_type = tuple_type_impl::get_instance({map_type->get_keys_type(), map_type->get_values_type()});
// In Cassandra, an empty (unfrozen) map/set/list is equivalent to the column being null. In
// other words a non-frozen collection only exists if it has elements. Return nullptr right
// away to simplify predicate evaluation. See also
// https://issues.apache.org/jira/browse/CASSANDRA-5141
if (map_type->is_multi_cell() && c.elements.empty()) {
return constant::make_null(receiver->type);
}
std::vector<expression> values;
values.reserve(c.elements.size());
bool all_terminal = true;
@@ -264,7 +279,7 @@ map_prepare_expression(const collection_constructor& c, data_dictionary::databas
values.emplace_back(tuple_constructor {
.elements = {std::move(k), std::move(v)},
.type = entry_tuple.type
.type = map_element_tuple_type
});
}
@@ -298,7 +313,7 @@ set_validate_assignable_to(const collection_constructor& c, data_dictionary::dat
return;
}
throw exceptions::invalid_request_exception(format("Invalid set literal for {} of type {}", receiver.name, receiver.type->as_cql3_type()));
throw exceptions::invalid_request_exception(format("Invalid set literal for {} of type {}", *receiver.name, receiver.type->as_cql3_type()));
}
auto&& value_spec = set_value_spec_of(receiver);
@@ -486,18 +501,18 @@ void
tuple_constructor_validate_assignable_to(const tuple_constructor& tc, data_dictionary::database db, const sstring& keyspace, const column_specification& receiver) {
auto tt = dynamic_pointer_cast<const tuple_type_impl>(receiver.type->underlying_type());
if (!tt) {
throw exceptions::invalid_request_exception(format("Invalid tuple type literal for {} of type {}", receiver.name, receiver.type->as_cql3_type()));
throw exceptions::invalid_request_exception(format("Invalid tuple type literal for {} of type {}", *receiver.name, receiver.type->as_cql3_type()));
}
for (size_t i = 0; i < tc.elements.size(); ++i) {
if (i >= tt->size()) {
throw exceptions::invalid_request_exception(format("Invalid tuple literal for {}: too many elements. Type {} expects {:d} but got {:d}",
receiver.name, tt->as_cql3_type(), tt->size(), tc.elements.size()));
*receiver.name, tt->as_cql3_type(), tt->size(), tc.elements.size()));
}
auto&& value = tc.elements[i];
auto&& spec = component_spec_of(receiver, i);
if (!assignment_testable::is_assignable(test_assignment(value, db, keyspace, *spec))) {
throw exceptions::invalid_request_exception(format("Invalid tuple literal for {}: component {:d} is not of type {}", receiver.name, i, spec->type->as_cql3_type()));
throw exceptions::invalid_request_exception(format("Invalid tuple literal for {}: component {:d} is not of type {}", *receiver.name, i, spec->type->as_cql3_type()));
}
}
}
@@ -567,6 +582,7 @@ operator<<(std::ostream&out, untyped_constant::type_class t)
case untyped_constant::type_class::boolean: return out << "BOOLEAN";
case untyped_constant::type_class::hex: return out << "HEX";
case untyped_constant::type_class::duration: return out << "DURATION";
case untyped_constant::type_class::null: return out << "NULL";
}
abort();
}
@@ -594,8 +610,9 @@ static
assignment_testable::test_result
untyped_constant_test_assignment(const untyped_constant& uc, data_dictionary::database db, const sstring& keyspace, const column_specification& receiver)
{
bool uc_is_null = uc.partial_type == untyped_constant::type_class::null;
auto receiver_type = receiver.type->as_cql3_type();
if (receiver_type.is_collection() || receiver_type.is_user_type()) {
if ((receiver_type.is_collection() || receiver_type.is_user_type()) && !uc_is_null) {
return assignment_testable::test_result::NOT_ASSIGNABLE;
}
if (!receiver_type.is_native()) {
@@ -660,6 +677,10 @@ untyped_constant_test_assignment(const untyped_constant& uc, data_dictionary::da
return assignment_testable::test_result::EXACT_MATCH;
}
break;
case untyped_constant::type_class::null:
return receiver.type->is_counter()
? assignment_testable::test_result::NOT_ASSIGNABLE
: assignment_testable::test_result::WEAKLY_ASSIGNABLE;
}
return assignment_testable::test_result::NOT_ASSIGNABLE;
}
@@ -673,9 +694,18 @@ untyped_constant_prepare_expression(const untyped_constant& uc, data_dictionary:
return std::nullopt;
}
if (!is_assignable(untyped_constant_test_assignment(uc, db, keyspace, *receiver))) {
if (uc.partial_type != untyped_constant::type_class::null) {
throw exceptions::invalid_request_exception(format("Invalid {} constant ({}) for \"{}\" of type {}",
uc.partial_type, uc.raw_text, *receiver->name, receiver->type->as_cql3_type().to_string()));
} else {
throw exceptions::invalid_request_exception("Invalid null value for counter increment/decrement");
}
}
if (uc.partial_type == untyped_constant::type_class::null) {
return constant::make_null(receiver->type);
}
raw_value raw_val = cql3::raw_value::make_value(untyped_constant_parsed_value(uc, receiver->type));
return constant(std::move(raw_val), receiver->type);
}
@@ -687,38 +717,19 @@ bind_variable_test_assignment(const bind_variable& bv, data_dictionary::database
}
static
bind_variable
std::optional<bind_variable>
bind_variable_prepare_expression(const bind_variable& bv, data_dictionary::database db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver)
{
if (!receiver) {
return std::nullopt;
}
return bind_variable {
.bind_index = bv.bind_index,
.receiver = receiver
};
}
static
assignment_testable::test_result
null_test_assignment(data_dictionary::database db,
const sstring& keyspace,
const column_specification& receiver) {
return receiver.type->is_counter()
? assignment_testable::test_result::NOT_ASSIGNABLE
: assignment_testable::test_result::WEAKLY_ASSIGNABLE;
}
static
std::optional<expression>
null_prepare_expression(data_dictionary::database db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) {
if (!receiver) {
// TODO: It is not possible to infer the type of NULL, but perhaps we can have a matcing null_type that can be cast to anything
return std::nullopt;
}
if (!is_assignable(null_test_assignment(db, keyspace, *receiver))) {
throw exceptions::invalid_request_exception("Invalid null value for counter increment/decrement");
}
return constant::make_null(receiver->type);
}
static
sstring
cast_display_name(const cast& c) {
@@ -864,6 +875,53 @@ test_assignment_function_call(const cql3::expr::function_call& fc, data_dictiona
}
}
std::optional<expression> prepare_conjunction(const conjunction& conj,
data_dictionary::database db,
const sstring& keyspace,
const schema* schema_opt,
lw_shared_ptr<column_specification> receiver) {
if (receiver.get() != nullptr && receiver->type->without_reversed().get_kind() != abstract_type::kind::boolean) {
throw exceptions::invalid_request_exception(
format("AND conjunction produces a boolean value, which doesn't match the type: {} of {}",
receiver->type->name(), receiver->name->text()));
}
lw_shared_ptr<column_specification> child_receiver;
if (receiver.get() != nullptr) {
::shared_ptr<column_identifier> child_receiver_name =
::make_shared<column_identifier>(format("AND_element({})", receiver->name->text()), true);
child_receiver = make_lw_shared<column_specification>(receiver->ks_name, receiver->cf_name,
std::move(child_receiver_name), boolean_type);
} else {
::shared_ptr<column_identifier> child_receiver_name =
::make_shared<column_identifier>("AND_element(unknown)", true);
sstring cf_name = schema_opt ? schema_opt->cf_name() : "unknown_cf";
child_receiver = make_lw_shared<column_specification>(keyspace, std::move(cf_name),
std::move(child_receiver_name), boolean_type);
}
std::vector<expression> prepared_children;
bool all_terminal = true;
for (const expression& child : conj.children) {
std::optional<expression> prepared_child =
try_prepare_expression(child, db, keyspace, schema_opt, child_receiver);
if (!prepared_child.has_value()) {
throw exceptions::invalid_request_exception(fmt::format("Could not infer type of {}", child));
}
if (!is<constant>(*prepared_child)) {
all_terminal = false;
}
prepared_children.push_back(std::move(*prepared_child));
}
conjunction result = conjunction{std::move(prepared_children)};
if (all_terminal) {
return constant(evaluate(result, evaluation_inputs{}), boolean_type);
}
return result;
}
std::optional<expression>
try_prepare_expression(const expression& expr, data_dictionary::database db, const sstring& keyspace, const schema* schema_opt, lw_shared_ptr<column_specification> receiver) {
return expr::visit(overloaded_functor{
@@ -873,8 +931,8 @@ try_prepare_expression(const expression& expr, data_dictionary::database db, con
[&] (const binary_operator&) -> std::optional<expression> {
on_internal_error(expr_logger, "binary_operators are not yet reachable via prepare_expression()");
},
[&] (const conjunction&) -> std::optional<expression> {
on_internal_error(expr_logger, "conjunctions are not yet reachable via prepare_expression()");
[&] (const conjunction& conj) -> std::optional<expression> {
return prepare_conjunction(conj, db, keyspace, schema_opt, receiver);
},
[] (const column_value& cv) -> std::optional<expression> {
return cv;
@@ -945,9 +1003,6 @@ try_prepare_expression(const expression& expr, data_dictionary::database db, con
[&] (const field_selection&) -> std::optional<expression> {
on_internal_error(expr_logger, "field_selections are not yet reachable via prepare_expression()");
},
[&] (const null&) -> std::optional<expression> {
return null_prepare_expression(db, keyspace, receiver);
},
[&] (const bind_variable& bv) -> std::optional<expression> {
return bind_variable_prepare_expression(bv, db, keyspace, receiver);
},
@@ -1009,9 +1064,6 @@ test_assignment(const expression& expr, data_dictionary::database db, const sstr
[&] (const field_selection&) -> test_result {
on_internal_error(expr_logger, "field_selections are not yet reachable via test_assignment()");
},
[&] (const null&) -> test_result {
return null_test_assignment(db, keyspace, receiver);
},
[&] (const bind_variable& bv) -> test_result {
return bind_variable_test_assignment(bv, db, keyspace, receiver);
},
@@ -1138,7 +1190,7 @@ static lw_shared_ptr<column_specification> get_lhs_receiver(const expression& pr
// Given type of LHS and the operation finds the expected type of RHS.
// The type will be the same as LHS for simple operations like =, but it will be different for more complex ones like IN or CONTAINS.
static lw_shared_ptr<column_specification> get_rhs_receiver(lw_shared_ptr<column_specification>& lhs_receiver, oper_t oper) {
const data_type& lhs_type = lhs_receiver->type->underlying_type();
const data_type lhs_type = lhs_receiver->type->underlying_type();
if (oper == oper_t::IN) {
data_type rhs_receiver_type = list_type_impl::get_instance(std::move(lhs_type), false);

View File

@@ -144,7 +144,7 @@ void preliminary_binop_vaidation_checks(const binary_operator& binop) {
}
if (binop.op == oper_t::IS_NOT) {
bool rhs_is_null = is<null>(binop.rhs)
bool rhs_is_null = (is<untyped_constant>(binop.rhs) && as<untyped_constant>(binop.rhs).partial_type == untyped_constant::type_class::null)
|| (is<constant>(binop.rhs) && as<constant>(binop.rhs).is_null());
if (!rhs_is_null) {
throw exceptions::invalid_request_exception(format("Unsupported \"IS NOT\" relation: {}", pretty_binop_printer));

30
cql3/expr/unset.hh Normal file
View File

@@ -0,0 +1,30 @@
// Copyright (C) 2023-present ScyllaDB
// SPDX-License-Identifier: (AGPL-3.0-or-later and Apache-2.0)
#pragma once
#include <optional>
#include "expression.hh"
namespace cql3 {
class query_options;
}
namespace cql3::expr {
// Some expression users can behave differently if the expression is a bind variable
// and if that bind variable is unset. unset_bind_variable_guard encapsulates the two
// conditions.
class unset_bind_variable_guard {
// Disengaged if the operand is not exactly a single bind variable.
std::optional<bind_variable> _var;
public:
explicit unset_bind_variable_guard(const expr::expression& operand);
explicit unset_bind_variable_guard(std::nullopt_t) {}
explicit unset_bind_variable_guard(const std::optional<expr::expression>& operand);
bool is_unset(const query_options& qo) const;
};
}

View File

@@ -12,7 +12,7 @@
#include "types.hh"
#include "types/tuple.hh"
#include "cql3/functions/scalar_function.hh"
#include "cql_serialization_format.hh"
#include "cql3/util.hh"
#include "utils/big_decimal.hh"
#include "aggregate_fcts.hh"
#include "user_aggregate.hh"
@@ -40,10 +40,10 @@ public:
virtual void reset() override {
_count = 0;
}
virtual opt_bytes compute(cql_serialization_format sf) override {
virtual opt_bytes compute() override {
return long_type->decompose(_count);
}
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) override {
virtual void add_input(const std::vector<opt_bytes>& values) override {
++_count;
}
virtual void set_accumulator(const opt_bytes& acc) override {
@@ -56,7 +56,7 @@ public:
virtual opt_bytes get_accumulator() const override {
return long_type->decompose(_count);
}
virtual void reduce(cql_serialization_format sf, const opt_bytes& acc) override {
virtual void reduce(const opt_bytes& acc) override {
if (acc) {
auto other = value_cast<int64_t>(long_type->deserialize(bytes_view(*acc)));
_count += other;
@@ -189,13 +189,13 @@ public:
virtual void reset() override {
_acc = _initcond;
}
virtual opt_bytes compute(cql_serialization_format sf) override {
return _finalfunc ? _finalfunc->execute(sf, std::vector<bytes_opt>{_acc}) : _acc;
virtual opt_bytes compute() override {
return _finalfunc ? _finalfunc->execute(std::vector<bytes_opt>{_acc}) : _acc;
}
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) override {
virtual void add_input(const std::vector<opt_bytes>& values) override {
std::vector<bytes_opt> args{_acc};
args.insert(args.end(), values.begin(), values.end());
_acc = _sfunc->execute(sf, args);
_acc = _sfunc->execute(args);
}
virtual void set_accumulator(const opt_bytes& acc) override {
_acc = acc;
@@ -203,9 +203,9 @@ public:
virtual opt_bytes get_accumulator() const override {
return _acc;
}
virtual void reduce(cql_serialization_format sf, const opt_bytes& acc) override {
virtual void reduce(const opt_bytes& acc) override {
std::vector<bytes_opt> args{_acc, acc};
_acc = _rfunc->execute(sf, args);
_acc = _rfunc->execute(args);
}
};
@@ -218,10 +218,10 @@ public:
virtual void reset() override {
_sum = {};
}
virtual opt_bytes compute(cql_serialization_format sf) override {
virtual opt_bytes compute() override {
return data_type_for<Type>()->decompose(accumulator_for<Type>::narrow(_sum));
}
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) override {
virtual void add_input(const std::vector<opt_bytes>& values) override {
if (!values[0]) {
return;
}
@@ -237,7 +237,7 @@ public:
virtual opt_bytes get_accumulator() const override {
return accumulator_for<Type>::decompose(_sum);
}
virtual void reduce(cql_serialization_format sf, const opt_bytes& acc) override {
virtual void reduce(const opt_bytes& acc) override {
if (acc) {
auto other = accumulator_for<Type>::deserialize(acc);
_sum += other;
@@ -248,7 +248,7 @@ public:
template <typename Type>
class impl_reducible_sum_function final : public impl_sum_function_for<Type> {
public:
virtual bytes_opt compute(cql_serialization_format sf) override {
virtual bytes_opt compute() override {
return this->get_accumulator();
}
};
@@ -316,14 +316,14 @@ public:
_sum = {};
_count = 0;
}
virtual opt_bytes compute(cql_serialization_format sf) override {
virtual opt_bytes compute() override {
Type ret{};
if (_count) {
ret = impl_div_for_avg<Type>::div(_sum, _count);
}
return data_type_for<Type>()->decompose(ret);
}
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) override {
virtual void add_input(const std::vector<opt_bytes>& values) override {
if (!values[0]) {
return;
}
@@ -348,7 +348,7 @@ public:
);
return tuple_val.serialize();
}
virtual void reduce(cql_serialization_format sf, const opt_bytes& acc) override {
virtual void reduce(const opt_bytes& acc) override {
if (acc) {
data_type tuple_type = tuple_type_impl::get_instance({accumulator_for<Type>::data_type(), long_type});
auto tuple = value_cast<tuple_type_impl::native_type>(tuple_type->deserialize(bytes_view(*acc)));
@@ -362,7 +362,7 @@ public:
template <typename Type>
class impl_reducible_avg_function : public impl_avg_function_for<Type> {
public:
virtual bytes_opt compute(cql_serialization_format sf) override {
virtual bytes_opt compute() override {
return this->get_accumulator();
}
};
@@ -457,13 +457,13 @@ public:
virtual void reset() override {
_max = {};
}
virtual opt_bytes compute(cql_serialization_format sf) override {
virtual opt_bytes compute() override {
if (!_max) {
return {};
}
return data_type_for<Type>()->decompose(data_value(Type{*_max}));
}
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) override {
virtual void add_input(const std::vector<opt_bytes>& values) override {
if (!values[0]) {
return;
}
@@ -487,8 +487,8 @@ public:
}
return {};
}
virtual void reduce(cql_serialization_format sf, const opt_bytes& acc) override {
return add_input(sf, {acc});
virtual void reduce(const opt_bytes& acc) override {
return add_input({acc});
}
};
@@ -502,10 +502,10 @@ public:
virtual void reset() override {
_max = {};
}
virtual opt_bytes compute(cql_serialization_format sf) override {
virtual opt_bytes compute() override {
return _max.value_or(bytes{});
}
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) override {
virtual void add_input(const std::vector<opt_bytes>& values) override {
if (values.empty() || !values[0]) {
return;
}
@@ -519,11 +519,11 @@ public:
virtual opt_bytes get_accumulator() const override {
return _max;
}
virtual void reduce(cql_serialization_format sf, const opt_bytes& acc) override {
virtual void reduce(const opt_bytes& acc) override {
if (acc && !acc->length()) {
return;
}
return add_input(sf, {acc});
return add_input({acc});
}
};
@@ -598,13 +598,13 @@ public:
virtual void reset() override {
_min = {};
}
virtual opt_bytes compute(cql_serialization_format sf) override {
virtual opt_bytes compute() override {
if (!_min) {
return {};
}
return data_type_for<Type>()->decompose(data_value(Type{*_min}));
}
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) override {
virtual void add_input(const std::vector<opt_bytes>& values) override {
if (!values[0]) {
return;
}
@@ -628,8 +628,8 @@ public:
}
return {};
}
virtual void reduce(cql_serialization_format sf, const opt_bytes& acc) override {
return add_input(sf, {acc});
virtual void reduce(const opt_bytes& acc) override {
return add_input({acc});
}
};
@@ -643,10 +643,10 @@ public:
virtual void reset() override {
_min = {};
}
virtual opt_bytes compute(cql_serialization_format sf) override {
virtual opt_bytes compute() override {
return _min.value_or(bytes{});
}
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) override {
virtual void add_input(const std::vector<opt_bytes>& values) override {
if (values.empty() || !values[0]) {
return;
}
@@ -660,11 +660,11 @@ public:
virtual opt_bytes get_accumulator() const override {
return _min;
}
virtual void reduce(cql_serialization_format sf, const opt_bytes& acc) override {
virtual void reduce(const opt_bytes& acc) override {
if (acc && !acc->length()) {
return;
}
return add_input(sf, {acc});
return add_input({acc});
}
};
@@ -720,10 +720,10 @@ public:
virtual void reset() override {
_count = 0;
}
virtual opt_bytes compute(cql_serialization_format sf) override {
virtual opt_bytes compute() override {
return long_type->decompose(_count);
}
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) override {
virtual void add_input(const std::vector<opt_bytes>& values) override {
if (!values[0]) {
return;
}
@@ -739,7 +739,7 @@ public:
virtual opt_bytes get_accumulator() const override {
return long_type->decompose(_count);
}
virtual void reduce(cql_serialization_format sf, const opt_bytes& acc) override {
virtual void reduce(const opt_bytes& acc) override {
if (acc) {
auto other = value_cast<int64_t>(long_type->deserialize(bytes_view(*acc)));
_count += other;
@@ -814,6 +814,35 @@ bool user_aggregate::is_reducible() const { return _reducefunc != nullptr; }
bool user_aggregate::requires_thread() const { return _sfunc->requires_thread() || (_finalfunc && _finalfunc->requires_thread()); }
bool user_aggregate::has_finalfunc() const { return _finalfunc != nullptr; }
std::ostream& user_aggregate::describe(std::ostream& os) const {
auto ks = cql3::util::maybe_quote(name().keyspace);
auto na = cql3::util::maybe_quote(name().name);
os << "CREATE AGGREGATE " << ks << "." << na << "(";
for (size_t i = 0; i < _arg_types.size(); i++) {
if (i > 0) {
os << ", ";
}
os << _arg_types[i]->cql3_type_name();
}
os << ")\n";
os << "SFUNC " << cql3::util::maybe_quote(_sfunc->name().name) << "\n"
<< "STYPE " << _sfunc->return_type()->cql3_type_name();
if (is_reducible()) {
os << "\n" << "REDUCEFUNC " << cql3::util::maybe_quote(_reducefunc->name().name);
}
if (has_finalfunc()) {
os << "\n" << "FINALFUNC " << cql3::util::maybe_quote(_finalfunc->name().name);
}
if (_initcond) {
os << "\n" << "INITCOND " << _sfunc->return_type()->deserialize(bytes_view(*_initcond)).to_parsable_string();
}
os << ";";
return os;
}
shared_ptr<aggregate_function>
aggregate_fcts::make_count_rows_function() {
return make_shared<count_rows_function>();

View File

@@ -18,7 +18,6 @@
#include "bytes_ostream.hh"
#include "types.hh"
#include "cql_serialization_format.hh"
#include <boost/algorithm/cxx11/any_of.hpp>
@@ -47,7 +46,7 @@ public:
virtual bool requires_thread() const override;
virtual bytes_opt execute(cql_serialization_format sf, const std::vector<bytes_opt>& parameters) override {
virtual bytes_opt execute(const std::vector<bytes_opt>& parameters) override {
bytes_ostream encoded_row;
encoded_row.write("{", 1);
for (size_t i = 0; i < _selector_names.size(); ++i) {

Some files were not shown because too many files have changed in this diff Show More