Commit Graph

7074 Commits

Author SHA1 Message Date
Kefu Chai
30e82a81e8 test: do not define boost_test_print_type() for types with operator<<
before this change, we provide `boost_test_print_type()` for all types
which can be formatted using {fmt}. these types includes those who
fulfill the concept of range, and their element can be formatted using
{fmt}. if the compilation unit happens to include `fmt/ranges.h`.
the ranges are formatted with `boost_test_print_type()` as well. this
is what we expect. in other words, we use {fmt} to format types which
do not natively support {fmt}, but they fulfill the range concept.

but `boost::unit_test::basic_cstring` is one of them

- it can be formatted using operator<<, but it does not provide
  fmt::format specialization
- it fulfills the concept of range
- and its element type is `char const`, which can be formatted using
  {fmt}

that's why it's formatted like:

```
test/boost/sstable_directory_test.cc(317): fatal error: in "sstable_directory_test_generation_sanity": critical check ['s', 's', 't', '-', '>', 'g', 'e', 'n', 'e', 'r', 'a', 't', 'i', 'o', 'n', '(', ')', ' ', '=', '=', ' ', 's', 's', 't', '1', '-', '>', 'g', 'e', 'n', 'e', 'r', 'a', 't', 'i', 'o', 'n', '(', ')'] has failed`
```

where the string is formatted as a sequence-alike container. this
is far from readable.

so, in this change, we do not define `boost_test_print_type()` for
the types which natively support `operator<<` anymore. so they can
be printed with `operator<<` when  boost::test prints them.

Fixes scylladb/scylladb#19637
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19638
2024-07-09 10:34:37 +03:00
Botond Dénes
9544c364be scylla-gdb.py: introduce scylla large-objects
The equivalent of small-objects, but for large objects (spans).
Allows listing object of a large-class, and therefore investigating a
run-away class, by attempting to identify the owners of the objects in
it.

Written to investigate #16493

Closes scylladb/scylladb#16711
2024-07-09 10:21:09 +03:00
Piotr Dulikowski
3c535641fd Merge 'service/storage_proxy: Add metrics keeping track of incoming hints' from Dawid Mędrek
Although Scylla already exposes metrics keeping track of various information related to hinted handoff, all of them correspond to either storing or sending hints. However, when debugging, it's also crucial to be aware of how many hints are coming to a given node and what their size is. Unfortunately, the existing metrics are not enough to obtain that information.

This PR introduces the following new metrics:

* `sent_bytes_total` – the total size of the hints that have been sent from a given shard,
* `received_hints_total` – the total number of hints that a given shard has received,
* `received_hints_bytes_total` – the total size of the hints a given shard has received.

It also renames `hints_manager_sent` to `hints_manager_sent_total` to avoid conflicts of prefixes between that metric and `sent_bytes_total` in tests.

Fixes scylladb/scylladb#10987

Closes scylladb/scylladb#18976

* github.com:scylladb/scylladb:
  db/hints: Add a metric for the size of sent hints
  service/storage_proxy: Add metrics for received hints
2024-07-08 10:29:53 +02:00
Botond Dénes
56c194e52c Merge 'compaction: not include unused headers' from Kefu Chai
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#19581

* github.com:scylladb/scylladb:
  .github: add compaction to iwyu's CLEANER_DIR
  compaction: not include unused headers
2024-07-08 10:03:51 +03:00
Michael Litvak
407274e828 view: drain view builder before database
The view builder is doing write operations to the database.
In order for the view builder to shutdown gracefully without errors, we
need to ensure the database can handle writes while it is drained.
The commit changes the drain order, so that view builder is drained
before the database shuts down.

Fixes scylladb/scylladb#18929

Closes scylladb/scylladb#19609
2024-07-05 22:17:40 +03:00
Avi Kivity
0626e0487d Merge 'Add copy on write to functions schema code' from Marcin Maliszkiewicz
This is the first patch from series which would allow us to unify raft command code. Property we want to achieve is that all modifications performed by a single raft command can be made visible atomically. This helps to exclude accidental dependencies across subsystem updates and make easier to reason about state.

Here we alter functions schema code so that changes are first applied to a copy of declared functions and then made visible atomically. Later work will apply similar strategy to the whole schema.

Relates scylladb/scylladb#19153

Closes scylladb/scylladb#19598

* github.com:scylladb/scylladb:
  cql3: functions: make modification functions accessible only via batch class
  db: replica: batch functions schema modifications
  cql3: functions: introduce class for batching functions modifications
  cql3: functions: make functions class non-static
  cql3: functions: remove reduntant class access specifiers
  cql3: functions: remove unused java snippet
2024-07-04 17:40:23 +03:00
Nadav Har'El
96dff367f8 Merge 'storage_proxy: update view update backlog on correct shard when writing' from Wojciech Mitros
This series is another approach of https://github.com/scylladb/scylladb/pull/18646 and https://github.com/scylladb/scylladb/pull/19181. In this series we only change where the view backlog gets
updated - we do not assure that the view update backlog returned in a response is necessarily the backlog
that increased due to the corresponding write, the returned backlog may be outdated up to 10ms. Because
 this series does not include this change, it's considerably less complex and it doesn't modify the common
write patch, so no particular performance considerations were needed in that context. The issue being fixed
is still the same, the full description can be seen below.

When a replica applies a write on a table which has a materialized view
it generates view updates. These updates take memory which is tracked
by `database::_view_update_concurrency_sem`, separate on each shard.
The fraction of units taken from the semaphore to the semaphore limit
is the shard's view update backlog. Based on these backlogs, we want
to estimate how busy a node is with its view updates work. We do that
by taking the max backlog across all shards.
To avoid excessive cross-shard operations, the node's (max) backlog isn't
calculated each time we need it, but up to 1 time per 10ms (the `_interval`) with an optimization where the backlog of the calculating shard is immediately up-to-date (we don't need cross-shard operations for it):
```
update_backlog node_update_backlog::fetch() {
    auto now = clock::now();
    if (now >= _last_update.load(std::memory_order_relaxed) + _interval) {
        _last_update.store(now, std::memory_order_relaxed);
        auto new_max = boost::accumulate(
                _backlogs,
                update_backlog::no_backlog(),
                [] (const update_backlog& lhs, const per_shard_backlog& rhs) {
                    return std::max(lhs, rhs.load());
                });
        _max.store(new_max, std::memory_order_relaxed);
        return new_max;
    }
    return std::max(fetch_shard(this_shard_id()), _max.load(std::memory_order_relaxed));
}
```
For the same reason, even when we do calculate the new node's backlog,
we don't read from the `_view_update_concurrency_sem`. Instead, for
each shard we also store a update_backlog atomic which we use for
calculation:
```
    struct per_shard_backlog {
        // Multiply by 2 to defeat the prefetcher
        alignas(seastar::cache_line_size * 2) std::atomic<update_backlog> backlog = update_backlog::no_backlog();
        need_publishing need_publishing = need_publishing::no;

        update_backlog load() const {
            return backlog.load(std::memory_order_relaxed);
        }
    };
 std::vector<per_shard_backlog> _backlogs;
```
Due to this distinction, the update_backlog atomic need to be updated
separately, when the `_view_update_concurrency_sem` changes.
This is done by calling `storage_proxy::update_view_update_backlog`, which reads the `_view_update_concurrency_sem` of the shard (in `database::get_view_update_backlog`)
and then calls node`_update_backlog::add` where the read backlog
is stored in the atomic:
```
void storage_proxy::update_view_update_backlog() {
    _max_view_update_backlog.add(get_db().local().get_view_update_backlog());
}
void node_update_backlog::add(update_backlog backlog) {
    _backlogs[this_shard_id()].backlog.store(backlog, std::memory_order_relaxed);
    _backlogs[this_shard_id()].need_publishing = need_publishing::yes;
}
```
For this implementation of calculating the node's view update backlog to work,
we need the atomics to be updated correctly when the semaphores of corresponding
shards change.

The main event where the view update backlog changes is an incoming write
request. That's why when handling the request and preparing a response
we update the backlog calling `storage_proxy::get_view_update_backlog` (also
because we want to read the backlog and send it in the response):
backlog update after local view updates (`storage_proxy::send_to_live_endpoints` in `mutate_begin`)
```
 auto lmutate = [handler_ptr, response_id, this, my_address, timeout] () mutable {
     return handler_ptr->apply_locally(timeout, handler_ptr->get_trace_state())
             .then([response_id, this, my_address, h = std::move(handler_ptr), p = shared_from_this()] {
         // make mutation alive until it is processed locally, otherwise it
         // may disappear if write timeouts before this future is ready
         got_response(response_id, my_address, get_view_update_backlog());
     });
 };
backlog update after remote view updates (storage_proxy::remote::handle_write)

 auto f = co_await coroutine::as_future(send_mutation_done(netw::messaging_service::msg_addr{reply_to, shard}, trace_state_ptr,
         shard, response_id, p->get_view_update_backlog()));
```
Now assume that on a certain node we have a write request received on shard A,
which updates a row on shard B (A!=B). As a result, shard B will generate view
updates and consume units from its `_view_update_concurrency_sem`, but will
not update its atomic in `_backlogs` yet. Because both shards in the example
are on the same node, shard A will perform a local write calling `lmutate` shown
above. In the `lmutate` call, the `apply_locally` will initiate the actual write on
shard B and the `storage_proxy::update_view_update_backlog` will be called back
on shard A. In no place will the backlog atomic on shard B get updated even
though it increased in size due to the view updates generated there.
Currently, what we calculate there doesn't really matter - it's only used for the
MV flow control delays, so currently, in this scenario, we may only overload
a replica causing failed replica writes which will be later retried as hints. However,
when we add MV admission control, the calculated backlog will be the difference
between an accepted and a rejected request.

Fixes: https://github.com/scylladb/scylladb/issues/18542

Without admission control (https://github.com/scylladb/scylladb/pull/18334), this patch doesn't affect much, so I'm marking it as backport/none

Closes scylladb/scylladb#19341

* github.com:scylladb/scylladb:
  test: add test for view backlog not being updated on correct shard
  test: move auxiliary methods for waiting until a view is built to util
  mv: update view update backlog when it increases on correct shard
2024-07-04 11:40:09 +03:00
Marcin Maliszkiewicz
16b770ff1a cql3: functions: make functions class non-static
This is done to ease code reuse in the following commit.
It'd also help should we ever want properly mount functions
class to schema object instead of static storage.
2024-07-04 10:24:57 +02:00
Kefu Chai
35e7a0b36f test/cql-pytest: use offset-aware API to avoid deprecate warning
to avoid warning like

```
DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
```

and to be future-proof, let's use the offset-aware timestamp.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19536
2024-07-04 10:48:00 +03:00
Botond Dénes
e3e5f8209d Merge 'alternator: fix "/localnodes" to use broadcast_rpc_address' from Nadav Har'El
This short series fixes Alternator's "/localnodes" request to allow a node's external IP address - configured with `broadcast_rpc_address` - to be listed instead of its usual, internal, IP address.

The first patch fixes a bug in gossiper::get_rpc_address(), which the second patch needs to implement the feature. The second patch also contains regression tests.

Fixes #18711.

Closes scylladb/scylladb#18828

* github.com:scylladb/scylladb:
  alternator: fix "/localnodes" to use broadcast_rpc_address
  gossiper: fix get_rpc_address() for this node
2024-07-04 10:37:28 +03:00
Wojciech Mitros
1fdc65279d test: add test for view backlog not being updated on correct shard
This patch adds a test for reproducing issue https://github.com/scylladb/scylladb/issues/18542
The test performs writes on a table with a materialized view and
checks that the view backlog increases. To get the current view
update backlog, a new metric "view_update_backlog" is added to
the `storage_proxy` metrics. The metric differs from the metric
from `database` metric with the same name by taking the backlog
from the max_view_update_backlog which keeps view update backlogs
from all shards which may be a bit outdated, instead of taking
the backlog by checking the view_update_semaphore which the backlog
is based on directly.
2024-07-03 23:18:52 +02:00
Wojciech Mitros
c4f5659c11 test: move auxiliary methods for waiting until a view is built to util
In many materialized view tests we need to wait until a view is built before
actually working on it, future tests will also need it. In existing tests
we use the same, duplicated method for achieving that.
In this patch the method is deduplicated and moved to pylib/util.py
and existing tests are modified to use it instead.
2024-07-03 23:18:52 +02:00
Avi Kivity
3fc4e23a36 forward_service: rename to mapreduce_service
forward_service is nondescriptive and misnamed, as it does more than
forward requests. It's a classic map/reduce algorithm (and in fact one
of its parameters is "reducer"), so name it accordingly.

The name "forward" leaked into the wire protocol for the messaging
service RPC isolation cookie, so it's kept there. It's also maintained
in the name of the logger (for "nodetool setlogginglevel") for
compatibility with tests.

Closes scylladb/scylladb#19444
2024-07-03 19:29:47 +03:00
Michael Litvak
08b29460fc mv: skip building view updates on a pending replica
Currently, a pending replica that applies a write on a table that has
materialized views, will build all the view updates as a normal replica,
only to realize at a late point, in db::view::get_view_natural_endpoint(),
that it doesn't have a paired view replica to send the updates to. It will
then either drop the view updates, or send them to a pending view
replica, if such exists.

This work is unnecessary since it may be dropped, and even if there is a
pending view replica to send the updates to, the updates that are built
by the pending replica may be wrong since it may have incomplete
information.

This commit fixes the inefficiency by skipping the view update building
step when applying an update on a pending replica.

The metric total_view_updates_on_wrong_node is added to count the cases
that a view update is determined to be unnecessary.

The test reproduces the scenario of writing to a table and applying
the update on a pending replica, and verifies that the pending replica
doesn't try to build view updates.

Fixes scylladb/scylladb#19152

Closes scylladb/scylladb#19488
2024-07-02 13:10:18 +02:00
Nadav Har'El
d61513c41c Merge 'reader_concurrency_semaphore: make CPU concurrency configurable' from Botond Dénes
The reader concurrency semaphore restricts the concurrency of reads that require CPU (intention: they read from the cache) to 1, meaning that if there is even a single active read which declares that it needs just CPU to proceed, no new read is admitted. This is meant to keep the concurrency of reads in the cache at 1. The idea is that concurrency in the cache is not useful: it just leads to the reactor rotating between these reads, all of the finishing later then they could if they were the only active read in the cache.
This was observed to backfire in the case where there reads from a single table are mostly very fast, but on some keys are very slow (hint: collection full of tombstones). In this case the slow read keeps up the fast reads in the queue, increasing the 99th percentile latencies significantly.

This series proposes to fix this, by making the CPU concurrency configurable. We don't like tunables like this and this is not a proper fix, but a workaround. The proper fix would be to allow to cut any page early, but we cannot cut a page in the middle of a row. We could maybe have a way of detecting slow reads and excluding them from the CPU concurrency. This would be a heuristic and it would be hard to get right. So in this series a robust and simple configurable is offered, which can be used on those few clusters which do suffer from the too strict concurrency limit. We have seen it in very few cases so far, so this doesn't seem to be wide-spread.

Fixes: https://github.com/scylladb/scylladb/issues/19017

This fixes a regression introduced in 5.0, so we have to backport to all currently supported releases

Closes scylladb/scylladb#19018

* github.com:scylladb/scylladb:
  test/boost/reader_concurrency_semaphore_test: add test for live-configurable cpu concurrenc  Please enter the commit message for your changes. Lines starting
  test/boost/reader_concurrency_semaphore_test: hoist require_can_admit
  reader_concurrency_semaphore: wire in the configurable cpu concurrency
  reader_concurrency_semaphore: add cpu_concurrency constructor parameter
  db/config: introduce reader_concurrency_semahore_cpu_concurrency
2024-07-02 13:39:00 +03:00
Kefu Chai
e87b64b7bb compaction: not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-07-02 14:06:42 +08:00
Nadav Har'El
44e036c53c alternator: fix "/localnodes" to use broadcast_rpc_address
Alternator's non-standard "/localnodes" HTTP request returns a list of
live nodes on this DC, to consider for load balancing. The returned
node addresses should be external IP addresses usable by the clients.
Scylla has a configuration parameter - broadcast_rpc_address - which
defines for a node an external IP address. If such a configuration
exists, we need to use those external IP addresses, not the internal
ones.

Finding these broadcast_rpc_address of all nodes is easy, because the
gossiper already gossips them.

This patch also tests the new feature:
1. The existing single-node test is extended to verify that without
   broadcast_rpc_address we get the usual IP address.
2. A new two-node test is added to check that when broadcast_rpc_address
   is configured, we get that address and not the usual internal IP
   addresses.

Fixes #18711.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-06-30 18:38:15 +03:00
Gleb Natapov
3f136cf2eb test: add test that checks that local address cannot expire between join request placemen and its processing 2024-06-30 15:52:23 +03:00
Pavel Emelyanov
d034cde01f Merge 'build: update C++ standard to C++23' from Avi Kivity
Switch the C++ standard from C++20 to C++23. This is straightforward, but there are a few
fallouts (mostly due to std::unique_ptr that became constexpr) that need to be fixed first.

Internal enhancement - no backport required

Closes scylladb/scylladb#19528

* github.com:scylladb/scylladb:
  build: switch to C++23
  config: avoid binding an lvalue reference to an rvalue reference
  readers: define query::partition_slice before using it in default argument
  test: define table_for_tests earlier
  compaction: define compaction_group::table_state earlier
  compaction: compaction_group: define destructor out-of-line
  compaction_manager: define compaction_manager::strategy_control earlier
2024-06-28 18:02:33 +03:00
Piotr Dulikowski
f00c4eaf72 Merge '[test.py] add --extra-scylla-cmdline-options argument for test.py' from Artsiom Mishuta
this PR has 2 commits
- [test: pass Scylla extra CMD args from test.py args](6b367a04b5)
- [test: adjust scylla_cluster.merge_cmdline_options behavior](c60b36090a)

the main goal is to solve [test.py: provide an easy-to-remember, univeral way to run scylla with trace level logging](https://github.com/scylladb/scylladb/issues/14960) issue

but also can be used to easily apply additional arguments for all UnitTests and PythonTests on the fly from the test.py CMD

Closes scylladb/scylladb#19509

* github.com:scylladb/scylladb:
  test: adjust scylla_cluster.merge_cmdline_options behavior
  test: pass scylla extra CMD args from test.py args
2024-06-28 11:11:29 +02:00
Piotr Smaron
88eda47f13 cql: forbid switching from tablets to vnodes in ALTER KS
This check is already in place, but isn't fully working, i.e.
switching from a vnode KS to a tablets KS is not allowed, but
this check doesn't work in the other direction. To fix the
latter, `ks_prop_defs::get_initial_tablets()` has been changed
to handle 3 states: (1) init_tablets is set, (2) it was skipped,
(3) tablets are disabled. These couldn't fit into std::optional,
so a new local struct to hold these states has been introduced.
Callers of this function have been adjusted to set init_tablets
to an appropriate value according to the circumstances, i.e. if
tablets are globally enabled, but have been skipped in the CQL,
init_tablets is automatically set to 0, but if someone executes
ALTER KS and doesn't provide tablets options, they're inherited
from the old KS.
I tried various approaches and this one resulted in the least
lines of code changed. I also provided testcases to explain how
the code behaves.

Fixes: #18795

Closes scylladb/scylladb#19368
2024-06-28 11:41:41 +03:00
Piotr Dulikowski
f9abe52d3b Merge 'test: auth: add random tag to resources in test_auth_v2_migration' from Marcin Maliszkiewicz
Those tests are sometimes failing on CI and we have two hypothesis:
1. Something wrong with consistency of statements
2. Interruption from another test run (e.g. same queries performed concurrently or data remained after previous run)

To exclude or confirm 2. we add random marker to avoid potential collision, in such case it will be clearly visible that wrong data comes from a different run.

Related scylladb/scylladb#18931
Related scylladb/scylladb#18319

backport: no, just a test fix

Closes scylladb/scylladb#19484

* github.com:scylladb/scylladb:
  test: auth: add random tag to resources in test_auth_v2_migration
  test: extend unique_name with random sufix
2024-06-27 17:35:14 +02:00
Avi Kivity
e5807555bd test: define table_for_tests earlier
C++23 made std::unique_ptr constexpr. A side effect of this (presumably)
is that the compiler compiles it more eagerly, requiring the full definition
of the class in std::make_unique, while it previously was content with
finding the definition later.

One victim of this change is table_for_tests; define it earlier to
build with C++23.
2024-06-27 17:54:12 +03:00
Botond Dénes
b4f3809ad2 test/boost/reader_concurrency_semaphore_test: add test for live-configurable cpu concurrenc
Please enter the commit message for your changes. Lines starting
2024-06-27 09:57:11 -04:00
Botond Dénes
9cbdd8ef92 test/boost/reader_concurrency_semaphore_test: hoist require_can_admit
This is currently a lambda in a test, hoist it into the global scope and
make it into a function, so other tests can use it too (in the next
patch).
2024-06-27 09:57:11 -04:00
Botond Dénes
59faa6d4ff reader_concurrency_semaphore: add cpu_concurrency constructor parameter
In the case of the user semaphore, this receives the new
reader_concurrency_semaphore_cpu_limit config item.
Not used yet.
2024-06-27 09:57:11 -04:00
Artsiom Mishuta
919d44e0c7 test: adjust scylla_cluster.merge_cmdline_options behavior
adjust merge_cmdline_options behaviour to
append --logger-log-level option instead of merge

this behaviour can be changed(if needed)
to previour version(all merge):
merge_cmdline_options(list1, list2, appending_options=[])

or, to append different cmd options:
merge_cmdline_options(list1, list2, appending_options=[option1,option2])
2024-06-27 10:03:31 +02:00
Artsiom Mishuta
677173bf8b test: generate core dumps on crashes in nodetool tests
The nodetool tests does not set the asan/ubsan options
to abort on error and create core dumps

Fix by setting the environment variables in nodetool tests.

Closes scylladb/scylladb#19503
2024-06-27 10:44:33 +03:00
Marcin Maliszkiewicz
b708c5701f test: auth: add random tag to resources in test_auth_v2_migration
Those tests are sometimes failing on CI and we have two hypothesis:
1. Something wrong with consistency of statements
2. Interruption from another test run (e.g. same queries performed
  concurrently or data remained after previous run)

To exclude or confirm 2. we add random marker to avoid potential collision,
in such case it will be clearly visible that wrong data comes from
a different run.

Related scylladb/scylladb#18931
Related scylladb/scylladb#18319
2024-06-27 09:28:27 +02:00
Marcin Maliszkiewicz
d08a80b34f test: extend unique_name with random sufix
This reduces collision risk in an unlikely
and incorrect setup where tests would be
run concurrently by multiple processes.
2024-06-27 09:28:02 +02:00
Botond Dénes
2fe50cda22 Merge 'chunked_vector enhancements' from Benny Halevy
This short series enhances utils::chunked_vector so it could be used more easily to convert dht::partition_range_vector to chunked_vector, for example.

- utils: chunked_vector: document invalidation of iterators on move
- utils: chunked_vector: add ctor from std::initializer_list
- utils: chunked_vector: add ctor from a single value

No backport required

Closes scylladb/scylladb#19462

* github.com:scylladb/scylladb:
  chunked_vector_test: add tests for value-initialization constructor
  utils: chunked_vector: add ctor from std::initializer_list
  utils: chunked_vector: document invalidation of iterators on move
2024-06-27 10:20:47 +03:00
Kamil Braun
13fc2bd854 Merge notify other nodes on boot from Gleb
The series adds a step during node's boot process, just before completing
the initialization, in which the node sends a notification to all other
normal nodes in the cluster that it is UP now. Other nodes wait for this
node to be UP and in normal state before replying. This ensures that,
in a healthy cluster, when a node start serving queries the entire
cluster knows its up-to-date state. The notification is a best effort
though. If some nodes are down or do not reply in time the boot process
continues. It is somewhat similar to shutdown notification in this regard.

* 'gleb/notify-up-v2' of github.com:scylladb/scylla-dev:
  gossiper: wait for a bootstrapping node to be seen as normal on all nodes before completing initialization
  Wait for booting node to be marked UP before complete booting.
  gossiper: move gossip verbs to the idl
2024-06-25 17:58:17 +02:00
Avi Kivity
371e37924f Merge 'Rebuild bloom filters that have bad partition estimates' from Lakshmi Narayanan Sreethar
The bloom filters are built with partition estimates because the actual
partition count might not be available in all cases. If the estimate is
inaccurate, the bloom filters might end up being too large or too small
compared to their optimal sizes. This PR rebuilds bloom filters with
inaccurate partition estimates using the actual partition count before
the filter is written to disk. A bloom filter is considered to have an
inaccurate estimate if its false positive rate based on the current
bitmap size is either less than 75% or more than 125% of the configured
false positive rate.

Fixes #19049

A manual test was run to check the impact of rebuild on compaction.

Table definition used : CREATE TABLE scylla_bench.simple_table (id int PRIMARY KEY);

Setup : 3 billion random rows with id in the range [0, 1e8) were inserted as batches of 5 rows into scylla_bench.simple_table via 80 threads.

Compaction statistics :

scylla_bench.simple_table :
(a) Total number of compactions : `1501`
(b) Total time spent in compaction : `9h58m47.269s`
(c) Number of compactions which rebuilt bloom filters : `16`
(d) Total time taken by these 16 compactions which rebuilt bloom filters : `2h55m11.89s`
(e) Total time spent by these 16 compactions to rebuild bloom filters : `8m6.221s` which is
- `4.63%` of the total time taken by the compactions which rebuilt filters (d)
- `1.35%` of the total compaction time (b).

(f) Total bytes saved by rebuilding filters : `388 MB`

system.compaction_history :
(a) Total number of compactions : `77`
(b) Total time spent in compaction : `21.24s`
(c) Number of compactions which rebuilt bloom filters : `74`
(d) Time taken by these 74 compactions which rebuilt bloom filters : `20.48s`
(e) Time spent by these 74 compactions to rebuild bloom filters : `377ms` which is
- `1.84%` of the total time taken by the compactions which rebuilt filters (d)
- `1.77%` of the total compaction time (b).

(f) Total bytes saved by rebuilding filters : `20 kB`

The following tables also had compactions and the bloom filter was rebuilt in all those compactions.
However, the time taken for every rebuild was observed as 0ms from the logs as it completed within a microsecond :

system.raft :
(a) Total number of compactions : `2`
(b) Total time spent in compaction : `106ms`
(c) Total bytes saved by rebuilding filters : `960 B`

system_schema.tables :
(a) Total number of compactions : `1`
(b) Total time spent in compaction : `25ms`
(c) Total bytes saved by rebuilding filter : `312 B`

system.topology :
(a) Total number of compactions : `1`
(b) Total time spent in compaction : `25ms`
(c) Total bytes saved by rebuilding filter : `320 B`

Closes scylladb/scylladb#19190

* github.com:scylladb/scylladb:
  bloom_filter_test: add testcase to verify filter rebuilds
  test/boost: move bloom filter tests from sstable_datafile_test into a new file
  sstables/mx/writer: rebuild bloom filters with bad partition estimates
  sstables/mx/writer: add variable to track number of partitions consumed
  sstable: introduce sstable::maybe_rebuild_filter_from_index()
  sstable: add method to return filter format for the given sstable version
  utils/i_filter: introduce get_filter_size()
2024-06-25 15:35:09 +03:00
Benny Halevy
3f23016cc0 perf-simple-query: add mean and standard deviation stats
Currently, perf-simple-query summarizes the statistics
only for the throughput, printing the median,
median absolute deviation, minimum, and maximum.
But the throughput put is typically highly variable
and its median is noisy.

This patch calculates also the mean and standard deviation
and does that also for instructions_per_op and cpu_cycles_per_op
to present a fuller picture of the performance metrics.

Output example:
```
random-seed=3383668492
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
95613.97 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42456 insns/op,   22117 cycles/op,        0 errors)
97538.45 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42454 insns/op,   22094 cycles/op,        0 errors)
95883.37 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42438 insns/op,   22268 cycles/op,        0 errors)
96791.45 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42433 insns/op,   22256 cycles/op,        0 errors)
97894.71 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   42420 insns/op,   22010 cycles/op,        0 errors)

         throughput: mean=96744.39 standard-deviation=996.89 median=96791.45 median-absolute-deviation=861.02 maximum=97894.71 minimum=95613.97
instructions_per_op: mean=42440.08 standard-deviation=14.99 median=42437.59 median-absolute-deviation=13.58 maximum=42456.15 minimum=42420.10
  cpu_cycles_per_op: mean=22148.98 standard-deviation=110.43 median=22117.04 median-absolute-deviation=106.89 maximum=22267.70 minimum=22010.42
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#19450
2024-06-25 12:25:59 +03:00
Kefu Chai
adca415245 bytes: drop unused operator<<
since we've switched almost all callers of the operator<< to {fmt},
let's drop the unused operator<<:s.

the callers in alternator/streams.cc is updated to use `fmt::print()`
to format the `bytes` instances.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19448
2024-06-25 12:11:28 +03:00
Benny Halevy
378578b481 chunked_vector_test: add tests for value-initialization constructor
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-06-25 12:08:11 +03:00
Benny Halevy
5bd2ee7507 utils: chunked_vector: add ctor from std::initializer_list
Prepare for using utils::chunked_vector for
dht::partition_range_vector

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-06-25 12:08:06 +03:00
Kefu Chai
af2b0b030b test/pylib: use raw string to avoid using escape sequence
before this change, when running test like:
```console
./test.py --mode release topology_experimental_raft/test_tablets
/home/kefu/dev/scylladb/test/pylib/scylla_cluster.py:333: SyntaxWarning: invalid escape sequence '\('
  deleted_sstable_re = f"^.*/{keyspace}/{table}-[0-9a-f]{{32}}/.* \(deleted\)$"
```
we could have the warning above. because `\(` is not a valid escape
sequence, but the Python interpreter accepts it as two separated
characters of `\(` after complaining. but it's still annoying.

so, let's use a raw string here, as we want to match "(deleted)".

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19451
2024-06-24 11:11:44 +03:00
Lakshmi Narayanan Sreethar
a09556a49f bloom_filter_test: add testcase to verify filter rebuilds
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-06-24 12:11:37 +05:30
Lakshmi Narayanan Sreethar
4aa5698f0d test/boost: move bloom filter tests from sstable_datafile_test into a new file
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-06-24 12:06:02 +05:30
Marcin Maliszkiewicz
794440eb85 test: skip checking default role in test_auth_v2_migration
Default role creation in auth-v1 is asynchronous and all nodes race to
create it so we'd need to delay the test and wait. Checking this particular
role doesn't bring much value to the test as we check other roles
to demonstrate correctness.

Fixes scylladb/scylladb#19039

Closes scylladb/scylladb#19424
2024-06-23 19:50:55 +03:00
Kefu Chai
a7e38ada8e test: remove unused operator<<
since we've switched almost all callers of the operator<< to {fmt},
let's drop the unused operator<<:s.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19432
2024-06-23 18:02:52 +03:00
zhouxiang
694014591a test/alternator/test_projection_expression.py: remove useless comparisons
pytest.raises expects a block of code that will raise an exception, not
a comparison of results.

Closes scylladb/scylladb#19436
2024-06-23 13:53:14 +03:00
Nadav Har'El
81a02f06dd test/cql-pytest: add more tests for SELECT's LIMIT
SELECT's "LIMIT" feature is tested in combination with other features
in different test/cql-pytest/*.py source files - for examples the
combination of LIMIT and GROUP BY is tested in test_group_by.py.

This patch adds a new test file, test_limit.py, for testing aspects
basic usage of LIMIT that weren't already tested in other files.
The new file also has a comment saying where we have other tests
for LIMIT combined with other features.

All the new tests pass (on both Scylla and Cassandra). But they can
be useful as regression tests to test patches which modify the
behavior of LIMIT - e.g., pull reques #18842.

This patch also adds another test in test_group_by.py. This adds to
one of the tests for the combination of LIMIT and GROUP BY (in this
case, GROUP BY of clustering prefix, no aggregation) also a check
for paging, that was previously missing.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#19392
2024-06-21 19:35:15 +03:00
Kefu Chai
7b10cc8079 treewide: include seastar headers with brackets
this change was created in the same spirit of ebff5f5d.

despite that we include Seastar as a submodule, Seastar is not a
part of scylla project. so we'd better include its headers using
brackets.

ebff5f5d addressed this cosmetic issue a while back. but probably
clangd's header-insertion helped some of contributor to insert
the missing headers with `"`. so this style of `include` returned
to the tree with these new changes.

unfortunately, clangd does not allow us to configure the style
of `include` at the time of writing.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19406
2024-06-21 19:20:27 +03:00
Kefu Chai
987fd59f21 test: correct some misspellings
fix a typo in source code. this typo was identified by codespell.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19412
2024-06-21 19:16:11 +03:00
Kefu Chai
c429a8d8ae sstables: use "me" sstable format by default
in 7952200c, we changed the `selected_format` from `mc` to `me`,
but to be backward compatible the cluster starts with "md", so
when the nodes in cluster agree on the "ME_SSTABLE_FORMAT" feature,
the format selector believes that the node is already using "ME",
which is specified by `_selected_format`. even it is actually still
using "md", which is specified by `sstable_manager::_format`, as
changed by 54d49c04. as explained above, it was specified to "md"
in hope to be backward compatible when upgrading from an existign
installation which might be still using "md". but after a second
thought, since we are able to read sstables persisted with older
formats, this concern is not valid.

in other words, 7952200c introduced a regression which changed the
"default" sstable format from `me` to `md`.

to address this, we just change `sstable_manager::_format` to "me",
so that all sstables are created using "me" format.

a test is added accordingly.

Fixes #18995
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19293
2024-06-21 12:56:01 +03:00
Piotr Dulikowski
c3536015e4 Merge 'cql3/statement/select_statement: do not parallelize single-partition aggregations' from Michał Jadwiszczak
This patch adds a check if aggregation query is doing single-partition read and if so, makes the query to not use forward_service and do not parallelize the request.

Fixes scylladb/scylladb#19349

Closes scylladb/scylladb#19350

* github.com:scylladb/scylladb:
  test/boost/cql_query_test: add test for single-partition aggregation
  cql3/select_statement: do not parallelize single-partition aggregations
2024-06-21 08:50:00 +02:00
Avi Kivity
fdc1449392 treewide: rename flat_mutation_reader_v2 to mutation_reader
flat_mutation_reader_v2 was introduced in a pair of commits in 2021:

  e3309322c3 "Clone flat_mutation_reader related classes into v2 variants"
  08b5773c12 "Adapt flat_mutation_reader_v2 to the new version of the API"

as a replacement for flat_mutation_reader, using range_tombstone_change
instead of range_tombstone to represent represent range tombstones. See
those commits for more information.

The transition was incremental; the last use of the original
flat_mutation_reader was removed in 2022 in commit

  026f8cc1e7 "db: Use mutation_partition_v2 in mvcc"

In turn, flat_mutation_reader was introduced in 2017 in commit

  748205ca75 "Introduce flat_mutation_reader"

To transition from a mutation_reader that nested rows within
a partition in a separate stream, to a flat reader that streamed
partitions and rows in the same stream.

Here, we reclaim the original name and rename the awkward
flat_mutation_reader_v2 to mutation_reader.

Note that mutation_fragment_v2 remains since we still use the original
for compatibilty, sometimes.

Some notes about the transition:

 - files were also renamed. In one case (flat_mutation_reader_test.cc), the
   rename target already existed, so we rename to
    mutation_reader_another_test.cc.

 - a namespace 'mutation_reader' with two definitions existed (in
   mutation_reader_fwd.hh). Its contents was folded into the mutation_reader
   class. As a result, a few #includes had to be adjusted.

Closes scylladb/scylladb#19356
2024-06-21 07:12:06 +03:00
Avi Kivity
185338c8cf Merge 'Reduce TWCS off-strategy space overhead' from Raphael "Raph" Carvalho
Normally, the space overhead for TWCS is 1/N, where is number of windows. But during off-strategy, the overhead is 100% because input sstables cannot be released earlier.

Reshaping a TWCS table that takes ~50% of available space can result in system running out of space.

That's fixed by restricting every TWCS off-strategy job to 10% of free space in disk. Tables that aren't big will not be penalized with increased write amplification, as all input (disjoint) sstables can still be compacted in a single round.

Fixes #16514.

Closes scylladb/scylladb#18137

* github.com:scylladb/scylladb:
  compaction: Reduce twcs off-strategy space overhead to 10% of free space
  compaction: wire storage free space into reshape procedure
  sstables: Allow to get free space from underlying storage
  replica: don't expose compaction_group to reshape task
2024-06-20 18:51:25 +03:00