scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-30 11:36:54 +00:00

Files

Nadav Har'El 96dff367f8 Merge 'storage_proxy: update view update backlog on correct shard when writing' from Wojciech Mitros

This series is another approach of https://github.com/scylladb/scylladb/pull/18646 and https://github.com/scylladb/scylladb/pull/19181. In this series we only change where the view backlog gets
updated - we do not assure that the view update backlog returned in a response is necessarily the backlog
that increased due to the corresponding write, the returned backlog may be outdated up to 10ms. Because
 this series does not include this change, it's considerably less complex and it doesn't modify the common
write patch, so no particular performance considerations were needed in that context. The issue being fixed
is still the same, the full description can be seen below.

When a replica applies a write on a table which has a materialized view
it generates view updates. These updates take memory which is tracked
by `database::_view_update_concurrency_sem`, separate on each shard.
The fraction of units taken from the semaphore to the semaphore limit
is the shard's view update backlog. Based on these backlogs, we want
to estimate how busy a node is with its view updates work. We do that
by taking the max backlog across all shards.
To avoid excessive cross-shard operations, the node's (max) backlog isn't
calculated each time we need it, but up to 1 time per 10ms (the `_interval`) with an optimization where the backlog of the calculating shard is immediately up-to-date (we don't need cross-shard operations for it):
```
update_backlog node_update_backlog::fetch() {
    auto now = clock::now();
    if (now >= _last_update.load(std::memory_order_relaxed) + _interval) {
        _last_update.store(now, std::memory_order_relaxed);
        auto new_max = boost::accumulate(
                _backlogs,
                update_backlog::no_backlog(),
                [] (const update_backlog& lhs, const per_shard_backlog& rhs) {
                    return std::max(lhs, rhs.load());
                });
        _max.store(new_max, std::memory_order_relaxed);
        return new_max;
    }
    return std::max(fetch_shard(this_shard_id()), _max.load(std::memory_order_relaxed));
}
```
For the same reason, even when we do calculate the new node's backlog,
we don't read from the `_view_update_concurrency_sem`. Instead, for
each shard we also store a update_backlog atomic which we use for
calculation:
```
    struct per_shard_backlog {
        // Multiply by 2 to defeat the prefetcher
        alignas(seastar::cache_line_size * 2) std::atomic<update_backlog> backlog = update_backlog::no_backlog();
        need_publishing need_publishing = need_publishing::no;

        update_backlog load() const {
            return backlog.load(std::memory_order_relaxed);
        }
    };
 std::vector<per_shard_backlog> _backlogs;
```
Due to this distinction, the update_backlog atomic need to be updated
separately, when the `_view_update_concurrency_sem` changes.
This is done by calling `storage_proxy::update_view_update_backlog`, which reads the `_view_update_concurrency_sem` of the shard (in `database::get_view_update_backlog`)
and then calls node`_update_backlog::add` where the read backlog
is stored in the atomic:
```
void storage_proxy::update_view_update_backlog() {
    _max_view_update_backlog.add(get_db().local().get_view_update_backlog());
}
void node_update_backlog::add(update_backlog backlog) {
    _backlogs[this_shard_id()].backlog.store(backlog, std::memory_order_relaxed);
    _backlogs[this_shard_id()].need_publishing = need_publishing::yes;
}
```
For this implementation of calculating the node's view update backlog to work,
we need the atomics to be updated correctly when the semaphores of corresponding
shards change.

The main event where the view update backlog changes is an incoming write
request. That's why when handling the request and preparing a response
we update the backlog calling `storage_proxy::get_view_update_backlog` (also
because we want to read the backlog and send it in the response):
backlog update after local view updates (`storage_proxy::send_to_live_endpoints` in `mutate_begin`)
```
 auto lmutate = [handler_ptr, response_id, this, my_address, timeout] () mutable {
     return handler_ptr->apply_locally(timeout, handler_ptr->get_trace_state())
             .then([response_id, this, my_address, h = std::move(handler_ptr), p = shared_from_this()] {
         // make mutation alive until it is processed locally, otherwise it
         // may disappear if write timeouts before this future is ready
         got_response(response_id, my_address, get_view_update_backlog());
     });
 };
backlog update after remote view updates (storage_proxy::remote::handle_write)

 auto f = co_await coroutine::as_future(send_mutation_done(netw::messaging_service::msg_addr{reply_to, shard}, trace_state_ptr,
         shard, response_id, p->get_view_update_backlog()));
```
Now assume that on a certain node we have a write request received on shard A,
which updates a row on shard B (A!=B). As a result, shard B will generate view
updates and consume units from its `_view_update_concurrency_sem`, but will
not update its atomic in `_backlogs` yet. Because both shards in the example
are on the same node, shard A will perform a local write calling `lmutate` shown
above. In the `lmutate` call, the `apply_locally` will initiate the actual write on
shard B and the `storage_proxy::update_view_update_backlog` will be called back
on shard A. In no place will the backlog atomic on shard B get updated even
though it increased in size due to the view updates generated there.
Currently, what we calculate there doesn't really matter - it's only used for the
MV flow control delays, so currently, in this scenario, we may only overload
a replica causing failed replica writes which will be later retried as hints. However,
when we add MV admission control, the calculated backlog will be the difference
between an accepted and a rejected request.

Fixes: https://github.com/scylladb/scylladb/issues/18542

Without admission control (https://github.com/scylladb/scylladb/pull/18334), this patch doesn't affect much, so I'm marking it as backport/none

Closes scylladb/scylladb#19341

* github.com:scylladb/scylladb:
  test: add test for view backlog not being updated on correct shard
  test: move auxiliary methods for waiting until a view is built to util
  mv: update view update backlog when it increases on correct shard

2024-07-04 11:40:09 +03:00

alternator

alternator: fix "/localnodes" to use broadcast_rpc_address

2024-06-30 18:38:15 +03:00

auth_cluster

test: auth: add random tag to resources in test_auth_v2_migration

2024-06-27 09:28:27 +02:00

boost

Merge 'reader_concurrency_semaphore: make CPU concurrency configurable' from Botond Dénes

2024-07-02 13:39:00 +03:00

broadcast_tables

…

cql

…

cql-pytest

test/cql-pytest: use offset-aware API to avoid deprecate warning

2024-07-04 10:48:00 +03:00

lib

forward_service: rename to mapreduce_service

2024-07-03 19:29:47 +03:00

manual

Merge notify other nodes on boot from Gleb

2024-06-25 17:58:17 +02:00

nodetool

test: generate core dumps on crashes in nodetool tests

2024-06-27 10:44:33 +03:00

object_store

[test.py] Fix URl for failed logs directory in CI

2024-06-17 11:16:48 +02:00

perf

perf-simple-query: add mean and standard deviation stats

2024-06-25 12:25:59 +03:00

pylib

Merge 'storage_proxy: update view update backlog on correct shard when writing' from Wojciech Mitros

2024-07-04 11:40:09 +03:00

pylib_test

test.py: support code coverage

2024-01-18 11:11:34 +02:00

raft

test: randomized_nemesis_test: use BOOST_REQUIRE_* when appropriate

2024-06-14 15:33:07 +03:00

redis

…

resource

build: cmake: mark wasm "ALL"

2024-06-20 18:45:31 +03:00

rest_api

test: add test for abort while a task is being unregistered

2024-06-18 13:41:51 +02:00

scylla_gdb

[test.py] Refactor alternator, nodetool, rest_api

2024-06-13 13:56:10 +02:00

topology

test: correct some misspellings

2024-06-21 19:16:11 +03:00

topology_custom

Merge 'storage_proxy: update view update backlog on correct shard when writing' from Wojciech Mitros

2024-07-04 11:40:09 +03:00

topology_experimental_raft

alternator: fix "/localnodes" to use broadcast_rpc_address

2024-06-30 18:38:15 +03:00

unit

treewide: rename flat_mutation_reader_v2 to mutation_reader

2024-06-21 07:12:06 +03:00

__init__.py

…

CMakeLists.txt

build: cmake: add "unit_test_list" target

2024-01-10 08:43:04 +02:00

README.md

…

README.md

Scylla in-source tests.

For details on how to run the tests, see docs/dev/testing.md

Shared C++ utils, libraries are in lib/, for Python - pylib/

alternator - Python tests which connect to a single server and use the DynamoDB API unit, boost, raft - unit tests in C++ cql-pytest - Python tests which connect to a single server and use CQL topology* - tests that set up clusters and add/remove nodes cql - approval tests that use CQL and pre-recorded output rest_api - tests for Scylla REST API Port 9000 scylla-gdb - tests for scylla-gdb.py helper script nodetool - tests for C++ implementation of nodetool

If you can use an existing folder, consider adding your test to it. New folders should be used for new large categories/subsystems, or when the test environment is significantly different from some existing suite, e.g. you plan to start scylladb with different configuration, and you intend to add many tests and would like them to reuse an existing Scylla cluster (clusters can be reused for tests within the same folder).

To add a new folder, create a new directory, and then copy & edit its suite.ini.