scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-01 13:45:53 +00:00

Author	SHA1	Message	Date
Benny Halevy	71d90b2fbc	view: check_needs_view_update_path: get token_metadata_ptr check_needs_view_update_path is async and might yield so the token_metadata reference passed to it must be kept alive throughout the call. Fixes scylladb/scylladb#20979 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `d34878e96c`) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#21039	2024-10-22 09:16:40 +03:00
Calle Wilund	4a1e83d6be	commitlog: Fix buffer_list_bytes not updated correctly Fixes #20862 With the change in `60af2f3cb2` the bookkeep for buffer memory was changed subtly, the problem here that we would shrink buffer size before we after flush use said buffer's size to decrement the buffer_list_bytes value, previously inc:ed by the full, allocated size. I.e. we would slowly grow this value instead of adjusting properly to actual used bytes. Test included. (cherry picked from commit `ee5e71172f`) Closes scylladb/scylladb#20914	2024-10-03 09:11:40 +03:00
Avi Kivity	8ddfd0d70d	cql3: add option to not unify bind variables with the same name Bind variables in CQL have two formats: positional (`?`) where a variable is referred to by its relative position in the statement, and named (`:var`), where the user is expected to supply a name->value mapping. In `19a6e69001` we identified the case where a named bind variable appears twice in a query, and collapsed it to a single entry in the statement metadata. Without this, a driver using the named variable syntax cannot disambiguate which variable is referred to. However, it turns out that users can use the positional call form even with the named variable syntax, by using the positional API of the driver. To support this use case, we add a configuration variable to disable the same-variable detection. Because the detection has to happen when the entire statement is visible, we have to supply the configuration to the parser. We call it the `dialect` and pass it from all callers. The alternative would be to add a pre-prepare call similar to fill_prepare_context that rewrites all expressions in a statement to deduplicate variables. A unit test is added. Fixes #15559 (cherry picked from commit `ea8441dfa3`) (cherry picked from commit `edb3068ecf`)	2024-09-13 18:17:15 +03:00
Avi Kivity	92dd47c6d6	cql3: introduce dialect infrastructure A dialect is a different way to interpret the same CQL statement. Examples: - how duplicate bind variable names are handled (later in this series) - whether `column = NULL` in LWT can return true (as is now) or whether it always returns NULL (as in SQL) Currently, dialect is an empty structure and will be filled in later. It is passed to query_processor methods that also accept a CQL string, and from there to the parser. It is part of the prepared statement cache key, so that if the dialect is changed online, previous parses of the statement are ignored and the statement is prepared again. The patch is careful to pick up the dialect at the entry point (e.g. CQL protocol server) so that the dialect doesn't change while a statement is parsed, prepared, and cached. (cherry picked from commit `d69bf4f010`)	2024-09-13 18:11:11 +03:00
Nadav Har'El	d9ba5423bb	Merge 'config: round-trip boolean configuration variables' from Avi Kivity When you SELECT a boolean from system.config, it reads as true/false, but this isn't accepted on UPDATE (instead, we accept 1/0). This is surprising and annoying, so accept true/false in both directions. Not a regression, so a backport isn't strictly necessary. Closes scylladb/scylladb#19792 * github.com:scylladb/scylladb: config: specialize from-string conversion for bool config: wrap boost::lexical_cast<> when converting from strings (cherry picked from commit `9eb47b3ef0`)	2024-09-13 17:54:37 +03:00
Piotr Dulikowski	2556c7a0dc	hints: send hints with CL=ALL if target is leaving Currently, when attempting to send a hint, we might choose its recipients in one of two ways: - If the original destination is a natural endpoint of the hint, we only send the hint to that node and none other, - Otherwise, we send the hint to all current replicas of the mutation. There is a problem when we decommission a node: while data is streamed away from that node, it is still considered to be a natural endpoint of the data that it used to own. Because of that, it might happen that a hint is sent directly to it but streaming will miss it, effectively resulting in the hint being discarded. As sending the hint _only_ to the leaving replica is a rather bad idea, send the hint to all replicas also in the case when the original destiantion of the hint is leaving. Note that this is a conservative fix written only with the decommission + vnode-based keyspaces combo in mind. In general, such "data loss" can occur in other situations where the replica set is changing and we go through a streaming phase, i.e. other topology operations in case of vnodes and tablet load balancing. However, the consistency guarantees of hinted handoff in the face of topology changes are not defined and it is not clear what they should be, if there should be any at all. The picture is further complicated by the fact that hints are used by materialized views, and sending view updates to more replicas than necessary can introduce inconsistencies in the form of "ghost rows". This fix was developed in response to a failing test which checked the hint replay + decommission scenario, and it makes it work again. Fixes scylladb/scylla-dtest#4582 Refs scylladb/scylladb#19835 (cherry picked from commit `61ac0a336d`)	2024-09-12 10:55:29 +02:00
Piotr Dulikowski	132d77f447	hints: inline do_send_one_mutation It's a small method and it is only used once in send_one_mutation. Inlining it lets us get rid of its declaration in the header - now, if one needs to change the variables passed from one function to another, it is no longer necessary to change the header. (cherry picked from commit `8abb06ab82`)	2024-09-12 10:55:21 +02:00
Gleb Natapov	bb9249f055	db/consistency_level: do not use result from hit weighted load balancer if it contains duplicates Because of https://github.com/scylladb/scylladb/issues/9285 hit weighted load balancer may sometimes return same node twice. It may cause wrong data to be read or unexpected errors to be returned to a client. Since the original bug is not easy to fix and it is rare lets introduce a workaround. We will check for duplicates and will use non HWLB one if one is found. (cherry picked from commit `e06a772b87`) Closes scylladb/scylladb#20468	2024-09-10 17:18:47 +03:00
Benny Halevy	31f3ff37f4	schema_tables: calculate_schema_digest: filter the key earlier Currently, each frozen mutation we get from system_keyspace::query_mutations is unfrozen in whole to a mutation and only then we check its key with the provided `accept_keyspace` function. This is wasteful, since they key can be processed directly form the frozen mutation, before taking the toll of unfreezing it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `52234214e5`)	2024-08-22 09:06:26 +00:00
Benny Halevy	828595786a	schema_tables: calculate_schema_digest: prevent stalls due to large mutations vector With a large number of table the schema mutations vector might get big enoug to cause reactor stalls when freed. For example, the following stall was hit on 2023.1.0~rc1-20230208.fe3cc281ec73 with 5000 tables: ``` (inlined by) ~vector at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_vector.h:730 (inlined by) db::schema_tables::calculate_schema_digest(seastar::sharded<service::storage_proxy>&, enum_set<super_enum<db::schema_feature, (db::schema_feature)0, (db::schema_feature)1, (db::schema_feature)2, (db::schema_feature)3, (db::schema_feature)4, (db::schema_feature)5, (db::schema_feature)6, (db::schema_feature)7> >, seastar::noncopyable_function<bool (std::basic_string_view<char, std::char_traits<char> >)>) at ./db/schema_tables.cc:799 ``` This change returns a mutations generator from the `map` lambda coroutine so we can process them one at a time, destroy the mutations one at a time, and by that, reducing memory footprint and preventing reactor stalls. Fixes #18173 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `95a5fba0ea`)	2024-08-22 09:06:25 +00:00
Dawid Medrek	8d90b81766	db/hints: Make commitlog use commitlog IO scheduling group Before these changes, we didn't specify which I/O scheduling group commitlog instances in hinted handoff should use. In this commit, we set it explicitly to the commitlog scheduling group. The rationale for this choice is the fact we don't want to cause a bottleneck on the write path -- if hints are written too slowly, new incoming mutations (NOT hints) might be rejected due to a too high number of hints currently being written to disk; see `storage_proxy::create_write_response_handler_helper()` for more context. (cherry picked from commit `6a7fb18b52`) Closes scylladb/scylladb#20093	2024-08-14 22:14:43 +03:00
Piotr Dulikowski	30b0cb4f5d	db/view: drop view updates to replaced node marked as left When a node that is permanently down is replaced, it is marked as "left" but it still can be a replica of some tablets. We also don't keep IPs of nodes that have left and the `node` structure for such node returns an empty IP (all zeros) as the address. This interacts badly with the view update logic. The base replica paired with the left node might decide to generate a view update. Because storage proxy still uses IPs and not host IDs, it needs to obtain the view replica's IP and tell the storage proxy to write a view update to that node - so, it chooses 0.0.0.0. Apparently, storage proxy decides to write a hint towards this address - hinted handoff on the other hand operates on host IDs and not IPs, so it attempts to translate the IP back, which triggers an assertion as there is no replica with IP 0.0.0.0. As a quick workaround for this issue just drop view updates towards nodes which seem to have IPs that are all zeros. It would be more proper to keep the view updates as hints and replay them later to the new paired replica, but achieving this right now would require much more significant changes. For now, fixing a crash is more important than keeping views consistent with base replicas. Fixes: scylladb/scylladb#19439 (cherry picked from commit `6af7882c59`)	2024-07-26 14:02:50 +00:00
Botond Dénes	53a6ec05ed	Merge 'replica: remove rwlock for protecting iteration over storage group map' from Raphael "Raph" Carvalho rwlock was added to protect iterations against concurrent updates to the map. the updates can happen when allocating a new tablet replica or removing an old one (tablet cleanup). the rwlock is very problematic because it can result in topology changes blocked, as updating token metadata takes the exclusive lock, which is serialized with table wide ops like split / major / explicit flush (and those can take a long time). to get rid of the lock, we can copy the storage group map and guard individual groups with a gate (not a problem since map is expected to have a maximum of ~100 elements). so cleanup can close that gate (carefully closed after stopping individual groups such that migrations aren't blocked by long-running ops like major), and ongoing iterations (e.g. triggered by nodetool flush) can skip a group that was closed, as such a group is being migrated out. Fixes #18821. ``` WRITE ===== ./build/release/scylla perf-simple-query --smp 1 --memory 2G --initial-tablets 10 --tablets --write - BEFORE 65559.52 tps ( 59.6 allocs/op, 16.4 logallocs/op, 14.3 tasks/op, 52841 insns/op, 30946 cycles/op, 0 errors) 67408.05 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53018 insns/op, 30874 cycles/op, 0 errors) 67714.72 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53026 insns/op, 30881 cycles/op, 0 errors) 67825.57 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53015 insns/op, 30821 cycles/op, 0 errors) 67810.74 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53009 insns/op, 30828 cycles/op, 0 errors) throughput: mean=67263.72 standard-deviation=967.40 median=67714.72 median-absolute-deviation=547.02 maximum=67825.57 minimum=65559.52 instructions_per_op: mean=52981.61 standard-deviation=79.09 median=53014.96 median-absolute-deviation=36.54 maximum=53025.79 minimum=52840.56 cpu_cycles_per_op: mean=30869.90 standard-deviation=50.23 median=30874.06 median-absolute-deviation=42.11 maximum=30945.94 minimum=30820.89 - AFTER 65448.76 tps ( 59.5 allocs/op, 16.4 logallocs/op, 14.3 tasks/op, 52788 insns/op, 31013 cycles/op, 0 errors) 67290.83 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53025 insns/op, 30950 cycles/op, 0 errors) 67646.81 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53025 insns/op, 30909 cycles/op, 0 errors) 67565.90 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53058 insns/op, 30951 cycles/op, 0 errors) 67537.32 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 52983 insns/op, 30963 cycles/op, 0 errors) throughput: mean=67097.93 standard-deviation=931.44 median=67537.32 median-absolute-deviation=467.97 maximum=67646.81 minimum=65448.76 instructions_per_op: mean=52975.85 standard-deviation=108.07 median=53024.55 median-absolute-deviation=49.45 maximum=53057.99 minimum=52788.49 cpu_cycles_per_op: mean=30957.17 standard-deviation=37.43 median=30951.31 median-absolute-deviation=7.51 maximum=31013.01 minimum=30908.62 READ ===== ./build/release/scylla perf-simple-query --smp 1 --memory 2G --initial-tablets 10 --tablets - BEFORE 79423.36 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41840 insns/op, 26820 cycles/op, 0 errors) 81076.70 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41837 insns/op, 26583 cycles/op, 0 errors) 80927.36 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41829 insns/op, 26629 cycles/op, 0 errors) 80539.44 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41841 insns/op, 26735 cycles/op, 0 errors) 80793.10 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41864 insns/op, 26662 cycles/op, 0 errors) throughput: mean=80551.99 standard-deviation=661.12 median=80793.10 median-absolute-deviation=375.37 maximum=81076.70 minimum=79423.36 instructions_per_op: mean=41842.20 standard-deviation=13.26 median=41840.14 median-absolute-deviation=5.68 maximum=41864.50 minimum=41829.29 cpu_cycles_per_op: mean=26685.88 standard-deviation=93.31 median=26662.18 median-absolute-deviation=56.47 maximum=26820.08 minimum=26582.68 - AFTER 79464.70 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41799 insns/op, 26761 cycles/op, 0 errors) 80954.58 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41803 insns/op, 26605 cycles/op, 0 errors) 81160.90 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41811 insns/op, 26555 cycles/op, 0 errors) 81263.10 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41814 insns/op, 26527 cycles/op, 0 errors) 81162.97 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41806 insns/op, 26549 cycles/op, 0 errors) throughput: mean=80801.25 standard-deviation=755.54 median=81160.90 median-absolute-deviation=361.72 maximum=81263.10 minimum=79464.70 instructions_per_op: mean=41806.47 standard-deviation=5.85 median=41806.05 median-absolute-deviation=4.05 maximum=41813.86 minimum=41799.36 cpu_cycles_per_op: mean=26599.22 standard-deviation=94.84 median=26554.54 median-absolute-deviation=50.51 maximum=26761.06 minimum=26527.05 ``` Closes scylladb/scylladb#19469 * github.com:scylladb/scylladb: replica: remove rwlock for protecting iteration over storage group map replica: get rid of fragile compaction group intrusive list	2024-07-12 15:45:36 +03:00
Raphael S. Carvalho	ad5c5bca5f	replica: get rid of fragile compaction group intrusive list It was added to make integration of storage groups easier, but it's complicated since it's another source of truth and we could have problems if it becomes inconsistent with the group map. Fixes #18506. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-07-09 16:53:35 -03:00
Avi Kivity	f31d5e3204	Merge 'repair/streaming: enable toggling tombstone gc with a config item' from Botond Dénes We currently disable tombstone GC for compaction done on the read path of streaming and repair, because those expired tombstones can still prevent data resurrection. With time-based tombstone GC, missing a repair for long enough can cause data resurrection because a tombstone is potentially GC'd before it could be spread to every node by repair. So repair disseminating these expired tombstones helps clusters which missed repair for long enough. It is not a guarantee because compaction could have done the GC itself, but it is better than nothing. This last resort is getting less important with repair-based tombstone GC. Furthermore, we have seen this cause huge repair amplification in a cluster, where expired tombstones triggered repair replicating otherwise identical rows. This series makes tombstone GC on the streaming/repair compaction path configurable with a config item. This new config item defaults to `false` (current behaviour), setting it to `true`, will enable tombstone GC. Fixes: https://github.com/scylladb/scylladb/issues/19015 Not a regression, no backport needed Closes scylladb/scylladb#19016 * github.com:scylladb/scylladb: test/topology_custom/test_repair: add test for enable_tombstone_gc_for_streaming_and_repair replica/table: maybe_compact_for_streaming(): toggle tombstone GC based on the control flag replica: propagate enable_tombstone_gc_for_streaming_and_repair to maybe_compact_for_streaming() db/config: introduce enable_tombstone_gc_for_streaming_and_repair	2024-07-09 19:04:11 +03:00
Patryk Wrobel	a89e3d10af	code-cleanup: add missing header guards The following command had been executed to get the list of headers that did not contain '#pragma once': 'grep -rnw . -e "#pragma once" --include *.hh -L' This change adds missing include guard to headers that did not contain any guard. Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com> Closes scylladb/scylladb#19626	2024-07-09 18:31:35 +03:00
Piotr Dulikowski	3c535641fd	Merge 'service/storage_proxy: Add metrics keeping track of incoming hints' from Dawid Mędrek Although Scylla already exposes metrics keeping track of various information related to hinted handoff, all of them correspond to either storing or sending hints. However, when debugging, it's also crucial to be aware of how many hints are coming to a given node and what their size is. Unfortunately, the existing metrics are not enough to obtain that information. This PR introduces the following new metrics: * `sent_bytes_total` – the total size of the hints that have been sent from a given shard, * `received_hints_total` – the total number of hints that a given shard has received, * `received_hints_bytes_total` – the total size of the hints a given shard has received. It also renames `hints_manager_sent` to `hints_manager_sent_total` to avoid conflicts of prefixes between that metric and `sent_bytes_total` in tests. Fixes scylladb/scylladb#10987 Closes scylladb/scylladb#18976 * github.com:scylladb/scylladb: db/hints: Add a metric for the size of sent hints service/storage_proxy: Add metrics for received hints	2024-07-08 10:29:53 +02:00
Dawid Medrek	0e1cb0dc73	db/hints: Add logging when ignoring hint directories In `2446cce`, we stopped trying to attempt to create endpoint managers for invalid hint directories even when their names represented IP addresses or host IDs. In this commit, we add logging informing the user about it. Refs scylladb/scylladb#19173 Closes scylladb/scylladb#19618	2024-07-04 20:14:52 +03:00
Avi Kivity	0626e0487d	Merge 'Add copy on write to functions schema code' from Marcin Maliszkiewicz This is the first patch from series which would allow us to unify raft command code. Property we want to achieve is that all modifications performed by a single raft command can be made visible atomically. This helps to exclude accidental dependencies across subsystem updates and make easier to reason about state. Here we alter functions schema code so that changes are first applied to a copy of declared functions and then made visible atomically. Later work will apply similar strategy to the whole schema. Relates scylladb/scylladb#19153 Closes scylladb/scylladb#19598 * github.com:scylladb/scylladb: cql3: functions: make modification functions accessible only via batch class db: replica: batch functions schema modifications cql3: functions: introduce class for batching functions modifications cql3: functions: make functions class non-static cql3: functions: remove reduntant class access specifiers cql3: functions: remove unused java snippet	2024-07-04 17:40:23 +03:00
Marcin Maliszkiewicz	32fe101f9d	db: replica: batch functions schema modifications Before each function change was immediately visible as during event notification logic yielded. Now we first gather the modifications and then commit them. Further work will broaden the scope of atomicity to the whole schema and even across other subsystems.	2024-07-04 13:10:26 +02:00
Nadav Har'El	96dff367f8	Merge 'storage_proxy: update view update backlog on correct shard when writing' from Wojciech Mitros This series is another approach of https://github.com/scylladb/scylladb/pull/18646 and https://github.com/scylladb/scylladb/pull/19181. In this series we only change where the view backlog gets updated - we do not assure that the view update backlog returned in a response is necessarily the backlog that increased due to the corresponding write, the returned backlog may be outdated up to 10ms. Because this series does not include this change, it's considerably less complex and it doesn't modify the common write patch, so no particular performance considerations were needed in that context. The issue being fixed is still the same, the full description can be seen below. When a replica applies a write on a table which has a materialized view it generates view updates. These updates take memory which is tracked by `database::_view_update_concurrency_sem`, separate on each shard. The fraction of units taken from the semaphore to the semaphore limit is the shard's view update backlog. Based on these backlogs, we want to estimate how busy a node is with its view updates work. We do that by taking the max backlog across all shards. To avoid excessive cross-shard operations, the node's (max) backlog isn't calculated each time we need it, but up to 1 time per 10ms (the `_interval`) with an optimization where the backlog of the calculating shard is immediately up-to-date (we don't need cross-shard operations for it): ``` update_backlog node_update_backlog::fetch() { auto now = clock::now(); if (now >= _last_update.load(std::memory_order_relaxed) + _interval) { _last_update.store(now, std::memory_order_relaxed); auto new_max = boost::accumulate( _backlogs, update_backlog::no_backlog(), [] (const update_backlog& lhs, const per_shard_backlog& rhs) { return std::max(lhs, rhs.load()); }); _max.store(new_max, std::memory_order_relaxed); return new_max; } return std::max(fetch_shard(this_shard_id()), _max.load(std::memory_order_relaxed)); } ``` For the same reason, even when we do calculate the new node's backlog, we don't read from the `_view_update_concurrency_sem`. Instead, for each shard we also store a update_backlog atomic which we use for calculation: ``` struct per_shard_backlog { // Multiply by 2 to defeat the prefetcher alignas(seastar::cache_line_size * 2) std::atomic<update_backlog> backlog = update_backlog::no_backlog(); need_publishing need_publishing = need_publishing::no; update_backlog load() const { return backlog.load(std::memory_order_relaxed); } }; std::vector<per_shard_backlog> _backlogs; ``` Due to this distinction, the update_backlog atomic need to be updated separately, when the `_view_update_concurrency_sem` changes. This is done by calling `storage_proxy::update_view_update_backlog`, which reads the `_view_update_concurrency_sem` of the shard (in `database::get_view_update_backlog`) and then calls node`_update_backlog::add` where the read backlog is stored in the atomic: ``` void storage_proxy::update_view_update_backlog() { _max_view_update_backlog.add(get_db().local().get_view_update_backlog()); } void node_update_backlog::add(update_backlog backlog) { _backlogs[this_shard_id()].backlog.store(backlog, std::memory_order_relaxed); _backlogs[this_shard_id()].need_publishing = need_publishing::yes; } ``` For this implementation of calculating the node's view update backlog to work, we need the atomics to be updated correctly when the semaphores of corresponding shards change. The main event where the view update backlog changes is an incoming write request. That's why when handling the request and preparing a response we update the backlog calling `storage_proxy::get_view_update_backlog` (also because we want to read the backlog and send it in the response): backlog update after local view updates (`storage_proxy::send_to_live_endpoints` in `mutate_begin`) ``` auto lmutate = [handler_ptr, response_id, this, my_address, timeout] () mutable { return handler_ptr->apply_locally(timeout, handler_ptr->get_trace_state()) .then([response_id, this, my_address, h = std::move(handler_ptr), p = shared_from_this()] { // make mutation alive until it is processed locally, otherwise it // may disappear if write timeouts before this future is ready got_response(response_id, my_address, get_view_update_backlog()); }); }; backlog update after remote view updates (storage_proxy::remote::handle_write) auto f = co_await coroutine::as_future(send_mutation_done(netw::messaging_service::msg_addr{reply_to, shard}, trace_state_ptr, shard, response_id, p->get_view_update_backlog())); ``` Now assume that on a certain node we have a write request received on shard A, which updates a row on shard B (A!=B). As a result, shard B will generate view updates and consume units from its `_view_update_concurrency_sem`, but will not update its atomic in `_backlogs` yet. Because both shards in the example are on the same node, shard A will perform a local write calling `lmutate` shown above. In the `lmutate` call, the `apply_locally` will initiate the actual write on shard B and the `storage_proxy::update_view_update_backlog` will be called back on shard A. In no place will the backlog atomic on shard B get updated even though it increased in size due to the view updates generated there. Currently, what we calculate there doesn't really matter - it's only used for the MV flow control delays, so currently, in this scenario, we may only overload a replica causing failed replica writes which will be later retried as hints. However, when we add MV admission control, the calculated backlog will be the difference between an accepted and a rejected request. Fixes: https://github.com/scylladb/scylladb/issues/18542 Without admission control (https://github.com/scylladb/scylladb/pull/18334), this patch doesn't affect much, so I'm marking it as backport/none Closes scylladb/scylladb#19341 * github.com:scylladb/scylladb: test: add test for view backlog not being updated on correct shard test: move auxiliary methods for waiting until a view is built to util mv: update view update backlog when it increases on correct shard	2024-07-04 11:40:09 +03:00
Marcin Maliszkiewicz	16b770ff1a	cql3: functions: make functions class non-static This is done to ease code reuse in the following commit. It'd also help should we ever want properly mount functions class to schema object instead of static storage.	2024-07-04 10:24:57 +02:00
Kefu Chai	cccec07581	db: use format_as() in favor of fmt::streamed() since fedora 38 is EOL. and fedora 39 comes with fmt v10.0.0, also, we've switched to the build image based on fedora 40, which ships fmt-devel v10.2.1, there is no need to use fmt::streamed() when the corresponding format_as() as available. simpler this way. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#19594	2024-07-04 11:10:43 +03:00
Wojciech Mitros	1fdc65279d	test: add test for view backlog not being updated on correct shard This patch adds a test for reproducing issue https://github.com/scylladb/scylladb/issues/18542 The test performs writes on a table with a materialized view and checks that the view backlog increases. To get the current view update backlog, a new metric "view_update_backlog" is added to the `storage_proxy` metrics. The metric differs from the metric from `database` metric with the same name by taking the backlog from the max_view_update_backlog which keeps view update backlogs from all shards which may be a bit outdated, instead of taking the backlog by checking the view_update_semaphore which the backlog is based on directly.	2024-07-03 23:18:52 +02:00
Wojciech Mitros	fd9c7d4d59	mv: update view update backlog when it increases on correct shard When performing a write, we should update the view update backlog on the shard where the mutation is actually applied. Instead, currently we only update it on the shard that initially received the write request (which didn't change at all) and as a result, the backlog on the correct shard and the aggregated max view update backlog are not updated at all. This patch enables updating the backlog on the correct shard. The update is now performed just after the view generation and propagation finishes, so that all backlog increases are noted and the backlog is ready to be used in the write response. Additionally, after this patch, we no longer (falsely) assume that the backlog is modified on the same shard as where we later read it to attach to a response. However, we still compare the aggregated backlog from all shards and the backlog from the shard retrieving the max, as with a shard-aware driver, it's likely the exact shard whose backlog changed.	2024-07-03 23:18:52 +02:00
Michael Litvak	08b29460fc	mv: skip building view updates on a pending replica Currently, a pending replica that applies a write on a table that has materialized views, will build all the view updates as a normal replica, only to realize at a late point, in db::view::get_view_natural_endpoint(), that it doesn't have a paired view replica to send the updates to. It will then either drop the view updates, or send them to a pending view replica, if such exists. This work is unnecessary since it may be dropped, and even if there is a pending view replica to send the updates to, the updates that are built by the pending replica may be wrong since it may have incomplete information. This commit fixes the inefficiency by skipping the view update building step when applying an update on a pending replica. The metric total_view_updates_on_wrong_node is added to count the cases that a view update is determined to be unnecessary. The test reproduces the scenario of writing to a table and applying the update on a pending replica, and verifies that the pending replica doesn't try to build view updates. Fixes scylladb/scylladb#19152 Closes scylladb/scylladb#19488	2024-07-02 13:10:18 +02:00
Nadav Har'El	d61513c41c	Merge 'reader_concurrency_semaphore: make CPU concurrency configurable' from Botond Dénes The reader concurrency semaphore restricts the concurrency of reads that require CPU (intention: they read from the cache) to 1, meaning that if there is even a single active read which declares that it needs just CPU to proceed, no new read is admitted. This is meant to keep the concurrency of reads in the cache at 1. The idea is that concurrency in the cache is not useful: it just leads to the reactor rotating between these reads, all of the finishing later then they could if they were the only active read in the cache. This was observed to backfire in the case where there reads from a single table are mostly very fast, but on some keys are very slow (hint: collection full of tombstones). In this case the slow read keeps up the fast reads in the queue, increasing the 99th percentile latencies significantly. This series proposes to fix this, by making the CPU concurrency configurable. We don't like tunables like this and this is not a proper fix, but a workaround. The proper fix would be to allow to cut any page early, but we cannot cut a page in the middle of a row. We could maybe have a way of detecting slow reads and excluding them from the CPU concurrency. This would be a heuristic and it would be hard to get right. So in this series a robust and simple configurable is offered, which can be used on those few clusters which do suffer from the too strict concurrency limit. We have seen it in very few cases so far, so this doesn't seem to be wide-spread. Fixes: https://github.com/scylladb/scylladb/issues/19017 This fixes a regression introduced in 5.0, so we have to backport to all currently supported releases Closes scylladb/scylladb#19018 * github.com:scylladb/scylladb: test/boost/reader_concurrency_semaphore_test: add test for live-configurable cpu concurrenc Please enter the commit message for your changes. Lines starting test/boost/reader_concurrency_semaphore_test: hoist require_can_admit reader_concurrency_semaphore: wire in the configurable cpu concurrency reader_concurrency_semaphore: add cpu_concurrency constructor parameter db/config: introduce reader_concurrency_semahore_cpu_concurrency	2024-07-02 13:39:00 +03:00
Kefu Chai	b71b638b2e	config: add descriptions for default_log_level and friends so that their description can be displayed in `reference/configuration-parameters/` web page. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-07-01 09:47:28 +08:00
Kefu Chai	b486f4ef01	config: define log_to_syslog in a different line before this change, docs/_ext/scylladb_cc_properties.py parses the options line by line, because `log_to_stdout` and `log_to_syslog` are defined in a single line, this script is not able to parse them, hence fails to display them on the `reference/configuration-parameters/` web page. after this change, these two member variables are defined on different lines. both of them can be displayed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-07-01 09:47:28 +08:00
Benny Halevy	b7f00ba4bf	schema_tables: remove dead code Well, even after 10 years, the c++ compilers still do not compile Java... And having that legacy code laying around not only it doesn't help anyone understand what's going on, but on the contrary, it's confusing and distracting. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-06-27 20:34:02 +03:00
Botond Dénes	d5a149fc01	db/config: introduce enable_tombstone_gc_for_streaming_and_repair To control whether the compacting reader (if enabled) for streaming and repair can garbage-collect tombstones. Default is false (previous behaviour). Not wired yet.	2024-06-26 04:05:17 -04:00
Botond Dénes	31c0fa07d8	db/batchlog_manager: bypass cache when scanning batchlog table Scans should not pollute the cache with cold data, in general. In the case of the batchlog table, there is another reason to bypass the cache: this table can have a lot of partition tombstones, which currently are not purged from the cache. So in certain cases, using the cache can make batch replay very slow, because it has to scan past tombstones of already replayed batches.	2024-06-25 06:15:47 -04:00
Botond Dénes	29f610d861	db/batchlog_manager: replace open-coded paging with internal one query_processor has built-in paging support, no need to open-code paging in batchlog manager code.	2024-06-25 06:15:47 -04:00
Botond Dénes	2dd057c96d	db/batchlog_manager: implement cleanup after all batchlog replay We have a commented code snippet from Origin with cleanup and a FIXME to implement it. Origin flushes the memtables and kicks a compaction. We only implement the flush here -- the flush will trigger a compaction check and we leave it up to the compaction manager to decide when a compaction is worthwhile. This method used to be called only from unbootstrap, so a cleanup was not really needed. Now it is also called at the end of repair, if the table is using repair-based tombstone-gc. If the memtable is filled with tombstones, this can add a lot of time to the runtime of each repair. So flush the memtable at the end, so the tombstones can be purged (they aren't purged from memtables yet).	2024-06-25 06:15:47 -04:00
Botond Dénes	c7317be09a	db/config: introduce reader_concurrency_semahore_cpu_concurrency To allow increasing the semaphore's CPU concurrency, which is currently hard-limited to 1. Not wired yet.	2024-06-25 04:00:11 -04:00
Kefu Chai	7b10cc8079	treewide: include seastar headers with brackets this change was created in the same spirit of `ebff5f5d`. despite that we include Seastar as a submodule, Seastar is not a part of scylla project. so we'd better include its headers using brackets. `ebff5f5d` addressed this cosmetic issue a while back. but probably clangd's header-insertion helped some of contributor to insert the missing headers with `"`. so this style of `include` returned to the tree with these new changes. unfortunately, clangd does not allow us to configure the style of `include` at the time of writing. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#19406	2024-06-21 19:20:27 +03:00
Dawid Medrek	2446cce272	db/hints: Initialize endpoint managers only for valid hint directories Before these changes, it could happen that Scylla initialized endpoint managers for hint directories representing * host IDs before migrating hinted handoff to using host IDs, * IP addresses after the migration. One scenario looked like this: 1. Start Scylla and upgrade the cluster to using host IDs. 2. Create, by hand, a hint directory representing an IP address. 3. Trigger changing the host filter in hinted handoff; it could be achieved by, for example, restricting the set of data centers Scylla is allowed to save hints for. When changing the host filter, we browse the hint directories and create endpoint managers if we can send hints towards the node corresponding to a given hint directory. We only accepted hint directories representing IP addresses and host IDs. However, we didn't check whether the local node has already been upgraded to host-ID-based hinted handoff or not. As a result, endpoint managers were created for both IP addresses and host IDs, no matter whether we were before or after the migration. These changes make sure that any time we browse the hint directories, we take that into account. Fixes scylladb/scylladb#19172 Closes scylladb/scylladb#19173	2024-06-21 15:59:49 +02:00
Avi Kivity	fdc1449392	treewide: rename flat_mutation_reader_v2 to mutation_reader flat_mutation_reader_v2 was introduced in a pair of commits in 2021: `e3309322c3` "Clone flat_mutation_reader related classes into v2 variants" `08b5773c12` "Adapt flat_mutation_reader_v2 to the new version of the API" as a replacement for flat_mutation_reader, using range_tombstone_change instead of range_tombstone to represent represent range tombstones. See those commits for more information. The transition was incremental; the last use of the original flat_mutation_reader was removed in 2022 in commit `026f8cc1e7` "db: Use mutation_partition_v2 in mvcc" In turn, flat_mutation_reader was introduced in 2017 in commit `748205ca75` "Introduce flat_mutation_reader" To transition from a mutation_reader that nested rows within a partition in a separate stream, to a flat reader that streamed partitions and rows in the same stream. Here, we reclaim the original name and rename the awkward flat_mutation_reader_v2 to mutation_reader. Note that mutation_fragment_v2 remains since we still use the original for compatibilty, sometimes. Some notes about the transition: - files were also renamed. In one case (flat_mutation_reader_test.cc), the rename target already existed, so we rename to mutation_reader_another_test.cc. - a namespace 'mutation_reader' with two definitions existed (in mutation_reader_fwd.hh). Its contents was folded into the mutation_reader class. As a result, a few #includes had to be adjusted. Closes scylladb/scylladb#19356	2024-06-21 07:12:06 +03:00
Piotr Dulikowski	75441ee120	Merge 'mv: fix value of the gossiped view update backlog' from Wojciech Mitros Currently, when calculating the view update backlog for gossip, we start with `db::view::update_backlog()` and compare it to backlogs from all shards. However, this backlog can't be compared to other backlogs - it has size 0 and we compare the fraction current/size when comparing backlogs, causing us to compare with `NaN`. This patch fixes it by starting the comparisons with an empty backlog. The patch introducing this issue (`f70f774e40`) wasn't backported, so this one doesn't need to be either Closes scylladb/scylladb#19247 * github.com:scylladb/scylladb: mv: make the view update backlog unmofidiable mv: fix value of the gossiped view update backlog	2024-06-20 06:27:11 +02:00
Wojciech Mitros	cde14a5788	mv: make the view update backlog unmofidiable Currently, a view update backlog may reach an invalid state, when its max is 0 and its relative_size() is NaN as a result. This can be achieved either by constructing the backlog with a 0 max or by modifying the max of an existing backlog. In particular, this happens when creating the backlog using the default constructor. In this patch the the default constructor is deleted and a check is added to make sure that the max is different than 0 is added to its constructor - if the check fails, we construct an empty backlog instead, to handle the possibility of getting an invalid backlog sent from a node with a version that's missing this check. Additionally, we make the backlogs members private, exposing them only through const getters.	2024-06-19 19:44:57 +02:00
Wojciech Mitros	1de5566cfa	mv: fix value of the gossiped view update backlog Currently, when calculating the view update backlog for gossip, we start with `db::view::update_backlog()` and compare it to backlogs from all shards. However, this backlog can't be compared to other backlogs - it has size 0 and we compare the fraction current/size when comparing backlogs, causing us to compare with `NaN`. This patch fixes it by starting the comparisons with an empty backlog.	2024-06-18 13:15:18 +02:00
Kefu Chai	68ef7dda79	config: correct the comment on printable_to_json() seastar::format() does not use operator<< under the hood, it uses {fmt}, so update the comment accordingly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#19315	2024-06-18 12:08:59 +03:00
Pavel Emelyanov	147552c34a	Merge 'configurable maintenance (streaming) semaphore count resource limit' from Botond Dénes Making the count resources on the maintenance (streaming) semaphore live update via config. This will allow us to improve repair speed on mixed-shard clusters, where we suspect that reader trashing -- due to the combination of high number of readers on each shard and very conservative reader count limit (10) -- is the main cause of the slowness. Making this count limit confgurable allows us to start experimenting with this fix, without committing to a count limit increase (or removal), addressing the pain in the field. Refs: #18269 No OSS backport needed. Closes scylladb/scylladb#19248 * github.com:scylladb/scylladb: replica/database: wire in maintenance_reader_concurrency_semaphore_count_limit db/config: introduce maintenance_reader_concurrency_semaphore_count_limit reader_concurrency_semaphore: make count parameter live-update	2024-06-18 12:02:24 +03:00
Botond Dénes	1acc57e19d	Merge 'schema: Make "describe" use extensions to string' from Calle Wilund Fixes #19334 Current impl uses hardcoded printing of a few extensions. Instead, use extension options to string and print all. Note: required to make enterprise CI happy again. Closes scylladb/scylladb#19337 * github.com:scylladb/scylladb: schema: Make "describe" use extensions to string schema_extensions: Add an option to string method	2024-06-18 11:28:11 +03:00
Nadav Har'El	4faceeaa33	Merge 'treewide: drop thrift support' from Kefu Chai thrift support was deprecated since ScyllaDB 5.2 > Thrift API - legacy ScyllaDB (and Apache Cassandra) API is > deprecated and will be removed in followup release. Thrift has > been disabled by default. so let's drop it. in this change, * thrift protocol support is dropped * all references to thrift support in document are dropped * the "thrift_version" column in system.local table is preserved for backward compatibility, as we could load from an existing system.local table which still contains this clolumn, so we need to write this column as well. * "/storage_service/rpc_server" is only preserved for backward compatibility with java-based nodetool. Fixes #3811 Fixes #18416 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> - [x] not a fix, no need to backport Closes scylladb/scylladb#18453 * github.com:scylladb/scylladb: config: expand on rpc_keepalive's description api: s/rpc/thrift/ db/system_keyspace: drop thrift_version from system.local table transport: do not return client_type from cql_server::connection::make_client_key() treewide: drop thrift support	2024-06-17 22:36:49 +03:00
Kefu Chai	b64126fe1c	db: remove unused operator<< since we've switched almost all callers of the operator<< to {fmt}, let's drop the unused operator<<:s. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#19313	2024-06-17 17:33:31 +03:00
Calle Wilund	d27620e146	schema_extensions: Add an option to string method Allow an extension to describe itself as the CQL property string that created it (and is serialized to schema tables) Only paxos extension requires override.	2024-06-17 13:30:10 +00:00
Dawid Medrek	670830091c	db/hints: Use dedicated functions to lock a shared mutex Seastar has functions implementing locking a `seastar::shared_mutex`. We should use those now instead of reimplementing them in Scylla. Closes scylladb/scylladb#19253	2024-06-14 20:31:37 +02:00
Wojciech Mitros	9bae1814ab	test: add test for failed view building write For various reasons, a view building write may fail. When that happens, the view building should not finish until these writes are successfully retried and they should not interfere with any writes that are performed to the base table while the view is building. The test introduced in this patch confirms that this is the case. Refs scylladb/scylladb#19261 Closes scylladb/scylladb#19263	2024-06-14 10:38:21 +02:00
Botond Dénes	665fdd6ce4	db/config: introduce maintenance_reader_concurrency_semaphore_count_limit To control the amount of count resources of the maintenance (streaming) semaphore. Not wired yet.	2024-06-13 01:59:21 -04:00

1 2 3 4 5 ...

3857 Commits